Implementing SaaS Monitoring

Introduction

One of the key problems with the shift to SaaS applications is the lack of visibility into performance and availability. Traditional monitoring tools are often not designed to handle SaaS. Issues such as limited access to metrics, multi-tenancy, and unwillingness of SaaS providers to share monitoring data make it inherently difficult to monitor. Historically, organizations used on-premise solutions where they’ve had full control over the infrastructure and applications utilized. At my current organization, until recently, we hosted a majority of our productivity tools on-premise, allowing us to have complete visibility into performance, reliability, and the user experience. This benefit is lost when shifting to SaaS.

In order to ensure business continuity and optimal user experience, it’s crucial to have a robust monitoring solution that can provide real-time insights into the health of the SaaS tooling used. Good monitoring ensures that failover plans can be executed in a timely manner, and we can continue to operate regardless of external factors.

The core crux of this issue is the lack of commercially available SaaS monitoring solutions that effectively address these challenges. Many existing tools are either too generic or too specialized, making it difficult to find a solution that reaches all the requirements.

This is the challenge we faced regarding our shift of on-premise productivity tools to SaaS. In this blog post, we will explore the challenges of monitoring SaaS applications and discuss potential solutions to address these issues, and our chosen solution.

Challenges of Monitoring SaaS

Monitoring SaaS apps presents several unique challenges that need to be addressed in order to ensure effective monitoring and management. Some of the key challenges faced include:

Multi-Tenancy: SaaS applications serve multiple tenants from a single instance, which can lead to challenges in isolating performance issues specific to a particular tenant.
1. Realistically, during an incident or outage, not one specific tenant is affected, therefore this level of granularity is often not needed.
Limited Access: SaaS providers may limit or outright block access to certain metrics and logs, making it challenging to gather the necessary data for effective monitoring.
1. Most SaaS providers only expose limited data via APIs or status pages, even with enterprise plans they are often reluctant at best to provide any additional data.
Data Privacy and Security: Monitoring may involve handling sensitive data, which requires strict adherence to data privacy and security regulations.
1. Related to the above point on limited access, this can often be attributed to data privacy concerns, they are hosting data for multiple customers and need to ensure that monitoring data does not expose sensitive information.
Synthetic Monitoring: Traditional monitoring techniques may not be sufficient for SaaS, necessitating the use of synthetic monitoring to simulate user interactions and measure performance.
1. This is essentially the only way to truly measure the performance and availability of SaaS apps from an end-user perspective. The problem, it’s infinitely complex and simple UI changes can break monitors and require constant maintenance.
Lack of commercial Solutions: While some tooling exists such as LogicMonitor and NewRelic, none of these solutions actually address the core challenges or provide significant enhancements over what can be built in-house.
1. This was the most surprising challenge for us. We were hopeful that some solutions had reached agreements with major SaaS providers to provide enhanced monitoring capabilities, but this was not the case. Not a single vendor could provide a solution greater than what we could build ourselves.

Potential Solutions

To address the challenges of monitoring SaaS, several potential solutions can be considered:

Third-Party Monitoring Tools: Evaluating and utilizing third-party monitoring tools that specialize in SaaS monitoring can provide additional capabilities and insights.
Collaboration with SaaS Providers: Establishing strong relationships with SaaS providers can facilitate better access to monitoring data and support.
Custom Monitoring Solutions: Developing custom monitoring solutions tailored to the specific needs of SaaS applications can provide greater flexibility and control over monitoring capabilities.
API Integration: Leveraging APIs provided by SaaS vendors can enable the collection of relevant metrics and logs for monitoring purposes.
Synthetic Monitoring: Implementing synthetic monitoring techniques can help simulate user interactions and measure the performance of SaaS tooling from an end-user perspective.

The issues with each potential solution

Third-Party Monitoring Tools: While third-party tools can provide some level of monitoring, they may not fully address the unique challenges of SaaS applications, such as multi-tenancy and limited access to data. Not to mention implementing additional SaaS tools adds additional cost and complexity to the monitoring stack.
1. In our chat with LogicMonitor, we found that while they had some SaaS monitoring capabilities, they only leveraged the same APIs that we could access ourselves. This meant that we would still face the same limitations in terms of data access and granularity. Additionally, this would result in providing access to our SaaS monitoring data to a third party, which raised security and privacy concerns for us.
Collaboration with SaaS Providers: Relying on SaaS providers for monitoring data can be challenging, as providers may have varying levels of willingness to share data and support monitoring efforts. Additionally, this approach may not be scalable for organizations using multiple SaaS applications from different providers as it would still require some custom work to integrate each provider.
1. At my current organization, we recently migrated from Mattermost on-premise to Slack as our primary communication tool.(This could be its own blog post with the challenges we faced!) Slack only provides a status page and little to no monitoring for platform reliability and performance. We asked Slack if they would be willing to expose some metrics via an API or other means, but they declined our request. This limited our ability to monitor Slack’s performance and availability effectively.
API Integration: While APIs can provide access to some monitoring data, they may not offer comprehensive insights into the performance and availability of SaaS applications. Additionally, API rate limits and data granularity can pose challenges for effective monitoring.
1. Initially as a stop-gap for slack, we implemented a simple solution to retrieve the slack-status page status and alert on changes. While this provided some basic monitoring capabilities, it was limited in scope and did not provide the depth of insights we needed to effectively monitor Slack’s performance and availability. We also quickly found that the page can be unreliable at times and would often not reflect real-time issues.
Synthetic Monitoring: While synthetic monitoring can provide valuable insights into the performance of SaaS applications, it may not capture all aspects of user experience and lacks the depth of monitoring that traditional tools can provide.
1. We looked into New Relic for synthetic monitoring of our SaaS tools, but found that it did not fully meet our needs and was cost-prohibitive for our use case. They also did not provide the level of customization we required for our monitoring needs.
Custom Monitoring Solutions: Developing custom monitoring solutions can be resource-intensive and may require specialized expertise. Additionally, maintaining and updating custom solutions can be challenging as SaaS applications evolve.
1. Ah yes, the dreaded custom solution. While we knew this would be a significant undertaking, we felt that it was the best way to ensure that our monitoring solution met our specific needs and addressed the unique challenges of SaaS.

Our Approach and Decision

Initially, we evaluated several third-party monitoring tools, including LogicMonitor and NewRelic, but found that they did not fully address our specific needs for SaaS monitoring. We also explored the possibility of collaborating with SaaS providers to gain better access to monitoring data, but this approach proved to be a dead end in almost all cases.

Ultimately, we decided to develop a custom monitoring solution tailored to our specific needs. We decided to leverage APIs where possible to gather relevant metrics and logs from our SaaS applications. Additionally, we intend on implementing synthetic monitoring to simulate user interactions and measure performance from an end-user perspective. This is a massive undertaking and will likely be developed in several iterations over an extended period of time.

This approach allowed us to have greater control over our monitoring capabilities and provided the flexibility to adapt to the changes of SaaS environments and ensure even when if we change tools we can continue to monitor with little effort. While developing a custom solution required significant resources and expertise, we believe it is the best approach to effectively monitor our SaaS tooling and ensure we do not lose visibility we have grown accustomed to with on-premise solutions.

System Architecture & Design

Our custom SaaS monitoring solution is designed to be modular and scalable, allowing us to easily add new SaaS tools and monitoring capabilities as needed. We also needed to ensure that the platform was tool agnostic and allowed new monitors and SaaS apps to be onboarded with minimal effort.

This posed many design challenges as each SaaS app has its own unique and complex requirements for monitoring. Such as different authentication mechanisms, data formats, and API limitations. To address these challenges, we designed the system with the following key components:

Config Management DB: A centralized database to store config data for each SaaS tool and its associated monitors. This allows us to easily manage and update monitoring configurations as needed.
Monitor Workers: Dedicated workers which could be deployed in multiple regions to perform the actual monitoring tasks. Each worker collects tasks from a centralized task queue and executes the monitoring logic. It reports the results back to the central management node for processing, alerting, and visualization.
Data Collection & Processing: A robust data collection and processing pipeline that ingests monitoring data from the various monitors and processes it for storage and analysis.
Alerting & Notification: Fortunately we were able to leverage our existing alerting and notification systems to handle alerts generated by the SaaS monitoring solution. This system is capable of sending alerts via multiple channels, SMS, and chat(Slack).
Dashboard & Visualization: User-friendly dashboards that provide real-time insights into the performance and availability of the monitored SaaS apps. These dashboards are designed to be customizable and can display various metrics and visualizations.
1. We were able to leverage our existing Grafana instances to visualize the data collected from the SaaS monitoring solution. This allowed us to quickly set up dashboards and visualizations.
2. We also implemented a pre-defined template system that allows users to quickly create dashboards for new SaaS apps based on common monitoring metrics and visualizations. This template system significantly reduces the time and effort required to onboard new SaaS applications and ensures consistency across dashboards.

I should also mention we want to store historical data for ensuring SLA compliance and trend analysis. We opted to use a time-series database (TSDB) for this purpose, as it is optimized for storing and querying time-series data. We also store incidents in a relational database for easy reporting. This allows us to track incidents over time and generate reports on SLA compliance and performance trends.

The overall architecture of the SaaS monitoring solution is designed to be flexible and adaptable, allowing us to easily add new SaaS apps and monitors as needed. By leveraging existing infrastructure and tools where possible, we were able to minimize development effort and focus on building the core monitoring capabilities required for effective SaaS monitoring.

Some Challenges We Faced

API Rate Limits: Many SaaS providers impose rate limits on their APIs, which makes challenging to frequently poll endpoints for accurate data to monitor. To address this, we implemented rate limiting and backoff mechanisms in our monitor workers to ensure that we stay within the allowed limits while still collecting the necessary data.
1. This was actually something we had worked around with Slack during the migration from Mattermost. Therefore, a large portion of the rate limiting logic was already implemented and was just a matter of adapting it for this.
Credential Management: Managing credentials for multiple SaaS apps can be complex and requires careful handling to ensure security.
1. We chose to use Hashicorp Vault, as we already use it for other purposes to store and manage credentials. Credentials are only accessible to the monitor workers that need them, and we implemented strict access controls to prevent unauthorized access.
Data Normalization: Different SaaS apps may provide monitoring data in different formats, making it challenging to aggregate and analyze the data.
1. We opted for a data normalization layer in our data collection pipeline to standardize the data format and ensure consistency across different SaaS tools.
Alert Fatigue: With multiple SaaS apps being monitored, there is a risk of generating excessive alerts, leading to alert fatigue amongst the product owners.
1. To mitigate this, we opted to implement intelligent alerting mechanisms that prioritize alerts based on severity and impact, ensuring that only critical issues are escalated to the operations team. We also account for jitter and transient errors to reduce false positives.
Scalability: As the number of monitored SaaS tools grows, the monitoring solution needs to be able to scale accordingly.
1. We designed the system to be modular and scalable, allowing us to easily add new monitor workers and deploy them in multiple regions. We also designed the system to ensure teams can manage their own SaaS app monitors without relying on a centralized team. This ensures we aren’t constantly working on adding and changing new monitors while enabling teams to work at a fast pace.
Maintenance and Updates: SaaS apps are constantly evolving, with new features and changes being introduced regularly. This requires ongoing maintenance and updates to the monitoring solution to ensure compatibility and effectiveness.
1. Due to the ease of adding and modifying monitors, teams can quickly adapt monitors to new changes in the SaaS tools and monitor new endpoints and services as needed.

Some Implementation Details

API Monitors: This was the first type of monitor we are implementing. These monitors interact with the SaaS application’s API to collect relevant metrics and logs. We designed these monitors to be modular and reusable, allowing us to easily add new API monitors for different SaaS apps.
1. The monitors are simple to create, they follow a simple form asking for the endpoint, authentication method, and the expected response. From there the monitor worker handles the rest.
2. We provided several authentication methods, including OAuth2, API keys, and basic authentication, to accommodate the different authentication mechanisms used by various SaaS platforms.
Synthetic Monitors: While not yet implemented, we plan to develop synthetic monitors that simulate user interactions with the SaaS tools. These monitors will help us measure performance and availability from an end-user perspective.
1. We plan to use headless browsers and scripting frameworks to create these synthetic monitors, allowing us to simulate complex user interactions and workflows.
2. The intention is to create a method for product owners to define key user journeys and workflows that can be monitored for performance and availability by defining steps in a simple UI.
3. These synthetic monitors will be designed to be modular and reusable, similar to the API monitors, allowing us to easily add new synthetic monitors for different SaaS applications.
Task Queue: We opted for a centralized task queue to manage the distribution of monitoring tasks to the monitor workers. This allows us to efficiently allocate resources and ensure that monitoring tasks are executed in a timely manner.
1. We had to implement task locking and retry mechanisms to handle celery’s at-least-once delivery semantics. This ensures that tasks are not executed multiple times and that failed tasks are retried as needed.
Config Management DB: We designed a centralized configuration management database to store monitoring configurations for each SaaS application. This database allows us to easily manage and update monitoring configurations as needed. We also opted to never allow for delete operations on configurations, only disable. This allows us to maintain a history of configurations and easily revert to previous versions if needed.
1. We designed the database schema to be flexible and extensible, allowing us to easily add new configuration parameters and monitoring capabilities as needed. We also store historical incidents in a relational database for easy querying and reporting.

Conclusion

In less formal terms, developing a custom solution to monitor SaaS applications that were never designed to be monitored was a royal pain. We needed to account for so many different edge cases and unique requirements for each SaaS app. You can also never fully trust a user to properly configure monitors or alerting, so we needed to account for this as well. Despite the challenges, we were are well on our way to having a robust SaaS monitoring solution that provides real-time insights into the performance and availability of our SaaS tooling.

If anyone has the drive and will to build a SaaS monitoring solution product I think there is a real market gap here. The existing tooling is severely lacking and there is a real need for effective SaaS monitoring solutions. As the recent outages of major SaaS providers and single points of failure with reliance on public cloud services have shown, the need for robust SaaS monitoring solutions is more critical than ever. As organizations, we need to ensure business continuity and optimal user experience in the face of these challenges. Feel free to reach out if you have any questions or would like to discuss further!

Final Notes

This project is still a work in progress and there are many features and improvements that we plan to implement in the future. As such, I will likely do a follow-up blog post once we have more features implemented and have had more time to refine the solution and provide some visual diagrams and data on effectiveness.

Stay tuned!