Are Prometheus & Grafana Sufficient To Support Modern IT?
The Prometheus and Grafana combination is rapidly becoming ubiquitous in the world of IT monitoring. There are many good reasons for this. They are free open source toolkits, so easy to get hold of and try out and so there is a lot of crowd sourced help available online to getting started, and this even includes documentation from the developers of middleware such as IBM MQ and RabbitMQ.
At Nastel, where we focus on supporting integration infrastructure for leading edge enterprise customers, we are finding that many of our customers have made policy decisions of using Prometheus and Grafana across the board for their monitoring. However, they are finding that it’s not sufficient in all situations.
Business Agility and Improving Time to Market
Speed is King. The business is constantly requesting new and updated applications and architecture, driven by changes in customer needs, competition, and innovation. Application developers must not be the blocker to business. We need business changes at the speed of life, not at the speed of software development.
For Black Friday, a large ecommerce site that typically has 3,000 concurrent visitors suddenly had to handle 1 million in a day! How can they handle 300 times as many visitors? If they can’t cope then this could change from black Friday to a very red one with a very public outage. A high-profile loss with serious reputational damage.
IT is constantly evolving. Most companies moved their IT to the agile development methodology and then they added DevOps with automation, continuous integration, continuous deployment, and constant validation against the ever-changing requirements.
With agile, companies reduced application development time from two years to six months. With DevOps it went down to a month, and now adding in cloud, companies like Netflix and Eli Lilly can get from requirements, to code, to test, to production in an hour. They’ve evolved their architectures from monolithic applications to service oriented architectures to multi-cloud, containers and microservices. Microservices can quickly get pushed out by DevOps, moved between clouds, they can be ephemeral and serverless and containerized. They use hyperscale clouds, so named because they have the elasticity to grow and shrink based on these dynamic needs. They have stretch clusters and cloud bursting so that burstable applications can extend from on-premise into a large temporary cloud environment with the necessary resources for Black Friday and then scale down again.
So now a single business transaction can flow across many different services, hosted on different servers in different countries, sometimes managed by different companies. The microservices are continuously getting updated, augmented, and migrated in real time. The topology of the application can change significantly on a daily basis.
Supporting Agile IT
So how is all this IT supported and how do we know if it is working? How do we get alerted if there is a break down in the transaction flow or if a message gets lost? How can you monitor an application that was spun up or moved for such a short period of time?
By building Grafana dashboards on top of the Prometheus platform, a monitoring, visualization and alerting solution can be constructed which provides visibility of the environment. The question is whether this can be built and adjusted fast enough to keep up with the everchanging demands of the business and IT.
Nastel’s Integration Infrastructure Platform
Nastel’s Integration Infrastructure Management platform addresses this. Its XRay component dynamically builds a topology view based on the real data that flows through the distributed application. It gives end to end observability of the true behavior of your application in a constantly changing agile world. It highlights problem transactions based on anomalies, and enables you to drill down to those transactions, carry out root cause analytics and perform remediation. The Nastel solution receives data from across the IT estate, including from Prometheus, and allows rapid creation and modification of visualizations and alerts in a single tool.
Furthermore, the Nastel technology adds in almost 30 years experience of supporting large production middleware environments. It has deep granular knowledge of the technologies and issues, and uses learned and derived data to provide AIOps and Observability, in addition to traditional event based monitoring.
Enhancing your existing infrastructure
Companies use Nastel’s Integration infrastructure Management (i2M) solution to enhance their existing tools. We take a proactive SRE approach based on an in-depth understanding of the middleware and monitoring history to prevent the outage altogether by monitoring key derived indicators such as:
- Latency – time to service a request.
- Traffic – demand placed on system
- Error Rate – rate of Failed Requests
- Saturation – which resources are most constraint
Prometheus & Grafana can give high level monitoring of a static environment, but they are two separate products and time is money. Ease of fixing issues without war rooms and handovers from support back to development is time critical. Do you have enough diagnostic data, skills or the tooling? Is your monitoring provider able to support you on the phone with middleware expertise throughout the outage? Just how important is your integration infrastructure? The Nastel Platform includes Nastel Navigator to quickly fix the problem too.
As business requirements change it is crucial to be able to change the dashboards and alerts in line with this. Nastel XRay is built with this as a core focus. With its time series database, the dashboard and alerting are all integrated together. The dashboards are dynamically created and change automatically as the infrastructure and message flows change. This requires minimal time for set-up, with monitoring not being a blocker to business. Rather than asking for a screenshot of a monitoring environment in use, ask for a demo of how long it takes to build a dashboard.
Nastel is the leader in Integration Infrastructure Management (i2M). You can read how a leading retailer used Nastel software to manage their transactions here and you can hear their service automation leader discussing how he used Nastel software to manage changing thresholds for peak periods here.
Is your IT environment evolving like this? What are your experiences with Prometheus and Grafana? Please leave a comment below or contact me directly and let’s discuss it.