with Opsgenie in
Jira Service Management
- Overview of Incident Management & IT teams
- Benefits of robust Incident Management
- Atlassian, Opsgenie and Jira Service Management
- What is Opsgenie?
- What does Opsgenie do?
- The evolution of Jira Service Desk into Jira Service Management
- How Opsgenie helps ITSM/ DevOps teams stay in control of critical incidents
- Monitoring & logging integrations
- Discovering & setting up Opsgenie in Jira Service Management
Overview of Incident Management & IT Teams
Being woken up while on-call at 2:30 AM for a major incident is sure to make even the most seasoned IT teams freak out. To make it worse, time wasted sifting through notifications and dealing with slowness (or worse, radio-silence) from key stakeholders can result not only in loss of revenue and reputation, but also all kinds of exposures to clients that trust you. Let’s not forget, we’re living in the age of social media, which makes preserving organizational reputation an even more challenging affair.
While operational efficiency is a by-product of modern IT Service Management, an increased reliance on software enabled workflows means more disruptions in streamlined service processes and at times, degradation in quality of services.
In ITIL terminology, such unplanned interruptions are defined as an ‘incident’. This includes events from both systems and also events communicated directly by users through an interface (a service desk or incident management platform).
Incident Management is the process of responding to such unplanned, critical escalations, restoring services to normalcy ASAP, and putting in place the knowledge and action plan to avoid future disruptions and slowdowns. This comprises a set of practices, processes, and solutions that enable teams to detect, investigate, and respond to incidents. Incidents are managed by teams with a view to minimize disruptive impact on business operations.
Did You Know?
Reports reveal that the Incident Response Market size is predicted to reach $37.11 billion by 2025, and estimated to grow at a CAGR of 17.23 % during 2020-2025.
Benefits of robust Incident Management
- Improved Mean Time To Resolution (MTTR)
- High visibility of incident status to key team members
- Clarity in business perception of IT as a value center in resolving incidents
- Risk mitigation, cost control and better human resource utilization in an organization
- Aligning Incident Management activities with business goals
- Timely resolution of incidents which results in lower downtime and better service
- Better user satisfaction with delivery of quality IT services
Did You Know?
- Starbucks' use of real-time alerts, as a part of its incident management tool, decreased response to disruptions in various operation from 1-2 weeks to 2-3 days
- Dealing with IT outages and downtime is one of the biggest technical challenges in the digital age, costing North American businesses approximately $700 billion per year.
Atlassian, Opsgenie and Jira Service Management
Atlassian’s history with incident management started with Jira (launched in 2002). Initially introduced as a bug-tracking tool for software teams, the core, flexible workflow engine gradually became instrumental for IT operations. By 2011, around 40% of Jira customers were using it for ITSM. Jira Service Desk was introduced in 2013, providing IT teams (Agents) a Jira platform to manage an intake of service requests without requiring licensing for requestors (Customers). As an ITSM solution, Jira Service Desk carved its niche as a lightweight, easy-to-deploy solution, particularly for internal employee support tickets with full service level agreement features and ITIL-certified capabilities.
Atlassian’s broader investments in the half decade signaled a further footprint to DevOps and business teams. While Trello brought a layer of ‘prosumer’ project management, AgileCraft (JiraAlign) brought into the fold a best-in-class Agile-at-scale portfolio management. However, Jira Service Desk-Atlassian’s fastest-growing core product-needed more native, mature features. Fast forward to 2018, and Atlassian acquires Opsgenie.
Did You Know?
50% of Incident Response engagements had insufficient end point or network visibility to respond successfully
What is Opsgenie?
- Founded in 2012
- Acquired by Atlassian on September 4, 2018
- 3200+ customers with over 50,000 users
- Headquartered in Boston, Offices in Virginia & Ankara
- 200+ powerful integrations with the most popular ITSM tools
As Atlassian customers (IT teams) evolved to provide more services, more automation, and more information, Jira Service Desk needed to help them stay aware and in control of high-impact issues in a more innovative, evolved fashion.
Driven by the goal of helping ITSM teams track outages and reduce downtime, Atlassian acquired Opsgenie in September 2018. Massive demands witnessed in the workflow automation market-coupled with Atlassian’s aim to extend its reach from just issue tracking to Incident Management-were the driving forces behind this $295 million deal.
The opportunity to merge a mature Incident Response platform with Atlassian’s already established, easy-to-deploy, highly extensible and customizable suite of products (and potential 240,000+ customers) was a win-win. The product, as well as the people-centric and innovation-oriented culture of both companies, resulted in the acquisition’s successful outcome.
What does Opsgenie do?
OpsGenie provides a platform that helps organizations manage a barrage of highly critical IT alerts that are an integral part of operating “always-on” IT services. This is achieved by streamlining alerts, sending instant notifications to the right people at the right time, and enabling teams to collaborate for taking rapid action.
Opsgenie helps notify all the right people through a sophisticated combination of scheduling, escalation paths, and notifications. The tool also takes factors like time zones and holidays into account as a part of its functionality. Catering to the needs of various teams in a fast-paced digital environment, the product has now evolved into a platform for empowering them to prepare for incidents in advance, collaborate on solutions, and analyze their response processes for enhanced operational efficiency.
The Evolution of Jira Service Desk into Jira Service Management
In late 2020, Atlassian announced the launch of Jira Service Management as a re-brand of Jira Service Desk. The new name itself connotes an increasing reach into the ITSM/ESM space, but the product was more than just a change of name. With new DevOps & IT workflows, Risk Assessment Automation and a host of power-packed features, Jira Service Management stood out as a gamechanger in ITSM solutions.
In a blog post announcing the launch of Jira Service Management, Edwin Wong, Atlassian’s head of product and IT, expressed that most ITSM tools aren’t fit for modern workflows as they don’t truly facilitate cooperation and are far too rigid to push “business agility”. Transforming Jira Service Desk to the next level, high velocity ITSM was aimed at providing development and IT operations with “a unified platform to collaborate at high velocity”, so that they are able to react to new challenges quicker.
The new Jira Service Management brought deeper capabilities of a full enterprise service management solution, including…
- Better visibility, proximity, responsiveness to related issues/incidences
- Accelerated flow of work between Support, Development & Operations
- Change management built for the DevOps era
- Richer contextual information from software dev as well as infrastructure-related tools
- Improved Agent experience
Most importantly, however, it brought into the fold Opsgenie as a modern incident management platform into Jira Service Management, natively (for Cloud hosted customers). All Jira Service Management Cloud plans come with major incident management, on-call scheduling, alerting, incident swarming, and more. With Opsgenie, Atlassian is able to offer customers a simpler approach to dealing with service disruptions – a one-stop shop to manage incident response.
As of February 2, 2021, added features inside Jira Service Management include:
- On-call overrides: Any change or conflict in scheduling can be managed by teams themselves through easy exchange of shifts and transfer of responsibility.
- On-call reminder notifications: Opsgenie ensures that team members are always updated about their duties via automatic notifications on shift timings.
- Alert Enrichment: No characters limits means being able to add optional fields to your alerts and attach charts, logs, runbooks, and other features for better context clarity.
- Call Routing: Nil scope for failure with Opsgenie’s on-call schedules always contacting the right person at the right time. Fret not if no one is available, for Opsgenie will then take a message, generate an alert, and notify the right person via their preferred mode of communication. With call details attached to the notification, recipients can listen to the message without any time wastage.
Opsgenie in Mobile App: Designed to fit in with our 'smart' lifestyle, Opsgenie’s latest mobile app version is packed with multiple enhancements including customizable saved search, post-mortem view on the go, optimized login and settings page, an app-switcher, a quick lock option for added security, and more.
How Opsgenie helps ITSM/ DevOps teams
stay in control of critical incidents:
- Centralized alerting: Optimize alert flow with dynamic schedules, escalations & custom actions. One can also add prescriptive actions to any alert.
- Streamlined team management: Build and maintain adaptive on-call schedule & custom notification policies.
- Effective collaboration: Work seamlessly across chat environments.
- Automate Incident Response: Execute system commands directly on opsgenie interface.
- End-to end alerting health: Never miss a failure alert with the option to monitor network heartbeats.
- Improved reporting & analytics: In-depth root cause analysis to provide data on team productivity with features like operational efficiency statistics, monthly overview statistics, etc.
Monitoring & Logging Integrations
- New Relic
- AWS Cloudwatch
- Automation integrations: These facilitate centralization of patch management, continuous integration, git repository visibility, general issue tracking, etc.
- Chat collaboration: Internal communication tools such as slack and Microsoft teams & customer communications such as Twillo & Moxtra.
Did You Know?
The Edge Encryption app acts as a bridge between Opsgenie and third-party tools. With this feature, sensitive data are encrypted using the Advanced Encryption Standard.
Discovering & Setting up Opsgenie in
Jira Service Management
- How to create a new Jira Service Management ITSM project?
- Get a view of the services you support
- How to manage people on the on-call & Rotation schedule
- Entering/Receiving a report of an Incident
- Initial review and establishing an impacted service
- Creating a major incident
- Working an incident in OpsGenie
- Command center options
- Providing updates to stakeholders
- Investigation with Bitbucket
- Assigning responders and the roles process
- The timeline
- Generating a Post-Mortem Report
- Setting up Alerts
- Using Analytics to improve future performance
- User Management in Opsgenie
1. How to create a new Jira Service Management ITSM project?
In order to utilize Opsgenie, you must create a project in Jira Service Management. There is a template available under the ‘change template’ section, which is a new Cloud feature. As long as you are under service project on your new project, you can select IT Service Management.
Upon clicking ‘create here’, it is automatically going to create the service management project that has the options for Incident Management, queues built out for problem management, change request management, etc.
Everything you would want to have in full ITSM solution has been built in with ITIL based workflows and everything inside of this.
Opsgenie has the ability to be the intake for issue or problems that come up on your services from any source. Ideally, it is something like a Solarwinds or AWS CloudWatch – Something that is telling us that we had a problem in a proactive fashion.
However, it is not uncommon to have a user sometimes come up and bring to your notice that the system is down, and access is denied. Let’s walk through that scenario.
a) Get a view of the services you support under ‘operations’
You can find this under the new operation menu, which are actually either views or direct links to Opsgenie. Under operations, one easily find enlisted various services that the team supports. This segment represents products that you have, or anything that is user-facing and your customer may rely on. By having these designated under the operation segment, one is easily able to communicate with the user about an outage or an event. Additionally, it give us a better metrics in the backend about what the uptime may have been, and provide a back-end view of where incidents may have occurred.
b) View people on the on-call & Rotation schedule
This is great for your service management team that maybe not necessarily have a full Opsgenie access to see who the on- calls are. One look at this section and you would immediately know who the on-call people are for different services that the team has set up. The tool is even more handy as one can simply go straight into Opsgenie to edit the list.
Schedules are defined via the Team > On-call section. In addition, a team schedule is created automatically for each team when the team is created, which can be managed through the team dashboard as well.
Admins (and team admins) can set up schedules with daily, weekly, and custom rotations, and specify the schedule for recipients of the Opsgenie alerts to determine who is notified according to the on-call schedule.
2. Entering/Receiving a report of an Incident
a) Initial review and establishing an impacted service: From the end user perspective of the service management portal, you see all these request types are actually part of the ITSM templates that Atlassian made available. Anyone with access to the portal can submit an incident.
Once submitted, the user (and watchers) can view the ticket along with its status (below).
b) Creating a major incident: On the agent side, once the ticket is picked up, and assuming it is confirmed that a system is down, he/she has the ability, in line, to escalate it as a major incident.
Upon creating this incident, two things will happen. An incident ticket is going to be generated inside Jira Service Management (which can be linked to any new reports of the issue for tracking purposes) and secondly, it will create an incident on the Opsgenie side. That is where your escalation to your on-call team comes into play. This means you are now tethering all actions and knowledge towards your investigation and root cause, and then eventually the closure of the incident itself.
3. Working an Incident in Opsgenie
a) Command center options: On the Opsgenie interface, you will see the incident automatically comes up and lets you know ‘Help! Outlook is down’. Users can then navigate to the Incident Command Center.
This is the Incident Command Center, where an alert’s process can be customized per your preference. For instance, scheduling a push notification first, followed with a phone call triggered after five minutes in case the former is not acknowledged.
‘Impacted Services’ shows on the left side, which is useful for figuring out if the disruption is connected to other vital services things. If there are interrelated services in your system, you need to be able to make sure people understand the scope of the problem, and the whole scenario gets communicated back to the service desk for awareness.
Any service management issues that are found here. This aggregated view helps when there are many tickets notifying of a single incident, and as your agents become aware of this critical incident, they are able to tie them together. It is much more convenient to do that clean-up process on the backside while your users know collectively when services are restored and change/problem tickets can be deployed.
Link to software issues: This section is tied into the potential causes section. Let’s say we’ve gone through the process and identified the root cause of the issue, and we’re going to be able to restore service for the change. This is where we would create that change so that it can be all tied in together. This is to convey the message that it is understood why we had to make an emergency adjustment to production, and it was to restore service as a result of this particular incident.
b) Providing updates to stakeholders: The right section of the Command Center has a list of stakeholders that agents can associate with a particular service. Any time an issue happens with that service, they are automatically notified.
Stakeholder communications for extended incidents is easier. Send updates out to stakeholder groups using ‘Send Update’ option. Make the most of this to quickly communicate the status of your work to your colleagues, and concentrate on addressing the actual issue at hand instead.
d) Assigning responders and the roles process: What if you want to assign people different roles? Or maybe, it’s not an issue that your team is going to immediately be able to address, and you need to bring in the infrastructure team or networking team? Simply add the concerned individuals as a Responders.
e) The Timeline: feature comes more into play once we resolve this issue and move on to the post-mortem process.
4. Generating a Post Mortem Report
Congratulations, the issue associated with the incident is now resolved. But, what is next?
Once an incident is closed, a templatized section pops up at the top. All you have to do next is select ‘Create Post-Mortem.’ A page opens where you can write up (using the ITIL methodology) the key pieces of the problem, and it automatically loads in the Command Center Session.
All context is at your fingertips. Look at all the timings associated with the incident including respondents, attachments added, etc. Next are metrics and insights that help you avoid making similar mistakes in future. Examples of these include how long it took people to respond, duration of the outage, and a host of other determinants.
5. Setting up Alerts
Alerts are effective as entry points for potential issues that can come from a variety of sources. For instance, if Outlook is down, it could be notified via teams, a Slack message, or even some of the more complex monitoring tools that are in place. These are forms of proactive monitoring.
The alert itself can be seen as a ticket of sorts, but different clients manage this in different ways. For instance, for one of Trundl’s clients, the team configured the system in a way that if an incident gets created, then a call is made out to Jira. Next, a ticket is automatically created for them in their Jira Service Management, which they use to track that particular outage. The bottom-line is, it can be set up in accordance with the client’s integration preferences.
If Send Alert Updates Back to Jira Service Management is enabled, actions for Jira Service Management are executed in Jira Service Management when the chosen action is executed in Opsgenie for alerts which are created by the Jira Service Management integration.
If Create Jira Service Management Issues for Opsgenie Alerts is enabled, actions for Jira Service Management are executed in Jira Service Management when the chosen action is executed in Opsgenie for alerts which have a source other than the Jira Service Management integration.
Any tool or system that can send an email or make an API call can create an alert on Opsgenie. So the options are limitless for how you can monitor your applications in different ways and then make sure that people are being made aware that there’s potential issue.
Next, we’re going to receive an alert inside the alert queue here, and then at that point, whoever we’ve defined to be on-call in the on-call schedule is going to get a notification. The next step is to acknowledge that alert, and assess it to see whether it’s a false alarm or a real issue.
6. Using Analytics to Improve Future Performance
There are built-in analytics on Opsgenie which primarily revolve around understanding the level of responsiveness and number of alerts raised.
Opsgenie’s reporting and analytics features show a team’s performance in acknowledging and resolving incidents, and how on-call workloads are distributed. This gives insights into the volume of alerts a team has handled over a specific duration, and the corresponding mean-time-to-acknowledge and mean-time-to resolve.
For instance, Opsgenie’s standard dashboard is used to analyze the monthly alert distribution and response trends. The findings can be shared with concerned people by exporting the reports in PDF and other formats. Consider this as a snapshot of how well your team is doing about responding to and resolving potential issues.
7. User Management in Opsgenie
Pick a user. Opsgenie will show you what products the person has access to and allow you to change their access status, if need be.
Heartbeats are infrastructure tools that are used to detect if a service is down. Opsgenie offers various heartbeat options to be set up within the system. When Heartbeats are activated, they touch some part of a system to check if it’s reacting. In the event that it doesn’t react, the Heartbeat will alert you that it’s down.
This functionality exists in Jira on-premise and the Data Center version, as well. Remember, you can find it on the most recent version of the application ONLY.
You can override, at any point, if someone else is sick or you need to have someone cover for you because you’re going to be on leave.
You can have multiple people notified as soon as a particular alert comes up. Or, let’s say you alert someone and they forget to leave their phone on overnight, and no one acknowledged after about 10 minutes. For such situations, you can have additional tiers of people as the on-call coverage.
If you want someone to, or even a group of people to be alerted to something that comes in Opsgenie, then you would need to set up a schedule. That’s going to be a group of people, and then you would just need to load all of them in as being on rotation all the time. However, ideally, you’d observe that there are few primary people to reach out to at specific times for any escalation.
Yes. If there’s an incident created and you have a Statuspage set up for that particular service, then it’s going to reflect on the status page.
It is also convenient to tie services together for best Incident Management outcomes. Suppose Jira software Cloud goes down, it automatically impacts confluence & Bit Bucket too as they integrate. So if a user from the service desk reports that Jira software Cloud is down and mark those services as impacted, the incident manager can see that this is also a problem for Bitbucket and Confluence, and put that on the Statuspage as well. Mission accomplished, as concerned people now know that the integrations with Jira software is not working because Jira is down.