It is important to have an on-call procedure in place to ensure that any critical issues or incidents can be quickly and effectively resolved. This on-call procedure outline the steps that should be taken when an issue arises, as well as the roles and responsibilities of the on-call team members.
Expectations for On-Call
- If you are on call, then you are expected to be available and ready to respond to BetterUpTime pings as soon as possible, but certainly within any response times set by our SLA in the case of Customer Emergencies. This may require bringing a laptop and reliable internet connection with you if you have plans outside of your work space while being on call, as an example.
- We take on-call seriously. There are escalation policies in place so that if a first responder does not respond fast enough another team member or members is/are alerted. Such policies are essentially expected to never be triggered, but they cover extreme and unforeseeable circumstances.
- Because Novu is an asynchronous workflow company, @mentions of On-Call individuals in Discord will be treated like normal messages and no SLA for response will be attached or associated with them.
- Provide support to the release managers in the release process.
- After being on-call take time off. Being available for issues and outages will wear you off even if you had no pages, and resting is critical for proper functioning. Just let your team know.
Here is an example on-call procedure that could be used:
- Identify the issue: When an issue or incident occurs, the first step is to identify and confirm the nature of the issue. This can involve gathering information from users, monitoring tools, or other sources to determine the scope and impact of the issue.
- Escalate the issue: Once the issue has been identified, it should be escalated to the on-call team for resolution. This can involve sending an email or SMS notification to the on-call team members, as well as providing details about the issue and any relevant information.
- Assign a primary responder: The on-call team should determine who will be the primary responder for the issue. This should be a team member who has the knowledge, skills, and availability to effectively address the issue.
- Take appropriate action: The primary responder should take appropriate action to resolve the issue, such as restarting a service, rolling back a code change, or applying a patch. If necessary, the primary responder can enlist the help of other team members or seek assistance from external sources such as vendors or support teams.
- Document the incident: After the issue has been resolved, the primary responder should document the incident, including details about the nature of the issue, the steps taken to resolve it, and any lessons learned. This documentation can be used to improve the on-call procedure and prevent similar issues from occurring in the future.
- Follow-up: Once the issue has been resolved and documented, the on-call team should follow-up to ensure that the issue has been fully resolved and that no further action is needed. This can involve checking with users, monitoring tools, or other sources to confirm that the issue has been resolved and that the system is functioning as expected.
We use BetterUptime to set the on-call schedules, and to route notifications to the appropriate individual(s).