Problem Management Practice
A Problem is described as the unknown cause of one or more Incidents. Problem Management works together with Incident Management and Change Management to ensure that IT service availability and quality are increased by managing and removing defects (bugs) from the IT infrastructure.
The primary goal and objectives of Problem Management are to:
- Prevent Problems and resulting Incidents from occurring
- Eliminate recurring Incidents
- Minimize the impact of Incidents that cannot be prevented
Practice Scope
Problem Management includes activities required to diagnose the root cause of Incidents, to determine the resolution to these causes, and to provide appropriate workarounds so that the organization is able to reduce the impact of Incidents while the Problem still exists.
- Although Incident and Problem Management are separate practices, they are closely related and will typically use the same tools, and may use similar categorization, impact and priority coding systems.
- Problem Management is also responsible for ensuring that the resolution is implemented through the control procedures of Change and Release Management.
RACI Matrix
A RACI Matrix, also known as Responsibility Assignment Matrix (RAM), clarifies to all involved with a practice which activities each person, group, or team is expected to fulfill. It is also helpful in clarifying the staffing model necessary for operation and improvement.
The RACI model specifies that only one role is accountable for an activity, although several people may be responsible, consulted, and informed for parts of the activity.
The RACI model stands for 4 main practice activity roles as follows:
RACI | Description |
---|---|
A = Accountable | The single owner who is accountable for the final outcome of the activity. |
R = Responsible | The executor(s) of the activity step. |
C = Consulted | The expert(s) providing information for the activity step. |
I = Informed | The stakeholder(s) who must be notified of the activity step. |
RACI Matrix for Problem Management Practice
Practice Roles Described
Role | Description |
Problem Owner | The Problem Owner is responsible for ensuring that all activities defined within the practice are undertaken and that the practice achieves its goals and objectives. |
Problem Manager | The Problem Manager is responsible for practice design and for the day to day management of the practice. The manager has authority to manage Problems effectively through investigation and resolution. |
Problem Analyst | The Problem Analyst is responsible for implementing and executing the Problem practice as defined by the Problem Owner/Manager, and to be a point of contact for escalated issues, questions, or concerns. |
Problem Solving Groups | Problem Solving Groups are technical teams responsible for IT hardware and/or Software components related to a Problem investigation. These groups typically correspond to Level 1, Level 2, and Level 3 support groups of the Incident Management practice. |
Service Desk | The Service Desk organization and functional role. |
Incident Management | The Incident Management practice role. |
Request Fulfilment | The Request Fulfilment practice role. |
Problem Management | The Problem Management practice role. |
Change Management | The Change Management practice role. |
Service Level Management | The Service Level Management practice role. |
Problem Management practice RACI Chart Example
Practice Steps Described
1.0 Problem Identification
- Objective: To reactively and proactively identify Incidents that pose threats to the stability and integrity of IT Services.
- Policy:
- All Incidents and Customer IT Service performance Complaints escalated to Problem Management will be assessed for Root Cause Investigation.
- All significant Incident trends derived from the Service Desk Monthly Incident Reports will be assessed for Problem investigation.
- Input(s):
- Major Incident Record
- Hierarchically Escalated Incident Record
- Service Desk Monthly Incident Reports
- Customer IT Service Performance Complaints
- Output(s): Assessment decision to open Problem Record
- Status: None
- Description:
- (A,R,I) The Problem Manager is Accountable, Responsible, and Informed to ensure that all escalated Incidents and Service Level Complaints are acknowledged, assessed, and may be opened.
- This role is also Accountable for ensuring the Service Desk Monthly Incident Report is received and analyzed for Incident trends that show significant change to suggest an underlying Problem.
- (R,C,I) The Problem Analyst is Informed of all requests for Problem investigation and for receipt of the Service Desk Monthly Incident Report.
- This role is Responsible and Consulted to analyze the Monthly Incident Report for Incident trends that show significant change to suggest an underlying Problem, and to recommend to the Problem Manager the top trends that should be investigated.
- (R) The Service Desk function is Responsible to provide accurate and complete Monthly Incident Reports.
- (C) All roles requesting Problem investigation (typically the Service Desk, Incident Management, or Service Level Management) are available to be Consulted for details.
- (A,R,I) The Problem Manager is Accountable, Responsible, and Informed to ensure that all escalated Incidents and Service Level Complaints are acknowledged, assessed, and may be opened.
2.0 Problem Logging
- Objective: To create a single source for documentation of all relevant problem details and for the management of problems.
- Policy: Each Problem that is opened for investigation must be fully and completely documented in a Problem Record.
- Input(s):
- Forwarded Incident Records and Primary Incident Records
- Problem and Known Error records
- Complaint Details
- Incident Trend Reports
- Output(s):
- A detailed Problem Record
- Related Incident Records
- Status: Open
- Description:
- (A) The Problem Manager is Accountable to ensure that a Problem record is opened and documented as the single source for information related to the accepted Problem investigation.
- (R) The Problem Analyst is Responsible to open and document a Problem Record, and is further Responsible to provide:
- All related Problem documents and attachments.
- All known Incident records, documents and attachments.
- (C,I) The Problem requesting practice is Consulted and Informed for all relevant information when logging a Problem Record.
3.0 Problem Categorization
- Objective: To properly categorized every Problem Record to match with the category used for Problem trend reports and to match Problem solutions and workarounds to related Incidents.
- Policy: The Categorization Model is used by Problem Management to categorize the Problem Record using the same category as the escalated Incident Record, the same category that identified the Problem from the Incident Trend Reports, or the same category as indicated by a Customer complaint from Service Level Management.
- Input(s):
- Open Problem Record
- Related Incident Records
- Customer Complaint from Service Level Management
- Output(s): Categorized Problem Record
- Status: Open
- Description: All Open Records are first categorized as Problem type. All Records are further categorized using categories from the Categorization Model, and all Records are matched to the related Services as outlined in the Service Catalog.
- (A,R) The Problem Analyst is Accountable and Responsible for properly identifying the category for the Problem Record.
- (R,I) The Problem Manager is Informed and Responsible for assisting in proper categorizing of all Records.
- (C) All roles requesting Problem assessment are available to be Consulted for Problem category details
4.0 Problem Prioritization
- Objective: To set an appropriate Priority for handling the Problem and for assigning Problem investigation workload.
- Policy: The Prioritization Model is used by Problem Management to prioritize the Problem Record, considering the priorities of escalated and related Incident Records or priority indicated for a Customer complaint coming through Service Level Management.
- Input(s): Open, Categorized Problem Record
- Output(s): Open, Categorized and Prioritized Problem Record
- Status Open, Problem Record
- Description:
- (A,R) The Problem Analyst is Accountable and Responsible for properly identifying the priority of the Problem Record.
- (R,I) The Problem Manager is Informed and Responsible for assisting in proper prioritizing of all Records.
- (C) All roles requesting Problem assessment are available to be Consulted for Problem priority details.
5.0 Investigation and Diagnosis
- Objective: To determine the Root Cause point(s) of failure of related Incident(s) to assure failures in the IT infrastructure are removed or their impact reduced.
- Policy: The assigned Problem roles shall make all effort to determine the Problem Root Cause point(s) of failure of related Incident(s).
- Input(s): Open, Categorized and Prioritized Problem Record
- Output(s): Diagnosed Problem Record
- Status Assigned, Diagnosed
- Description:
- (A,R,C,I) The Problem Analyst is Accountable and Responsible for (a) determining the most appropriate Problem Solving Groups (typically one or more IT groups that were involved in the original Incident Records) and (b) for ensuring that all Problems continue to be worked on by assigned Problem Solving Groups, researching both Root Cause and Workarounds.
- The Problem Analyst will be the Single Point of Contact Consulted and Informed of new or changed information to the Problem, and to update the main Problem record.
- The Problem Analyst is Responsible for conveying all new or changed information to all Problem Solving Group roles.
- The Problem Analyst is Responsible for Informing the Service Desk and providing all new or changed information to be related to the Problem Record (in the form of attachments).
- The Problem Analyst is Responsible for assessing changes to Category and Priority based on feedback from the Problem Solving Group.
- (R,C,I) Problem Solving Group roles will be Responsible for researching both Root Cause and Workarounds for Problem Assignments escalated to them.
- The Problem Solving Group roles will be Consulted for information related to the Incident Record, and in turn will be Informed of any relevant new or changed information to the Incident.
- The Problem Solving Group roles are Responsible for conveying all new or changed information to the coordinating Problem Analyst.
- The Problem Solving Group roles are Responsible for conveying perceived changes to Category and Priority with the coordinating Problem Analyst.
- (I,R) The Problem Manager is Informed and Responsible for assisting the Problem Analyst and ensuring the Problem investigation continues to move forward.
6.0 Workaround?
- Objective: To minimize the impact of recurring Incidents caused by Problems where there is no immediate solution to eliminate the Problem.
- Policy: Workarounds are the first priority of Problem investigation and will be conveyed to the Service Desk and Incident Management in order to minimize the Impact of recurring Incidents.
- Input(s): Problem Investigation
- Output(s): Problem Workarounds
- Status: Assigned, Workaround
- Assigned, Diagnosed and Workaround (Known Error)
- Description: Workarounds should be raised as early in the Investigation activity as possible in order to minimize recurring Incidents. This activity may happen at the start of the step 5.0 Investigation and Diagnosis activity and happen several times thereafter should new and more relevant workarounds be determined.
- Workarounds may be determined by consulting the Service Desk and Incident Support Level 2/3 roles involved with incident resolutions.
- All effort should be made to find and settle on using the most effective workaround(s) should the Incident reoccur.
- (A,R,I) The Problem Analyst is Accountable for ensuring that the most effective workarounds are identified and communicated back to the Service Desk for recurring Incidents related to Problems.
- The Problem Analyst is Informed and Responsible for Consolidating and Informing Problem Solving Groups of new or changed workarounds.
- The Problem Analyst will Consult with the Problem Manager and the Service Desk to determine the workarounds to be used at the Service Desk and Incident Support Teams.
- (R,C,I) Problem Solving Group roles will be Responsible for researching workarounds for Problems escalated to them.
- The Problem Solving Group roles are Responsible for, and will be Consulted for workarounds, and in turn will be Informed of any relevant new or changed workarounds along with recommendations for those most effective.
- (C) The Service Desk, Incident Manager and Problem Manager are Consulted for recommendations of which workarounds to use at the Service Desk and Incident Support Teams.
7.0 Create Known Error Record
- Objective: On determination of the Cause of a Problem, Known Error Records are used by the Service Desk and Incident Support roles to reference Problem details and workarounds that will aid in shortening the recovery period when Incidents reoccur.
- Policy: All final Problem details and Workarounds are communicated to the Service Desk and Incident Management to be documented in a Problem Record, and incorporated into support scripts and tools for future use when similar Incidents reoccur.
- Input(s): Workarounds
- Output(s):
- Support Scripts and Workarounds
- Updated Problem Record
- Status: Known Error
- Description:
- (A,R) The Problem Analyst is Accountable and Responsible for updating the Service Management System (shared Ticketing tool) with Known Error details.
- (C) Problem Solving Groups will be Consulted for details related to Problems and Workarounds.
- (R,I) The Service Desk and Incident Management practice are Informed of Known Error updates and are Responsible for updating support scripts that reference agreed workarounds and Known Error matching for similar Incidents that may reoccur.
8.0 Analyze and Provide Solution Options
- Objective: To assess a range of solution options and make a considered decision to fix the root cause of a Problem or live with the Workaround.
- Policy: All Problems will be assessed for an optimal solution by considering the business impact and cost of raising a Change to correct the Problem Cause, or to live with the Workaround.
- Input(s):
- Support Scripts and Workarounds
- Updated Problem Record
- Output(s): Assessments and Recommendations to Solve the Problem or live with the Workaround
- Status: Assigned, Diagnosed
- Description:
- (A,R) The Problem Manager is Accountable for ensuring that all solution recommendations are assessed for business impact and cost, and that the most optimal Problem solution is taken.
- (R,C,I) Problem Solving Groups, the Service Desk, Incident Management and/or Service Level Management are Responsible for providing a range of solutions (where possible) related to Problems they are involved with, and will be Consulted for details and Informed of selected options.
- (R,I) The Problem Analyst is Informed of and Responsible for coordinating and escalating all solution options to the attention of the Problem Manager.
9.0 Change Required?
- Objective: To ensure that all Problem solutions that involve changes to IT Services are practiced under change control.
- Policy: All Problems solutions that involve making IT changes will be practiced under the Change Management practice.
- Input(s):
- Support Scripts and Workarounds
- Updated Problem Record
- Recommended Solutions to Solve the Problem
- Output(s):
- Submission of a Request for Change (RFC)
- Or Decision to continue Resolution of Problem record
- Status: Assigned, Pending Change
- Assigned, Diagnosed
- Description:
- (A,R) The Problem Manager is Accountable and Responsible for ensuring that the chosen solution (if an IT Change to a software and/or Hardware components) is assessed for practicing under Change Management.
- (I,R) The Change Management practice is Informed and Responsible to practice the IT Change.
- (R) The Problem Analyst is Responsible for coordinating the Problem recommendation through Change Management, and may be the Change Requestor.
- (R,I) The Problem Solving Group(s) with ownership of the IT Change components are Informed of the Problem resolution selected and are Responsible for coordinating the Problem resolution through Change Management.
10.0 Resolution
- Objective: To ensure that all Problem solutions have actually resolved the root cause of the Problem.
- Policy: All implemented Problems solutions will be verified to have removed the root cause of the Problem if resolved by a Change, or will be verified to be effective in minimizing the impact of recurring Incidents if the decision is made to live with the workaround.
- Input(s):
- Change Post Implementation Review (PIR)
- Problem Solution not requiring Change Management
- Output(s): Resolved Problem
- Status:
- Resolved Fixed
- Resolved Workaround
- Description: To ensure a Problem has been removed, Problem Resolution may involve a period of monitoring at the Service Desk to ensure like Incidents do not reoccur, and/or to ensure that Workarounds are functioning as expected when like Incidents reoccur.
- (A,R) The Problem Manager is Accountable to ensure that the chosen solution is validated to have removed the root cause of the Problem or that effective Workarounds are in place.
- The Problem Manager is Responsible to make the decision to classify the Problem record as “Resolved”.
- (R,I) The Problem Analyst is Informed and Responsible for assessing the success of Problem resolutions and updating the Problem record.
- (C,I) Problem Solving Groups are Informed of implemented Problem resolutions and Consulted for validation of these resolutions.
- (R,C,I) The Service Desk and Incident Management are Informed of Problem resolutions and are Responsible for updating support scripts as necessary, They will also monitor and be Consulted for reports on Incidents that may reoccur related to the Problem record.
- (I) The Change Management practice is Informed of assessment results to allow Change Records to be closed.
- (A,R) The Problem Manager is Accountable to ensure that the chosen solution is validated to have removed the root cause of the Problem or that effective Workarounds are in place.
11.0 Major Problem?
- Objective: To assess all Major Problems and their relation to Agreements and Contracts, and make recommendations for follow up (if required) to IT Management and Service Level Management.
- Policy: All Problems of Priority 1 and 2 are considered Major Problems, and will be brought to the attention of IT Management and Service Level Management for assessment and decisions.
- Input(s): Resolved Problem
- Output(s):
- Possible recommendations to Agreements and Contracts
- Possible recommendations to IT Infrastructure
- Status: NA
- Description:
- (A,R) The Problem Manager is Accountable for and may be Responsible for ensuring that all Major Problems are assessed against existing Agreements for breaches or concerns, and that such information is forwarded to the Service Level Management practice.
- The Problem Manager is also responsible to ensure that Major Problems are assessed against the IT Infrastructure, and that such information is forwarded to IT Management.
- (R) The Problem Analyst may be delegated Responsibility for assessing Major Problems and against Agreements and Contracts.
- (I) The Service Level Management practice may be Informed to follow up Major Problems that may be a breach of contract.
- (A,R) The Problem Manager is Accountable for and may be Responsible for ensuring that all Major Problems are assessed against existing Agreements for breaches or concerns, and that such information is forwarded to the Service Level Management practice.
12.0 Problem Closure
- Objective: To close all open Problem records and related practice activities.
- Policy: The Service Desk shall be responsible to close all Problem Records.
- Input(s): Request to Close Problem Record
- Output(s): Closed Problem Record
- Status:
- Closed, Fixed
- Closed, Workaround
- Description:
- (A,R) The Problem Manager is Accountable and Responsible to inform the Problem Analyst to close the Problem record.
- (R,I) The Problem Analyst is Informed and Responsible to close the Problem record and all related Records (such as Known Error record, but only if the Problem is removed and the workaround is no longer required to be used).
Terms & Definitions
- Problem
- A cause of one or more Incidents. The cause is not usually known at the time a Problem Record is created and the Problem Management practice is responsible for further investigation.
- Root Cause Analysis (RCA)
- Solving a Problem in a structured and organized manner in order to identify the true and underlying cause(s), categorize the issue and eliminate the cause(s), thus preventing future recurrences.
- Root Cause
- The underlying or original cause of an Incident or Problem.
- Workaround
- Reducing or eliminating the impact of an Incident or Problem for which a full resolution is not yet available, e.g., by restarting a failed CI. Workarounds for Problems are documented in Known Error Records. Workarounds for Incidents that do not have associated Problem Records are documented in the Incident Record.
- Known Error
- A Problem that has a documented root cause and a workaround. Known errors are created and managed throughout their Lifecycle by Problem Management. Known Errors may also be identified by development or suppliers.
- Known Error Database (KEDB)
- A database containing all Known Error Records. This database is created by Problem Management and used by Incident and Problem Management. The Known Error Database is part of the Service Knowledge Management System (SKMS).
- Trend Analysis
- Analysis of data to identify time related patterns. Trend Analysis is used in Problem Management to identify common failures or fragile CI; in Capacity Management as a modeling tool to predict future behavior; and as a management tool for identifying deficiencies in IT Service Management practices.
- Impact
- A measure of the effect of an Incident, Problem or change on business practices. Impact is often based on how service levels will be affected. Impact and urgency are used to assign priority.
- Urgency
- A measure of how long it will be until an Incident, Problem or change has a significant impact on the business. For example, a high-impact Incident may have low urgency if the impact will not affect the business until the end of the financial year. Impact and urgency are used to assign priority.
- Priority
- A category used to identify the relative importance of an Incident, Problem or change. Priority is based on impact and urgency, and is used to identify required times for actions to be taken. For example, the SLA may state that priority 2 Incidents must be resolved within 12 hours.
- Service Level Agreement (SLA)
- Written agreement between a Service Provider and the customer(s) that documents agreed service levels for a service.
- Operating Level Agreement (OLA)
- An agreement between an IT Service Provider and another part of the same organization. An OLA supports the IT Service Provider’s delivery of IT services to customers. The OLA defines the goods or services to be provided and the responsibilities of both parties.
- Underpinning Contract (UC)
- A contract between an IT Service Provider and a third-party. The third-party provides goods or services that support delivery of an IT service to a customer. The UC defines targets and responsibilities that are required to meet agreed service target levels in an SLA.
Download the Problem Practice Activity Design document template