@Nadeesha Nanayakkara Good questions. SRE is a maturing area and some concepts of it are more descriptive than prescriptive in nature. Here's how I would answer your queries,
Ans1: This is related to your earlier query in a way. Yes, Error Budget is set for a given SLO you have defined. But, as an advanced practice of SRE, you can combine 2 or more SLOs also called as composite SLOs which can then be used for determining Error Budget. A similar approach can be followed for SLIs (defining composite SLIs) when you want to define a single 'larger' SLO. Either of these approaches can help you get to your end goal. The impact of breaching an SLO should be targeted more towards helping you determine your development/build focus on new features vs stability rather than directly assigning a $$ value to it.
Ans2: Determining the right error budget involves multiple factors and does not belong to just one team taking a decision. All stakeholders related to the service need to be a part of this decision. One of the considerations is selecting the correct time period. This is usually depending on factors like your development sprint durations, nature of your service, stage of your product (early product, mature product), industry and current SLA.
Having a common time period for all SLOs helps in effectively utilising error budget as you can determine how aggressively you can move forward adding more features or enhancing functionality vs reliability improvements. If your application has a common dev team for all services it helps having common time periods whereas if you have dedicated dev teams for different services you can choose to have different SLOs.
Pls keep in mind a core reason to define an error budget is to be able to make certain go / no-go decisions on development which can then become your guiding light for other decisions you have to make.
@Shamayel and @Vishnu thanks for the responses. In addition to which, a few related queries on the Error Budget -
1) Believe the budget is set per SLO (as the unit of consumption would vary for each SLO), but is there a concept of maintaining an application level budget using a common denominator, for example by providing a $$ value per unit of the budget?
2) Also what are your thoughts on the time duration for which the Error Budget is considered? Should it be a common duration for all SLOs of an application?
Hi Nadessha, it would be good to have SLOs at individual service level. This will allow for tracking individual service SLIs and find out the problematic one when there are issues. Also it helps in defining different SLOs based on the usage of the service. For example, SLOs can be set higher for critical services than less critical ones.
What are the best practice around creating SLIs and SLOs? Are they created at a specific service level or at the application level?For example if application AB is dependant on service A + service B, would an SLI or SLO on 'Availability' be defined for AB as a whole or for each service independently, with the possibility of varying thresholds ?
Hi Nadeesha, while it is possible to do it both ways, ideally it should be defined at a service level for which you are able to measure an SLI independently. This makes the approach simple and clean WRT meeting the SLOs. The one challenge with this is that it could become tricky to map the multiple individual SLOs directly with one single SLA, as SLA's are typically defined at an application level, not at individual service level. You will have to do some pre-work were which helps you map multiple SLOs towards one SLA as only then will be able to ensure there is a proper relationship between the SLOs and SLA in order to avoid SLA breaches
@Nadeesha Nanayakkara Good questions. SRE is a maturing area and some concepts of it are more descriptive than prescriptive in nature. Here's how I would answer your queries,
Ans1: This is related to your earlier query in a way. Yes, Error Budget is set for a given SLO you have defined. But, as an advanced practice of SRE, you can combine 2 or more SLOs also called as composite SLOs which can then be used for determining Error Budget. A similar approach can be followed for SLIs (defining composite SLIs) when you want to define a single 'larger' SLO. Either of these approaches can help you get to your end goal. The impact of breaching an SLO should be targeted more towards helping you determine your development/build focus on new features vs stability rather than directly assigning a $$ value to it.
Ans2: Determining the right error budget involves multiple factors and does not belong to just one team taking a decision. All stakeholders related to the service need to be a part of this decision. One of the considerations is selecting the correct time period. This is usually depending on factors like your development sprint durations, nature of your service, stage of your product (early product, mature product), industry and current SLA.
Having a common time period for all SLOs helps in effectively utilising error budget as you can determine how aggressively you can move forward adding more features or enhancing functionality vs reliability improvements. If your application has a common dev team for all services it helps having common time periods whereas if you have dedicated dev teams for different services you can choose to have different SLOs.
Pls keep in mind a core reason to define an error budget is to be able to make certain go / no-go decisions on development which can then become your guiding light for other decisions you have to make.
@Shamayel and @Vishnu thanks for the responses. In addition to which, a few related queries on the Error Budget -
1) Believe the budget is set per SLO (as the unit of consumption would vary for each SLO), but is there a concept of maintaining an application level budget using a common denominator, for example by providing a $$ value per unit of the budget?
2) Also what are your thoughts on the time duration for which the Error Budget is considered? Should it be a common duration for all SLOs of an application?
Hi Nadessha, it would be good to have SLOs at individual service level. This will allow for tracking individual service SLIs and find out the problematic one when there are issues. Also it helps in defining different SLOs based on the usage of the service. For example, SLOs can be set higher for critical services than less critical ones.
What are the best practice around creating SLIs and SLOs? Are they created at a specific service level or at the application level? For example if application AB is dependant on service A + service B, would an SLI or SLO on 'Availability' be defined for AB as a whole or for each service independently, with the possibility of varying thresholds ?
please add me to this community