The History of the SPICE Trials

Phase 2

Purpose and Scope

Phase 2 of the trials started in September 1996.  This phase has a broader scope than the initial phase of the trials, with participation from all the software engineering community.  As well as evaluating the complete document set and design decisions, its objectives include providing guidance for applying the emerging standard most effectively.  Phase 2 of the SPICE Trials is evaluating the ISO/IEC PDTR 15504 documents (aka SPICE Version 2.0).

Two features of the second phase that differentiate it from the first phase of the trials are that a much larger number of assessments are expected.  For this interim report 30 assessments were conducted, however, at the time of writing, data from close to 60 assessments are already available to the trials team.  We expect that this number will increase before the end of the Phase 2 Trials.  Second, during this phase more ‘local studies’ were encouraged.  These are small scale studies conducted in individual SPICE regions with a small number of organisations and assessors.  Such studies focus on a particular issue and attempt to investigate it in detail.  The benefits of this approach include the fact that more studies can be conducted.  In this document, we have the results from thirteen different studies (including ‘local studies’), whereas in Phase 1 three studies were conducted.  In addition, different empirical research methods can be applied to investigate the same phenomenon when there are opportunities to conduct more studies.  Here we have surveys as well as what may be classified as field experiments, in addition to the data that is normally collected in the trials database.
 

Summary of Findings

The following is a summary of all the findings in the interim report.  At this point in time we refrain from making recommendations for the SPICE Project as this is only an interim report.  Furthermore, the current studies have raised a number of issues that we are studying currently before the end of Phase 2.  The results of these studies will provide stronger and better supported recommendations.  Recommendations based on these findings will be made for the final report.  However, some of the current findings do indicate some obvious recommended actions.

Trials Assessments and Ratings

The major findings on trials assessments and ratings are:
  1. Only two regions have participated in the trials by providing data thus far:  Europe and South Asia Pacific. 
  2. We have data from 30 assessments conducted in 23 different organisations in these two regions.
  3. There was a good distribution in terms of OU size (both large and small).  However, there was no participation for OUs whose primary business sector was:  business services, petroleum, automotive, aerospace, public administration, consumer goods, retail, health and pharmaceuticals, leisure and tourism, manufacturing, construction, and travel.
  4. Most assessments involved only one project in the OU.
  5. All processes in the Reference Model were covered.  ENG.2 and MAN.1 were covered most extensively.
  6. The median number of process instances per assessment is seven.
  7. In general, we found that the attributes corresponding to the higher capability levels receive the higher ratings less often than those corresponding to the lower capability levels.
  8. In a significant number of cases, process instances fail to achieve a particular capability level because of inadequacies at the previous level, rather than at the level in question.
  9. Approximately 19% of the process instances were at level 0, 50% at level 1, and 19% at level 2.
  10. The most costly activity during an assessment is the collection of evidence, and the least costly is the final presentation.  Preparation of the assessment inputs consumed 15% of the total assessment effort on average.

Part 5 Evaluation

A majority of the assessors who used Part 5 used it as a source of indicators for conducting their assessments.  In general, they found it to be useful and easy to use.  The amount of detail in that document was considered to be a benefit.  Furthermore, ratings at the process instance level were found to be meaningful, but a smaller majority felt that the grouping of process in categories was meaningful.

The assessors found that making the distinction between the ‘F’ and ‘L’ responses during rating the most difficult.  Further guidance for making this distinction in particular would be of benefit.  The practices in Part 5 were consistently found to be helpful in interpreting Part 2.  However, when considering Part 2 by itself, there seemed to be less confidence in understanding and rating the attributes at the higher capability levels.  Furthermore, when considering Part 5 by itself, there was doubt by the assessors that they really understood the Work Products, the Work Product Characteristics, and the Process Capability Indicators enough to make direct repeatable ratings on these, but then they are not required to.  More assessors felt it was easier to relate the Management Practices to the OU than it was to relate the Base Practices.

The Reliability of Assessments

We conducted a large number of studies on the reliability of assessments.  The main findings are summarised below:
  1. The SPICE Version 1.0 capability dimension had a high internal consistency, sufficiently high that it is usable for practical purposes.  This consistency serves as a baseline with which to compare ISO/IEC PDTR 15504.
  2. The internal consistency of the complete SPICE Version 1.0 capability dimension (all 26 generic practices) was very similar to the internal consistency of an already widely-used assessment instrument, 1987 SEI Maturity Questionnaire.
  3. The ISO/IEC PDTR 15504 capability dimension (with nine attributes) still has a very high internal consistency compared to SPICE Version 1.0 that make it usable in practice.  Therefore, from an internal consistency perspective, the reduction of the capability dimension from 26 items to only nine items did not deteriorate its internal consistency.
  4. The interrater agreement of assessments using ISO/IEC PDTR 15504 has been demonstrated to be high.  This is true for attribute ratings and the capability level ratings.  Given that this has been one of major concerns of the developers of ISO/IEC 15504, the evidence presented here should serve as an acknowledgement that their efforts were successful.
  5. Combining the 4-point scale categories into a 2- or a 3-point scale does not have a substantial impact on interrater agreement.  This means that converting the 4-point scale to a 2-point or a 3-point scale by combining categories has not improved interrater agreement.
  6. We now have a benchmark to evaluate the quality of ISO/IEC 15504 assessments in terms of their interrater agreement.  The benchmark consolidates all the data that was collected prior to Phase 2.  This benchmark is actually quite pessimistic and we hope to relax it through the collection of further data.  Being a pessimistic benchmark, it is most useful for identifying the assessments that have quite low reliability.
  7. Assessment team competence and the clarity of the documents are the two most important factors that have an impact on the reliability of assessments, according to the perceptions of experienced assessors.
  8. Trained assessors who have not conducted assessments in the past should not be the primary source of ratings for an assessment because, according to the results of the current study, the ratings of a novice assessor may differ substantially from the ratings of a more experienced assessor.
  9. For low capability levels, there is a difference in reliability between rating processes early in an assessment versus late in an assessment.  It is better to rate the low-capability processes late in the assessment after collecting information about other related processes (i.e. other process for the project or for the organisation).  For higher-capability processes, it does not make a difference whether ratings are done early or late in an assessment.
  10. From the perspective of future research on the reliability of assessments, the number of previously-performed assessments is a variable that must be controlled to avoid it having an effect on the results.
  11. It is of questionable value to conduct studies in “laboratory” settings, for example, during an assessor training course where the assessors have not conducted actual assessments in the past.  The reason is that the trainee assessors’ ratings may be quite different from the ratings that would be obtained in the field.

Measurement of Process Capability

Thus far, there have been two ways of measuring process capability of a process instance in ISO/IEC PDTR 15504:  using the attribute ratings and converting the attribute ratings into capability levels.  We have identified a third approach.

The ISO/IEC PDTR 15504 capability dimension actually has two dimensions:  Process Implementation (levels 1 to 3) and Quantitative Process Management (levels 4 and 5).  These were identified to be two separate constructs that can be measured about a process’ capability (see Section 7).

The rating of attributes is necessary for the latter two capability measures.  Through further studies we plan to identify the efficacy of the third measure.  However, a priori it has the advantage of an empirically-determined internal consistency.

Benchmarking Process Capability

We constructed a number of initial benchmarks that can be used for comparing obtained capability with the capability of other similar processes in the trials database.  The benchmarks indicate the feasibility of performing sophisticated benchmarking using the trials data.  Furthermore, we are still exploring ways for further improving these benchmarks.

Success Factors for SPI

The recommendations presented here are focused on maximising the effectiveness of software process assessments for the purpose of SPI:
  1. A follow-up study of organisations that took part in Phase 1 of the trials indicates that many organisations struggle with achieving successful SPI based on process assessments.  A good proportion of them have not taken the steps that are generally recommended for successful SPI.
  2. We found that the more an organisation’s SPI effort is determined by the findings of an assessment, the greater the extent to which the assessment findings are successfully addressed.  Therefore, it is important to ensure that the SPI effort is determined by the assessment findings.
  3. To increase the possibility that the assessment’s findings determine the SPI effort of the organisation, the following factors were found to be important:
  4. Surprisingly, it was found that increased so-called “Organisational politics” and “ambitious recommendations” from the assessment tend to increase the extent to which the SPI effort is determined by the assessment findings.  Therefore, in this context, these two factors are not necessarily a bad thing!
  5. Ensuring that the SPI effort is determined by the assessment findings is not the only factor that affects the success in addressing the assessment findings.  Other factors that should be taken into account are:

Sensitivity of the Capability Level Determination Scheme

The results of a sensitivity analysis of the capability level determination scheme indicate a few interesting things, with some useful implications for conducting assessments.   We performed the sensitivity analysis by distorting the attribute ratings in the trials database for each process instance and calculating the amount of change in capability level.  Sensitivity analysis can be useful for dealing with confusion between categories on the 4-point rating scales during an assessment.  Confusion can occur, for example, when an assessor is not sure which of two categories to choose on the 4-point achievement scale.
  1. It is clear that distortions downward have a much larger impact on capability levels than distortions upward.  A larger impact here means that more process instances will have their capability levels change.  Therefore, in cases where there is potential confusion between categories, automatically choosing the higher attribute rating will lead to a much smaller error compared to automatically choosing the lower one.
  2. Even if the higher rating is chosen when there is confusion, the sensitivity of the capability levels to upward distortions on the level 1 attribute tend to be larger than for the other attributes.  Therefore, to be prudent, one could avoid automatically choosing the higher rating for attribute 1 when there is confusion between two ratings.  Instead, existing evidence should be reexamined and/or further evidence collected to decide on the rating.
  3. When ratings are distorted upwards to an ‘F’, sensitivity is still not negligible.  Therefore prudence should be exercised when choosing the higher rating where the higher rating is an ‘F’.
  4. 4. The above guideline is especially true for attribute 1.  For example, up to 14% of process instances may be in error for attribute 1, with the next largest sensitivity being up to 6% for attribute 2.2.  The result does amount to only a 3% increase in capability for attribute 1 and around 1-2% for attribute 2.2.  Therefore, especially for attribute 1, prudence should be exercised when following the above guideline and choosing the higher rating where the higher rating is an ‘F’.
  5. Apart from choosing the higher rating when it is an ‘F’, also choosing the higher rating when the lower one is a ‘P’ leads to non-negligible sensitivity.  Therefore, prudence should be exercised in choosing the higher rating when the lower rating is a ‘P’.
  6. The above guideline is especially true for attribute 1 where up to 14% of process instances can increase their capability level, and to a lesser extent, 6% for attribute 2.2.  Therefore, especially for attribute 1, prudence should be exercised when following the above guideline and in choosing the higher rating when the lower rating is a ‘P’.
  7. Distortions in the higher level attributes have negligible impact on capability levels, especially for levels 4 and 5.  This is due to the fact that very few process instances achieve capabilities at the higher levels in practice.
  8. Confusion between the ‘P’ and ‘N’ ratings has no impact on the capability levels.  Therefore, if one is interested only in the capability levels of the process instances as defined in ISO/IEC 15504, then s/he would be justified in paying little attention to deciding whether an attribute is a ‘P’ or an ‘N’.
  9. For the level 2 to level 5 attributes, if one of the two attributes at that level is definitely either a ‘P’ or an ‘N’, it does not matter what the other attribute is.  Therefore, if it is confirmed that one of the attributes is a ‘P’ or an ‘N’, then there is no point in expending effort deciding what the other attribute should be.

Download the Phase 2 Report

The interim report from Phase 2 of the Trials can be downloaded as an Adobe Acrobat (PDF) file (1,551 Kb).
 

Back to Top