The History of the SPICE Trials
Phase 2
Purpose and Scope
Phase 2 of the trials started in September 1996. This phase has a
broader scope than the initial phase of the trials, with participation
from all the software engineering community. As well as evaluating
the complete document set and design decisions, its objectives include
providing guidance for applying the emerging standard most effectively.
Phase 2 of the SPICE Trials is evaluating the ISO/IEC PDTR 15504 documents
(aka SPICE Version 2.0).
Two features of the second phase that differentiate it from the first
phase of the trials are that a much larger number of assessments are expected.
For this interim report 30 assessments were conducted, however, at the
time of writing, data from close to 60 assessments are already available
to the trials team. We expect that this number will increase before
the end of the Phase 2 Trials. Second, during this phase more ‘local
studies’ were encouraged. These are small scale studies conducted
in individual SPICE regions with a small number of organisations and assessors.
Such studies focus on a particular issue and attempt to investigate it
in detail. The benefits of this approach include the fact that more
studies can be conducted. In this document, we have the results from
thirteen different studies (including ‘local studies’), whereas in Phase
1 three studies were conducted. In addition, different empirical
research methods can be applied to investigate the same phenomenon when
there are opportunities to conduct more studies. Here we have surveys
as well as what may be classified as field experiments, in addition to
the data that is normally collected in the trials database.
Summary of Findings
The following is a summary of all the findings in the interim report.
At this point in time we refrain from making recommendations for the SPICE
Project as this is only an interim report. Furthermore, the current
studies have raised a number of issues that we are studying currently before
the end of Phase 2. The results of these studies will provide stronger
and better supported recommendations. Recommendations based on these
findings will be made for the final report. However, some of the
current findings do indicate some obvious recommended actions.
Trials Assessments and Ratings
The major findings on trials assessments and ratings are:
-
Only two regions have participated in the trials by providing data thus
far: Europe and South Asia Pacific.
-
We have data from 30 assessments conducted in 23 different organisations
in these two regions.
-
There was a good distribution in terms of OU size (both large and small).
However, there was no participation for OUs whose primary business sector
was: business services, petroleum, automotive, aerospace, public
administration, consumer goods, retail, health and pharmaceuticals, leisure
and tourism, manufacturing, construction, and travel.
-
Most assessments involved only one project in the OU.
-
All processes in the Reference Model were covered. ENG.2 and MAN.1
were covered most extensively.
-
The median number of process instances per assessment is seven.
-
In general, we found that the attributes corresponding to the higher capability
levels receive the higher ratings less often than those corresponding to
the lower capability levels.
-
In a significant number of cases, process instances fail to achieve a particular
capability level because of inadequacies at the previous level, rather
than at the level in question.
-
Approximately 19% of the process instances were at level 0, 50% at level
1, and 19% at level 2.
-
The most costly activity during an assessment is the collection of evidence,
and the least costly is the final presentation. Preparation of the
assessment inputs consumed 15% of the total assessment effort on average.
Part 5 Evaluation
A majority of the assessors who used Part 5 used it as a source of indicators
for conducting their assessments. In general, they found it to be
useful and easy to use. The amount of detail in that document was
considered to be a benefit. Furthermore, ratings at the process instance
level were found to be meaningful, but a smaller majority felt that the
grouping of process in categories was meaningful.
The assessors found that making the distinction between the ‘F’ and
‘L’ responses during rating the most difficult. Further guidance
for making this distinction in particular would be of benefit. The
practices in Part 5 were consistently found to be helpful in interpreting
Part 2. However, when considering Part 2 by itself, there seemed
to be less confidence in understanding and rating the attributes at the
higher capability levels. Furthermore, when considering Part 5 by
itself, there was doubt by the assessors that they really understood the
Work Products, the Work Product Characteristics, and the Process Capability
Indicators enough to make direct repeatable ratings on these, but then
they are not required to. More assessors felt it was easier to relate
the Management Practices to the OU than it was to relate the Base Practices.
The Reliability of Assessments
We conducted a large number of studies on the reliability of assessments.
The main findings are summarised below:
-
The SPICE Version 1.0 capability dimension had a high internal consistency,
sufficiently high that it is usable for practical purposes. This
consistency serves as a baseline with which to compare ISO/IEC PDTR 15504.
-
The internal consistency of the complete SPICE Version 1.0 capability dimension
(all 26 generic practices) was very similar to the internal consistency
of an already widely-used assessment instrument, 1987 SEI Maturity Questionnaire.
-
The ISO/IEC PDTR 15504 capability dimension (with nine attributes) still
has a very high internal consistency compared to SPICE Version 1.0 that
make it usable in practice. Therefore, from an internal consistency
perspective, the reduction of the capability dimension from 26 items to
only nine items did not deteriorate its internal consistency.
-
The interrater agreement of assessments using ISO/IEC PDTR 15504 has been
demonstrated to be high. This is true for attribute ratings and the
capability level ratings. Given that this has been one of major concerns
of the developers of ISO/IEC 15504, the evidence presented here should
serve as an acknowledgement that their efforts were successful.
-
Combining the 4-point scale categories into a 2- or a 3-point scale does
not have a substantial impact on interrater agreement. This means
that converting the 4-point scale to a 2-point or a 3-point scale by combining
categories has not improved interrater agreement.
-
We now have a benchmark to evaluate the quality of ISO/IEC 15504 assessments
in terms of their interrater agreement. The benchmark consolidates
all the data that was collected prior to Phase 2. This benchmark
is actually quite pessimistic and we hope to relax it through the collection
of further data. Being a pessimistic benchmark, it is most useful
for identifying the assessments that have quite low reliability.
-
Assessment team competence and the clarity of the documents are the two
most important factors that have an impact on the reliability of assessments,
according to the perceptions of experienced assessors.
-
Trained assessors who have not conducted assessments in the past should
not be the primary source of ratings for an assessment because, according
to the results of the current study, the ratings of a novice assessor may
differ substantially from the ratings of a more experienced assessor.
-
For low capability levels, there is a difference in reliability between
rating processes early in an assessment versus late in an assessment.
It is better to rate the low-capability processes late in the assessment
after collecting information about other related processes (i.e. other
process for the project or for the organisation). For higher-capability
processes, it does not make a difference whether ratings are done early
or late in an assessment.
-
From the perspective of future research on the reliability of assessments,
the number of previously-performed assessments is a variable that must
be controlled to avoid it having an effect on the results.
-
It is of questionable value to conduct studies in “laboratory” settings,
for example, during an assessor training course where the assessors have
not conducted actual assessments in the past. The reason is that
the trainee assessors’ ratings may be quite different from the ratings
that would be obtained in the field.
Measurement of Process Capability
Thus far, there have been two ways of measuring process capability of a
process instance in ISO/IEC PDTR 15504: using the attribute ratings
and converting the attribute ratings into capability levels. We have
identified a third approach.
The ISO/IEC PDTR 15504 capability dimension actually has two dimensions:
Process Implementation (levels 1 to 3) and Quantitative Process Management
(levels 4 and 5). These were identified to be two separate constructs
that can be measured about a process’ capability (see Section 7).
The rating of attributes is necessary for the latter two capability
measures. Through further studies we plan to identify the efficacy
of the third measure. However, a priori it has the advantage of an
empirically-determined internal consistency.
Benchmarking Process Capability
We constructed a number of initial benchmarks that can be used for comparing
obtained capability with the capability of other similar processes in the
trials database. The benchmarks indicate the feasibility of performing
sophisticated benchmarking using the trials data. Furthermore, we
are still exploring ways for further improving these benchmarks.
Success Factors for SPI
The recommendations presented here are focused on maximising the effectiveness
of software process assessments for the purpose of SPI:
-
A follow-up study of organisations that took part in Phase 1 of the trials
indicates that many organisations struggle with achieving successful SPI
based on process assessments. A good proportion of them have not
taken the steps that are generally recommended for successful SPI.
-
We found that the more an organisation’s SPI effort is determined by the
findings of an assessment, the greater the extent to which the assessment
findings are successfully addressed. Therefore, it is important to
ensure that the SPI effort is determined by the assessment findings.
-
To increase the possibility that the assessment’s findings determine the
SPI effort of the organisation, the following factors were found to be
important:
-
Senior management monitoring of SPI
-
Compensated SPI responsibilities
-
SPI goals well understood
-
Technical staff involvement in SPI
-
Staff and time resources availability for SPI
-
SPI people well respected.
-
Surprisingly, it was found that increased so-called “Organisational politics”
and “ambitious recommendations” from the assessment tend to increase the
extent to which the SPI effort is determined by the assessment findings.
Therefore, in this context, these two factors are not necessarily a bad
thing!
-
Ensuring that the SPI effort is determined by the assessment findings is
not the only factor that affects the success in addressing the assessment
findings. Other factors that should be taken into account are:
-
Ensuring that SPI goals are well understood
-
Technical staff involvement in SPI
-
Creating process action teams.
Sensitivity of the Capability Level Determination Scheme
The results of a sensitivity analysis of the capability level determination
scheme indicate a few interesting things, with some useful implications
for conducting assessments. We performed the sensitivity analysis
by distorting the attribute ratings in the trials database for each process
instance and calculating the amount of change in capability level.
Sensitivity analysis can be useful for dealing with confusion between categories
on the 4-point rating scales during an assessment. Confusion can
occur, for example, when an assessor is not sure which of two categories
to choose on the 4-point achievement scale.
-
It is clear that distortions downward have a much larger impact on capability
levels than distortions upward. A larger impact here means that more
process instances will have their capability levels change. Therefore,
in cases where there is potential confusion between categories, automatically
choosing the higher attribute rating will lead to a much smaller error
compared to automatically choosing the lower one.
-
Even if the higher rating is chosen when there is confusion, the sensitivity
of the capability levels to upward distortions on the level 1 attribute
tend to be larger than for the other attributes. Therefore, to be
prudent, one could avoid automatically choosing the higher rating for attribute
1 when there is confusion between two ratings. Instead, existing
evidence should be reexamined and/or further evidence collected to decide
on the rating.
-
When ratings are distorted upwards to an ‘F’, sensitivity is still not
negligible. Therefore prudence should be exercised when choosing
the higher rating where the higher rating is an ‘F’.
-
4. The above guideline is especially true for attribute 1. For example,
up to 14% of process instances may be in error for attribute 1, with the
next largest sensitivity being up to 6% for attribute 2.2. The result
does amount to only a 3% increase in capability for attribute 1 and around
1-2% for attribute 2.2. Therefore, especially for attribute 1, prudence
should be exercised when following the above guideline and choosing the
higher rating where the higher rating is an ‘F’.
-
Apart from choosing the higher rating when it is an ‘F’, also choosing
the higher rating when the lower one is a ‘P’ leads to non-negligible sensitivity.
Therefore, prudence should be exercised in choosing the higher rating when
the lower rating is a ‘P’.
-
The above guideline is especially true for attribute 1 where up to 14%
of process instances can increase their capability level, and to a lesser
extent, 6% for attribute 2.2. Therefore, especially for attribute
1, prudence should be exercised when following the above guideline and
in choosing the higher rating when the lower rating is a ‘P’.
-
Distortions in the higher level attributes have negligible impact on capability
levels, especially for levels 4 and 5. This is due to the fact that
very few process instances achieve capabilities at the higher levels in
practice.
-
Confusion between the ‘P’ and ‘N’ ratings has no impact on the capability
levels. Therefore, if one is interested only in the capability levels
of the process instances as defined in ISO/IEC 15504, then s/he would be
justified in paying little attention to deciding whether an attribute is
a ‘P’ or an ‘N’.
-
For the level 2 to level 5 attributes, if one of the two attributes at
that level is definitely either a ‘P’ or an ‘N’, it does not matter what
the other attribute is. Therefore, if it is confirmed that one of
the attributes is a ‘P’ or an ‘N’, then there is no point in expending
effort deciding what the other attribute should be.
The interim report from Phase 2 of the Trials can be downloaded as an Adobe Acrobat
(PDF) file (1,551 Kb).
Back to Top