Earlier this week I presented a very influential paper to our reading group: Damaged Merchandise? A Review of Experiments that Compare Usability Evaluation Methods, by Wayne Gray and Marilyn Salzman. Reading it again reminded me why it had such an impact on me first time around, and I thought I’d share my views on why I think it’s such a worthwhile read, even 11 years after it was published.
The paper critiques 5 prominent (i.e. published in prominent academic publications and subsequently cited) studies that compared different Usability Evaluation Methodologies (UEMs). It found that for each study the experimental design casts doubt over the validity of the conclusions made.
In a clear and accessible fashion, the paper:
- outlines the value of UEMs in interface design, and explains the relative merits of empirical UEMs (involving watching users interact with a system) and analytical UEMs (using some pre-defined knowledge to methodically assess the system for potential barrriers) in identifying true barriers and providing the design team with information necessary to fix them;
- reminds us that the value of experiments is in establishing causality (that X causes Y) and generality (that X will cause Y across different circumstances);
- introduces 4 measures of validity that can be applied to an experiment (from Quasi-Experimentation: Design and Analysis issues for field settings; Cook T and Campbell D, 1979);
- uses these measures to identify ‘threats to validity’ that might exist in the design of an experiment;
- treats each UEM comparison as a case study of how validity of the experiment and the results it presents can be questioned;
- offers advice for minimising threats to validity through experimental design and analysis.
What are the four measures of validity? Two concern causality, and two concern generality.
- Causality issues:
- statistical conclusion validity – concerning whether real differences do exist between experiment groups. Did the experiment really find differences in the results of using different UEMs? Validity may be affected by the impact of low numbers of participants; lack of appropriate statistical analysis; ‘random heterogeneity’ (or the influence of wildcard participants on results). This is explored in a post on bias in DIY usability testing.
- internal validity – concerning whether measured differences are causal or correlational. Were these differences definitely due to using different UEMs? Or could some other factor have influenced results? Selection (of participant groups) and setting (conditions under which the experiment was carried out) can influence internal validity.
- Generality issues:
- Construct validity: in the words of the authors, “are the experimenters manipulating what they claim to be manipulating?” (this is causal construct validity) and “are they measuring what they claim to be measuring?” (this is effect construct validity)
- External validity: how valid are claims that results can be generalised across different settings and persons?
There is a fifth validity issue – conclusion validity, where the conclusions are not based on the data generated by the experiment. The authors note the tendency of usability evaluators to include general ‘good advice’ amongst conclusions based on the findings of an experiment, when the data gathered cannot possibly support this advice. If it is accepted as good advice, it should be presented as such, not as the findings of the experiment.
Why is this work important? Well, given that these studies were selected as being of particularly high impact in the community, there is potential for major decisions to have been made relating to using one UEM over another, or for further research to have been conducted, based on unsafe assertions. What’s not clear to me, 11 years on, is just how big the impact has been on usable technology design of the flaws identified in these studies.
But more practically, for all of us who do usability or accessibility testing, this paper reminds us of the difference between analytical evaluation methods and empirical methods. There’s a danger that our eagerness to promote what we believe is best practice may obscure what we actually find out in empirical testing (the “guideline compliance vs designing for humans” argument in another form). Finding participants can be difficult; finding disabled participants for testing is very difficult, so while of course user involvement is still recommended in order to achieve valuable insight, presenting results with due qualifications and caveats is essential.
For a lot of people, this stuff will be nothing new – it’s basic good practice in science. But, like many people who have come into applied science from other areas, I don’t have a background in rigorous experimental design. And while designing major experiments is not something I do often, knowing how to devise and follow a process of generating new knowledge that is reliable and repeatable – such as conducting a usability testing programme of a software application or web application – is certainly wisdom worth having.
- Damaged Merchandise – the original paper, and a rejoinder – commenting on feedback the authors received.
- Wikipedia on experimental design.