measurement

Measure Development and Testing Research: Some Pointers

Have you ever had one of those dreams where you’re running towards something, and the faster you go the further away it seems to get? That, to me, is what doing research in the measure development field seems like. Every time I think I have mastered the key methods some bright spark seems to have come up with a new procedure or analysis that is de rigueur for publishing in the field. No mind, I have to say that developing measures has been one of the most satisfying and even exhilarating elements of my research career, however humbling it might be at times. And, indeed, having gone from knowing next to nothing about measure development to creating, or helping to test, some fairly well-used measures (including one with my name on it, the Cooper-Norcross Inventory of Preferences!), I’m pretty confident that it’s a research process that anyone—who’s willing to devote the time—can get involved in.

And, of course, the point of developing and validating measures is not just for the narcissistic glory. It’s research that can help to define phenomena and explore their relationship to other factors and processes. Take racial microaggressions in therapy, for instance. Measures can help us see where these are taking place, what’s leading to them, and help us assess methods for reducing their prevalence. Of course, the downside of measures is that they take complex phenomena and reduce them down to de-contextualised, linear variables. But, in doing so, we can examine—over large, representative samples—how these variables relate to others. Do different ethnic groups, for instance, experience different levels of racial microaggressions in therapy? We could use qualitative methods to interview clients of different ethnicities, but comparing their responses and making conclusions is tricky. Suppose, for instance, of the Afro-Caribbean clients, we had four identifying ‘some’ microaggressions, two ‘none’, and three ‘it depended on the therapist’. Then, for the Asian clients, we had two saying, ‘I wasn’t sure’, three saying ‘no’, and two saying, ‘it was worse in the earlier sessions’. And one Jewish client felt that their therapist made an anti-Semitic comment while one didn’t. So who had more or less? By contrast, if Afro-Caribbean clients have an average rating of 3.2 on our 1 to 5 scale of in-therapy racial microaggressions, and Asian clients have an average rating of 4.2, and our statistical analysis show that the likelihood of this difference being due to chance is less than 1 in a 1,000 (see blog on quantitative analysis), then we can say something much more definitive.

From a pluralistic standpoint, then, measure development research—like all research methods—has a particular value at particular points in time: it all depends on the question(s) that we are asking. And while, as we will see, it tends to be based on positivistic assumptions (that there is a real, underlying reality—which we can get closer to knowing through scientific research), it can also be conducted from a more relativist, social constructionist perspective (that no objective ‘reality’ exists, just our constructions of it).

What is Measure Development and testing Research?

Measure development research, as the name suggests, is the development of ‘measures’, ‘scales’, or ‘instruments’ (also known as the field of psychometrics); and measure testing research is assessing those measures quality. Measure development studies will always involve some degree of measuring testing, but you can have measure testing studies that do not develop or alter the original measure.

A measure can be defined as a means of trying to assess ‘the size, capacity, or quantity of something’: for instance, the extent to which clients experience their therapist as empathic, or therapists’ commitment to a spiritual faith. In this sense (and particularly from a positivist standpoint), we can think of psychological measures as a bit like physical measures, for instance rulers or thermometers: tools for determining what’s out there (like the length of things, or their temperature).

Well known examples of measures in the counselling and psychotherapy field are the CORE-OM (Clinical Outcomes in Routine Evaluation – Outcome Measure), which measures clients’ levels of psychological distress; and the Working Alliance Inventory, which measures the strength of therapist-client collaboration and bond. There’s more information on a range of widely used ‘process’ and ‘outcome’ measures for counselling and psychotherapy here.

Measures generally consist of several ‘items’ combined into a composite score. For instance, on the CORE-OM, two of the 34 items are ‘I have felt terribly alone and isolated’ and ‘I have felt like crying’. Respondents are then asked to score such items on a wide range of different scales—for instance, on the CORE-OM, clients are asked to rate the items from 0 (not at all) to 4 (Most or all of the time)—such that a total score can be calculated. Note, in this way, measures are different from ‘questionnaires’, ‘surveys’, or ‘checklists’ that have lots of different items asking about lots of different things. Indeed, as we will see, the ‘combinability’ of items into one, or a few, scales tends to be a defining feature of measures.

A measure can consist of:

  • One scale. An example is the Relational Depth Frequency Scale, which measures the frequency of experiencing relational depth in therapy.

  • Two or more scales. An example is the Cooper-Norcross Inventory of Preferences, which has scales for ‘client preference for warm support vs focused challenge’, and ‘client preference for past focus vs present focus’.

  • Two or more subscales: meaningful in their own rights, but also summable to make a main scale score. An example is the Strengths and Difficulties Questionnaire for children, which has such subscales as ‘peer problems’ and ‘emotional symptoms’, combining together to make a ‘total difficulties’ score.

Generally, A single scale measure or a subscale will have between about four and 10 items. Less than that and the internal consistency starts to become problematic (see below); more than that and the measure may too long to complete, with items that are redundant.

Measures can be designed for completion by therapists, by clients, or by observers. They can also be nomothetic (where everyone completes the same, standardised items), or idiographic (where people develop their own items, for instance on a Goals Form).

Underlying Principles

Most measure development and testing research is underpinned by a set of principles known as classical test theory. These are fairly positivistic, in that they assume that there are certain dimensions out there in the world (known as latent variables) that exist across all members of the population, and are there independent of our constructions of them. So people’s ‘experiencing of racial microaggressions’ is a real thing, just like people’s temperature or the length of their big toe: it’s an actual, existent thing, and the point of our measure is to try and get as close as possible to accurately assessing it.

You might think, ‘If we want to know about clients’ experiences of racial microaggressions in therapy, why don’t we just ask them the question, “To what extent do you experience racial microaggressions in your therapy?”’ The problem is, from a classical test theory perspective, a respondent’s answer (the ‘observed score’) is going to consist of two components. The first component is going to be the part that genuinely reflects their experiencing of microaggressions (the ‘true score’ on the latent variable). But, then, a second part is going to be determined by various random factors that influence how they answer that specific question (the ‘error’). For instance, perhaps the client doesn’t understand the word ‘microaggressions’, or misunderstands it, so that their responses to this particular item don’t wholly reflect the microaggressions that they have experienced. Here, what we might do is to try and minimise that error by asking the question in a range of different ways—for instance, ‘Did your therapist make you feel bad about your race?’ ‘Did your therapist deny your experiences of racism?’—so that the errors start to even out. And that’s essentially what measure development based on classical test theory is all about: developing measures that have as little error as possible, so that they’re evaluating, as accurately as they can, respondents’ true positioning on the latent variable. No one wants a broken thermometer or a wonky ruler and, likewise, a measure of the experiencing of racial microaggressions in therapy that only reflects error variance isn’t much good.

As you can see, all this is based on very positivist assumptions: a ‘true’, underlying (i.e., latent) reality out there in the world; acting according to laws that are true for us all; and with ‘error’ like an uninvited guest that we’re trying to escort out of the party. Not much room for the existence of unpredictability, chaos, or individual uniqueness; or the idea that ‘reality’ is something we construct according to social mores and traditions. Having said that, adopting classical test theory assumptions, for the purposes of measure development, doesn’t mean you have to be a fully-fledged positivist. From a pragmatic standpoint, for instance, you can see measure development as a means of identifying and assessing something of meaning and importance—but whether or not it is something ‘real’ can be considered a mute point. We know, for instance, that there is something like racial microaggressions that can hurt clients and damage the therapeutic relationship, so we can do our best to find ways of assessing it, while also acknowledging the inherent vagaries of whatever we do. And, perhaps, what we call ‘racial microaggressions’ will change over time and vary across cultures and individuals, but that shouldn’t stop us from trying to get some sort of handle on it, so that we can do our best to find out more and intervene.

Developing a measure

So how do you actually go about developing a measure? It might seem like most measures are developed on the back of the proverbial ‘fag packet’ but, OMG, it is vastly more complicated and time-consuming than that. I worked out that, when Gina di Malta (with myself and Chris Evans) developed the 6-item Relational Depth Frequency Scale, it took something like six years! That’s one year per item.

That’s why, for most of us who have developed measures, the first thing we say to people who want to develop their own measures is to first see if they can use measures that are already out there. That’s unless you really have the time and resources to do the work that’s needed to develop and validate your own measure. Bear in mind, a half-validated measure isn’t much valid at all.

So why does it take so long? To a great extent, it’s because there’s a series of stages that you need to go through, detailed below. These aren’t exact, and every measure development study will do them slightly differently, but the sections below should give you a rough idea of what steps a measure development study will take.

Defining the Latent Variable

Before you develop a measure, you have to know what it is that you are trying to measure. To some extent, this may emerge and evolve through your analysis, but the clearer you are about what you’re looking for, the more likely your measure will be fit for finding it.

‘I’d like to know whether clients feel that they’ve got something out of a session.’ OK, great, but what do we mean by ‘got something out of’? Is this feeling that they’ve learnt something, or finding the session worthwhile, or experiencing some kind of progress in their therapy? ‘Maybe all of those things.’ OK, but feeling like you’ve learnt something from a session may not necessarily correlate with feeling like you’ve made progress. They may seem similar, but perhaps some clients feel there’s a lot they’ve learnt while still coming out of a session feeling stuck and hopeless.

Things that just naturally seem to go together in your mind, then, may not do so in the wider world, and disentangling out what you want to focus on is an important starting point for the measure development work. How do you do that? Read the literature in the area, talk to colleagues, journal, look at dictionaries and encyclopaedias: think around the phenomenon—critically—as much as you can. What you want to identify is one discrete variable, or field, that you can really, clearly define. It could be broader (like ‘the extent to which clients value their sessions’) or narrower (like ‘the extent to which clients feel they have developed insight in their sessions’), but be clear about what it is.

item generation

Once you know what latent variable you want to measure, the next step is to generate items that might be suitable for its assessment. At this stage, don’t worry too much if the items are right or not: brainstorm—generate as many items as you can. In fact, one thing I’ve learnt over the years is that you can never have too many items at this stage, and often you can have too few. Probably around 80% or so of items end up getting discarded through the measure development process, so if you want to end up with a scale of around 5-10 items, you probably want to start with around 25-50 potential ones. Bear in mind that you can always drop items if you get to the end of the measure development process and have too many, but it’s much more difficult to generate new items if you get to the end and find you have too few.

Ideally, you want to do this item generation process in one or more systematic ways, so it is not just the first, ad hoc, items that come into your head. Some strategies for generating items are:

  • Search the literature on the topic. Say we wanted to develop a measure to assess the extent to which adolescent clients feel awkward in therapy (we’re interested in differences in awkwardness across types of therapies, and types of clients). So let’s go to Google Scholar to see what papers there are on young people’s awkwardness in therapy, and also we should check the more established psychological search engines like PsychInfo and Web of Science (if we have access, generally through a university). Supposing, there, we find research where young people say things like, ‘I felt really uncomfortable talking to the counsellor’ or ‘The therapist really weirded me out’. So we can use these statements like that (or in modified form) as items for our measure, and it might also trigger some ideas about further items, like ‘I felt really comfortable talking to the counsellor’ (a reverse of the first statement here), or ‘The therapist seemed really weird’ (a modification of the second statement).

  • Interviews and focus groups. Talk to people in the target population to see what terms they use to talk about the phenomena. For instance, an interview with young clients about their experiences of counselling (approved, of course, through the appropriate ethical procedures) might be an ideal way of finding out how they experience ‘awkwardness’ in therapy. What sort of words do they use to talk about it? How does it feel to them?

  • Dictionaries and thesauruses. Always a valuable means of finding different synonyms and antonyms for a phenomena.

Remember, what you are trying to do is to generate a range of items which are, potentially, a means of ‘tapping into’ your latent variable. Have a mixture of phrasings, with some items that are as closely worded to your latent variable as possible (for instance, ‘I felt awkward in therapy’), but other that might come at it from a different angle, providing ‘triangulation’ (for instance, ‘The interaction with my therapist seemed unusual’). It’s also good to try reversing some items (so, for instance, having items that are about not feeling awkward, as well as feeling awkward)—though having such items in a final scale is no longer considered essential.

At this point, you’ll also need to start thinking about your response categories: the ways that people score your items. For instance, do people rate the items on a 3- or 5-point scale, and what labels might you use to describe these different points? This is an enormous field of science in itself, and usually it’s best to keep it simple and use something that’s already out there so that it’s been tried and testing. For instance, if you decide to develop your own four point scale with labels like 1 = Not at all, 2 = A really small amount, 3 = Quite a bit, 4 = Moderately, 5 = Mostly, How do you know that Quite a bit means less to people than Moderately; and couldn’t the difference between 2 and 3 (A really small amount and Quite a bit) be a lot more than the difference between 4 and 5 (Moderately and Mostly)? So have a look at what other validated and well-used measures use as response categories and see if anything there suits. Two common ones are:

  1. = Strongly disagree

  2. = Moderately disagree

  3. = Mildly disagree

  4. = Mildly agree

  5. = Moderately agree

  6. = Strongly agree

    Or

  1. = Not at all

  2. = Only occasionally

  3. = Sometimes

  4. = Often

  5. = Most or all of the time

At this point, you’ll also need some idea of how you phrase the introduction to your measure. Generally, you’ll want to keep it as short as possible, but there may be some essential instructions to give, such as who or what to rate. For instance, for our racial microaggressions measures, we might want to say something like:

Please think of your relationship with your current therapist. To what extent did you experience each of the following?

In this instance, we might also consider it essential to say whether or not the clients’ therapists will see their scores, as this may make a big difference to their responses.

Testing items

Expert Review

The next stage of the measure development process is to pilot test our items. What we would do is to show each of our items to experts in the field (ideally experts by experience, as well as mental health professionals)—say between about 3 and 10 of them—and ask them to rate each of our potential items for how ‘good’ they are. We could do this as a survey questionnaire, on hard copy, or through questionnaire software such as Qualtrics. An example of a standardised set of questions for asking this comes from DeVellis’s brilliant book on scale development. Here, experts can be asked to rate each item on a four-point scale (1 = not at all, 2 = a little, 3 = moderately, and 4 = very well) with respect to three criteria:

  1. How well the item matches the definition of our latent variable (which the experts are provided with)

  2. How well formulated the item is for participants to fill in

  3. How well, overall, the item is suited to the measure

Once the responses are in, those items with the lowest ratings (for instance, with an average < 3) can be discarded, leaving only the most well formulated and suitable items to go forward for further testing and analysis.

Three Step Interviewing Technique

Something else that I’ve learnt, from Joel Vos, that can be really useful for selecting items in these early stages is called The Three-Step Test Interview. This essentially involves asking a few respondents (ideally the kind of people the measure is for) to ‘think aloud’ while completing the measure, and then to answer some interview questions about their experiences and perceptions of completing the measure. This, then, gives us a vivid sense of what the experience of completing the measure is like, and what’s working and what’s not. Through this process, for instance, it might become evident that certain items—even if the experts thought they were OK—don’t make much sense to participants, or are experienced as boring or repetitive. And respondents might also have ideas for how items can be better worded. Again, questions that don’t work well can be removed at this stage and, potentially, new or modified items could be added (though bear in mind they haven’t been through the expert review process).

Exploratory psychometrics

You’re now at the stage of sending your measure out to survey. The number of respondents you need at this stage is another field that is a science in itself. However, standard guidance is a minimum of 10 respondents per item, with other guidance suggesting at least 50 respondents overall if the aim is detect one dimension/scale, and 100 for two (see, for instance, here).

At this point, you almost certainly want to be inviting respondents to complete the measure online: for instance, through Qualtrics or Survey Monkey. Hard copies are an option, but add considerably to the processing burden and, these days, may make prospective participants less likely to respond.

Ideally, you want respondents to be reflective of the people who are actually going to use the measure. For instance, if it’s a measure intended for use with a clinical population, it’s not great if it’s been developed only with undergraduate students or with just your social media contacts. Obviously, too,it’s also important to aim for representativeness across ethnicity/race, gender, age, and other characteristics.

If you’ve got funding, one very good option here can be to use a Mechanical Turk service, such as Prolific. This is, essentially, a site where people get paid to complete questionnaires; and because it’s such a large pool of people, from all over the world, it means you’ve got more chance of recruiting the participants you need. We used this, for instance, to gather data on the reliability and validity of the Cooper-Norcross Inventory of Preferences (see write-up here), and it allowed us to get US and UK samples that were relatively representative in terms of ethnicity, gender, and age—not something we could have easily achieved just by reaching out to our contacts.

Once you’ve got your responses back, you’re on to the statistical analysis. The aim, at this point, is to get to a series of items that can reliably assess one or more latent dimensions, in a way that is as parsimonious as possible (i.e., with the fewest items necessary). This scale shortening process can be done in numerous ways, but one of the most common starting points is to use exploratory factor analysis (EFA).

EFA is a system for identifying the dimension(s) that underlie scores from a series of items. It’s a bit like putting an unknown liquid on a dish and then boiling it off to see what’s left: perhaps there’s crystals of salt, or maybe residues of copper or gold. EFA has to be done using statistical software, like SPSS or R (not Excel), and you need to know what you’re doing and looking for. On a 1-10 scale of difficult stats, it’s probably about a 5: not impossible to pick up but also does require some fair degree of training, particularly if you don’t have a psychology degree. What follows (as with all the stats below) is just a basic overview to give you an idea of the steps that are needed.

The first thing you do in EFA is to see how many dimensions actually underlie your data. For instance, the data from our ‘experiences of racial microaggression’ items may suggest that they are all underpinned by just one dimension: How much or how little people have experienced microaggressions from their therapists. But, alternatively, we may find that there were more latent dimensions underlying our data: for instance, perhaps people varied in how much they experienced microaggressions, but also the degree to which they felt hurt by the microaggressions they experienced. So while some people could have experienced a lot of microaggressions and a lot of hurt, others might have experienced a lot of microaggressions but not much hurt; and any combination across these two variables might be possible.

What EFA also does is to help you see how well different items ‘load’ onto the different dimensions: that is, whether scores on the items correlate well with the latent dimension(s) identified, or whether they are actually independent of all the underpinning dimensions on the measure. That way, it becomes possible to select just those items that reflect the latent dimension well, discarding those that are uncorrelated with what you have actually identified as a latent scale. At this point, it’s also common to discard those items that load onto multiple scales: what you’re wanting is items that are specifically and uniquely tied to particular latent variables. At this point, there’s many other decision rules that can also get used for selecting items. For instance, you might want items that have a good range (i.e., going the full length of the scale), rather than all scores clustering in the higher or lower regions; and the items also need to be meaningful when grouped together. So this process of scale shortening is not just a manualised one, following clearly-defined rules, but a complex, nuanced, and selective art: as much alchemy as it is science.

By the end of this exploratory process, you should have a preliminary set of items for each scale or subscale. And what you’ll then need to do is to look at the items for each scale or subscale and think about what they’re assessing: how will you label this dimension? It may be that the alchemical process leads you back to what you set out to find: a ‘prevalence of racial microaggressions’ dimension, for instance. But perhaps what crystallised out was a range of factors that you hadn’t anticipated. When we conducted our first Cooper-Norcross Inventory of Preferences study, for instance (see here), we didn’t really know what preference dimensions would emerge from it. I thought, for instance, that we might find a ‘therapist directed vs client directed’ dimension, as we did, but I was surprised to see that there was also a ‘focused challenge vs warm support’ dimension emerging as well—I had just assumed that therapist directiveness and challenge were the same thing.

Testing the Measure

As with exploratory measure development, there are numerous methods for testing the psychometric properties of a measure, and procedures for developing and testing measures are often iterative and overlap. For instance, as part of finalising items for a subscale, a researcher may assess the subscale’s internal reliability (see below) and, if problematic, adjust its items. These tests may also be conducted on the same sample that was used for the EFA, or else a new sample of data may be collected with which to assess the measure’s psychometric properties.

Two basic sets of tests exist that most researchers will use at some point in measure development research: the first concerned with the reliability of the measure and the second concerned with its validity.

Basic Reliability Tests

The reliability of a measure is the extent to which it produces consistent, reproducible estimates of an underlying variable. A thermometer, for instance, that gave varied readings from one moment to the next wouldn’t be much use.

  • Internal consistency is probably the most important, and frequently reported, indicator of a scale’s ‘goodness’ (aside from when the measure is idiographic). It refers to the extent that the different items in the scale are all correlating together to measure the same thing. If the internal reliability is low, it means that the items, in fact, are not particularly well associated; if high, it means that they are all aligned. Traditionally, internal consistency was assessed with a statistic called ‘Cronbach’s alpha (α)’, with a score of .7 or higher generally considered adequate. Today, there is increasing use of a statistic called ‘McDonald’s omega (ω)’, which is seen as giving a less biased assessment.

  • Test-retest reliability is very commonly used in field of psychology, but is, perhaps, a little less prevalent in the field of counselling and psychotherapy research, where stability over time is not necessarily assumed or desired. Test-retest reliability refers to the stability of scores over a period of time, where you would expect people to score roughly the same on a measure (particularly if it is a relatively stable trait). If respondents, for instance, had wildly fluctuating scores on a measure of self-esteem from one week to the next, it would suggest that the measure may not be tapping into this underlying characteristic. Test-retest stability is often calculated by simply looking at the correlation of scores from Time 1 to Time 2 (an interval of about two weeks is typically used), though there are more sophisticated statistics for this calculation. Assessing test-retest reliability requires additional data to be collected after the original survey—often with a subset of the original respondents.

  • Inter-rater reliability is used where you have an observer-completed measure. Essentially, if the measure is reliable, then different raters should be giving approximately the same ratings on the scales. In our assessment of an auditing measure for person-centred practice in young people, for instance (see here), we found quite low correlations between how the raters were assessing segments of person-centred practice. That was a problem, because if one rater, on the measure, is saying that the practice is adherent to person-centred competencies, and another is saying it isn’t, then it suggests that the measure isn’t a reliable means of assessing what is and is not a person-centred way of working.

Basic Validity Tests

The validity of a measure is the extent to which it measures the actual thing that it is intended to. Validity can be seen as the ‘outward-facing’ element of a measure (how it relates to what is really going on in the world), whereas reliability can be seen as the ‘inward-facing’ element (how the different parts within it relate together).

  • Convergent validity tends to be the most widely emphasised, and reported, test of validity in the counselling and psychotherapy research field. It refers to the extent that scores on the measure correlate with scores on a well-established measure of a similar construct. Suppose we were developing a measure to assess how prized clients feel by their therapists. No measures of this exact construct exist out there in the field (indeed, if it did, we wouldn’t be doing this work), but there’s almost certainly other scales, subscales, or even individual items out there that we’d expect our measure to correlate with: for instance, the Barrett-Lennard Relationship Inventory’s ‘Level of Regard’ subscale. So we would expect to find relatively high correlations between scores on our new prizing measure and those on the Level of Regard subscale, say around .50 or so. If the correlations were zero, it might suggest that we weren’t really measuring what we thought we were. But bear in mind that correlations can also be too high. For instance, if we found that scores on our prizing measure correlated extremely closely with scores on Level of Regard (> .80 or so), it would suggest that our new measure is pretty redundant: the latent variable we were hoping to tap has already been identified as Level of Regard. Assessing convergent validity means that, in our original survey, we might also want to ask respondents to complete some related measures. That way, we don’t have to do a further round of surveying to be able to assess this psychometric property.

  • Divergent validity is the opposite of convergent validity, and is essentially the degree to which our scale or subscale doesn’t correlate with a dimension that should be unrelated. For instance, our measure of how prized clients feel wouldn’t be expected to correlate against a measure of their degrees of extraversion, or levels of mental wellbeing. If they did, it would suggest that our measure is measuring something other than we think it is. Measures of ‘social desirability’ are good tools to assess divergent validity against because we really don’t want our measure to be associated with how positively people try to present themselves. As with assessing convergent validity, assessing divergent validity means that we may need to add a few more measures to our original survey, if we don’t want to go through a subsequent stage of additional data collection.

  • Structural validity is the degree to which the scores on the measure are an adequate reflection of the dimensions being assessed. EFA, as discussed above, can be used to identify one or more underlying dimensions, but this structure needs validating in further samples. So this means collecting more data (or splitting the original data into ‘exploratory’ and ‘confirmatory’ subsamples), and then the new data can be analysed using a procedure called confirmatory factor analysis (CFA). CFA is a complex statistical process (about a 9 on the 1-10 scale), but it essentially involves testing whether the new data fits to our ‘model’ of the measure (i.e., its hypothesised latent dimension(s) and associated items). CFA is a highly rigorous check of a measure, and it’s a procedure that’s pretty much essential now if you want to publish a measure development study in one of the higher impact journals.

  • Sensitivity to intervention effects is specific to outcome measures, and refers to the question of whether or not the measure picks up on changes brought about by therapy. We know that therapy, overall, has positive benefits, so if scores on a measure do not show any change from beginning to end of intervention, it suggests that the measure is not a particularly valid indicator of mental wellbeing or distress. To assess this sensitivity, we need to use the measure at two time points with clients in therapy: ideally at the start (baseline) and at the end (endpoint). Measures that show more change may be particularly useful for assessing therapeutic effects. For instance, in our psychometric analysis of a goal-setting measure for young people (the Goal Based Outcome Tool), we found that this measure indicated around 80% of the young people had improved in therapy, as compared with 30% for the YP-CORE measure of psychological distress.

Advanced Testing

…And there’s more. That’s just some of the basic psychometric tests and, like I said earlier, there seems to be new ones to catch up with everyday, with numerous journals and books on the topic. For instance, testing for ‘measurement invariance’ seems to becoming increasingly dominant in the field, which uses complex statistical processes to look at whether the psychometrics of the measures are consistent across different groups, times, and contexts (this is about a 15 out of 10 for me!). And then there’s ‘Rasch analysis’ (see here), which uses another set of complex statistical procedures to explore the ways that respondents are scoring items (for instance, is the gap between a score of ‘1’ and ‘2’ on a 1-5 scale the same as the gap between ‘3’ and ‘4’?). So if you’re wanting to publish a measure development study in the highest impact journals, you’ll almost certainly need to have a statistician—if not a psychometrician—on board with you, if you’re not one already.

Developing benchmarks

Once you’ve got a reliable and valid measure, you may want to think about developing ‘benchmarks’ or ‘cutpoints’, so that people know how to interpret the scores from it. This can be particularly important when you’re developing a clinical outcome measure. Letting a client know, for instance, that they’ve got a score of ‘16’ on the PHQ-9 measure of depression, in itself, doesn’t tell them too much; letting them know that this is in the range of ‘moderately severe depression’ means a lot more.

There’s no one way of defining or making benchmarks. For mental health outcome measures, however, what’s often established is a clinical cut-off point (which distinguishes between those who can be defined as being in a ‘clinical range’ and those in a ‘non-clinical range’); and a measure of reliable change, which indicates how much someone has to change on a measure for it to be unlikely that this is just due to chance variations. For instance, on the Young Person’s CORE measure of psychological distress, where scores can vary from 0 to 40, we established a clinical cut-off point of 10.3 for males in the 11-13 age range, and a reliable change index of 8.3 points (see here). The calculations for these benchmark statistics are relatively complex, but there are some online sites which can help, such as here. You can also set benchmarks very simply: for instance, for our Cooper-Norcross Inventory of Preferences, we used scores for the top 25% and bottom 25% on each dimension as the basis for establishing cut-off points for ‘strong preferences’ in both ways.

the Public domain

Once it’s all finalised and you’re happy with your measure, you still need to think about how you’re going to let others know about it. There’s some journals that specifically focus on the development of measures, like Assessment, though they’re by no means easy to get published in. Most counselling and psychotherapy journals, though, will publish measure development studies in the therapy field, and that puts your measure out into the wider public domain.

At this stage you’ll also need to finalise a name for your measure—and also an acronym. In my experience, the latter often ends up being the toughest part of the measure development process, though sites like Acronymify can help you work out what the options might be. Generally, you want a title that is clear and specific to what your measure is trying to do; and a catchy, easy-to-pronounce acronym. If the acronym actually means or sounds something like what the measure is about—like ‘CORE’—that’s even better.

If there’s any complexities or caveats to the measure at all in terms of its use in research or clinical practice, it’s good to produce really clear guidelines for those who want to use it. Even a page or so can be helpful and minimise any ambiguities or potential problems with its application. Here is an an example of the instructions we produced for our Goals Form.

It can also be great to develop a website where people can access the measure, its instructions, and any translations. You can see an example of this for our C-NIP website here.

Regarding translations, its important that people who may want to translate your measure follow a standardised procedure, so that it stays as consistent as possible with the original measure. For instance, a standard process is to ‘back translate’ an initial draft translation of the measure to check that the items are still meaning the same thing.

In terms of copyright, you can look at charging for use of the measure, but personally I think it’s great if people can make these freely available for non-commercial use. But to protect the measure from people amending it (and you really don’t want people doing their own modifications of your measure) you can use one of the Creative Commons licenses. With the measures I’ve been involved with, we’ve used ‘© licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)’ so that others can use it freely, but can’t change it or make money from its use (for instance, by putting it on their own website and then charging people to use it).

Conclusion

At the most advanced levels, measure development and testing studies can be bewildering. Indeed, even at the most basic level they can be bewildering—particularly for those who are unfamiliar with statistics. But don’t let that put you off. There’s a lot of the basic item generation and testing that you can do without knowing complex stats, and if you’re based at an institution there’s generally someone you can ask to help you with the harder stuff. There’s also loads of information that you can google. And what you get at the end of it is a way of operationalising something that may be of real importance to you: creating a tool which others can use to develop knowledge in this field. So although measure development research can feel hard, and like a glacially slow process at times, you’re creating something that can really help build up understandings in a particular area—and with that the potential to develop methods and interventions that can make a real difference to people’s lives.

Acknowledgements

Photo by Tran Mau Tri Tam ✪ on UnsplashDisclaimer

 The information, materials, opinions or other content (collectively Content) contained in this blog have been prepared for general information purposes. Whilst I’ve endeavoured to ensure the Content is current and accurate, the Content in this blog is not intended to constitute professional advice and should not be relied on or treated as a substitute for specific advice relevant to particular circumstances. That means that I am not responsible for, nor will be liable for any losses incurred as a result of anyone relying on the Content contained in this blog, on this website, or any external internet sites referenced in or linked in this blog.

Evaluating and Auditing Counselling and Psychotherapy Services: Some Pointers

How do you go about setting up an evaluation or audit of your therapy service—whether it’s a large volunteer organisation or your own private practice?

Clarifying your Aims

There’s lots of reasons for setting up a service evaluation or audit, and being clear about what your’s are is a vital first step forward. Some possible aims might be:

  • Showing the external world (e.g., commissioners, policy makers, potential clients) that your therapy is effective.

  • Knowing for yourself, at the practitioner or service level, what’s working well and what isn’t.

  • Enhancing outcomes by providing therapists, and clients, with ‘systematic feedback’.

  • Developing evidence for particular forms of therapy (e.g., person-centred therapy) or therapeutic processes (e.g., the alliance).

And, of course, there’s also:

  • Because you have to!

Choosing an Evaluation Design

There’s lots of different designs you can adopt for your evaluation and audit study, and these can be combined in a range of ways.

Audit only

This is the most basic type of design, where you’re just focusing on who’s coming in to use your service and the type of service you are providing.

Pre-/Post-

This is probably the most common type of evaluation design, particularly if your main concern is to show outcomes. Here, clients’ levels of psychological problems are assessed at the beginning and end of therapy, so that you can assess the amount of change associated with what you’re doing.

Qualitative

You could also choose to do interviews with clients at the end of therapy about how they experienced the service. A simpler form of this would be to use a questionnaire at the end of treatment. John McLeod has produced a very useful review of qualitative tools for evaluation and routine outcome monitoring (see here).

Experimental

If you’ve got a lot of time and resources to hand—and/or if you need to provide the very highest level of evidence for your therapy—you could also choose to adopt an experimental design. Here, you’re comparing changes in people who have your therapy with those who don’t (a ‘control group’). These kinds of studies are much, much more complex and expensive than the other types, but they are the only one that can really show that the therapy, itself, is causing the changes you’ve identified (pre-/post- evaluations can only ever show that your therapy is associated with change).

Choosing Instruments

There’s thousands of tools and measures out there that can be used for evaluation purposes, so where do you start?

Tools for use in counselling and psychotherapy evaluation and audit studies can be divided into three types. These are described below and, for each type, I have suggested some tools for a ‘typical’ service evaluation in the UK. Unless otherwise stated, all these measures are free to use, well-validated (which means that they show what they’re meant to show), and fairly well-respected by people in the field. All the measures described below are also ‘self-rated’. This means that clients, themselves, fill them in. There are also many therapist- and observer-rated measures out there, but the trend is towards using self-rated measures and trusting that clients, themselves, know their own states of mind best.

Just to add: however tempting it might be, I’d almost always you not to develop your own instruments and measures. You’d be amazed how long it takes to create a validated measure (we once took about six years to develop one with six items!) and, if you create your own, you can never compare your findings with those of other services. Also, for the same reason, it is almost always unhelpful to modify measures that are out in the public domain—even minimally. Just changing the wording on an item from ‘often’ to ‘frequently’, for instance, may make a large difference in how people respond to it.

Outcome Tools

Outcome tools are instruments that can be used to assess how well clients are getting on in their lives, in terms of symptoms, problems, and/or wellbeing. These are the kinds of tools that can then be used in pre-/post-, or experimental, designs to see how clients change over the course of therapy. These tools primarily consist of forms with around 10 ‘items’ or so, like, ‘I’ve been worrying’ or ‘'I’ve been finding it hard to sleep’. The client indicates how frequently or how much they have been experiencing this, and then their responses can be totalled up to give an overall indication of their mental and emotional state.

Its generally good practice to integrate clients’ responses to the outcome tools into the session, rather than divorcing them from the therapeutic process. For instance, a therapist might say, ‘I can see on the form that this has been a difficult week for you,’ or, ‘Your levels of anxiety seem to be going down again.’ This is particularly important if the aim of the evaluation is to enhance outcomes through systematic feedback.

General

A popular measure of general psychological distress (both with therapists and clients), particularly in the UK, is:

This can be used in a wide range of services to look at how overall levels of distress, wellbeing, and functioning change over time. A shortened, and more easily usable version of this (particularly for weekly outcome monitoring, see below), is:

Another very popular, and particularly brief, general measure of how clients are doing is:

Two other very widely used measures of distress in the UK are:

The PHQ-9 is a depression-specific measure, and the GAD-7 is a generalised-anxiety specific measure, but because these problems are so common they are often used as general measures for assessing how clients are doing, irrespective of their specific diagnosis. They do also have the dual function of being able to show whether or not clients are in the ‘clinical range’ for these problems, and at what level of severity.

Problem-specific

There are also many measures that are specific to particular problems. For instance, for clients who have experienced trauma there is:

And for eating problems there is:

If you are working in a clinic with a particular population, it may well be appropriate to use both a general measure, and one that is more specific to that client group.

Wellbeing

For those of us from a more humanistic, or positive psychology, background, there may be a desire to assess ‘wellness’ and positive functioning instead of (or as well as) distress. Aside from the ORS, probably the most commonly used wellbeing measure is:

There’s both a 14-item version, and shortened 7-item version for more regular measurement.

Personalised measures

All the measures above are nomothetic, meaning that they have the same items for each individual. This is very helpful if you want to compare outcomes across individuals, or across services, and to use standardised benchmarks. However, some people feel that it is more appropriate to use measures that are tailored to the specific individual, with items that reflect their unique goals or problems. In the UK, probably the best known measure here is:

This can be used with children and young people as well as adults, and invites them to state their specific problem(s) and how intense they are. Another personalised, problem-based tool is:

If you are more interested in focusing on clients’ goals, rather than their problems, then you can use:

Service Satisfaction

At the end of therapy, clients can be asked about how satisfied they were with the service. There isn’t any one generic standard measure here, but the one that seems to be used throughout IAPT is:

Children and young people

The range of measures for young people is almost as good as it is for adults, although once you get below 11 years old or so the tools are primarily parent/carer- or teacher-report. Some of the most commonly used ones are:

  • YP-CORE: Generic, brief distress outcome measure

  • SDQ: Generic distress outcome measure, very well validated and in lots of languages

  • CORS: Generic, ultra-brief measure of wellbeing (available via license)

  • RCADS: Diagnosis-based outcome measure

  • GBO Tool: Personalised goal-based outcome measure

  • ESQ: Service satisfaction measure.

A brilliant resource for all things related to evaluating therapy with children and young people is corc.uk.net/

Process Tools

Process measures are tools that can help assess how clients are experiencing the therapeutic work, itself: so whether they like/don’t like it, how they feel about their therapist, and what they might want differently in the therapeutic work. These are less widely used than outcome measures, and are more suited to evaluations where the focus is on improving outcomes through systematic feedback, rather than on demonstrating what the outcomes are.

Probably the most widely used process measure in everyday counselling and psychotherapy is:

  • SRS (available via license)

This form, the Session Rating Scale, is part of the PCOMS family of measures (along with the ORS), and is an ultrabrief tool that clients can complete at the end of each session to rate such in-session experiences as whether they feel heard and understood.

For a more in-depth assessment of particular sessions, there is:

This has been widely used in a research context, and includes qualitative (word-based) as well as quantitative (number-based) items.

Several well-validated research measures also exist to assess various elements of the therapeutic relationship. These aren’t so widely used in everyday service evaluations, but may be helpful if there is a research component to the evaluation, or if there is an interest in a particular therapeutic process. The most common of these is:

This comes in various version, and assesses the clients’ (or therapists’) view of the level of collaboration between members of the therapeutic dyad. Another relational measure, specific to the amount of relational depth, is:

A process tool that we have been developing to help elicit, and stimulate dialogue on, clients’ preferences for therapy is:

This invites clients to indicate how they would like therapy to be on a range of dimensions, such that the practitioner can identify any strong preferences that the client has. This can either be used at assessment, or in the ongoing therapeutic work. An online tool for this measure can be accessed here.

Interviews

If you really want to find out how clients have experienced your service, there’s nothing better you can do than actually talk to them. Of course, you shouldn’t interview your own clients (there would be far too much pressure on them to present a positive appraisal) but an independent colleague or researcher can ask some key questions (for instance, ‘What did you find helpful? What did you find unhelpful? What would you have liked more/less of?) which can be shared with the therapist or the service more widely (with the client’s permission). There’s also an excellent, standardised protocol that can be used for this purposes:

Note, as an interviewing approach has the potential to feel quite invasive to clients (though also, potentially, very rewarding), it’s important to have appropriate ethical scrutiny here of procedures before carrying these out.

Children and young people

Process tools for children and young people are even more infrequent, but there is the child version of the Session Rating Scale:

Demographic/Service Audit Tools

As well as knowing how well clients are doing, in and out of therapy, it can also be important to know who they are—particularly for auditing purposes. Demographic forms gather data about basic characteristics, such as age and gender, and also the kinds of problems or complexity factors that clients are presenting with. These tools do tend to be less standardised than outcome or process measures, and it’s not so problematic here to develop your own forms.

For adults, a good basic assessment form is:

For children and young people, one of the most common, and thorough, forms is:

Choosing Measurement Points

So when are you actually going to ask clients, and/or therapist, to complete these measures? The demographic/audit measures can generally be done just once at the beginning of therapy, although you may want to update them as you go along. Service satisfaction measures and interviews tend to be done just at the end of the treatment.

For the other outcome and process measures, the current trend is to do them every session. Yup, every session. Therapists often worry about that—indeed, they often worry about using measures altogether—but generally the research shows that clients are OK with it, provided that they don’t take up too much of the session (say not more than 5-10 minutes in total). So, for session-by-session outcome monitoring, make sure you use just one or two of the briefer forms, like the CORE-10 or SRS, rather than longer and more complex measures.

Why every session? The reason is that clients, unfortunately, do sometimes drop out, and if you try and do measures just at the beginning and end you miss out on those clients who have terminated therapy prior to a planned ending. In fact, that can give you better results (because you’re only looking at the outcomes of those who finished properly, who tend to do better) but it’s biased and inaccurate. Session by session monitoring means that you’ve always got a last score for every client, and now most funders or commissioners would expect to see data gathered in that way. If you’ve only got results from 30% of your sample, it really can’t tell you much about the overall picture.

Generally, outcome measures are completed at the start of a session—or before the start of a session—so that clients’ responses are not too affected by the session content. Process measures are generally completed towards the end of a session as they are a reflection on the session itself (but with a bit of time to discuss any issues that might come up).

Analysing the Data

Before you start a service evaluation, you have to know what you are going to do with the data. After all, what you don’t want is to a big pile of CORE-OM forms in one corner of your storage room!

That means making sure you price in to any evaluation the costs, or resources, of inputting the data, analysing it, and writing it up. It simply not fair to ask clients, and therapists, to use hundreds of evaluation forms if nothing is ever going to happen to them.

The good news is that most of the forms, or the sites that the forms come from, tell you how to analyse the data from that form.

The simplest form of analysis, for pre-/post- evaluations, is to look at the average score of clients at the beginning of therapy on the measure, and then their average score at the end. Remember to only use clients who have completed both pre- and post- forms. That will show you whether clients are improving (hopefully) or getting worse.

With a bit more sophisticated statistics you can calculate what the ‘effect size’ is. This is a standardised measure of the magnitude of change (after all, different measures will change by different amounts). The effect size can be understood as the difference between pre- and post- scores divided by the ‘standard deviation’ of the pre- scores (this is the amount of variation in scores, which you can work out via Excel using the function ‘stdev’). Typically in counselling and psychotherapy services, the effect size is around 1, and you can compare your statistics with other services in your field, or with IAPT, to see how your service is doing (although, of course, any such comparisons are ultimately very approximate).

What you can also do is to find out the percentage of your clients that have shown ‘reliable change’ (which is change more than a particular amount, to compensate for the fact that measures will always be imprecise), and ‘clinical change’ (the amount of clients who have gone from clinical to non-clinical bands and vice versa). If you look around on the internet, you can normally find the clinical and reliable change ‘indexes’ for the measures that you are using (though some don’t have them). For the PHQ-9 and GAD-7, you can look here to see both calculations for reliable and clinical change, and the percentages for each of these statistics that were found in IAPT.

Online Services

One way around having to input and analyse masses of data yourselves is to use an online evaluation service. This can simplify the process massively, and is particularly appropriate if you want to combine service evaluation with regular systematic feedback for clinicians and clients. Most of these (though not all) can host a wide range of measures, so they can support the particular evaluation that you choose to develop. However, these services come at a price: a license, even for an individual practitioner, can be in the hundreds or thousands of pounds. Normally, you’d also need to cost in the price of digital tablets for clients to enter the data on.

My personal recommendation for one of these services is:

At the CREST Research Clinic we’ve been using this system for a few years now, and we’ve been consistently impressed with the support and help we’ve received from the site developers. Bill and Tony are themselves psychotherapists with an interest in—and understanding of—how to deliver the best therapy.

Other sites that I would recommend for consideration, but that I haven’t personally used, are:

Challenges

In terms of setting up and running a service evaluation, one of the biggest challenges is getting counsellors and psychotherapists ‘on board’. Therapists are often sceptical about evaluation, and feel that using measures goes against their basic values and ways of doing therapy. Here, it can be helpful for them to hear that clients, in fact, often find evaluation tools quite useful, and are often (though not always) much more positive about it than therapists may assume. It’s perhaps also important for therapists to see the value that these evaluations can have in securing future funding and support for services.

Another challenge, as suggested above, is simply finding the time and person-power to analyse the forms. So, just to repeat, do plan and cost that in at the beginning. And if it doesn’t feel like that is going to be possible, do consider using an online service that can process the data for you.

For the evaluation to be meaningful, it needs to be consistent and it needs to be comprehensive. That means it’s not enough to have a few forms from a few clients across a few sessions, or just forms from assessment but none at endpoint. Rather, whatever you choose to do, all therapists need to do it, all of the time. In that respect, it’s better just to do a few things well, rather than trying to overstretch yourself and ending up with a range of methods done patchily.

Some ‘Template’ Evaluations

Finally, I wanted to suggest some examples of what an evaluation design might look like for particular aims, populations, and budgets:

Aim: Showing evidence of effectiveness to the external world. Population: adults with range of difficulties. Budget: minimal

  • CORE-10: Assessment, and every session

  • CORE Assessment Form

  • Analysis: Service usage statistics; pre- to post- change, effect size, % reliable and clinical change

Aim: Showing evidence of effectiveness to the external world, enhancing outcomes. Population: young people with range of difficulties. Budget: minimal

  • YP-CORE: Assessment, and every session

  • Current View: Assessment

  • ESQ: End of therapy

  • Analysis: Service usage statistics; pre- to post- change, effect size, % reliable and clinical change; satisfaction (quantitative and qualitative analysis)

Aims: Showing evidence of effectiveness to the external world, enhancing outcomes. Population: adults with depression. Budget: medium

  • PHQ-9: Assessment and every session

  • CORE Assessment Form

  • Helpful Aspects of Therapy Questionnaire

  • Patient Experience Questionnaire: End of Therapy

  • Analysis: Service usage statistics; pre- to post- change, effect size, % reliable and clinical change; helpful and unhelpful aspects of therapy (qualitative analysis); satisfaction (quantitative and qualitative analysis)

And finally…

Please note, the information, materials, opinions or other content (collectively Content) contained in this blog have been prepared for general information purposes. Whilst I’ve endeavoured to ensure the Content is current and accurate, the Content in this blog is not intended to constitute professional advice and should not be relied on or treated as a substitute for specific advice relevant to particular circumstances. That means that I am not responsible for, nor will be liable for any losses incurred as a result of anyone relying on the Content contained in this blog, on this website or any external internet sites referenced in or linked in this blog. I also can’t offer advice on individual evaluations. Sorry… but hope the information here is useful.