reliability

Measure Development and Testing Research: Some Pointers

Have you ever had one of those dreams where you’re running towards something, and the faster you go the further away it seems to get? That, to me, is what doing research in the measure development field seems like. Every time I think I have mastered the key methods some bright spark seems to have come up with a new procedure or analysis that is de rigueur for publishing in the field. No mind, I have to say that developing measures has been one of the most satisfying and even exhilarating elements of my research career, however humbling it might be at times. And, indeed, having gone from knowing next to nothing about measure development to creating, or helping to test, some fairly well-used measures (including one with my name on it, the Cooper-Norcross Inventory of Preferences!), I’m pretty confident that it’s a research process that anyone—who’s willing to devote the time—can get involved in.

And, of course, the point of developing and validating measures is not just for the narcissistic glory. It’s research that can help to define phenomena and explore their relationship to other factors and processes. Take racial microaggressions in therapy, for instance. Measures can help us see where these are taking place, what’s leading to them, and help us assess methods for reducing their prevalence. Of course, the downside of measures is that they take complex phenomena and reduce them down to de-contextualised, linear variables. But, in doing so, we can examine—over large, representative samples—how these variables relate to others. Do different ethnic groups, for instance, experience different levels of racial microaggressions in therapy? We could use qualitative methods to interview clients of different ethnicities, but comparing their responses and making conclusions is tricky. Suppose, for instance, of the Afro-Caribbean clients, we had four identifying ‘some’ microaggressions, two ‘none’, and three ‘it depended on the therapist’. Then, for the Asian clients, we had two saying, ‘I wasn’t sure’, three saying ‘no’, and two saying, ‘it was worse in the earlier sessions’. And one Jewish client felt that their therapist made an anti-Semitic comment while one didn’t. So who had more or less? By contrast, if Afro-Caribbean clients have an average rating of 3.2 on our 1 to 5 scale of in-therapy racial microaggressions, and Asian clients have an average rating of 4.2, and our statistical analysis show that the likelihood of this difference being due to chance is less than 1 in a 1,000 (see blog on quantitative analysis), then we can say something much more definitive.

From a pluralistic standpoint, then, measure development research—like all research methods—has a particular value at particular points in time: it all depends on the question(s) that we are asking. And while, as we will see, it tends to be based on positivistic assumptions (that there is a real, underlying reality—which we can get closer to knowing through scientific research), it can also be conducted from a more relativist, social constructionist perspective (that no objective ‘reality’ exists, just our constructions of it).

What is Measure Development and testing Research?

Measure development research, as the name suggests, is the development of ‘measures’, ‘scales’, or ‘instruments’ (also known as the field of psychometrics); and measure testing research is assessing those measures quality. Measure development studies will always involve some degree of measuring testing, but you can have measure testing studies that do not develop or alter the original measure.

A measure can be defined as a means of trying to assess ‘the size, capacity, or quantity of something’: for instance, the extent to which clients experience their therapist as empathic, or therapists’ commitment to a spiritual faith. In this sense (and particularly from a positivist standpoint), we can think of psychological measures as a bit like physical measures, for instance rulers or thermometers: tools for determining what’s out there (like the length of things, or their temperature).

Well known examples of measures in the counselling and psychotherapy field are the CORE-OM (Clinical Outcomes in Routine Evaluation – Outcome Measure), which measures clients’ levels of psychological distress; and the Working Alliance Inventory, which measures the strength of therapist-client collaboration and bond. There’s more information on a range of widely used ‘process’ and ‘outcome’ measures for counselling and psychotherapy here.

Measures generally consist of several ‘items’ combined into a composite score. For instance, on the CORE-OM, two of the 34 items are ‘I have felt terribly alone and isolated’ and ‘I have felt like crying’. Respondents are then asked to score such items on a wide range of different scales—for instance, on the CORE-OM, clients are asked to rate the items from 0 (not at all) to 4 (Most or all of the time)—such that a total score can be calculated. Note, in this way, measures are different from ‘questionnaires’, ‘surveys’, or ‘checklists’ that have lots of different items asking about lots of different things. Indeed, as we will see, the ‘combinability’ of items into one, or a few, scales tends to be a defining feature of measures.

A measure can consist of:

  • One scale. An example is the Relational Depth Frequency Scale, which measures the frequency of experiencing relational depth in therapy.

  • Two or more scales. An example is the Cooper-Norcross Inventory of Preferences, which has scales for ‘client preference for warm support vs focused challenge’, and ‘client preference for past focus vs present focus’.

  • Two or more subscales: meaningful in their own rights, but also summable to make a main scale score. An example is the Strengths and Difficulties Questionnaire for children, which has such subscales as ‘peer problems’ and ‘emotional symptoms’, combining together to make a ‘total difficulties’ score.

Generally, A single scale measure or a subscale will have between about four and 10 items. Less than that and the internal consistency starts to become problematic (see below); more than that and the measure may too long to complete, with items that are redundant.

Measures can be designed for completion by therapists, by clients, or by observers. They can also be nomothetic (where everyone completes the same, standardised items), or idiographic (where people develop their own items, for instance on a Goals Form).

Underlying Principles

Most measure development and testing research is underpinned by a set of principles known as classical test theory. These are fairly positivistic, in that they assume that there are certain dimensions out there in the world (known as latent variables) that exist across all members of the population, and are there independent of our constructions of them. So people’s ‘experiencing of racial microaggressions’ is a real thing, just like people’s temperature or the length of their big toe: it’s an actual, existent thing, and the point of our measure is to try and get as close as possible to accurately assessing it.

You might think, ‘If we want to know about clients’ experiences of racial microaggressions in therapy, why don’t we just ask them the question, “To what extent do you experience racial microaggressions in your therapy?”’ The problem is, from a classical test theory perspective, a respondent’s answer (the ‘observed score’) is going to consist of two components. The first component is going to be the part that genuinely reflects their experiencing of microaggressions (the ‘true score’ on the latent variable). But, then, a second part is going to be determined by various random factors that influence how they answer that specific question (the ‘error’). For instance, perhaps the client doesn’t understand the word ‘microaggressions’, or misunderstands it, so that their responses to this particular item don’t wholly reflect the microaggressions that they have experienced. Here, what we might do is to try and minimise that error by asking the question in a range of different ways—for instance, ‘Did your therapist make you feel bad about your race?’ ‘Did your therapist deny your experiences of racism?’—so that the errors start to even out. And that’s essentially what measure development based on classical test theory is all about: developing measures that have as little error as possible, so that they’re evaluating, as accurately as they can, respondents’ true positioning on the latent variable. No one wants a broken thermometer or a wonky ruler and, likewise, a measure of the experiencing of racial microaggressions in therapy that only reflects error variance isn’t much good.

As you can see, all this is based on very positivist assumptions: a ‘true’, underlying (i.e., latent) reality out there in the world; acting according to laws that are true for us all; and with ‘error’ like an uninvited guest that we’re trying to escort out of the party. Not much room for the existence of unpredictability, chaos, or individual uniqueness; or the idea that ‘reality’ is something we construct according to social mores and traditions. Having said that, adopting classical test theory assumptions, for the purposes of measure development, doesn’t mean you have to be a fully-fledged positivist. From a pragmatic standpoint, for instance, you can see measure development as a means of identifying and assessing something of meaning and importance—but whether or not it is something ‘real’ can be considered a mute point. We know, for instance, that there is something like racial microaggressions that can hurt clients and damage the therapeutic relationship, so we can do our best to find ways of assessing it, while also acknowledging the inherent vagaries of whatever we do. And, perhaps, what we call ‘racial microaggressions’ will change over time and vary across cultures and individuals, but that shouldn’t stop us from trying to get some sort of handle on it, so that we can do our best to find out more and intervene.

Developing a measure

So how do you actually go about developing a measure? It might seem like most measures are developed on the back of the proverbial ‘fag packet’ but, OMG, it is vastly more complicated and time-consuming than that. I worked out that, when Gina di Malta (with myself and Chris Evans) developed the 6-item Relational Depth Frequency Scale, it took something like six years! That’s one year per item.

That’s why, for most of us who have developed measures, the first thing we say to people who want to develop their own measures is to first see if they can use measures that are already out there. That’s unless you really have the time and resources to do the work that’s needed to develop and validate your own measure. Bear in mind, a half-validated measure isn’t much valid at all.

So why does it take so long? To a great extent, it’s because there’s a series of stages that you need to go through, detailed below. These aren’t exact, and every measure development study will do them slightly differently, but the sections below should give you a rough idea of what steps a measure development study will take.

Defining the Latent Variable

Before you develop a measure, you have to know what it is that you are trying to measure. To some extent, this may emerge and evolve through your analysis, but the clearer you are about what you’re looking for, the more likely your measure will be fit for finding it.

‘I’d like to know whether clients feel that they’ve got something out of a session.’ OK, great, but what do we mean by ‘got something out of’? Is this feeling that they’ve learnt something, or finding the session worthwhile, or experiencing some kind of progress in their therapy? ‘Maybe all of those things.’ OK, but feeling like you’ve learnt something from a session may not necessarily correlate with feeling like you’ve made progress. They may seem similar, but perhaps some clients feel there’s a lot they’ve learnt while still coming out of a session feeling stuck and hopeless.

Things that just naturally seem to go together in your mind, then, may not do so in the wider world, and disentangling out what you want to focus on is an important starting point for the measure development work. How do you do that? Read the literature in the area, talk to colleagues, journal, look at dictionaries and encyclopaedias: think around the phenomenon—critically—as much as you can. What you want to identify is one discrete variable, or field, that you can really, clearly define. It could be broader (like ‘the extent to which clients value their sessions’) or narrower (like ‘the extent to which clients feel they have developed insight in their sessions’), but be clear about what it is.

item generation

Once you know what latent variable you want to measure, the next step is to generate items that might be suitable for its assessment. At this stage, don’t worry too much if the items are right or not: brainstorm—generate as many items as you can. In fact, one thing I’ve learnt over the years is that you can never have too many items at this stage, and often you can have too few. Probably around 80% or so of items end up getting discarded through the measure development process, so if you want to end up with a scale of around 5-10 items, you probably want to start with around 25-50 potential ones. Bear in mind that you can always drop items if you get to the end of the measure development process and have too many, but it’s much more difficult to generate new items if you get to the end and find you have too few.

Ideally, you want to do this item generation process in one or more systematic ways, so it is not just the first, ad hoc, items that come into your head. Some strategies for generating items are:

  • Search the literature on the topic. Say we wanted to develop a measure to assess the extent to which adolescent clients feel awkward in therapy (we’re interested in differences in awkwardness across types of therapies, and types of clients). So let’s go to Google Scholar to see what papers there are on young people’s awkwardness in therapy, and also we should check the more established psychological search engines like PsychInfo and Web of Science (if we have access, generally through a university). Supposing, there, we find research where young people say things like, ‘I felt really uncomfortable talking to the counsellor’ or ‘The therapist really weirded me out’. So we can use these statements like that (or in modified form) as items for our measure, and it might also trigger some ideas about further items, like ‘I felt really comfortable talking to the counsellor’ (a reverse of the first statement here), or ‘The therapist seemed really weird’ (a modification of the second statement).

  • Interviews and focus groups. Talk to people in the target population to see what terms they use to talk about the phenomena. For instance, an interview with young clients about their experiences of counselling (approved, of course, through the appropriate ethical procedures) might be an ideal way of finding out how they experience ‘awkwardness’ in therapy. What sort of words do they use to talk about it? How does it feel to them?

  • Dictionaries and thesauruses. Always a valuable means of finding different synonyms and antonyms for a phenomena.

Remember, what you are trying to do is to generate a range of items which are, potentially, a means of ‘tapping into’ your latent variable. Have a mixture of phrasings, with some items that are as closely worded to your latent variable as possible (for instance, ‘I felt awkward in therapy’), but other that might come at it from a different angle, providing ‘triangulation’ (for instance, ‘The interaction with my therapist seemed unusual’). It’s also good to try reversing some items (so, for instance, having items that are about not feeling awkward, as well as feeling awkward)—though having such items in a final scale is no longer considered essential.

At this point, you’ll also need to start thinking about your response categories: the ways that people score your items. For instance, do people rate the items on a 3- or 5-point scale, and what labels might you use to describe these different points? This is an enormous field of science in itself, and usually it’s best to keep it simple and use something that’s already out there so that it’s been tried and testing. For instance, if you decide to develop your own four point scale with labels like 1 = Not at all, 2 = A really small amount, 3 = Quite a bit, 4 = Moderately, 5 = Mostly, How do you know that Quite a bit means less to people than Moderately; and couldn’t the difference between 2 and 3 (A really small amount and Quite a bit) be a lot more than the difference between 4 and 5 (Moderately and Mostly)? So have a look at what other validated and well-used measures use as response categories and see if anything there suits. Two common ones are:

  1. = Strongly disagree

  2. = Moderately disagree

  3. = Mildly disagree

  4. = Mildly agree

  5. = Moderately agree

  6. = Strongly agree

    Or

  1. = Not at all

  2. = Only occasionally

  3. = Sometimes

  4. = Often

  5. = Most or all of the time

At this point, you’ll also need some idea of how you phrase the introduction to your measure. Generally, you’ll want to keep it as short as possible, but there may be some essential instructions to give, such as who or what to rate. For instance, for our racial microaggressions measures, we might want to say something like:

Please think of your relationship with your current therapist. To what extent did you experience each of the following?

In this instance, we might also consider it essential to say whether or not the clients’ therapists will see their scores, as this may make a big difference to their responses.

Testing items

Expert Review

The next stage of the measure development process is to pilot test our items. What we would do is to show each of our items to experts in the field (ideally experts by experience, as well as mental health professionals)—say between about 3 and 10 of them—and ask them to rate each of our potential items for how ‘good’ they are. We could do this as a survey questionnaire, on hard copy, or through questionnaire software such as Qualtrics. An example of a standardised set of questions for asking this comes from DeVellis’s brilliant book on scale development. Here, experts can be asked to rate each item on a four-point scale (1 = not at all, 2 = a little, 3 = moderately, and 4 = very well) with respect to three criteria:

  1. How well the item matches the definition of our latent variable (which the experts are provided with)

  2. How well formulated the item is for participants to fill in

  3. How well, overall, the item is suited to the measure

Once the responses are in, those items with the lowest ratings (for instance, with an average < 3) can be discarded, leaving only the most well formulated and suitable items to go forward for further testing and analysis.

Three Step Interviewing Technique

Something else that I’ve learnt, from Joel Vos, that can be really useful for selecting items in these early stages is called The Three-Step Test Interview. This essentially involves asking a few respondents (ideally the kind of people the measure is for) to ‘think aloud’ while completing the measure, and then to answer some interview questions about their experiences and perceptions of completing the measure. This, then, gives us a vivid sense of what the experience of completing the measure is like, and what’s working and what’s not. Through this process, for instance, it might become evident that certain items—even if the experts thought they were OK—don’t make much sense to participants, or are experienced as boring or repetitive. And respondents might also have ideas for how items can be better worded. Again, questions that don’t work well can be removed at this stage and, potentially, new or modified items could be added (though bear in mind they haven’t been through the expert review process).

Exploratory psychometrics

You’re now at the stage of sending your measure out to survey. The number of respondents you need at this stage is another field that is a science in itself. However, standard guidance is a minimum of 10 respondents per item, with other guidance suggesting at least 50 respondents overall if the aim is detect one dimension/scale, and 100 for two (see, for instance, here).

At this point, you almost certainly want to be inviting respondents to complete the measure online: for instance, through Qualtrics or Survey Monkey. Hard copies are an option, but add considerably to the processing burden and, these days, may make prospective participants less likely to respond.

Ideally, you want respondents to be reflective of the people who are actually going to use the measure. For instance, if it’s a measure intended for use with a clinical population, it’s not great if it’s been developed only with undergraduate students or with just your social media contacts. Obviously, too,it’s also important to aim for representativeness across ethnicity/race, gender, age, and other characteristics.

If you’ve got funding, one very good option here can be to use a Mechanical Turk service, such as Prolific. This is, essentially, a site where people get paid to complete questionnaires; and because it’s such a large pool of people, from all over the world, it means you’ve got more chance of recruiting the participants you need. We used this, for instance, to gather data on the reliability and validity of the Cooper-Norcross Inventory of Preferences (see write-up here), and it allowed us to get US and UK samples that were relatively representative in terms of ethnicity, gender, and age—not something we could have easily achieved just by reaching out to our contacts.

Once you’ve got your responses back, you’re on to the statistical analysis. The aim, at this point, is to get to a series of items that can reliably assess one or more latent dimensions, in a way that is as parsimonious as possible (i.e., with the fewest items necessary). This scale shortening process can be done in numerous ways, but one of the most common starting points is to use exploratory factor analysis (EFA).

EFA is a system for identifying the dimension(s) that underlie scores from a series of items. It’s a bit like putting an unknown liquid on a dish and then boiling it off to see what’s left: perhaps there’s crystals of salt, or maybe residues of copper or gold. EFA has to be done using statistical software, like SPSS or R (not Excel), and you need to know what you’re doing and looking for. On a 1-10 scale of difficult stats, it’s probably about a 5: not impossible to pick up but also does require some fair degree of training, particularly if you don’t have a psychology degree. What follows (as with all the stats below) is just a basic overview to give you an idea of the steps that are needed.

The first thing you do in EFA is to see how many dimensions actually underlie your data. For instance, the data from our ‘experiences of racial microaggression’ items may suggest that they are all underpinned by just one dimension: How much or how little people have experienced microaggressions from their therapists. But, alternatively, we may find that there were more latent dimensions underlying our data: for instance, perhaps people varied in how much they experienced microaggressions, but also the degree to which they felt hurt by the microaggressions they experienced. So while some people could have experienced a lot of microaggressions and a lot of hurt, others might have experienced a lot of microaggressions but not much hurt; and any combination across these two variables might be possible.

What EFA also does is to help you see how well different items ‘load’ onto the different dimensions: that is, whether scores on the items correlate well with the latent dimension(s) identified, or whether they are actually independent of all the underpinning dimensions on the measure. That way, it becomes possible to select just those items that reflect the latent dimension well, discarding those that are uncorrelated with what you have actually identified as a latent scale. At this point, it’s also common to discard those items that load onto multiple scales: what you’re wanting is items that are specifically and uniquely tied to particular latent variables. At this point, there’s many other decision rules that can also get used for selecting items. For instance, you might want items that have a good range (i.e., going the full length of the scale), rather than all scores clustering in the higher or lower regions; and the items also need to be meaningful when grouped together. So this process of scale shortening is not just a manualised one, following clearly-defined rules, but a complex, nuanced, and selective art: as much alchemy as it is science.

By the end of this exploratory process, you should have a preliminary set of items for each scale or subscale. And what you’ll then need to do is to look at the items for each scale or subscale and think about what they’re assessing: how will you label this dimension? It may be that the alchemical process leads you back to what you set out to find: a ‘prevalence of racial microaggressions’ dimension, for instance. But perhaps what crystallised out was a range of factors that you hadn’t anticipated. When we conducted our first Cooper-Norcross Inventory of Preferences study, for instance (see here), we didn’t really know what preference dimensions would emerge from it. I thought, for instance, that we might find a ‘therapist directed vs client directed’ dimension, as we did, but I was surprised to see that there was also a ‘focused challenge vs warm support’ dimension emerging as well—I had just assumed that therapist directiveness and challenge were the same thing.

Testing the Measure

As with exploratory measure development, there are numerous methods for testing the psychometric properties of a measure, and procedures for developing and testing measures are often iterative and overlap. For instance, as part of finalising items for a subscale, a researcher may assess the subscale’s internal reliability (see below) and, if problematic, adjust its items. These tests may also be conducted on the same sample that was used for the EFA, or else a new sample of data may be collected with which to assess the measure’s psychometric properties.

Two basic sets of tests exist that most researchers will use at some point in measure development research: the first concerned with the reliability of the measure and the second concerned with its validity.

Basic Reliability Tests

The reliability of a measure is the extent to which it produces consistent, reproducible estimates of an underlying variable. A thermometer, for instance, that gave varied readings from one moment to the next wouldn’t be much use.

  • Internal consistency is probably the most important, and frequently reported, indicator of a scale’s ‘goodness’ (aside from when the measure is idiographic). It refers to the extent that the different items in the scale are all correlating together to measure the same thing. If the internal reliability is low, it means that the items, in fact, are not particularly well associated; if high, it means that they are all aligned. Traditionally, internal consistency was assessed with a statistic called ‘Cronbach’s alpha (α)’, with a score of .7 or higher generally considered adequate. Today, there is increasing use of a statistic called ‘McDonald’s omega (ω)’, which is seen as giving a less biased assessment.

  • Test-retest reliability is very commonly used in field of psychology, but is, perhaps, a little less prevalent in the field of counselling and psychotherapy research, where stability over time is not necessarily assumed or desired. Test-retest reliability refers to the stability of scores over a period of time, where you would expect people to score roughly the same on a measure (particularly if it is a relatively stable trait). If respondents, for instance, had wildly fluctuating scores on a measure of self-esteem from one week to the next, it would suggest that the measure may not be tapping into this underlying characteristic. Test-retest stability is often calculated by simply looking at the correlation of scores from Time 1 to Time 2 (an interval of about two weeks is typically used), though there are more sophisticated statistics for this calculation. Assessing test-retest reliability requires additional data to be collected after the original survey—often with a subset of the original respondents.

  • Inter-rater reliability is used where you have an observer-completed measure. Essentially, if the measure is reliable, then different raters should be giving approximately the same ratings on the scales. In our assessment of an auditing measure for person-centred practice in young people, for instance (see here), we found quite low correlations between how the raters were assessing segments of person-centred practice. That was a problem, because if one rater, on the measure, is saying that the practice is adherent to person-centred competencies, and another is saying it isn’t, then it suggests that the measure isn’t a reliable means of assessing what is and is not a person-centred way of working.

Basic Validity Tests

The validity of a measure is the extent to which it measures the actual thing that it is intended to. Validity can be seen as the ‘outward-facing’ element of a measure (how it relates to what is really going on in the world), whereas reliability can be seen as the ‘inward-facing’ element (how the different parts within it relate together).

  • Convergent validity tends to be the most widely emphasised, and reported, test of validity in the counselling and psychotherapy research field. It refers to the extent that scores on the measure correlate with scores on a well-established measure of a similar construct. Suppose we were developing a measure to assess how prized clients feel by their therapists. No measures of this exact construct exist out there in the field (indeed, if it did, we wouldn’t be doing this work), but there’s almost certainly other scales, subscales, or even individual items out there that we’d expect our measure to correlate with: for instance, the Barrett-Lennard Relationship Inventory’s ‘Level of Regard’ subscale. So we would expect to find relatively high correlations between scores on our new prizing measure and those on the Level of Regard subscale, say around .50 or so. If the correlations were zero, it might suggest that we weren’t really measuring what we thought we were. But bear in mind that correlations can also be too high. For instance, if we found that scores on our prizing measure correlated extremely closely with scores on Level of Regard (> .80 or so), it would suggest that our new measure is pretty redundant: the latent variable we were hoping to tap has already been identified as Level of Regard. Assessing convergent validity means that, in our original survey, we might also want to ask respondents to complete some related measures. That way, we don’t have to do a further round of surveying to be able to assess this psychometric property.

  • Divergent validity is the opposite of convergent validity, and is essentially the degree to which our scale or subscale doesn’t correlate with a dimension that should be unrelated. For instance, our measure of how prized clients feel wouldn’t be expected to correlate against a measure of their degrees of extraversion, or levels of mental wellbeing. If they did, it would suggest that our measure is measuring something other than we think it is. Measures of ‘social desirability’ are good tools to assess divergent validity against because we really don’t want our measure to be associated with how positively people try to present themselves. As with assessing convergent validity, assessing divergent validity means that we may need to add a few more measures to our original survey, if we don’t want to go through a subsequent stage of additional data collection.

  • Structural validity is the degree to which the scores on the measure are an adequate reflection of the dimensions being assessed. EFA, as discussed above, can be used to identify one or more underlying dimensions, but this structure needs validating in further samples. So this means collecting more data (or splitting the original data into ‘exploratory’ and ‘confirmatory’ subsamples), and then the new data can be analysed using a procedure called confirmatory factor analysis (CFA). CFA is a complex statistical process (about a 9 on the 1-10 scale), but it essentially involves testing whether the new data fits to our ‘model’ of the measure (i.e., its hypothesised latent dimension(s) and associated items). CFA is a highly rigorous check of a measure, and it’s a procedure that’s pretty much essential now if you want to publish a measure development study in one of the higher impact journals.

  • Sensitivity to intervention effects is specific to outcome measures, and refers to the question of whether or not the measure picks up on changes brought about by therapy. We know that therapy, overall, has positive benefits, so if scores on a measure do not show any change from beginning to end of intervention, it suggests that the measure is not a particularly valid indicator of mental wellbeing or distress. To assess this sensitivity, we need to use the measure at two time points with clients in therapy: ideally at the start (baseline) and at the end (endpoint). Measures that show more change may be particularly useful for assessing therapeutic effects. For instance, in our psychometric analysis of a goal-setting measure for young people (the Goal Based Outcome Tool), we found that this measure indicated around 80% of the young people had improved in therapy, as compared with 30% for the YP-CORE measure of psychological distress.

Advanced Testing

…And there’s more. That’s just some of the basic psychometric tests and, like I said earlier, there seems to be new ones to catch up with everyday, with numerous journals and books on the topic. For instance, testing for ‘measurement invariance’ seems to becoming increasingly dominant in the field, which uses complex statistical processes to look at whether the psychometrics of the measures are consistent across different groups, times, and contexts (this is about a 15 out of 10 for me!). And then there’s ‘Rasch analysis’ (see here), which uses another set of complex statistical procedures to explore the ways that respondents are scoring items (for instance, is the gap between a score of ‘1’ and ‘2’ on a 1-5 scale the same as the gap between ‘3’ and ‘4’?). So if you’re wanting to publish a measure development study in the highest impact journals, you’ll almost certainly need to have a statistician—if not a psychometrician—on board with you, if you’re not one already.

Developing benchmarks

Once you’ve got a reliable and valid measure, you may want to think about developing ‘benchmarks’ or ‘cutpoints’, so that people know how to interpret the scores from it. This can be particularly important when you’re developing a clinical outcome measure. Letting a client know, for instance, that they’ve got a score of ‘16’ on the PHQ-9 measure of depression, in itself, doesn’t tell them too much; letting them know that this is in the range of ‘moderately severe depression’ means a lot more.

There’s no one way of defining or making benchmarks. For mental health outcome measures, however, what’s often established is a clinical cut-off point (which distinguishes between those who can be defined as being in a ‘clinical range’ and those in a ‘non-clinical range’); and a measure of reliable change, which indicates how much someone has to change on a measure for it to be unlikely that this is just due to chance variations. For instance, on the Young Person’s CORE measure of psychological distress, where scores can vary from 0 to 40, we established a clinical cut-off point of 10.3 for males in the 11-13 age range, and a reliable change index of 8.3 points (see here). The calculations for these benchmark statistics are relatively complex, but there are some online sites which can help, such as here. You can also set benchmarks very simply: for instance, for our Cooper-Norcross Inventory of Preferences, we used scores for the top 25% and bottom 25% on each dimension as the basis for establishing cut-off points for ‘strong preferences’ in both ways.

the Public domain

Once it’s all finalised and you’re happy with your measure, you still need to think about how you’re going to let others know about it. There’s some journals that specifically focus on the development of measures, like Assessment, though they’re by no means easy to get published in. Most counselling and psychotherapy journals, though, will publish measure development studies in the therapy field, and that puts your measure out into the wider public domain.

At this stage you’ll also need to finalise a name for your measure—and also an acronym. In my experience, the latter often ends up being the toughest part of the measure development process, though sites like Acronymify can help you work out what the options might be. Generally, you want a title that is clear and specific to what your measure is trying to do; and a catchy, easy-to-pronounce acronym. If the acronym actually means or sounds something like what the measure is about—like ‘CORE’—that’s even better.

If there’s any complexities or caveats to the measure at all in terms of its use in research or clinical practice, it’s good to produce really clear guidelines for those who want to use it. Even a page or so can be helpful and minimise any ambiguities or potential problems with its application. Here is an an example of the instructions we produced for our Goals Form.

It can also be great to develop a website where people can access the measure, its instructions, and any translations. You can see an example of this for our C-NIP website here.

Regarding translations, its important that people who may want to translate your measure follow a standardised procedure, so that it stays as consistent as possible with the original measure. For instance, a standard process is to ‘back translate’ an initial draft translation of the measure to check that the items are still meaning the same thing.

In terms of copyright, you can look at charging for use of the measure, but personally I think it’s great if people can make these freely available for non-commercial use. But to protect the measure from people amending it (and you really don’t want people doing their own modifications of your measure) you can use one of the Creative Commons licenses. With the measures I’ve been involved with, we’ve used ‘© licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)’ so that others can use it freely, but can’t change it or make money from its use (for instance, by putting it on their own website and then charging people to use it).

Conclusion

At the most advanced levels, measure development and testing studies can be bewildering. Indeed, even at the most basic level they can be bewildering—particularly for those who are unfamiliar with statistics. But don’t let that put you off. There’s a lot of the basic item generation and testing that you can do without knowing complex stats, and if you’re based at an institution there’s generally someone you can ask to help you with the harder stuff. There’s also loads of information that you can google. And what you get at the end of it is a way of operationalising something that may be of real importance to you: creating a tool which others can use to develop knowledge in this field. So although measure development research can feel hard, and like a glacially slow process at times, you’re creating something that can really help build up understandings in a particular area—and with that the potential to develop methods and interventions that can make a real difference to people’s lives.

Acknowledgements

Photo by Tran Mau Tri Tam ✪ on UnsplashDisclaimer

 The information, materials, opinions or other content (collectively Content) contained in this blog have been prepared for general information purposes. Whilst I’ve endeavoured to ensure the Content is current and accurate, the Content in this blog is not intended to constitute professional advice and should not be relied on or treated as a substitute for specific advice relevant to particular circumstances. That means that I am not responsible for, nor will be liable for any losses incurred as a result of anyone relying on the Content contained in this blog, on this website, or any external internet sites referenced in or linked in this blog.