Saturday, March 19, 2016

2700 Words on Literature Review on Language Test Validation Paradigms and Migration

[Not proofed yet, and will likely want some editing before submission. Still undecided whether or not to give a few lines each to Weir, 2005, and Chapelle, 2012. Will finish it on Monday. Allotment tomorrow.]

This research is concerned with the validity of language tests used by the immigration authorities to measure the fitness of people resident in Scotland to become citizens of the United Kingdom. The first issue is to establish what is meant by validity in the context of assessing abilities in English as a Second Language (L2) for people who have a different first language, (L1). I refer to ‘citizenship tests’ meaning for the purposes of this submission the language element (actually a test in the speaking and listening domains at the B1 level of the Common European Framework of Reference). But the literature reviewed could equally apply to all tests of language ability connected with migration to the UK, and in particular to Scotland. These include the A1 level tests for spousal visas, and the B2 level tests for applicants for visas who wish to work as Ministers of Religion.

The problem of language test validity was stated a long time ago by Kelley (1927) as being ‘whether a test measures what it purports to measure’, (p14). Lado (1961) put it, ‘Does a test measure what it is supposed to measure? If it does, it is valid,’ (p321). But if we hold a citizenship test up to this formulation the difficulties are quickly apparent: what precisely is the test measuring? For Kiwan, (2011) those involved in formulating UK policy on citizenship and citizenship education see its defining characteristic as ‘active participation’ (per Crick, 2000). Kiwan sees policy as being framed in educational terms, (p271), and that all citizenship education is seen as ‘providing knowledge and skills of participation’ and that there is an assumption on the part of policy makers that acquisition of such skills will lead to citizenship as ‘active participation’, (p272).

Biesta, 2008, synthesises policy on citizenship education to find four characteristics of official views of what citizenship means in Scotland: focus on capabilities suggests a powerful emphasis on the individual; the concept is wide enough to take in ‘political, economic, social and cultural life’; it is active, not passive; and, it must be excercised in a community. (P40).

It can be seen that the concept of citizenship in both the UK as a whole, and in Scotland is, whilst tangible, far from straightforward. How then can tests which purport to assess a person’s fitness to exercise citizenship be validated? More specifically, how can we validate a test which purports to show that a person has reached a stage of sophistication in an L2 which will enable them to exercise citizenship?

Linn et al, 1991, observed that, in an era when language testing was moving away from multiple choice items and into more sophisticated tests of ability, existing validity paradigms were insufficient, (p16). Traditional practice was to validate through correlation, (often described as ‘criterion validity’ in the literature) whereby test scores would be compared with other measures of the same ability such as teacher norming or other tests (p16). These routes to validity are not open in language tests of fitness for citizenship. (The ‘Knowledge of Life in the UK’ aspect of the citizenship test, which is assessed by multiple choice items, could actually be validated on the basis of correlation: it is taught from a single text; however, the limits of ‘criterion validity’ can be seen in the fact that a high score on such a test could be achieved through rote-learning, which policy makers would be unlikely to endorse as valid.)

Chapelle, 1999, suggests that our understanding of a language test’s validity is essential so that we can say it is a good test for the situation in which it is being deployed. She also underlines (p264) the responsibility of those who give tests to demonstrate their test’s fitness for the use to which it is being put. No such demonstration have been given in respect of UK citizenship tests, which have now been in place for a decade.

Recall that Kiwan, 2011, opined that policy makers saw that the acquisition of knowledge and skills would enable people to actively participate as citizens. Messick, 1984, would agree that what a person knows, and what she can do are markers of educational achievement in any given area, (p155).  So, for example, and deploying a detail of one of Biesta, 2008’s four characteristics, namely political life: the new Scot aspirant to citizenship should know the names of the principal Scottish political parties, and should be able to ask and answer questions about each party’s policies.

However, Messick’s understanding of knowledge and ability goes further. Knowledge is not only declarative and procedural, (the what and the how), it is also strategic, (the which, the when and the why), (p156). Ability can be regarded as a ‘multidimensional organization of stable assemblies of information-processing components that are combined functionally in task performance, learning, problem solving, and creative production’. It is multidimensional because it operates in multiple situations: for example, in asking and answering questions about Scottish politics, a person is using the same abilities he uses in asking and answering questions about (say) the growing of turnips. Functional combinations exist when abilities act as sub-routines for others: when deploying an ability to ask and answer questions about politics, our new Scot could deploy pragmatic sub-routines (such as turn-taking) learned in any conversational setting.

For each of these multiple and interdependent constructs of knowledge and ability, Messick demands we assess by developing construct specific measurements, which, as our understanding of the constructs develops over time, must be recalibrated appropriately, (p169). So for Messick, only construct specific measures can be validated, and we must acknowledge that such validation will always be provisional in light of our understanding evolving over time.

It has been objected that Messick’s approach, whilst fundamentally sound, is impractical. Chalhoub-Deville, 2015, notes that his ‘hold on the [language testing] profession is waning’. That may be so, but construct validity, (as it is usually referred to) still has relevance. We can frame the knowledge and abilities required of new Scots as constructs, for which we could attempt to develop specific measurements.

Bachman, 2005, builds on Toulmin (2003)’s argument theory to put forward an argument-based form of test validity. This involves those relying on the test putting forward a claim for the interpretation of a test score. This is based on data, which is backed up by a warrant. However, because a test performance has many variables, it is subject to a rebuttal. (Bachman, 2005, p9). In the case of a citizenship language test, the claim might be that a score in the test indicated that the test taker was fit to engage in citizenship in the UK. Data would come from the test score. This might in turn be obtained from a rubric based on Common European Framework of Reference ‘Can Do’ statements. The warrants for these scores are more problematical: if we rely on the CEFR itself for backing, we get caught up in a kind of validity negative feedback loop, (see Weir, 2005). There is nothing in research or theory to link CEFR descriptors to the citizenship frameworks outlined in Kiwan, 2012, and Biesta, 2008.

Test validity was somewhat reframed by Bachman, 2005, when, in addition to examing argument theory the new concept of assessment use argument was put forward. This was developed by Bachman & Palmer, 2010, which, whilst being a significant development in test validity theory, actually sets out to be a practical guide for test development ‘in the real world’. (The term justification is preferred over validity, Bachman & Palmer, 2010, p135n).

Building on Toulmin’s (2003) argument theory, Bachman & Palmer urge that all language assessments be subjected to overt protocols for both their development and use. This is done by means of an Assessment Use Argument, (p30). The AUA will consist of a series of 4 claims and warrants (which are themselves backed-up) for those claims, (p103). A test’s claims should be: that its consequences are beneficial; that decisions made as a result of it are sensitve to societal values, and fair; interpretations of test scores are ‘meaningful’, ‘generalizable’, ‘relevant’, ‘impartial’ and ‘sufficient’ to enable decisions to be made; and finally, test data (records of scores) must be consistent. (P105-127).

In connnection with the claim of (beneficial) consequences, Bachman & Palmer introduce the important topic of ‘stakeholders’, (p255), who are identified as people and organisations affected as a result of the test’s usage, and the decisions made as a result of it. Stakeholders in a citizenship test in Scotland would be the aspiring new Scot, the organisation who design and deliver the test, and policy makers in the UK and Scottish Governments. But it could be argued that, as the test-taker aspires to citizenship, existing citizens, in particular members of her community, are also stakeholders. Claims for any language test upon which citizenship depends need to focus on the beneficial consequences of test results for all of these stakeholders.

The argument form of validity is further delineated by Kane, 2013, which refines Bachman & Palmer’s Assessment Use Argument, (AUA) into an Interpretation/Use Argument, (IUA) because Kane, 2013 is minded to give interpretation of test scores ‘equal billing’ with use, (p2), though in ‘the real world’, (Bachman & Palmer, 2010) the distinction may seem rather fine: if a test score is interpreted to mean that the test taker is unfit for citizenship, it is difficult to see a difference if the test score is used to the same end. Kane observes that test scores will not be used to say that a test taker got a certain score on a certain date but rather that he (amongst other interpretations) ’has some standing on a trait’, a useful formulation of what might be claimed for a citizenship language test.

Claims made for tests often go ‘far beyond’ a test score, and as they are usually not self evident, such claims will ‘merit evaluation’. And claims for test score use in the public domain requires public justification, (Kane, 2013, p1). In practical terms, Kane sees such validation or justification arising when the IUA developed alongside the test, where the IUA reflects the proposed use to which test scores will be put. Once the test and the IUA are developed, there should be a critical “appraisal stage” conducted if possible by a ‘neutral or skeptical’ third party. (Kane, 2013, p16-17). Any interpretation of use of test scores which has been critically with regards to “coherence” and ‘plausibility’ can be regarded as valid, unless and until ‘new evidence’ demands a re-evaluation, (p18).

It is suggested that Bachman & Palmer’s (2010) guidance for test validity arising from  Assessment Use Argument taken with Kane’s (2013) overlapping Inference/Use Argument framework together mark the current paradigm of language test validity. But the matter is by no means settled. Chalhoub-Deville, 2015, addresses validity in the shadow of global educational reform movements (Sahlberg, 2011) which ‘increasingly employ high-stakes accountability tests’. Manifiestations of these reform movements are ‘Race to the Top’ in the US, (p3) and Curriculum for Excellence in Scotland.

Chalhoub-Deville 2015 is pertinent to this work for two reasons. In respect of understanding the current paradigms of Test Validity Theory, it reminds us of the central importance of test consequences in valdidity frameworks and any future validity research, (p12) but it does so in relation to educational reform, and in the UK, and in Scotland,  citizenship testing has developed in parallel to citizenship education, (Kiwan, 2011; Biesta, 2008).  Consequences can also be nominated impact, backwash and washback, (p6). Chalhoub-Deville conceptualises consequences in a validity framework (for reformed education) through the prism of Theory of Action, (Bennet et al, 2011) which gives an ineluctably social aspect to validation, (p10). It also means (p13) that all stakeholders need to be privy to documentation concerning test development and appraisal.

No language testing validation theorist has addressed their minds in published research to specific questions regarding language tests used in relation to migration. Chalhoub-Deville, 2015 is concerned with global educational reform movements. The work of both Bachman and Kane is of general application. The former has said that ‘all languge testing connected with migration is ethically problematical’ (L. Bachman, personal communication, 2012); the latter is primarily concerned with large scale high stakes testing. Both Kiwan (2011) and Biesta (2008) whilst considering questions regarding citizenship in education do not refer to test validity.

Researchers have considered the use of language tests as ‘boundary objects’ (Macqueen et al, 2015, in relation to health services) but the validity of, (or justification for, per Bachman & Palmer, 2010) such tests, as Chahoub-Devilles language testing ‘profession’ would understand those terms, has not been the subject of any research know to this author. No stakeholders in any tests used for the purposes of migration have published Assessment Use Arguments, Inference/Use Arguments, or any other validation documentation.


Bachman, L. F. (2005). Building and Supporting a Case for Test Use. Language Assessment Quarterly: An International Journal, 2(1), 1–34.

Bachman, L. F., & Palmer, A. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford, UK: Oxford University Press.

Bennett, R. E., Kane, M., & Bridgeman, B. (2011). Theory of action and validity argument in the context of through-course summative assessment. Paper presented at invitational Research Symposium on Through Course Summative Assessment, Atlanta, GA. Retrieved from: Bridgeman.pdf.

Biesta, G. (2008). What kind of citizen? What kind of democracy? Citizenship Education and the Scottish Curriculum for Excellence. Scottish Educational Review, 40(2), 38–52.

Chalhoub-Deville, M. (2015). Validity theory: Reform policies, accountability testing, and consequences. Language Testing.

Chapelle, C. A. (1999). Validity in language assessment. Annual Review of Applied Linguistics, (19), 254 – 272.

Chapelle, C. A. (2012). Validity argument for language assessment: The framework is simple… Language Testing, 29, 19–27.

Crick B (2000) In Defence of Politics (5th edition). London and New York: Continuum.

Kane, M. (1982). A sampling model for validity. Applied Psychological Measurement, 6,125– 160.

Kane, M. (1992). An argument-based approach to validation. Psychological Bulletin, 112, 527–535.

Kane, M. (2013). Validating the Interpretations and Uses of Test Scores. Journal of Educational Measurement, 50(1), 1–73.

Kelley, T. L. (I927). The Interpretation of Educational Measurements. Yonkers-on-Hudson, NY.: New World Book Company.

Kiwan, D. (2011). “National” citizenship in the UK? Education and naturalization policies in the context of internal division. Ethnicities, 11(3), 269–280.

Lado, R. (1961). Language Testing: The Construction and Use of Foreign Language Tests. A Teacher's Book. London: Longmans.

Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, Performance-Based Assessment: Expectations and Validation Criteria. Educational Researcher, 20(November 1991), 15–21.

Macqueen, S., Pill, J., & Knoch, U. (2015). Language test as boundary object: Perspectives from test users in the healthcare domain. Language Testing, 1–18.

Messick, S. (1984). Abilities and Knowledge in Educational Achievement Testing: The Assessment of Dynamic Cognitive Structures. In B. S. Plake (Ed.), Social and Technical Issues in Testing: Implications for Test Construction and Usage. Hillsdale, NJ.: Lawrence Erlbaum Associates.

Sahlberg, P. (2011). Global educational reform movement is here! [Blog Post] Retrieved from:

Toulmin, S. E. (2003). The uses of argument. Cambridge, UK: Cambridge University Press.

Weir, C. J. (2005). Limitations of the Common European Framework for developing comparable examinations and tests. Language Testing, 22(3), 281-300.