Test Validity! When did we last talk? 2010! What have you been up to since then?
It's a fact which seems to leave academics scoobied. When a person finishes their MA, and gets a job outside of the University, the keys to the Ivory Tower are taken away without ceremony. This means you can't keep up with published research and thought on your discipline, (or on anything else academic). My keys were taken from me in 2010, I only got them back two months ago, and all paradigm shifts during that time have passed my by. I'm having to work hard to catch up. For example, Bonfiglio's (2010) castigation of the "native speaker" construct, and Garcia's (2009) taxonomical "emergent bilingual", are, I would suggest, significant shifts which I needed to get to grips with, and which have, furthermore, had time to mature.
Today and tomorrow I need to see what test validity has been up to since 2010. My starting point will be Language Testing, (the assessment person's trade journal) starting with 2015 and working backwards, noting all promising-looking references, but not going to them yet. [NB, I'm also going to be getting a handle on the bibliographical software, Mendeley and EndNote, see if they can work together, or preferably which one of them I can abandon]. In reading and writing I'm cognisant of the need to re-write formative submissions 1-3, so will be critical at all times, will write good abstracts, and will keep in mind the overall structure of the literature review.
Here we go with Powers & Powers (2014). This paper looks at the ability of the TOIEC test to predict real-world abilities in the four domains, with particular reference to speaking. One of its authors is an employee of the commercial language testing company ETS, who produce the TOIEC test. "Validity" is equated with the meaning of test scores, "One of the most prominent standards concerns the meaning (or validity) of test scores."(p152).
This sentence very neatly (apparently) defines validity, but this could be regarded as an assertion. Yes, it can be argued that validity is a prominent standard but is it simply the meaning of test scores? It does appear to be an oversimplification. But let's look at simple example of construct validity. The construct is SVO order. Student must put the words football love I in the best order. The test score is binary, right or wrong.
The construct could be thought to be validated if a large proportion of learners get I love football. But what about those in the tested population who invert the subject and object? Perhaps they don't know the vocabulary, but do recognise that they've got two nouns and a verb, and that in English we use SVO for simple clauses. Should those who choose football I love get 0.5? They have hit the SVO construct, but have not put the words in the best order. The point is, even with this apparently straightforward construct, CEFR A1, we can query validity. These problems multiply in the more difficult domains, Listening and Speaking, and multiply exponentially with more sophisticated levels.
That's construct validity, with consequential validity there's a whole other layer of complexity and data to consider. We can validate items and whole tests, but it involves a lot of work beyond a straightforward intelligent and informed consideration of test scores. [nb2 for submission, get refs on constr. & conseq. validity]
At p153, Powers & Powers refer to their "test users", (they mean, it would appear, stakeholders such as employers and educational institutions, contrasted with "test takers"), who "seem to value
most often, we believe, is a person’s ability to communicate in English in contexts that require multiple language skills." Can't we have a bit of meat on this, please? Surely ETS will have been able to survey their "users" to see what they want? The hedging language "we believe" is revealing. It would not be necessary if we could have some data, perhaps from a survey.
On p154 it is suggested that, scores across domains are correlated, but "not to the degree that a measure of one can adequately substitute for a measure of another". Not "substitute" perhaps, but they do indicate. And then we have a funny sentence: "Although results are not completely consistent across studies, recent research has concluded, generally, that language skill entails multiple components." I could spend all day unpacking that one. Briefly, what do we mean by "language skill"? Of course, overall language skill entails multiple components, and learning across the four domains will be useful in SLA terms [check refs]. But ESL teachers know about "spikey profiles", and they are not necessarily a handicap in real world terms. The tour guide with good listening and speaking scores will be fine if she gets some help with reading and writing back at the office.
But the most interesting aspect of that sentence from a critical view is the device of admitting that research results "are not completely consistent." That's true, but it opens the door to now listing the research which [I haven't checked it yet] agrees with this thesis, whilst the research which is not "consistent" can be ignored, and is not referenced, now that its existence has been admitted.
The Methods section of this paper tells us that, although TOIEC covers much of the world, only Japanese and Korean test takers have been used in the data: "it is generally acknowledged that a very high proportion of TOEIC test takers come from Korea and Japan". We have to rely on this sweeping statement because TOIEC worldwide data is "not publicly available", (p155). Test scores in the domains are correlated with "self assessment statements", and it is admitted that this is not ideal, but "the best criteria are usually difficult, time-consuming, or otherwise infeasible to collect."
[NB3 come back to this later to scrutinise data].
On the point of bibliographic software, long story short, I'm going to stick with Mendeley, using methodology of saving PDF and then dragging it into the app, noting it up there, and then pasting quotes and notes into blog or Pages.
More fun with test validity next week.
This sentence very neatly (apparently) defines validity, but this could be regarded as an assertion. Yes, it can be argued that validity is a prominent standard but is it simply the meaning of test scores? It does appear to be an oversimplification. But let's look at simple example of construct validity. The construct is SVO order. Student must put the words football love I in the best order. The test score is binary, right or wrong.
The construct could be thought to be validated if a large proportion of learners get I love football. But what about those in the tested population who invert the subject and object? Perhaps they don't know the vocabulary, but do recognise that they've got two nouns and a verb, and that in English we use SVO for simple clauses. Should those who choose football I love get 0.5? They have hit the SVO construct, but have not put the words in the best order. The point is, even with this apparently straightforward construct, CEFR A1, we can query validity. These problems multiply in the more difficult domains, Listening and Speaking, and multiply exponentially with more sophisticated levels.
That's construct validity, with consequential validity there's a whole other layer of complexity and data to consider. We can validate items and whole tests, but it involves a lot of work beyond a straightforward intelligent and informed consideration of test scores. [nb2 for submission, get refs on constr. & conseq. validity]
At p153, Powers & Powers refer to their "test users", (they mean, it would appear, stakeholders such as employers and educational institutions, contrasted with "test takers"), who "seem to value
most often, we believe, is a person’s ability to communicate in English in contexts that require multiple language skills." Can't we have a bit of meat on this, please? Surely ETS will have been able to survey their "users" to see what they want? The hedging language "we believe" is revealing. It would not be necessary if we could have some data, perhaps from a survey.
On p154 it is suggested that, scores across domains are correlated, but "not to the degree that a measure of one can adequately substitute for a measure of another". Not "substitute" perhaps, but they do indicate. And then we have a funny sentence: "Although results are not completely consistent across studies, recent research has concluded, generally, that language skill entails multiple components." I could spend all day unpacking that one. Briefly, what do we mean by "language skill"? Of course, overall language skill entails multiple components, and learning across the four domains will be useful in SLA terms [check refs]. But ESL teachers know about "spikey profiles", and they are not necessarily a handicap in real world terms. The tour guide with good listening and speaking scores will be fine if she gets some help with reading and writing back at the office.
But the most interesting aspect of that sentence from a critical view is the device of admitting that research results "are not completely consistent." That's true, but it opens the door to now listing the research which [I haven't checked it yet] agrees with this thesis, whilst the research which is not "consistent" can be ignored, and is not referenced, now that its existence has been admitted.
The Methods section of this paper tells us that, although TOIEC covers much of the world, only Japanese and Korean test takers have been used in the data: "it is generally acknowledged that a very high proportion of TOEIC test takers come from Korea and Japan". We have to rely on this sweeping statement because TOIEC worldwide data is "not publicly available", (p155). Test scores in the domains are correlated with "self assessment statements", and it is admitted that this is not ideal, but "the best criteria are usually difficult, time-consuming, or otherwise infeasible to collect."
[NB3 come back to this later to scrutinise data].
On the point of bibliographic software, long story short, I'm going to stick with Mendeley, using methodology of saving PDF and then dragging it into the app, noting it up there, and then pasting quotes and notes into blog or Pages.
More fun with test validity next week.
REFERENCES
Bonfiglio, T. (2010). Mother tongues and nations. Berlin: De Gruyter Mouton.
Garcia, O. (2009). Emergent Bilinguals and TESOL: What’s in a Name? Tesol Quarterly, 43(2), 322-326. http://doi.org/10.2307/27785009
Powers, D. E., & Powers, A. (2014). The incremental contribution of TOEIC(R) Listening, Reading, Speaking, and Writing tests to predicting performance on real-life English language tasks. Language Testing, 32(2), 151–167. http://doi.org/10.1177/0265532214551855
Comments
Post a Comment