CLA-中国外语测评中心

最新论文精选| 测评素养、口语机器评分、高考试卷分析

2020-09-11

每两个月，为你精心挑选语言测试领域最新发表的研究成果，提供论文摘要、作者、关键词等基本信息，旨在帮助各位对语言测试感兴趣的小伙伴们了解语言测试领域最新科研方向与研究成果，逐步提升测评素养与科研能力。

本期内容为2020年7—8月的论文精选，推荐的论文来自中国考试、教育测量与评价、Language Assessment Quarterly、Language Testing等期刊。本期推荐的论文主要涉及口语考试自动评分效度验证、外语教师测评素养探析、成绩报告单设计、高考英语试卷解析、Rasch模型在测评中的应用、居家在线考试综述等内容。

▶ 01 大学英语四级口语考试自动评分效度初探

作者：金艳；王伟；张晓艺；赵英华
期刊：中国考试
期号：07期
摘要：为验证大学英语四级口语考试自动评分系统的有效性，本研究采用基于论证的评分效度论证框架，聚焦评估、概化和解释3个推论，通过人机评分的对比分析及专家对各等级考生的典型口语特征描述，论证该评分系统的效度。研究表明，人机评分具有较好的相关性和等级一致性，但机评分数的离散度略低于人工评分；机评对不同的语言特征敏感度不同，对语言准确性以及内容的相关性和丰富度特征比较敏感，对语音、策略等区分能力较差。对自动评分系统的效度论证还需在其他维度持续开展研究。
关键词：大学英语四六级考试；口语考试；自动评分；评分效度；效度论证

▶ 02 外语教师语言测评素养再探——基于对语言测试专家的访谈

作者：潘鸣威
期刊：中国考试
期号：07期
摘要：结合相关文献，通过深度访谈我国外语测评领域的10名专家学者，本研究梳理出我国外语教师语言测评素养的盲区和薄弱环节，提取出语言测评素养构念发展的新动态和新外延，提出语言测评素养构念的修正模型，包括语言测评知识、语言测评能力、语言测评原理和语言测评的接受度。同时，对提升语言测评素养的重点和难点进行分析，以冀对各教育阶段外语教师语言测评素养发展提供借鉴和参考。
关键词：测评素养；语言测试；外语教师；外语教育

▶ 03 落实评价体系促进全面发展考查关键能力彰显改革方向——2020年高考英语全国卷试题评析

作者：教育部考试中心
期刊：中国考试
期号：08期
摘要：2020年高考英语试题依据高考评价体系的总体要求，以学科素养导向，深化考试内容改革，加强对阅读理解、应用写作、语言表达和批判性思维等关键能力的考查；稳定试卷结构，合理控制试卷难度，确保公平性；试题力求对接高中育人方式改革，引导中学英语教学加强学生关键能力的培养。
关键词：高考；新高考；高考英语；高考命题；高考评价体系；考试内容改革；试题评价

▶ 04 TOEFL Junior 考试成绩报告单的设计与启示

作者：郭洁
期刊：教育测量与评价
期号：08期
摘要：TOEFL Junior考试在中国已经实施近十年，它是适应留学低龄化这一时代趋势的产物，其成绩报告单不仅提供了总分成绩及整体表现描述，而且包括了“听力理解”“语言形式和含义”“阅读理解”各单项成绩及能力表述，同时提供了与欧洲共同语言框架体系对接后的等级，以及考生“阅读理解”单项表现与蓝思阅读分级体系对接后的蓝思指数。TOEFL Junior考试的成绩报告单能较为全面且准确地反映出考生的英语学习情况，呈现出规范性、权威性、可对接性以及实用性等特点，为学校、学生、家长提供了一个国际英语的客观测评标准。借鉴TOEFL Junior考试成绩报告单的设计经验，我国在大规模考试分数报告单中除了体现分数，还有必要对各个学科领域的能力水平进行细化描述，提供学生能力水平等级参照，给出学习建议，指导并促进学生的课后学习，同时改进表达方式，如多采用鼓励性、建议性的肯定描述等。
关键词：TOEFL Junior考试；成绩报告单；国际性语言测评工具

▶ 05 Exploring the Construct Validity of the ECCE: Latent Structure of a CEFR-Based High-Intermediate Level English Language Proficiency Test

作者：Minkyung Kim; Scott A. Crossley
期刊：Language Assessment Quarterly
发表时间：Published online July 4, 2020
摘要：This paper focuses on the Examination for the Certificate of Competency in English (ECCE), which is based on the CEFR and assesses high intermediate level English proficiency. Specifically, the study explores the latent structure of the ECCE and its generalizability across groups (i.e., gender, age, and first language [L1]) to examine its construct validity, dimensionality of language proficiency, and commensurability with the CEFR macro-functions. The results indicated that test-takers’ performance on the ECCE could be best represented by a correlated three-factor model (i.e., reading/listening/lexico-grammar, writing, and speaking abilities). The correlated three-factor model also held irrespective of gender, age, and L1 (with the exception of vocabulary scores). Overall, the findings indicate that the correlated three-factor model is consistent with the constructs that the ECCE proposes to measure, is in line with the current multi-componential view of language proficiency, and is partly commensurate with the CEFR macro-functions.

▶ 06 A comprehensive review of Rasch measurement in language assessment: Recommendations and guidelines for research

作者：Vahid Aryadoust; Li Ying Ng; Hiroki Sayama
期刊：Language Testing
发表时间：First Published July 8, 2020
摘要：In the present study, we reviewed and coded 215 papers using Rasch measurement published in 21 applied linguistics journals for multiple features. We found that seven Rasch models and 23 software packages were adopted in these papers, with many-facet Rasch measurement (n=100) and Facets (n=113) being the most frequently used Rasch model and software, respectively. Significant differences were detected between the number of papers that applied Rasch measurement to different language skills and components, with writing (n=63) and grammar (n=12) being the most and least frequently investigated, respectively. In addition, significant differences were found between the number of papers reporting person separation (n=73, not reported: n=142) and item separation (n=59, not reported: n=156) and those that did not. An alarming finding was how few papers reported unidimensionality check (n=57 vs 158) and local independence (n=19 vs 196). Finally, a multilayer network analysis revealed that research involving Rasch measurement has created two major discrete communities of practice (clusters), which can be characterized by features such as language skills, the Rasch models used, and the reporting of item reliability/separation vs person reliability/separation. Guidelines and recommendations for analyzing unidimensionality, local independence, data-to-model fit, and reliability in Rasch model analysis are proposed.
关键词：Fit; language assessment; local independence; network analysis; modularity maximization method; Rasch measurement; reliability and separation; unidimensionality

▶ 07 Test review: Current options in at-home language proficiency tests for making high-stakes decisions

作者：Daniel R. Isbell; Benjamin Kremmel
期刊：Language Testing
发表时间：First Published July 16, 2020
摘要：The switch to accepting at-home proficiency tests for high-stakes decisions raises many concerns for stakeholders, such as technological demands, exam security, and validity of score use. Along these lines, this thematic review addresses such concerns and features brief reviews of seven options in at-home proficiency testing: ACTFL Assessments, Duolingo English Test, IELTS Indicator, LanguageCert, TEF Express, TOEFL iBT Special Home Edition, and Versant. Considering at home testing more broadly, we discuss key considerations for selecting an at-home test. We close with speculation on how at-home tests may shape language testing going forward: Beyond adapting to the current pandemic, at-home testing might address longstanding issues in access to language testing services and the representation of real-world communication practices in language tests.
关键词：ACTFL Assessments; at-home test; Duolingo English Test; IELTS Indicator; LanguageCert; security; TEF Express; TOEFL iBT Special Home Edition; Versant

▶ 08 Predicting communicative effectiveness in the international workplace: Support for TOEIC Speaking test scores from linguistic laypersons

作者：Jonathan Schmidgall; Donald E. Powers
期刊：Language Testing
发表时间：First Published July 21, 2020
摘要：In this study we examined the extent to which TOEIC® Speaking test scores relate to evaluations by professionals in the international workplace, the target language use domain of TOEIC tests. Linguistic laypersons in 10 countries were invited to participate in an online research survey. The survey incorporated a stratified sample of test-taker (N = 99) responses to three representative tasks from the TOEIC Speaking test (reading a text aloud, responding to questions, expressing an opinion) that were cast as workplace role-play tasks. After completing each role-play task, participants used brief, descriptive six-point rating scales to rate the communicative effectiveness (comprehensibility, task fulfillment, elaboration, and coherence) of each of several speakers. Communicative effectiveness ratings from linguistic laypersons were strongly correlated with TOEIC Speaking test scaled scores (r = 0.84). In addition, regression analysis was used to plot the relationship between layperson and test-based evaluations of speaking proficiency. Results suggested that test takers’ performances can be expected to be perceived as effective at score ranges typically associated with important decisions. The results are discussed in terms of their implications for claims about the generalizability of TOEIC Speaking test score interpretations in relation to the evaluations of linguistic laypersons in the international workplace.
关键词：Communicative effectiveness; comprehensibility; English as a lingua franca; speaking; validation; workplace