International Journal of Testing ISSN: 1530-5058 (Print) 1532-7574 (Online) Journal homepage: www.tandfonline.com/journals/hijt20 Fitting the mixed Rasch model to the listening comprehension section of the IELTS: Identifying latent class differential item functioning Farshad Effatpanah, Purya Baghaei, Hamdollah Ravand & Olga Kunina- Habenicht To cite this article: Farshad Effatpanah, Purya Baghaei, Hamdollah Ravand & Olga Kunina- Habenicht (2025) Fitting the mixed Rasch model to the listening comprehension section of the IELTS: Identifying latent class differential item functioning, International Journal of Testing, 25:1, 50-89, DOI: 10.1080/15305058.2024.2414423 To link to this article: https://doi.org/10.1080/15305058.2024.2414423 © 2024 The Author(s). Published with license by Taylor & Francis Group, LLC. Published online: 20 Oct 2024. Submit your article to this journal Article views: 1466 View related articles View Crossmark data Citing articles: 2 View citing articles Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=hijt20 https://www.tandfonline.com/journals/hijt20?src=pdf https://www.tandfonline.com/action/showCitFormats?doi=10.1080/15305058.2024.2414423 https://doi.org/10.1080/15305058.2024.2414423 https://www.tandfonline.com/action/authorSubmission?journalCode=hijt20&show=instructions&src=pdf https://www.tandfonline.com/action/authorSubmission?journalCode=hijt20&show=instructions&src=pdf https://www.tandfonline.com/doi/mlt/10.1080/15305058.2024.2414423?src=pdf https://www.tandfonline.com/doi/mlt/10.1080/15305058.2024.2414423?src=pdf http://crossmark.crossref.org/dialog/?doi=10.1080/15305058.2024.2414423&domain=pdf&date_stamp=20%20Oct%202024 http://crossmark.crossref.org/dialog/?doi=10.1080/15305058.2024.2414423&domain=pdf&date_stamp=20%20Oct%202024 https://www.tandfonline.com/doi/citedby/10.1080/15305058.2024.2414423?src=pdf https://www.tandfonline.com/doi/citedby/10.1080/15305058.2024.2414423?src=pdf https://www.tandfonline.com/action/journalInformation?journalCode=hijt20 InternatIonal Journal of testIng 2025, Vol. 25, no. 1, 50–89 Fitting the mixed Rasch model to the listening comprehension section of the IELTS: Identifying latent class differential item functioning Farshad Effatpanaha , Purya Baghaeib , Hamdollah Ravandc and Olga Kunina-Habenichta aresearch unit of Psychological assessment, tu Dortmund university, Dortmund, germany; bDepartment of english, Islamic azad university, Mashhad Branch, Mashhad, Iran; cDepartment of english, Vali-e-asr university of rafsanjan, rafsanjan, Iran ABSTRACT This study applied the Mixed Rasch Model (MRM) to the listening comprehension section of the International English Language Testing System (IELTS) to detect latent class differential item functioning (DIF) by exploring multiple profiles of second/foreign language listeners. Item responses of 462 examinees to an IELTS listening test were sub- jected to MRM analysis. Three classes emerged: (1) ‘Medium-level Stimulus Processors’ who can somewhat synchronize top-down and bottom-up processing, handle multitasking to a certain extent, com- prehend moderately complex items, and manage input delivered at a relatively fast pace; (2) ‘High-level Stimulus Processors’ who have greater abilities in synchronizing top-down and bottom-up processing, multi- tasking, understanding complex items, and handling fast delivery input and more paraphrased content; and (3) ‘Low-level Stimulus Processors’ who rely more on bottom-up processing, have limited lexi- co-grammatical knowledge, struggle with multitasking and complex items, and find fast delivery input and paraphrased content challeng- ing. Differences across the classes were further explained. Introduction Listening comprehension is the ability to process, integrate, and discern implicit and explicit meaning from perceptual oral and/or visual stimuli (Buck, 2001). This skill represents a highly intricate and multidimensional cognitive process in which several (meta)cognitive and (non)linguistic skills are involved (Du & Man, 2022). Despite its pivotal role in language acquisition, production, daily communication, and academic learning (Graham, 2017), listening remains the least-researched skill, often referred to as the “Cinderella skill” in second/foreign language (L2) learning (Field, 2013). This underrepresentation is evident in the limited research on listening test performance and the distinct cognitive processes that lead to unique listening https://doi.org/10.1080/15305058.2024.2414423 © 2024 the author(s). Published with license by taylor & francis group, llC. CONTACT farshad effatpanah farshad.effatpanah@tu-dortmund.de research unit of Psychological assessment, tu Dortmund university, emil-figge street 50, 44227 Dortmund, germany this is an open access article distributed under the terms of the Creative Commons attribution license (http://creativecommons.org/ licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. the terms on which this article has been published allow the posting of the accepted Manuscript in a repository by the author(s) or with their consent. KEYWORDS IELTS; latent class differential item functioning; L2 listening comprehension; mixed Rasch model; multiple profiles http://orcid.org/0000-0003-3970-5588 http://orcid.org/0000-0002-5765-0413 http://orcid.org/0000-0002-8757-3850 http://orcid.org/0000-0002-1646-8260 https://doi.org/10.1080/15305058.2024.2414423 mailto:farshad.effatpanah@tu-dortmund.de http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://crossmark.crossref.org/dialog/?doi=10.1080/15305058.2024.2414423&domain=pdf&date_stamp=2025-2-12 http://www.tandfonline.com InTERnaTIOnaL JOuRnaL OF TESTIng 51 proficiencies (Aryadoust, 2015; Goh & Vandergrift, 2022). As a result, assessing listening comprehension ability becomes crucial across a wide spectrum of educational and pro- fessional settings, where accurate measurement can provide insight into language learn- ers’ proficiencies. Listening comprehension tests, designed to assess examinees’ understanding of spoken stimuli, are widely employed in language testing and assessment worldwide, serving as a means to measure individuals’ listening proficiency. These tests are commonly integrated into a variety of standardized language proficiency assess- ments, each designed to fulfill distinct objectives such as placement, achievement, evaluation, and diagnostic purposes. Researchers have categorized listening tests into two types: post-listening performance (PLP) tests and while-listening performance (WLP) tests (Aryadoust, 2012, p. 41). PLP tests involve examinees listening to oral input, taking notes, and subsequently responding to a set of test items (Marx et  al., 2017). Notable examples of such tests include the Michigan Language Assessment Battery (MELAB; Goh & Aryadoust, 2010) and the Test of English as a Foreign Language (TOEFL®; Isbell & Kremmel, 2020). WLP tests, by contrast, necessitate that examinees listen attentively to oral stimuli while simultaneously reading items, tak- ing notes, selecting or composing responses, and keeping pace with the ongoing oral input stream (Aryadoust, 2012, p. 41). Prominent instances of WLP tests encompass the Certificate in Advanced English (CAE; Geranpayeh & Kunnan, 2007) and the International English Language Testing System (IELTSTM; Isbell & Kremmel, 2020), developed by the University of Cambridge ESOL (English for Speakers of Other Languages) Examination Syndicate. Despite the widespread utilization of WLP tests in high-stakes standardized assessments, little attention has been devoted to such tests. The literature in this area is scant, and there is a controversy among research- ers concerning the theoretical and practical aspects of these tests (Aryadoust, 2012; Field, 2009). In most listening tests, test takers are presented with audio recordings, which can range from conversations and interviews to academic lectures or everyday spoken lan- guage scenarios. These recordings are followed by questions designed to evaluate the test takers’ comprehension of the content. Test takers usually need to select the correct answer from several options, complete sentences or summaries using information from the audio, determine the truth of a statement based on what they hear, provide brief written responses based on the information from the audio, or match speakers or pieces of information to specific categories or statements. Scoring is typically binary, with responses being marked as either correct (1 point) or incorrect (0 point); partial credit is not given for incomplete or partially correct answers. The total score is calculated as the sum of all correct responses, which may then be converted into a scaled score to reflect the overall proficiency level. Furthermore, listening tests generally measure a range of cognitive skills necessary for understanding spoken language, including under- standing explicit information, making inference and deduction, recognizing opinions, attitudes, or purposes, and following complex sequences. Many listening tests also report a composite score that reflects overall listening proficiency. However, some tests may report scores for specific sub-skills or dimensions of listening ability (e.g., understanding main ideas vs. specific details). These subscales, if present, are usually designed to give 52 F. EFFaTPanaH ET aL. more granular feedback to learners and instructors about specific areas of strength or weakness (Rost, 1990). Scores derived from listening tests represent examinees’ listening proficiency levels and furnish valuable evidence for making informed decisions about each examinee. When interpreting the results of listening tests, test users and developers must priori- tize the validity of these tests—ensuring that they are appropriate for their intended purposes and uses (He & Jiang, 2020). This emphasis on validity is of paramount importance as the inferences and decisions drawn from test scores bear consequences for all stakeholders involved. Consequently, concerns pertaining to the quality and reli- ability of listening tests remain prevalent within the field of language assessment research. In the field of educational testing, two main statistical models are used to scale examinees. These models are the Item Response Theory (IRT; Embretson & Reise, 2000) and the Rasch model (Rasch, 1980). These models determine the position of examinees’ scores along a single continuous proficiency scale. Subsequently, these scores are utilized to facilitate comparisons or rank-ordering of examinees against specific criteria or in relation to their peers. The primary assumption underlying psychometric models is that individual differences among examinees are quantitative variations (Rost, 1990). While it is true that all examinees employ varying degrees of the same strategies and skills to answer a set of test items, it is important to recognize that they may employ these strategies and skills in distinct ways or even opt for entirely different solution patterns or strategies. This indicates the presence of qualitative or structural variations among examinees as well (Rost, 1990). Consequently, test scores reflect not only quantitative differences in a single con- struct but also represent the qualitative differences in the types of strategies, skills, or processes that examinees adopt to give a correct response to a given test item. Without acknowledging these qualitative differences, examinees would essentially be compared across different test constructs using the same test (Baghaei et  al., 2019). A statistical approach that proves invaluable in specifying the test construct and, nota- bly, in identifying both quantitative and qualitative distinctions among examinees is the analysis of Differential Item Functioning (DIF; Holland & Wainer, 1993). DIF methods are used to ascertain whether different subgroups with the same levels of ability exhibit varying responses to specific test items. When subgroups employ different strategies and processes in addressing test items, observed DIF signals the presence of qualitative differ- ences among them. As noted by De Ayala and Santiago (2017), these differences may stem from variations in language or educational backgrounds, self-concept, and diverse test-taking experiences, among other factors. The occurrence of DIF calls into question the assumption of unidimensionality and raises doubts about the credibility of test score interpretations and uses. Most significantly, DIF signifies that the test is measuring dis- tinct constructs across subgroups, rendering the ranking or comparison of examinees along the same proficiency continuum inappropriate. Given this background, the primary objective of this study is to employ the Mixed Rasch Model (MRM; Rost, 1990) in the analysis of the listening comprehension section of the IELTS, a well-established standardized WLP test. The aim is to investigate latent class DIF by exploring multiple profiles of L2 listeners who are likely to employ diverse listening comprehension processes in order to correctly respond to a set of test items. InTERnaTIOnaL JOuRnaL OF TESTIng 53 Background Listening comprehension and multiple profiles Numerous researchers have proposed various models aimed at elucidating the intrica- cies of the listening comprehension process and its relationship with a plethora of (non)cognitive attributes. As delineated by Aryadoust (2018), these models can be broadly categorized into two groups. The first group encompasses general models that primarily focus on listening within non-assessment contexts. For instance, models such as those proposed by Imhof and Janusik (2006), Rost (2016), and Goh and Vandergrift (2022) posit that listening involves a multifaceted cognitive journey that commences with the pre-comprehension phase, characterized by processes of perception and rec- ognition. During perception, auditory organs receive sound waves, which are subse- quently converted into electrical impulses and transmitted to the brain for processing via the nervous system. Recognition, the second process, involves the identification and retrieval of phonemes and words through lexical access and segmentation. In the comprehension stage, syntactic knowledge is applied in a bottom-up fashion (i.e., the use of sounds, individual words, and smaller units to construct meaning from the auditory stimuli) to amalgamate the identified words. This amalgamation results in the creation of a localized mental representation of perceived chunks or sentences (Kintsch, 1998). These representations contain occasional gaps and are comprised of proposi- tions and mental imagery. The task of storing these mental representations falls upon long-term memory, as listeners decode, analyze, and encode the received chunks or sentences. When a new stimulus is encountered, a similar comprehension process unfolds, connecting the new input to the previously stored mental representation of the listening stimuli. This connection is facilitated through a top-down process (i.e., the use of linguistic, contextual, pragmatic, and sociolinguistic knowledge to facilitate auditory comprehension) that draws upon one’s background knowledge and the for- mulation of inferences. This process enables listeners to bridge gaps and construct a coherent mental representation referred to as the situation model (Aryadoust, 2018). Both bottom-up and top-down processes rely on the listener’s memory as their syn- chronization constrains working memory capacity and impacts test performance (Buck, 2001). Researchers have indicated that cognitive processing occurs in an interactive rather than a linear manner and depends not only on linguistic knowledge but also on world and topical knowledge (Field, 2013; Goh & Vandergrift, 2022). More importantly, vari- ous levels of processing are required to comprehend an oral input. Lower-level cognitive processes (e.g., lexico-grammatical knowledge, word recognition, parsing, and acous- tic-phonetic decoding) are engaged to grasp the literal meaning of a text, whereas high- er-level processes (e.g., semantic processing, world knowledge resources, inferencing, multitasking, and speakers’ intentions and prosodic patterns) are activated to compre- hend the discourse and implied meaning of an input (Field, 2013; Rukthong & Brunfaut, 2020). Previous studies have shown that automatic or fluent use of lower-level processes reduces the cognitive processing load, enabling listeners to allocate more attentional capacity to higher-level processes (Goh & Vandergrift, 2022). The second group encompasses models that are specifically tailored for assessment purposes (Bejar et  al., 2000; Buck, 2001; Buck & Tatsuoka, 1998; Field, 2013; Freedle 54 F. EFFaTPanaH ET aL. & Kostine, 1996). These models take into account both the default listening mecha- nisms and a range of test- and test-taker-related characteristics (Aryadoust, 2018). Among these assessment-specific models, one of the most influential models has been presented by Bejar et  al. (2000), which delineates listening comprehension into two stages: the listening stage and the response stage. During the listening stage, verbal stimuli are received by the listener’s auditory system and subsequently processed. To comprehend incoming signals, listeners must have real-time access to at least three knowledge sources: (1) Situational knowledge, which underscores the significance of contextual knowledge and visual cues in aiding listening comprehension, (2) Linguistic knowledge, encompassing grammar (phonology, vocabulary, morphology, and syntax), discourse, and pragmatics, and (3) Background knowledge, pertaining to one’s knowl- edge of the world and the current situation. Throughout this stage, the oral input is transformed into a set of propositions. Owing to variations in knowledge, cognitive processing, and working memory capacity among different listeners, various sets of propositions may be generated (Bejar et  al., 2000). In the response stage, the incoming input is processed to formulate a response, which can take the form of spoken or written expression (e.g., selecting an option, filling in a blank space, or providing words or phrases). Bejar et  al. (2000) also emphasize the role of several factors in listening test performance and the interconnectedness of these fac- tors. These factors include task characteristics (e.g., delivery speed, task complexity, exposure to listening input, discourse register, genre, audio file repetition, test materials, and rubric), individual characteristics (e.g., age, gender, and background knowledge), and linguistic and cognitive knowledge (e.g., lexico-grammatical knowledge, concentra- tion, and working memory capacity). Furthermore, alongside these models, numerous researchers have posited that lis- tening comprehension comprises various subskills and have proposed various taxono- mies (e.g., Field, 2013; Richards, 1983; Rost, 2016, as comprehensively reviewed by Aryadoust, 2018). While these models and taxonomies have yielded valuable insights into the nature of listening comprehension and provided implications for pedagogy, assessment, and research, they often assume that listeners exhibit similar processes and patterns when processing listening input and responding to test items. This assumption implies homogeneity among listeners in terms of listening processes. However, it is important to recognize that individual listeners may differ in their lis- tening processes, and there may be multiple viable configurations through which lis- teners achieve successful listening comprehension. As posited by Wolvin (2013, p. 105), the individual differences of listeners might take us to a reconceptualization of listening in the broader communication context. Rather than a ‘one size fits all’ model of the listening process, perhaps we should focus on individual listeners’ processing strategies. It may be that listeners vary considerably in their cognitive functioning while engaging as listening communicators. Moreover, Hickendorff et  al. (2018) assert that total raw scores are typically employed in statistical analyses to develop models. The use of total scores assumes homogeneity in response patterns, with any heterogeneity within and between individuals considered primarily as statistical noise. InTERnaTIOnaL JOuRnaL OF TESTIng 55 Therefore, only a limited number of researchers have adopted a profiling analysis approach and designed scales to measure individual differences by identifying styles, habits, and patterns of listening, and categorizing listeners into different profiles. For instance, Watson et  al. (1995) developed a sixteen-item Listening Styles Profile (LSP-16) inventory to assess intra-individual variability in listening styles. They defined listening styles as “attitudes, beliefs, and predispositions about the how, where, when, who, and what of the information reception and encoding process” (p. 2). Using factor analysis, they derived four distinct patterns of listening styles: content-oriented, time-oriented, action-oriented, and people-oriented listening styles. Addressing a common criticism of the LSP-16 regarding its consistently low estimates of internal consistency (Bodie & Worthington, 2010), Bodie et  al. (2013) proposed a revised 24-item measure of LSP (LSP-R). Employing exploratory factor analysis (EFA), they investigated the underlying factor structure of the measure, which assesses four listening styles: task-oriented, ana- lytical, relational, and critical listening. In another study, Bodie et  al. (2020) first devel- oped a typology of listening habits based on facets of meaning construction and subsequently designed a corresponding measure to capture individual differences. Utilizing EFA, they identified four distinct listening habits: Connective, Reflective, Analytical, and Conceptual. Chon and Shin (2019) also conducted a study to theorize and substantiate intra-individual differences in students’ motivational-metacognitive pro- files regarding their listening proficiency. Using latent class analysis (LCA), they identi- fied four cluster solutions: Amotivated-Translators, Externally Motivated-Don’t do much Planning or Evaluation, Introjected-Totally Alert, and High Autonomous Motivation- Achievement Strategists. This was based on the responses of 312 Korean middle school learners of English to a metacognitive awareness listening questionnaire and an aca- demic self-regulation questionnaire. While these studies have shown the presence of different profiles of listening styles and processes, they are subject to several limitations. First, these studies have often employed inadequate methodologies for exploring listening profiles. Early studies relied on factor analysis methods, while recent ones have tended to use latent class approaches. Factor analysis methods are limited in their ability to accu- rately describe heterogeneity and complex, non-linear listening patterns. According to Hickendorff et  al. (2018, p. 2), factor analysis methods and other traditional analytical approaches, including correlation, regression-based techniques, and Analysis of Variance (ANOVA), are variable-centered and focus on relationships between variables. They assume that the relationship between variables applies uni- formly to all individuals, suggesting homogeneity in the nature of individual differ- ences (Hickendorff et  al., 2018). Consequently, they are ill-suited for providing a clear representation of non-linear and interactive patterns and addressing heteroge- neity within and between individuals. Latent class approaches, such as LCA, also rely on observed variables and mean scores (Tabachnick & Fidell, 2013) and are incapable of modeling item responses (Aryadoust, 2015). Second, these empirical studies have primarily focused on intra-individual differences to characterize listen- ing patterns. Third, these studies were conducted under non-assessment conditions. In assessment contexts, if qualitatively different groups exist within a population, it serves as empirical evidence for the presence of DIF. 56 F. EFFaTPanaH ET aL. All in all, the research discussed above underscores the complexity and variability of listening comprehension processes, highlighting that different listeners exhibit dis- tinct cognitive patterns and strategies. While previous studies have explored these indi- vidual differences through various profiling approaches, there remains a gap in effectively capturing the heterogeneity of listening processes. In the context of language assessments like IELTS, such variability can manifest as DIF, where certain test items may function differently for different groups of listeners. Consequently, further research is needed to detect DIF and better account for the diverse listening profiles that exist within the population that enhance our understanding of how individual differences impact assessment outcomes. Differential item functioning (DIF) DIF is identified when individuals with the same level of the construct being measured, but from different predefined groups (such as age, gender, ethnicity/race, education, etc.), have different probabilities of endorsing an item (Zumbo, 2007). Essentially, DIF can be seen as a form of measurement bias where examinees’ responses to test items are influenced by factors beyond the primary construct the test intends to measure (Ravand, 2015; Roussos & Stout, 1996). In other words, item responses are not solely determined by the primary construct but also by group membership, indicating the presence of multidimensionality. This situation can threaten the validity of test interpre- tations and uses (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 2014). Within the context of IRT and the Rasch model, DIF detection methods involve comparing areas between item response functions (Raju, 1990) or item difficulty param- eters (Thissen et  al., 1993). To maintain unidimensionality, test scores or item difficulty1 estimates should be invariant across groups of the population with similar ability levels. When difficulty estimates for items differ among groups, it implies that examinees from different groups employ distinct cognitive processes and strategies to respond to items, revealing that group membership influences test performance. Several statistical methods are available for detecting DIF, including Mantel-Haenszel (Holland & Thayer, 1988), logistic regression (Swaminathan & Rogers, 2000), multi- ple-group factor analysis (Meredith, 1993; Ravand, 2024), multiple indicator multiple cause (MIMIC; Finch, 2005; Ravand et  al., 2019), IRT-/Rasch-based analytical methods (Raju, 1988; Steinberg & Thissen, 2006), and multidimensional IRT (MIRT; Oshima et  al., 1997). These methods typically examine DIF by analyzing the manifest character- istics of examinees (e.g., age, gender, ethnic group, race, etc.). However, this approach may not always identify the root causes of DIF effectively, as it focuses on research- er-defined characteristics and may overlook unidentified characteristics, leading to het- erogeneity among emerging groups (Cohen & Bolt, 2005; Geranpayeh & Kunnan, 2007). 1 DIF is not necessarily limited to item difficulty because it could also be applied to item discrimination (e.g., Humphry & Montuoro, 2021; Lord, 1980). although the focus on item difficulty fits with the use of the Rasch model, it must be pointed out that other characteristics of the item can vary as well. InTERnaTIOnaL JOuRnaL OF TESTIng 57 As argued by Ackerman et  al. (2003), psychometric DIF analysis should always be fol- lowed by a substantive investigation of the sources of DIF, bridging the gap between statistical and substantive analyses. While statistical analysis is crucial for validity, understanding the underlying causes of DIF is enlightening from both theoretical and substantive perspectives, especially in terms of construct validation (Borsboom et  al., 2004; Van Nijlen & Janssen, 2011). DIF in L2 listening comprehension DIF analysis in listening comprehension research has conventionally been conducted using several covariates such as gender, age, grade, nationality, place of origin, aca- demic background, first language or language background, prior exposure to similar tests (test-wiseness or practice effect), familiarity with topic, and familiarity with item type (e.g., Aryadoust, 2012; Geranpayeh & Kunnan, 2007; Seo et  al., 2016). Among the covariates, gender has been the most well-researched factor in the DIF literature on listening assessment. In previous L2 listening comprehension studies conducting DIF analyses, gender was either the sole variable examined (e.g., Aryadoust et  al., 2011; Lin & Wu, 2003; Park, 2008) or one of several variables investigated for its potential impact on differential test performance (e.g., Aryadoust, 2012; Cid et  al., 2017; Geranpayeh & Kunnan, 2007; Seo et  al., 2016). However, the findings of the studies for the presence of DIF were inconclusive and contradictory. A group of stud- ies reported that males outperformed females, or males were disadvantaged on some items and advantaged on others (Alavi et  al., 2018; Aryadoust, 2012; Aryadoust et  al., 2011; Cole, 1997; Park, 2008; Zansen et  al., 2022). By contrast, another group of stud- ies found that females outperformed their male counterparts (Lin & Wu, 2003). Some studies also reported no gender-based DIF of practical concern (Bourdeaud’Hui et  al., 2021; Cid et  al., 2017). These mixed findings are sensible because the studies used different tests and detection methods for DIF analyses. Additionally, previous studies indicated that nationality and age do not induce DIF, but grade, place of origin, famil- iarity with topic and item type, prior exposure to tests, academic background, and first language can cause DIF (Aryadoust, 2011, 2012; Banerjee & Papageorgiou, 2016; Geranpayeh & Kunnan, 2007; Harding, 2012; Lia & Yao, 2021; Nishizawa, 2023; Pae, 2004; Raquel, 2019; Shin et  al., 2021). Some researchers further focused on investigat- ing the interaction of several covariates in inducing DIF (e.g., Aryadoust et  al., 2024; Pae, 2012). Researchers have explored various hypotheses to explain differential performance on listening items in standardized tests, with a focus on gender differences. Park (2008) attributed observed DIF in the English listening subtest of the Korea College Scholastic Ability Test to factors such as item content, topic areas, and language type, noting that items about travel and sports favored males, while those about theater and shopping favored females. Other studies echoed these findings, showing items favoring females typically relate to arts and social sciences, while those favoring males involve natural sciences and technical content (Carlton & Harris, 1992; Curley & Schmitt, 1993; O’Neill & McPeek, 1993; Scheuneman & Gerritz, 1990). Females also excel in computations and symbolic items, whereas males perform better on geometry and items involving tables and graphs (O’Neill & McPeek, 1993). Additionally, females tend to outperform 58 F. EFFaTPanaH ET aL. males on items involving mood, contextual clues, or abstract concept understanding, while males do better on logical inference tasks (O’Neill & McPeek, 1993). Studies also indicate that females excel in oral and constructed response formats (e.g., essay-type, short answers, and fill-in-the blanks) due to stronger writing skills, while males often perform better on multiple-choice (MC) items, partly due to a greater willingness to guess (Aryadoust, 2012; Bolger & Kellaghan, 1990; Mazzeo et  al., 1993; Pae, 2012; Willingham & Cole, 1997). Cognitive processing differences also contribute, with females showing stronger verbal skills and males relying more on spatial processing, affecting performance across various task types (O’Neill & McPeek, 1993). For instance, males excel in spatial tasks like map labeling, which involve visualization, real-life information, and spatial understanding, while females perform better in tasks involving linguistic analysis, detailed comprehension, and linguistic inference such as sentence and table completion (Aryadoust, 2012). Table 1 provides descriptions of abilities required for different listening comprehension item types, adapted from Buck (2001) and Aryadoust (2012). Although previous DIF studies explain the causes of DIF and characterize the differ- ential performance of examinees on listening comprehension tests, only a few studies have specifically focused on WLP tests, particularly the IELTS exam (e.g., Alavi et  al., 2018; Aryadoust, 2012). Focusing on WLP tests is crucial because it ensures that the Table 1. listening comprehension item types and their description. Item type Description Multiple-choice - ability to attentively listen to comprehend the main idea and details of oral stimuli. - ability to identify main information, such as main points, supporting details, and inferred meanings. - ability to understand context and use inference skills to deduce implied meanings. - ability to distinguish between similar options and eliminate distractors. Map labeling - spatial awareness and visualization skills to understand and interpret maps. - ability to follow the sequence of directions and comprehend spatial relationships between locations. - ability to accurately identify and label locations on a map based on verbal descriptions. - ability to attentively listen for specific details related to geographical features, landmarks, and directions. sentence completion - ability to understand the context and meaning of sentences. - ability to predict and infer missing information based on preceding content. - ability to recognize keywords and phrases that appropriately complete missing information. - Knowledge of grammar, vocabulary, sentence structure, phonology, and morphology to accurately complete sentences. table completion - ability to extract and comprehend information presented in tables. - ability to identify patterns and relationships within the table. - ability to attentively listen for specific details related to numerical data, categories, or attributes (e.g., numbers, dates, names, etc.). - ability to retain and accurately transfer information from the oral input to the table. Matching items - ability to understand relationships between items presented in the oral stimuli. - ability to recognize similarities and differences between items. - ability to categorize and classify information based on shared characteristics. - ability to identify key information and making connections between items. - ability to retain and manipulate information in memory while processing the task. - ability to follow the sequence of information presented and retain the order in memory. - ability to infer the correct matches based on given information. Classification - ability to categorize and group information based on shared characteristics. - ability to identify similarities and differences between items. - ability to understand the overall structure and organization of information presented. - ability to recognize patterns and relationships to accurately classify items. - ability to retain, remember, and retrieve specific details from the oral stimuli. InTERnaTIOnaL JOuRnaL OF TESTIng 59 assessments accurately reflect the real-time processing abilities of examinees, which is an essential component of effective listening comprehension. The limited attention to DIF analysis in WLP ESOL tests may be attributed to Cambridge University’s standards that prioritize score consistency over other aspects of test quality, such as fairness and test validity (Geranpayeh & Kunnan, 2007). Mixed Rasch model The MRM (Rost, 1990) is a combination of the Rasch model (Rasch, 1980) and the latent class model (LCM; Lazarsfel & Henry, 1968). The Rasch model assumes that the probability of getting an item right depends on the ability of a person and the diffi- culty of an item. The greater a person’ ability relative to an item difficulty, the higher is the probability of a right answer. The position of an item on a latent variable con- tinuum corresponds with the position of person at which there is a 0.5 probability of a correct response to the item. The probability of a correct response is dependent on the difference between the ability of a person and the difficulty of an item. Generally, when the difference is negative, the probability of responding correctly to the item is less than 0.50; when the difference is 0, the probability is equal to 0.50; and when the difference is positive, the probability is greater than 0.50. For the standard Rasch model, the item response function is expressed as: P vj v j v j v j ( | . ) ( ) Χ = = −( ) + − 1 1 θ β θ β θ β exp exp (1) where P vj( )Χ denotes the probability of solving item j for examinee v; θv is the ability of person ν; and β j is the difficulty of the item j. The Rasch model relies heavily on the assumption of parameter invariance, which asserts that the difficulties of items should remain constant across all members of the population. In fact, the ordering of item difficulties should be equal for all individuals. However, this assumption might not hold in cases where there are qualitative distinc- tions, such as variations in cognitive approaches, among different (sub)groups. To mit- igate this issue, the MRM relaxes the assumption by permitting item parameters to vary across latent population classes. The assumption of MRM is that the population is heterogeneous but involves various non-overlapping subpopulations that are different in terms of item response probabili- ties. The MRM can be expressed as: P gvj vg jg v vg jg vg jg ( | . . )Χ = = −( ) + −( ) 1 1 θ β θ β θ β exp exp (2) where P(Χvj) is the probability of a correct response; θvg is the ability parameter of person v in class g; β jg is the class specific difficulty parameter of item j in class g; and gv is a latent class person v belongs to it. It must be noted that unlike the LCM, which 60 F. EFFaTPanaH ET aL. assumes that there are no individual differences within each class regarding the response probabilities, the MRM allows for individual differences within latent classes. Therefore, while the unidimensional Rasch model holds within each latent class, it does not hold for the entire population. Since latent classes are a priori unknown and disjoint, each person must belong to only one latent class with the highest probability (Rost, 1990). There are typically two approaches to the analysis of MRM: (1) exploratory analysis approach and (2) confirmatory analysis approach. In the exploratory approach, a num- ber of latent classes are firstly detected, and then class-specific item profiles are analyzed to understand the nature of differences among the latent classes. This entails a thorough examination of the content and patterns of variation in item parameters within each class. Such scrutiny can illuminate qualitative differences among the classes for research- ers. The association between several covariates and the latent class membership can also be explored to elucidate the qualitative distinctions among these classes. In contrast, the confirmatory approach incorporates several covariates directly into the model while esti- mating the latent classes (De Ayala & Santiago, 2017). The presence of substantial a priori evidence aids in confirming whether the covariates moderate person and item parameters (Sen & Cohen, 2019). The MRM has already been used in methodological and practical studies to test fit of the unidimensional Rasch model, standard setting, test calibration, DIF, test speededness, problem-solving strategies, and response styles and faking personality (see Sen & Cohen, 2019 for a comprehensive review of applications of IRT mixture models). The application of MRM to listening comprehension tests has received too little attention. Most directly, Aryadoust (2015) applied the MRM to an EFL listening comprehension test. In addition to the listening test, respondents were given a metacognitive awareness listening question- naire and a lexico-grammatical test. Two latent groups were detected. The first class con- sisted of examinees with higher ability in multitasking and lexico-grammatical knowledge, and with higher scores in planning, evaluation, and problem solving; however, the second class comprised examinees with lower ability in multitasking and lexico-grammatical knowledge, and with lower scores in mental translation, person knowledge, and directed attention. The class analysis revealed that examinees in Class 1 outperformed examinees in Class 2 on matching items (items for which examinees must select the accurate options from a list of choices and write the letters next to the item numbers). Aryadoust (2015) argued that these concurrent cognitive-motor tasks are extraneous to listening compre- hension. Although the study provided valuable information about individual differences in listening test performance, there was a priori hypothesis for using the MRM. A study with a priori hypotheses concerning the causes of DIF using MRM is not a reasonable strategy. The purpose of MRM is to identify DIF across latent classes that are otherwise unknown, and no a priori assumption is required to detect DIF across the classes (Baghaei et  al., 2019). Results of MRM would not be informative when researchers first perform MRM and then attempts to make an association between the classes and covariates that have been chosen a priori based on a theory. This method can be utilized when analysis of class-specific item profiles fails to produce interpretable results. When researchers have covariates that are speculated to be the causes of DIF, conventional DIF detection tech- niques for known groups can be used to analyze DIF for these variables (Baghaei et  al., 2019; Rost, 1990). InTERnaTIOnaL JOuRnaL OF TESTIng 61 The present study The present study aims to apply the MRM (Rost, 1990) to the listening comprehension section of the IELTS to investigate latent class DIF by exploring multiple profiles of L2 listeners, focusing on examining the patterns of item difficulty parameters across latent classes and analyzing the content of test items to capture qualitative differences among them. For the purpose of this study, the following research questions were addressed: RQ1: How many latent classes (qualitative profiles in listening mechanism) do exist among examinees in taking IELTS listening comprehension items? RQ2: Which items exhibit a significant difference in item difficulty estimates across latent classes? RQ3: What are the distinct cognitive abilities required for effectively answering various types of listening comprehension items, and how do these abilities differ across various question types? Method Data The present study capitalized on a dataset that had been previously utilized by Aryadoust (2012) and Effatpanah (2019). The data consisted of item responses of 462 international students to forty items of the listening comprehension section of a sample paper of the IELTS. Participants were studying in language schools and tertiary-level institutions in Iran, Malaysia, Singapore, and the Philippines. There were 191 (41.34%) males and 271 (58.66%) females, and their mean age was 24.46 (SD = 4.27). Participants were from Iran (76.62%), China (11.91%), Malaysia (5.84%), and from other countries, mostly Arab states in the Persian Gulf region (5.63%). They were preparing for the IELTS exam and their participation was voluntary. A written informed consent had been obtained from all individual participants included in the study. They had been reassured that their information would remain confidential and anonymous. They also had received com- plete feedback on their performance. The feedback contained their raw scores, IELTS band scores, information regarding their deficiencies in listening comprehension, and suggestions for improving their listening skills. Due to confidentiality reasons, the version of the IELTS listening test had been taken from Official IELTS Practice Materials (2007) (www.IELTS.org,). Just similar to other IELTS versions, this test was subjected to a rigorous test design, development, and val- idation process which rely on the seven-stage Cambridge ESOL Question Paper Production Cycle (CEQPPC). This process not only assures that every version of the test is of a comparable level of difficulty but also provides evidence for the plausibility of the interpretations and uses of the test scores. The listening test usually lasts for a total of 40 minutes. The recordings are heard only once and include a range of accents, including British, American, Australian, New Zealand, and Canadian. The items are designed so that the answers appear in the order they are heard in the audio. One mark is awarded for each correct answer, and care should be taken when writing answers on the answer sheet because poor spelling and grammar are penalized. There are no par- tially correct answers; each answer is marked as either 1 (correct) or 0 (incorrect). For http://www.IELTS.org 62 F. EFFaTPanaH ET aL. the paper-based listening test, after the recording ends, examinees are given 10 minutes to transfer their answers to the answer sheet. The test that had been administered to the participants comprised four sections (four audio stimuli), each with ten questions. A variety of question types were used, including multiple-choice (MC), map labeling (ML), sentence completion (SC), and table comple- tion (TC). No sample items are provided in this article due to copyright constraints. Readers can refer to the IELTS website (n.d.). The first section of the test consisted of two ML items, four MC items, and four TC items on a woman being interviewed by a police officer. The second section was composed of five MC items and five TC items on providing commercial information about an English Hotel. The third section included one MC item and nine SC items based on a conversation among three students on campus. The last section comprised six SC items and four TC items based on a talk presented by a university lecturer on a bird of prey. The total score in the test ranged between 4 and 40 with a mean of 21.32 and standard deviation of 9.12. Reliability coefficients of the test were estimated using Cronbach alpha, and a value of 0.92 was obtained. Data analysis The item responses of all examinees were analyzed with the MRM using the TAM package (Robitzsch et  al., 2024) in R Core Team (2024). The TAM package uses mar- ginal maximum likelihood estimation (MMLE) and joint maximum likelihood estima- tion (JMLE) methods for unidimensional and multidimensional IRT models as well as dichotomous and polytomous models. Most of conventional (parametric) DIF detection methods are based on the premise that a certain group of items, called anchor items, are free from DIF. These anchor items serve to create a constant scale for comparing item difficulty across different (sub)groups (Glas & Verhelst, 1995). The presence of suitable anchor items is critical for accurate DIF analysis. The violation of this assump- tion can result in inflated type I error rate and erroneous inferences (Wang et  al., 2022; Yuan et  al., 2021). The TAM package uses the equal-mean-difficulty anchor method for DIF analysis. This method assumes that the mean of difficulty across items is the same across classes or (sub)groups (typically equal to zero). Therefore, in this study, the mean of difficulty across items was constrained to be equal to zero for each latent class in the MRM analysis. Several researchers (e.g., Alexeev et  al., 2011; Sen, 2018) have indicated that when a test aligns with, for instance, the two or three parameters logistic (2PL or 3PL) IRT models, utilizing an MRM might lead to the detection of spurious latent classes, which, in turn, can lead to erroneous or ambiguous conclusions that have consider- able effects on practitioners. Therefore, it is essential to first examine whether the tests align with the 2PL and 3PL IRT models (Tay et  al., 2011), unless there exists a valid rationale why these models would not yield as much valuable insight as MRM. The estimation of the 3PL IRT model with an additional guessing parameter appeared not reasonable for the data used in this study due to the multi-item format of the listening test, including MC and fill-in-the-gap items. Thus, a 2PL mixture IRT model was fitted to the data and its results were compared against the MRM. InTERnaTIOnaL JOuRnaL OF TESTIng 63 Since the number of classes is not a model parameter to be estimated, Rasch models with one to five latent classes were fitted to the data and compared to explore the opti- mal number of classes. Their results were also compared against the 2PL mixture IRT models with one to five latent classes. The models were compared with regard to rela- tive fit statistics: Akaike’s Information Criterion (AIC) = −2 log L + 2 P, where L is the maximum likelihood function value, P is the number of parameters; Bayesian Information Criterion (BIC) = −2 log L+ P ln[n], where ln[n] is the natural log of sample size; and Bozdogan’s Consistent AIC (CAIC) = −2 log L + p [ln(n) + 1]. The model with the least information criteria is considered the best model. Burnham and Anderson (2002) argued that AIC asymptotically selects the model that reduces the mean square error predic- tion. With a simulation study, Li et  al. (2009) also indicated that AIC is less accurate and inconsistent unless the true model is among the rival models. However, because AIC does not impose any penalty for sample sizes, the more complex or highly param- eterized model is selected with an increase in sample size. Numerous researchers, on the contrary, have shown that BIC has the superiority to detect the correct number of latent classes because it imposes a large penalty for the number of parameters and sample sizes (Choi et  al., 2017; Li et  al., 2009; Nylund et  al., 2007; Preinerstorfer & Formann, 2012; Sen et  al., 2019), so BIC tends to choose models with a smaller number of param- eters compared to AIC. The model identification was followed by estimating item difficulty and person parameters as well as examining mean square (MNSQ) fit statistics for each latent class. Item difficulty and person ability parameters with logit units or log odds units for test items/persons indicate the location of each item/person on the latent trait con- tinuum (e.g., L2 listening comprehension ability). Class-specific item difficulty param- eters were also graphically compared to identify the items that cause qualitative differences between classes. The mean and standard deviation of the weighted likeli- hood estimate (WLE) of person parameters for latent classes were also computed. WLE person parameter estimates show the ability scores on the Rasch scale for latent classes after controlling for the item difficulty pattern difference. To detect further differences between the resulting groups, they were compared with regard to their reliability, means of raw scores across latent classes, and the correlation of item difficulties across the classes, along with their confidence intervals (CIs). Furthermore, the fit of individual items for each latent class was tested using outfit and infit MNSQ fit indices (Linacre, 2002), their z-standardized values, and p-value. According to Linacre (2002), infit MNSQ is a t-standardized information-weighted sta- tistic which is sensitive to inliers, whereas outfit MNSQ is a t-standardized unweighted statistic which is sensitive to outliers. ZStd provides a t-test to examine whether the data have a perfect fit to the model. A p-value is used to test the statistical significance of the observed misfit. The acceptable range for infit and outfit MNSQs is 0.70 − 1.30 (Bond et  al., 2020; Linacre, 2024). Overall, larger values of infit MNSQ show that the items of a test do not perform well for the examinees on whom the items are targeted, suggesting a more serious threat to validity (Linacre, 2024). Finally, to further identify which items had a significant difference in item difficulty across latent classes, a post hoc t-test was conducted. The Welch t-test was used to assess whether the differences between the difficulty estimates based on each class are statistically significant. In this method, item difficulty parameters are separately 64 F. EFFaTPanaH ET aL. estimated for each group through logistic regression. The difference between these esti- mates is then tested for statistical significance. DIF items are identified when p < 0.05 (Linacre, 2024). When DIF is both statistically and substantively significant and consis- tently replicates across various subgroups, the researcher can more confidently conclude that the item functions differently across these subgroups (Aryadoust, 2012). The for- mula for the Welch t-test is as follows: t d d s s j j j j = − − 2 1 2 2 1 2 (3) where di1 denotes the difficulty of item j for group g, and si1 2 is the standard error of estimate for item j for group g. Researchers at Educational Testing Service (ETS) devel- oped a standardized metric, known as the ETS DIF classification scheme (Holland & Thayer, 1988), to categorize DIF into three levels based on effect size ( )∆ : negligible DIF (A) if |Δ| ≤ 1, intermediate DIF (B) if 1 1 5≤ ≤∆ . , and large DIF (C) if ∆ ≥1 5. . These categories help identify whether test items function differently across different groups after controlling for the overall ability level. Results To explore the appropriate number of latent classes, several Rasch models with one to five latent classes were fitted to the data, and their results were compared against those of 2PL mixture IRT models with different latent classes. Table 2 provides the relative model fit statistics across the models. As can be seen, information criteria (e.g., AIC and BIC) did not produce consistent results. Such discrepancies between information criteria have been frequently reported in previous studies for model comparisons (Sen & Cohen, 2019). Nylund et  al. (2007) argued that compared to AIC, BIC has a higher probability of selecting the correct number of latent classes in factor mixture modeling. With regard to AIC, the value of the 2PL mixture IRT model with five latent classes was the smallest due to its higher complexity. However, the values of BIC showed that 2PL mixture IRT models had a poor fit compared to the Rasch models with different latent classes, except for 2PL mixture IRT model with one latent class that outperformed the one-latent class Rasch model. This suggests that the 2PL item response functions do Table 2. Model-data relative fit information for the estimated MrMs and 2Pl mixture Irt models. Models aIC BIC rasch one-latent Class 18752 18922 rasch two-latent Class 18309 18652 rasch three-latent Class 18071 18587 rasch four-latent Class 17907 18598 rasch five-latent Class 17779 18643 2Pl one-latent Class 18294 18625 2Pl two-latent Class 17954 18620 2Pl three-latent Class 17741 18741 2Pl four-latent Class 17519 18855 2Pl five-latent Class 17459 19130 InTERnaTIOnaL JOuRnaL OF TESTIng 65 not accurately capture the underlying structure of the data. The BIC value for a three- latent-class Rasch model was the smallest, and thus this model was determined to be the most effective for the sample in this study and chosen for subsequent analyses. This indicates that assuming three homogeneous subgroups offers a more accurate represen- tation of the data compared to assuming a single population. The class size decimals for each latent class are presented in Table 3. The sum of these sizes is equal to one, and they are interpreted as percentages, suggesting the num- ber of examinees assigned to be members of each latent class. About 50%, 28%, and 21% of the examinees were identified as members of Classes 1, 2, and 3, respectively. Table 3 also gives the mean hypothetical class assignment probabilities of the three latent classes. For instance, members of Class 1 had a very high probability of being classified in Class 1 (96.1%) and a low probability of assignment to Class 2 (3.7%) and Class 3 (0.2%). Members of Class 2 had a chance of 94.3% to be in Class 2; the chances to be in one of the other two classes were 3.6% and 2.1%, respectively. Similarly, mem- bers of Class 3 had a probability of 95.2% of being in Class 3, and they had 4.2% and 0.6% probabilities of being classified in Classes 2 and 1, respectively. Three well-sepa- rated classes were thus clearly identified. The off-diagonal indices on the probability matrix are much smaller than the diagonal statistics, indicating a high classification accuracy of the model (Baghaei & Carstensen, 2013; Effatpanah et  al., 2024). The mean and standard deviation of WLE of person parameters for the three latent classes were computed. The mean and standard deviation of person parameters for Class 1 were 1.379 and 0.941; for Class 2 they were 1.485 and 0.832; and for Class 3 they were −1.055 and 0.800. This suggests the higher ability of Class 2 members. Class 1 had the highest standard deviation, indicating greater variability in listening compre- hension abilities among its members. Table 4 shows the difficulty parameters for the items, their standard errors, and their infit and outfit MNSQ values. As can be seen, although most infit and outfit values were within the acceptable range of 0.70–1.30 (Bond et  al., 2020; Linacre, 2024), there were some misfitting items across the three classes. For instance, the outfit values for Items 29 (0.69), 36 (1.95), and 38 (0.67) in Class 1 were out of the acceptable boundary, but their misfit was insignificant. For Class 2, Items 3 (1.38), 9 (1.61), 14 (1.75), 15 (1.41), 22 (1.34), and 37 (0.65) were beyond the criteria. Misfit was significant only for Item 3. And, for Class 3, although twelve items (i.e., 2, 3, 5, 8, 9, 10, 13, 14, 17, 21, 24, and 34) fell out of the expected range of outfit MNSQ, only Item 5 had a significant misfit. Items with values lower than 0.70 are overfit and benign. However, items with values exceeding 1.30 indicate abnormal response patterns that deviate from the model’s expectations, and suggest that the test is not unidimensional. A possible reason for this can be due to the presence of several item types in the listening section of the test that can potentially affect the unidimensionality of the test. When different item types are introduced in a listening test, they may require different cognitive processes. For exam- ple, MC items might assess recognition and selection skills, while constructed response Table 3. Class-specific statistics across the three-latent class model. Class Class size Mean probability class 1 Mean probability class 2 Mean probability class 3 1 0.509 0.961 0.037 0.002 2 0.278 0.036 0.943 0.021 3 0.213 0.006 0.042 0.952 66 F. EFFaTPanaH ET aL. Ta bl e 4. I te m d iff ic ul ty p ar am et er s an d ite m f it st at ist ic s fo r th e th re e la te nt c la ss es . Ite m s Cl as s 1 Cl as s 2 Cl as s 3 es tim at e s. e. In fit M n sQ o ut fit M n sQ es tim at e s. e. In fit M n sQ o ut fit M n sQ es tim at e s. e. In fit M n sQ o ut fit M n sQ 1 −1 .5 0 0. 37 0. 92 0. 86 −2 .9 8 0. 69 1. 00 0. 81 −1 .6 6 0. 15 1. 01 0. 76 2 −2 .5 5 0. 59 0. 95 0. 92 −3 .6 5 0. 87 1. 01 0. 96 −1 .7 0 0. 15 0. 93 0. 22 3 0. 88 0. 20 1. 10 * 1. 13 * −2 .2 4 0. 48 1. 23 * 1. 38 * −0 .8 5 0. 14 0. 99 1. 63 4 −0 .4 1 0. 25 1. 04 1. 09 0. 57 0. 22 0. 98 0. 98 −0 .2 1 0. 15 1. 03 0. 98 5 0. 29 0. 22 1. 03 1. 03 2. 16 0. 24 1. 02 1. 06 −0 .7 0 0. 14 1. 24 * 1. 36 * 6 2. 11 0. 20 1. 02 1. 06 −0 .4 9 0. 27 0. 94 0. 95 0. 91 0. 18 1. 18 1. 09 7 −1 .8 1 0. 42 0. 90 * 0. 88 * −1 .8 1 0. 43 0. 97 1. 00 −0 .6 0 0. 14 1. 06 0. 87 8 −1 .8 1 0. 42 1. 03 1. 01 −2 .5 6 0. 59 0. 97 1. 06 −2 .9 8 0. 20 1. 07 1. 68 9 −1 .2 5 0. 34 1. 06 1. 06 −2 .9 8 0. 71 1. 12 1. 61 −2 .3 4 0. 16 0. 89 0. 25 10 0. 24 0. 22 0. 99 1. 00 −0 .9 1 0. 31 1. 01 0. 99 −0 .2 9 0. 14 1. 10 1. 41 11 0. 80 0. 20 1. 05 1. 07 0. 20 0. 23 0. 99 0. 92 −0 .3 4 0. 14 0. 97 0. 94 12 −2 .9 6 0. 67 1. 09 1. 12 −0 .4 2 0. 27 1. 02 1. 16 −1 .6 6 0. 15 0. 93 0. 85 13 −2 .9 6 0. 71 1. 03 1. 18 −3 .6 8 0. 93 1. 01 0. 96 −2 .0 6 0. 16 0. 98 0. 37 14 −1 .1 4 0. 32 0. 93 0. 92 −1 .8 1 0. 43 1. 09 1. 75 −1 .9 9 0. 15 1. 07 1. 86 15 −0 .4 1 0. 25 0. 94 0. 93 −0 .2 1 0. 25 1. 07 1. 41 −1 .1 7 0. 14 1. 01 0. 91 16 −0 .2 2 0. 24 1. 08 1. 12 −1 .8 1 0. 43 1. 05 1. 18 −1 .2 9 0. 14 1. 07 1. 29 17 0. 76 0. 20 0. 98 0. 95 3. 03 0. 28 0. 99 0. 96 0. 07 0. 15 0. 79 0. 58 18 −0 .7 7 0. 28 0. 98 0. 96 −0 .0 3 0. 24 1. 03 1. 12 −0 .2 7 0. 14 1. 00 1. 22 19 −0 .2 8 0. 25 1. 13 * 1. 18 * −0 .3 5 0. 26 1. 07 1. 24 −1 .5 9 0. 14 0. 98 0. 89 20 −0 .2 2 0. 24 0. 93 0. 88 0. 15 0. 24 1. 04 1. 04 0. 52 0. 16 1. 07 1. 09 21 1. 11 0. 20 1. 01 0. 94 3. 31 0. 31 1. 04 1. 04 1. 50 0. 22 1. 18 1. 39 22 −0 .1 1 0. 23 0. 90 0. 85 0. 72 0. 22 1. 11 1. 34 0. 14 0. 15 1. 01 1. 01 23 −0 .1 7 0. 24 0. 97 0. 83 1. 68 0. 22 0. 97 0. 89 1. 45 0. 22 0. 97 0. 95 24 2. 28 0. 21 0. 99 1. 02 3. 13 0. 30 0. 96 1. 06 3. 26 0. 45 0. 76 0. 53 25 −0 .4 1 0. 25 1. 05 1. 05 −0 .0 8 0. 25 0. 99 0. 97 0. 05 0. 15 0. 97 0. 82 26 1. 57 0. 19 1. 04 1. 02 3. 52 0. 33 0. 99 0. 98 1. 45 0. 21 1. 04 1. 15 27 1. 30 0. 19 0. 92 0. 93 −1 .5 7 0. 34 0. 93 0. 94 1. 50 0. 22 0. 95 0. 71 28 −1 .8 1 0. 41 1. 03 1. 01 −2 .2 4 0. 48 0. 98 0. 91 −0 .0 2 0. 15 1. 00 1. 07 29 −0 .7 6 0. 27 0. 97 0. 69 1. 74 0. 23 1. 05 1. 08 2. 76 0. 34 1. 20 * 1. 18 InTERnaTIOnaL JOuRnaL OF TESTIng 67 30 −0 .8 5 0. 29 1. 02 0. 93 −0 .2 1 0. 25 1. 01 0. 93 0. 42 0. 16 1. 06 1. 01 31 1. 53 0. 19 1. 01 1. 00 1. 25 0. 22 0. 99 0. 99 −0 .9 1 0. 14 1. 03 1. 02 32 0. 60 0. 20 0. 96 0. 84 0. 15 0. 24 0. 92 0. 89 1. 24 0. 20 0. 87 0. 80 33 −0 .1 3 0. 23 0. 97 0. 94 −0 .5 7 0. 28 0. 89 0. 77 0. 01 0. 15 1. 02 0. 89 34 1. 34 0. 19 1. 01 1. 12 −0 .4 9 0. 27 0. 91 0. 89 1. 08 0. 19 0. 81 0. 62 35 0. 47 0. 21 1. 03 1. 03 −0 .8 2 0. 30 0. 86 0. 79 0. 75 0. 17 0. 99 0. 82 36 0. 76 .0 20 1. 04 1. 95 2. 00 0. 23 0. 86 0. 81 3. 25 0. 42 0. 80 * 0. 73 * 37 −0 .9 4 0. 30 0. 99 0. 99 −1 .3 5 0. 36 0. 92 0. 65 −0 .4 2 0. 14 1. 06 0. 94 38 2. 45 0. 21 1. 00 0. 67 1. 68 0. 22 0. 96 0. 96 4. 69 0. 70 0. 98 0. 94 39 1. 95 0. 20 0. 95 0. 74 0. 09 0. 24 0. 92 0. 89 2. 53 0. 32 1. 02 0. 97 40 1. 34 0. 19 0. 95 0. 79 0. 86 0. 22 1. 09 1. 08 1. 55 0. 22 1. 02 0. 98 N ot e. * p < 0. 05 ; s .e . = s ta nd ar d er ro r of m ea su re m en t; M n sQ = M ea n- sq ua re . 68 F. EFFaTPanaH ET aL. items (e.g., short answers) might assess production and elaboration skills. These differ- ences can introduce multiple dimensions to the test. The pattern of class-specific item difficulty parameters across the three latent classes is graphically presented in Figure 1. The vertical axis shows difficulty estimates of the items in logit scale, and the horizontal axis indicates 40 items of the test. Class 1 is depicted with a solid green line, Class 2 with a dotted blue line, and Class 3 with a dashed yellow line. The patterns of difficulty parameters showed an inconsistent pattern across the three classes, suggesting different cognitive structures for examinees from the three classes (Effatpanah et  al., 2024; Rost, 1990). That is, there are substantial qualita- tive differences between examinees with regard to their listening comprehension pro- cess. Item difficulty parameter estimates for Class 3 showed greater variability (ranging from −2.98 to 4.69) than Classes 1 and 2 in which item difficulty parameter estimates ranged from −2.96 to 2.45 and −3.68 to −3.52, respectively. The easiest items for Class 1 were Items 12, 13, and 2, while Items 13, 2, 1, and 9 were the easiest items for Class 2, and Items 8, 9, and 13 for Class 3. On the other hand, Items 38, 24, and 6 were the most difficult items for Class 1, whereas the most difficult items for Class 2 were 26, 21, and 24, and Items 38, 24, and 36 for Class 3. As illustrated in Figure 1, almost all items contributed to variability among the three latent classes, especially Items 3, 4, 5, 6, 12, 17, 29, 36, and 38. To detect which items exhibit significant differences in difficulty parameters across the classes and which items have consistent difficulty levels across them, post hoc t-tests on the difficulty parameter differences for each item across the classes were conducted. Table 5 presents the local difficulty contrasts across classes, the standard error of the Figure 1. Item difficulty parameters across the three-latent classes. InTERnaTIOnaL JOuRnaL OF TESTIng 69 Ta bl e 5. W el ch t -t es t in t he d iff ic ul ty p ar am et er d iff er en ce f or e ac h ite m a cr os s th e th re e cl as se s. Ite m s Cl as s 1 vs . C la ss 2 Cl as s 1 vs . C la ss 3 Cl as s 2 vs . C la ss 3 Ite m fo rm at ty pe o f sp ee ch D If C on tr as t Jo in t s. e. t p- va lu e et s D If C on tr as t Jo in t s. e. t p- va lu e et s D If C on tr as t Jo in t s. e. t p- va lu e et s 1 1. 48 0. 78 1. 89 0. 06 0 B 0. 16 0. 40 0. 40 0. 68 9 a −1 .3 2 0. 70 −1 .8 7 0. 06 3 B M l D ia lo gu e 2 1. 11 1. 05 1. 05 0. 29 4 B −0 .8 5 0. 60 −1 .4 0 0. 16 3 a −1 .9 5 0. 88 −2 .2 1 0. 02 9* C M l 3 3. 12 0. 52 6. 00 0. 00 0* C 1. 74 0. 24 7. 17 0. 00 0* C −1 .3 9 0. 50 −2 .7 6 0. 00 6* B M C 4 −0 .9 8 0. 34 −2 .8 9 0. 00 4* a −0 .2 0 0. 29 −0 .6 8 0. 50 0* a 0. 78 0. 27 2. 93 0. 00 4* a M C 5 −1 .8 7 0. 32 −5 .8 6 0. 00 0* C 0. 99 0. 26 3. 85 0. 00 0* a 2. 86 0. 27 0. 41 0. 00 0* C M C 6 2. 60 0. 34 7. 73 0. 00 0* C 1. 20 0. 27 4. 43 0. 00 0* B −1 .4 0 0. 33 −4 .3 0 0. 00 0* B M C 7 0. 00 0. 60 0. 00 0. 99 8 a −1 .2 1 0. 44 −2 .7 3 0. 00 7* B −1 .2 1 0. 45 −2 .7 0 0. 00 8* B tC 8 0. 75 0. 73 1. 02 0. 30 8 a 1. 17 0. 47 2. 50 0. 01 3* B 0. 42 0. 62 0. 67 0. 50 1 a tC 9 1. 74 0. 79 2. 20 0. 02 9* C 1. 10 0. 38 2. 92 0. 00 4* B −0 .6 4 0. 73 −0 .8 7 0. 38 6 a tC 10 1. 15 0. 38 3. 06 0. 00 3* B 0. 54 0. 26 2. 07 0. 04 0* a −0 .6 1 0. 34 −1 .8 0 0. 07 3 a tC 11 0. 60 0. 31 1. 94 0. 05 3 a 1. 14 0. 25 4. 63 0. 00 0* B 0. 54 0. 28 1. 96 0. 05 1 a M C M on ol og ue 12 −2 .5 4 0. 72 −3 .5 2 0. 00 1* C −1 .3 0 0. 69 −1 .8 9 0. 06 0* B 1. 24 0. 30 4. 07 0. 00 0* B M C 13 0. 71 1. 17 0. 61 0. 54 2 a −0 .9 0 0. 73 −1 .2 4 0. 21 6 a −1 .6 1 0. 94 −1 .7 2 0. 08 8 C M C 14 0. 67 0. 54 1. 26 0. 21 0 a 0. 86 0. 36 2. 41 0. 01 7* a 0. 18 0. 46 0. 40 0. 69 1 a M C 15 −0 .2 0 0. 36 −0 .5 5 0. 58 6 a 0. 76 0. 29 2. 61 0. 01 0* a 0. 95 0. 29 3. 28 0. 00 1* a M C 16 1. 58 0. 49 3. 22 0. 00 2* C 1. 06 0. 28 3. 79 0. 00 0* B −0 .5 2 0. 45 −1 .1 6 0. 24 8 a tC 17 −2 .2 7 0. 34 −6 .6 6 0. 00 0* C 0. 69 0. 25 2. 74 0. 00 7* a 2. 96 0. 31 9. 41 0. 00 0* C tC 18 −0 .7 4 0. 38 −1 .9 8 0. 04 9* a −0 .4 9 0. 32 −1 .5 5 0. 12 2 a 0. 25 0. 28 0. 87 0. 38 4 a tC 19 0. 06 0. 36 0. 17 0. 86 2 a 1. 31 0. 28 4. 59 0. 00 0* B 1. 25 0. 30 4. 16 0. 00 0* B tC 20 −0 .3 7 0. 34 −1 .1 0 0. 27 2 a −0 .7 5 0. 29 −2 .5 5 0. 01 1* a −0 .3 7 0. 29 −1 .2 9 0. 19 7 a tC 21 −2 .2 0 0. 36 −6 .0 4 0. 00 0* C −0 .3 8 0. 29 −1 .3 1 0. 19 1 a 1. 81 0. 38 4. 81 0. 00 0* C sC D ia lo gu e 22 −0 .8 3 0. 32 −2 .5 7 0. 01 1* a −0 .2 6 0. 28 −0 .9 1 0. 36 3 a 0. 57 0. 27 2. 14 0. 03 4* a sC 23 −1 .8 5 0. 33 −5 .6 7 0. 00 0* C −1 .6 2 0. 32 −5 .0 5 0. 00 0* C 0. 23 0. 31 0. 75 0. 45 4 a M C 24 −0 .8 5 0. 36 −2 .3 6 0. 02 0* a −0 .9 9 0. 50 −1 .9 8 0. 04 9* a −0 .1 3 0. 54 −0 .2 5 0. 80 4 a sC 25 −0 .3 3 0. 36 −0 .9 4 0. 35 1 a −0 .4 6 0. 30 −1 .5 6 0. 12 0 a −0 .1 3 0. 29 −0 .4 4 0. 65 8 a sC 26 −1 .9 6 0. 38 −5 .1 3 0. 00 0* C 0. 12 0. 29 0. 40 0. 68 8 a 2. 07 0. 39 5. 29 0. 00 0* C sC 27 2. 88 0. 39 7. 35 0. 00 0* C −0 .1 9 0. 29 −0 .6 7 0. 50 6 a −3 .0 7 0. 40 −7 .6 0 0. 00 0* C sC 28 0. 43 0. 63 0. 68 0. 49 8 a −1 .7 9 0. 44 −4 .0 8 0. 00 0* C −2 .2 2 0. 50 −4 .4 1 0. 00 0* C sC 29 −2 .4 9 0. 35 −7 .1 1 0. 00 0* C −3 .5 2 0. 43 −8 .1 2 0. 00 0* C −1 .0 2 0. 41 −2 .5 1 0. 01 2* B sC 30 −0 .6 4 0. 39 −1 .6 5 0. 10 0 a −1 .2 7 0. 33 −3 .8 0 0. 00 0* B −0 .6 3 0. 30 −2 .0 8 0. 03 8* a sC (C on tin ue d) 70 F. EFFaTPanaH ET aL. Ite m s Cl as s 1 vs . C la ss 2 Cl as s 1 vs . C la ss 3 Cl as s 2 vs . C la ss 3 Ite m fo rm at ty pe o f sp ee ch D If C on tr as t Jo in t s. e. t p- va lu e et s D If C on tr as t Jo in t s. e. t p- va lu e et s D If C on tr as t Jo in t s. e. t p- va lu e et s 31 0. 28 0. 29 0. 97 0. 33 3 a 2. 44 0. 24 0. 21 0. 00 0* C 2. 16 0. 26 8. 33 0. 00 0* C sC M on ol og ue 32 0. 45 0. 31 1. 44 0. 15 2 a −0 .6 4 0. 29 −2 .2 3 0. 02 7* a −1 .0 9 0. 31 −3 .5 1 0. 00 1* B sC 33 0. 44 0. 36 1. 21 8. 22 9 a −8 .1 3 8. 28 −8 .4 8 0. 63 3 C −0 .5 7 0. 32 −1 .8 1 0. 07 2 a sC 34 1. 83 0. 33 5. 49 0. 00 0* C 0. 26 0. 27 0. 95 0. 34 3 a −1 .5 7 0. 33 −4 .7 4 0. 00 0* C tC 35 1. 29 0. 36 3. 53 0. 00 1* B −0 .2 8 0. 27 −1 .0 3 0. 30 3 a −1 .5 7 0. 35 −4 .5 3 0. 00 0* C tC 36 −1 .2 3 0. 31 −4 .0 1 0. 00 0* B −2 .4 9 0. 47 −5 .3 5 0. 00 0* C −1 .2 5 0. 48 −2 .6 1 0. 00 9* B tC 37 0. 41 0. 47 0. 87 0. 38 3 a −0 .5 2 0. 33 −1 .5 6 0. 11 9 a −0 .9 3 0. 39 −2 .4 1 0. 01 7* a tC 38 0. 77 0. 31 2. 49 0. 01 4* a −2 .2 4 0. 73 −3 .0 6 0. 00 2* C −3 .0 1 0. 74 −4 .0 9 0. 00 0* C sC 39 1. 86 0. 31 5. 99 0. 00 0* C −0 .5 8 0. 38 −1 .5 3 0. 12 7 a −2 .4 4 0. 40 −6 .0 8 0. 00 0* C sC 40 0. 48 0. 29 1. 63 0. 10 5 a −0 .2 1 0. 30 −0 .7 0 0. 48 7 a −0 .6 8 0. 31 −2 .1 8 0. 03 0* a sC N ot e. s .e .: st an da rd e rro r of m ea su re m en t. * p < 0. 05 ; e ts : e du ca tio na l t es tin g se rv ic e; a : n eg lig ib le ; B : m od er at e; a nd C : l ar ge . Ta bl e 5. C on tin ue d. InTERnaTIOnaL JOuRnaL OF TESTIng 71 DIF contrasts, the Welch t value, the p-value for the contrasts, and ETS DIF classifica- tion of effect size measures. The ‘DIF Contrast’ columns provide the difference between the local difficulty estimates of the items across the classes. A positive DIF contrast indicates that the item is more difficult for the first, left-hand-listed class, and a nega- tive DIF contrast indicates that the item is more difficult for the second, right-hand- listed class. The Joint S.E. is the standard error of the DIF contrast. The Welch t value also shows the statistical significance between the local difficulties of items as a Student’s two-sided t statistic (Linacre, 2024). The null hypothesis is that the two estimates are the same, except for measurement error. For instance, as illustrated in Table 4, the dif- ficulty of Item 1 is −1.50 for Class 1 and −2.98 for Class 2; the contrast in difficulty is 1.48 with a joint SE of 0.78, as depicted in Table 5, indicating that Item 1 is more difficult for members of Class 1; the Welch t value of this contrast is 1.89; and the p-value of the contrast is 0.060, which does not meet the established significance thresh- old of 0.05. The DIF contrast value shows a moderate (B) DIF. As can be seen in Table 5, the DIF analysis between Classes 1 and 2 identified 22 items (i.e., 3, 4, 5, 6, 9, 10, 12, 16, 17, 18, 21, 22, 23, 24, 26, 27, 29, 34, 35, 36, 38, and 39) with significant DIF at p < 0.05. The analysis between Classes 1 and 3 also revealed that most items function differentially across the classes, and fifteen items (i.e., 1, 2, 13, 18, 21, 22, 25, 26, 27, 33, 34, 35, 37, 39 and 40) did not exhibit differential functioning between the two classes. Similarly, the analysis between Classes 2 and 3 indicated that only 14 out of 40 items did not function differently across the classes (i.e., 1, 8, 9, 10, 11, 13, 14, 16, 18, 20, 23, 24, 25, and 33). Items 3, 4, 5, 6, 12, 17, 29, 36, and 38 showed differential functioning across the three classes. The substantial difference across the classes was supported by a moderate Spearman rank-order correlation between difficulty parameter estimates of Classes 1 and 2 (r = 0.605, 95% CI [0.353–0.775], p < 0.001), of Classes 1 and 3 (r = 0.697, 95% CI [0.485–0.831], p < 0.001), and of Classes 2 and 3 (r = 0.676, 95% CI [0.454–0.819], p < 0.001). This sug- gests slight agreement between the item parameter estimates across the classes. The mean difference in raw scores can be used to represent the difference in listening com- prehension ability across classes because most items for the three classes fitted the Rasch model well, and the mean of item difficulty was assumed to be consistent across the classes to link the metrics of classes to a common scale (Rasch, 1977; Wang, 2004). Independent-sample t-tests were thus performed to investigate whether mean of raw scores across the three classes were statistically significant. The results showed a signifi- cant difference in mean between Class 2 (M = 29.63, SD = 5.000) and Class 1 (M = 27.77, SD = 5.026, t (222) = 2.748), p = 0.006), suggesting that members of Class 2 showed con- siderably better test performance compared to Class 1 members. There was also a sig- nificant difference in mean between Class 2 and Class 3 (M = 13.84, SD = 5.436, t (363) = 27.177), p < 0.001) as well as between Class 1 and Class 3 (t (333) = 21.744), p < 0.001). This indicates that members of Class 1 had a significantly better performance relative to Class 3 members and that Class 2 members significantly outperformed members of Class 3. A minor difference in reliability was also observed across the three latent classes. For Class 1, the reliability was 0.79, with 99% CIs ranging from 0.75 to 0.83. Class 2 showed a reliability of 0.79, with CIs between 0.71 and 0.82, while Class 3 had a reliability of 0.79, with CIs of 0.73–0.85. 72 F. EFFaTPanaH ET aL. Discussion This study applied the MRM to examine latent class DIF in the listening comprehen- sion section of the IELTS. The goal was to identify multiple profiles of listeners who exhibit qualitative differences in their listening processes when answering test items. Alternative Rasch and 2-PL models with one to five latent classes were considered, and the Rasch model with three latent classes yielded the best fit. Class 1 comprised approximately 50% of the sample, while Classes 2 and 3 included about 28% and 21%, respectively. To capture qualitative differences among the classes, some speculations regarding the individual differences of the classes are first proposed. The processes examinees of the three classes may distinctively use to answer the test items are char- acterized. Then, a content analysis is conducted to identify potential causes of DIF in test items. Labeling and characterizing latent classes The emergence of three latent classes indicates significant qualitative differences in how examinees in each class approached the listening test items. The results of person parameters and mean test performance across the classes revealed that Class 2 consists of high-level examinees with the highest listening ability and less variability; Class 1 includes moderate-level examinees with high proficiency but more variability; and Class 3 is comprised of low-level examinees with the lowest ability and moderate variability. Most of the items generally favored examinees in Class 2, although the items from Section 3 mostly functioned in favor of examinees in Class 1. The easiest items for the three groups belonged to Sections 1 and 2, while the most difficult items for Classes 1 and 3 were from Sections 3 and 4. Examinees in Class 2 found items in Section 3 as the most challenging items. Overall, the results appear to emphasize that examinees with varying levels of listen- ing proficiency approach the items differently. In fact, there is a different pattern in the information processing capacity of examinees with different listening abilities. To com- prehend listening input and produce correct answers, L2 examinees need to utilize both lower- and higher-level processing skills. As the L2 listening literature suggests (Field, 2013; Rukthong & Brunfaut, 2020), it is unlikely that examinees can successfully activate higher-level cognitive processes without effectively engaging in lower-level cognitive processing, such as lexico-grammatical knowledge, word recognition, and parsing. Examinees must utilize higher-level cognitive processes to understand the main point of the input they are listening to (Rost, 2016). Therefore, higher-level examinees use both types of cognitive processes to answer test items. As articulated by Goh and Vandergrift (2022), examinees with higher-level listening abilities can seamlessly synchronize the higher- and lower-level as well as top-down and bottom-up processes in a rapid, almost subconscious manner. However, low and moderate-ability examinees are dependent on “controlled listening processes which entail conscious attention to and processing of ele- ments in the speech stream.” (p. 19). Fluent or automatic accessing of lower-level skills takes up little of examinees’ attention, thereby leaving sufficient cognitive capacity for higher-level skills (Rost, 2016). This aligns with Cognitive Load Theory (Sweller, 1988), which emphasizes the importance of managing cognitive load to maximize learning effi- ciency. More particularly, fluent access to linguistic repertoire significantly enhances the InTERnaTIOnaL JOuRnaL OF TESTIng 73 efficient use of working memory (Field, 2013). This efficient use of working memory is crucial, as Limited Attentional Capacity theory suggests that finite attentional resources must be allocated effectively to process information successfully (Kahneman, 1973). Therefore, it seems that Class 2 includes examinees who can effectively activate both top-down and bottom-up processes as well as lower- and higher-level skills. They are also efficient at decoding sounds, words, and syntactic structures due to their high listening proficiency and greater lexico-grammatical knowledge. Moderate-level exam- inees (Class 1) seems to possess both some ability in either top-down or bottom-up processes, but lack proficiency in integrating both effectively, and a decent grasp of lower-level skills, but struggle with higher-level comprehension tasks. Also, they are generally good at decoding sounds and words, but may occasionally struggle with more complex or less familiar linguistic structures, and can parse syntax and grammar effec- tively, though perhaps not as quickly or accurately as the high-level group. However, due to their limited linguistic repertoire, low-level examinees (Class 3) appears to struggle with recognizing sounds, words, and syntactic structures, leading to difficulties in constructing appropriate meaning from oral input, misinterpreting or missing parts of the stimuli, and hindering overall comprehension. More importantly, members of this class are likely to depend on bottom-up listening processing which may potentially hinder their ability to effectively engage in top-down processing and develop a com- prehensive mental representation of the auditory stimuli (Imhof & Janusik, 2006). In essence, examinees in Class 3 exhibit a lower aptitude for coordinating between top- down and bottom-up processes and retrieving pertinent knowledge from memory com- pared to their counterparts in Classes 2 and 1. Any difficulty in retrieving lower-level processes also increases the cognitive processing load and prevent a listener from leav- ing sufficient cognitive capacity for higher level processes. This finding also aligns with Limited Attentional Capacity theory (Kahneman, 1973) stating that individuals have limited attentional resources, and the involvement of higher-level subskills likely increases the cognitive load. Class 3 examinees may have struggled to allocate their attention effectively between the low- and high-level skills, indicating their lack of competence in distributing their attentional resources across the skills more effectively. Given their limited lexico-grammatical knowledge, examinees in Class 3 seem to struggle with vocabulary recognition and use most of their memory capacity to recover word meanings (Aryadoust, 2012). Consequently, these examinees are more likely to employ specific strategies to mitigate the adverse impact of working memory limita- tions. Several researchers have posited that individuals with lower to moderate listening abilities employ metacognitive strategies and compensatory mechanisms (e.g., relying on general world knowledge, common sense, mental translation, cultural information, and visual, contextual, or paralinguistic cues) to manage their listening processes. These strategies help compensate for their deficiency in specific subskills of the target language and facilitate the coordination between lower-level and higher-level processing (Effatpanah, 2019; Goh & Vandergrift, 2022). Echoing this perspective, Harding et  al. (2015, p. 12) contend that “comprehension does not strictly adhere to a linear progres- sion from lower-level to higher-level processing; instead, various levels may operate con- currently, with difficulties at one level being offset by ‘positive information’ at another, or with simultaneous challenges at both higher and lower levels leading to overall miscomprehension.” 74 F. EFFaTPanaH ET aL. The study also highlighted the impact of genre on test performance, with Sections 1 and 2 focusing on general topics and Sections 3 and 4 dealing with academic contexts. Genre has been found to significantly affect examinees’ performance (Chen & Chen, 2021), emphasizing the importance of considering genre-related factors in listening comprehension tests. Previous studies showed that although genre is not an important factor for grasping the theme of the text, it has a great impact on understanding the details and main points of lectures (Chen & Chen, 2021). The results revealed that examinees across the classes had better performance in the two first sections relating to general topics, while examinees in Class 2 with higher listening comprehension ability outperformed in the last two sections focusing on academic places. In the later sections of the IELTS listening test, items become more complex, includ- ing more paraphrased content, longer sentences, multitasking tasks, and faster delivery. Generally, items in Sections 1 and 2 mainly involve understanding smaller chunks of oral input and require lower-level cognitive processing, such as grasping factual infor- mation and functional relationships. In contrast, items in Sections 3 and 4 involved understanding longer chunks of oral input and demanded higher-level cognitive pro- cessing, such as making inferences, paraphrasing, and integrating listening skills with other abilities like reading, note-taking, and writing (Aryadoust, 2012; Effatpanah, 2019). It appears that the better performance of examinees in Class 2 can emanate from their ability to handle more challenging items and possess more cognitive capacity and stim- ulus-focused attention for increasing their speed of processing. However, certain factors like lengthy sentences and unclear item instructions may inadvertently tap into reading comprehension processes and negatively affect the performance of examinees. Class 3 examinees struggled in Section 4, which required them to follow a fast-paced stream of oral input and integrate listening ability with other skills. Integrated tasks like lectures in Section 4 are more authentic but demanding. Examinees in Class 3 may have lagged behind the audio stream, leading to missed items. It has been shown that integrated tasks such as the lecture type in Section 4 of the IELTS test are more authentic than conventional item formats, including MC and matching items, and reduce the effect of background knowledge (Rukthong & Brunfaut, 2020). However, Aryadoust (2012) argues that the concurrent exposure to written and oral inputs impedes note-taking. Therefore, it can be assumed that those examinees who fell behind the stream of oral stimuli missed several questions. This can be attributed to restricted reading skills, memory capacity, stimulus-focused attention, test wiseness, test-taking strategies, and other con- fining factors (Aryadoust, 2012; Estaji & Banitalebi, 2023; He & Jiang, 2020). Taken together, the three distinct classes of listeners can be labeled as: “High-level Stimulus Processors” (Class 2); “Moderate-level Stimulus Processors” (Class 1); and “Low- level Stimulus Processors” (Class 3). Class 2 examinees exhibited better abilities in syn- chronizing top-down and bottom-up processing, operationalizing higher-level cognitive processes, understanding longer chunks of oral input, possessing more cognitive capac- ity and stimulus-focused attention, handling multitasking and integrated items, compre- hending complex items, and understanding items with a high speed of delivery and paraphrased content. Class 1 examinees exhibited a balanced approach to listening tasks. They are capable of effectively combining top-down and bottom-up processing but may not do so as seamlessly as Class 2. These examinees can understand and interpret oral input well but may require a bit more effort to comprehend longer or more complex InTERnaTIOnaL JOuRnaL OF TESTIng 75 stimuli. They demonstrate good cognitive capacity and can manage multitasking and integrate items reasonably well. However, their performance may vary depending on delivery speed and paraphrased content. Class 3 examinees, in contrast, struggled with these tasks, relying more on lower-level processing and metacognitive strategies to com- pensate for their limitations. These findings shed light on diverse approaches individuals take in listening comprehension tests and highlight the impact of various cognitive pro- cesses on test performance. Content analysis of test items The post-hoc analysis and the item difficulty profiles across the three classes indicated that almost all items contributed to the differences among the classes. A content anal- ysis of the items was conducted to identify the causes of observed DIF and further analyze the above-mentioned speculations about the information processing of examin- ees across the classes. The identification of main sources of DIF is often demanding, especially in exploratory DIF studies where a priori hypothesis is absent (Zumbo, 2007). The authors consulted previous studies (i.e., Aryadoust, 2012; Effatpanah, 2019; Geranpayeh & Taylor, 2008) in which attributes required to correctly answer listening items of the IELTS were discussed. Test items primarily tap examinees’ linguistic knowl- edge, understanding of detailed information, comprehending explicitly stated general and literal information, understanding of paraphrases, making inferences, and compre- hending of illocutionary meaning. Geranpayeh and Taylor (2008) argue that the listen- ing inputs, developed by the University of Cambridge ESOL Examination Syndicate for WLP tests, are designed “with some internal repetition”, and that test items focus on “explicit and easily accessible information” to decrease “any potential negative impact of hearing the text only once in slightly adverse conditions”, with “key information rephrased and repeated within the text” to allow examinees to confirm their answers as they listen (p. 3). Therefore, such tests appear to narrowly reflect the listening construct by concentrating mainly on the understanding of details (Aryadoust, 2012), which impose severe demands on the examinees’ memory load as they need to focus on spe- cific details and memory skills (Shohamy & Inbar, 1991). In the following sections, the potential causes of DIF across the four sections of the test are discussed. Section 1 Items 1 and 2.  In this section (social dimension), there is a dialogue between a woman and a police officer in which the woman tells the story of a robbery. The attributes or primary dimension targets for these items are linguistic knowledge, world knowledge, making inferences, ability to understand detailed or specific factual information, and ability to integrate listening with visual skills. Our post-hoc analysis showed that Item 2 was significantly easier for Class 2 members than Class 3. They were generally expected to outperform on this item because they can more accurately identify locations and effectively follow instructions (i.e., spatial and directional understanding) due to their higher listening comprehension ability. The significant difference across the classes could also be due to the confounding role of gender. Previous studies reported that males tend to perform better on map labeling items than females because of their higher verbal processing capacity (Aryadoust, 2012; O’Neill & McPeek, 1993). 76 F. EFFaTPanaH ET aL. Items 3–6.  These items significantly contributed to variability among examinees across the three classes. The primary dimensions are linguistic knowledge, world knowledge, ability to make paraphrase, the ability to understand detailed or specific factual information, and making inferences. As expected, Items 3 and 6 favored Class 2 with the lowest item difficulties. However, Items 4 and 5 functioned in favor of low- and medium-level examinees (Classes 3 and 1). This unexpected result accords with previous studies reporting that examinees whose listening comprehension ability ranges from low to moderate level tend to achieve higher scores on MC items (Aryadoust, 2012; Chang & Read, 2013). It could be attributed to the influence of test-taking strategies, with low- and moderate-ability listeners being more inclined to guess, and these guesses sometimes result in correct answers. With three options for each MC question (Items 3 to 5), there is a relatively high probability of success (33%) when making a random guess. Furthermore, research on cognitive psychology has shown that males exhibit a pro- pensity to take greater risks, such as opting for a lucky guess, when confronted with a problem (Buck, 2001). Consequently, males tend to adopt a more risk-taking approach, potentially resulting in higher scores and excel over females in MC items (Bolger & Kellaghan, 1990; Mazzeo et  al., 1993). However, females exhibit greater reluctance to guess on MC items compared to males and are inclined to skip items they are uncertain about (Aryadoust, 2012). Another possible reason is that high-level examinees might miss easy items due to carelessness as there is a lack of shared incorrect answers among Class 2 members. This conjecture finds reinforcement in the outfit MNSQ patterns of this Class: numerous incorrect responses by high-ability examinees on easy items had outfit MNSQ values exceeding 1.3, indicating that their performance on these items was unanticipated. Items 7–10.  These items measure examinees’ linguistic knowledge, world knowledge, ability to understand detailed or specific factual information, and ability to integrate listening, reading, short-term memory span, and writing abilities. Our post-hoc analysis indicated that except for Item 8, the other items functioned in favor of Class 2, with the smallest item difficulty values. However, the results showed some unexpected patterns. For example, the advantage of Item 8 for Class 3 was unexpected. Items 9 and 10 also indicated a disadvantage for Class 1 compared to Class 3. Given that outfit MNSQ is sensitive to outliers, this suggests that some high- and moderate- level examinees missed easy items, and low-level examinees correctly answered more difficult items. Moreover, the differences across the classes could be attributed to the effect of gender. Research has indicated that females have better performance on table completion items because these items require linguistic inference and detailed comprehension (Aryadoust, 2012). Section 2 Items 11–15.  In this section (social dimension), there is a woman providing commercial information about an English Hotel (Bridge Hotel). The primary dimensions are linguistic knowledge, world knowledge, ability to make paraphrases, ability to understand detailed or specific factual information, and make inferences. The results of post-hoc analysis revealed that except for Item 13 favoring Class 2, Items 11 and 12 favored Classes 3 and 1, respectively. Similar to MC items in section 1, low- and moderate-level InTERnaTIOnaL JOuRnaL OF TESTIng 77 examinees show more tendency to get higher scores on MC items because of test-taking strategies and guessing factors. As noted above, males are also more risk averters than females in MC items. Items 14 and 15 also indicated a significant advantage for Class 3. For these items, examinees should choose two out of five choices. The better performance of Class 3 members can be justified by two main reasons. First, there is an overlap in wording between oral input, the correct choices, and the distractors. Researchers showed that lex- ical overlap between the text and the answer options impacts item difficulty (Freedle & Kostin, 1996); the more overlap, the easier item will be. However, the values of outfit MNSQ for Item 14 show that the better performance of Class 3 members is unexpected, likely due to the lack of common correct answers in the responses of the group members. Therefore, some examinees in this group might have correctly answered the item by chance. Second, although the answers are clearly stated in the oral stimuli, it seems that the combined length of these two items and the provided options impacted the perfor- mance of examinees. Researchers argued that the presence of two consecutive items in the IELTS listening can increase item difficulty and negatively affect the compre- hension of examinees (Coleman & Heap, 1998). These items might be assessing exam- inees’ reading speed and memory span, which are not relevant to the listening ability being measured. Such challenges affected the performance of high-ability examinees in Class 2. Additionally, it appears that examinees in Class 2 missed these two items out of carelessness. This supposition is supported by the outfit MNSQ patterns of this group, suggesting that their performance on these items was abnormal. Overall, the results indicate that the interaction between the combined length of items and the overlap in wording can significantly impact the performance of examinees. Items 16–20.  These items measure examinees’ linguistic knowledge, world knowledge, ability to understand detailed or specific factual information, and ability to integrate listening, reading, short-term memory span, and writing abilities. Our post-hoc analysis showed that (1) Item 17 contributed to variability across all the classes, favoring Class 3; (2) Item 16 favored Class 2; (3) Items 17 and 19 functioned in favor of Class 3; and (4) Items 18 and 20 favored Class 1. The results show that some test items seem to have imposed listening-construct-irrelevant challenges on examinees. For instance, a portion of the input including the correct response for Item 17 is “… full cooked breakfast and evening entertainment …”. Examinees should simultaneously read the stem (i.e., Full cooked breakfast Entertainment in the …), listen to the input, and write their answer. The item also requires examinees to mentally rearrange the vocabulary. Due to the difficulty of reading this specific item format simultaneously, this rearrangement might place additional memory demands on examinees, particularly because the response to Item 16 is just a few words earlier in the oral input. This indicates that the test item format poses a challenge for examinees (Field, 2009). Such item formats require understanding numerous details, which hinders deep comprehension of the material (Field, 2009). Therefore, this could be a possible reason why high-level examinees (Class 2) missed Item 17 compared to low- and moderate- level examinees, who might have missed Item 16 and only focused on Item 17. Another noteworthy point that warrants special consideration is the effect of item instruction on the performance of examinees. The instruction of Items 16–20 78 F. EFFaTPanaH ET aL. is “Complete the sentences below. Write NO MORE THAN TWO WORDS AND/ OR A NUMBER for each answer.” The interpretation of the statement can vary depending on linguistic, cultural, and educational background of examinees. Researchers have argued how language, culture, and education shape interpretation (Hofstede, 2001; Weir, 1990). Therefore, more research is required to illuminate to what extent item instruction can cause different item performance across examin- ees with different nationalities and L1 background. This variability in the interpre- tation of the item instruction might have caused a misunderstanding among examinees, although the word limit for all production items is three. The answer to Item 18 is “(four-course) dinner,” with “four-course” being optional. It appears that many high-level examinees may have erroneously assumed that providing all three words was necessary for the correct response, and some of them may have written ‘four course dinner’ instead of ‘four-course dinner’. Despite the instructions not clearly stipulating the required word count, it seems that the cognitive load of maintaining three words simultaneously proved challenging for these examinees. Conversely, examinees with low and moderate abilities who managed to recall the final word, ‘dinner’, were able to successfully include it in their response. It is also possible that due to misspelling, high-level examinees did not receive any partial credit, highlighting the importance of using polytomous scoring scale (Bodie et  al., 2011). These findings are totally in line with Aryadoust (2015). The better performance of examinees with low to moderate abilities diverges with Coleman and Heap (1998) study stating that table completion items are the most chal- lenging items for examinees in the listening section of the IELTS, because test formats requiring examinees to write words in gaps or to compose responses to short-answer questions impose a substantial additional demand unrelated to the construct of listening ability. The variances observed among the classes may also be linked to gender, with females showing a better performance in tasks such as table completion. Section 3 Items 23.  In this section (academic dimension), there is a conversation between three students on campus talking about study programs. The primary dimension targets for these items are linguistic knowledge, world knowledge, ability to make paraphrases, ability to understand detailed or specific factual information, and making inferences. Our post-hoc analysis indicated that the item favored examinees with low to moderate abilities, likely due to test-taking strategies and guessing, with males exhibiting more tendency to take a risk than females. Items 21–22 and 24–30.  The attributes are linguistic knowledge, world knowledge, ability to make paraphrases, ability to understand detailed or specific factual information, and ability to integrate listening, reading, short-term memory span, and writing abilities. The post-hoc analysis showed that Items 27 and 28 favored Class 2, and the remaining items mostly functioned in favor of moderate-level examinees (Class 1), followed by Class 2. The instruction for the items is “Complete the sentences below. Write NO MORE THAN TWO WORDS AND/OR A NUMBER for each answer.” The performance of examinees might have been impacted by the item instructions, with high-level InTERnaTIOnaL JOuRnaL OF TESTIng 79 examinees potentially missing easy items due to carelessness and unclear guidance. This finding disagrees with previous studies (e.g., Aryadoust, 2012) reporting limited produc- tion items tend to favor high-ability examinees due to their greater difficulty. However, in this study, low- and moderate-level examinees had a better performance on such items. Additionally, females generally tend to exhibit better performance in SC items than males (Aryadoust, 2012). Section 4 Items 31–33 and 38–40.  In this section (academic dimension), a guest university lecturer presents a talk about a bird of prey (Peregrine Falcons). The primary dimensions are linguistic knowledge, world knowledge, ability to make paraphrases, ability to understand detailed or specific factual information, making inferences, and ability to integrate listening, reading, short-term memory span, and writing abilities. Our post-hoc analysis showed that except for Item 31 favoring Class 3, the other items functioned in favor of Class 2, followed by Class 1. There are two possible reasons for the better performance of low-level examinees in Item 31. First, the necessary information for answering the item is located at the beginning of the oral input, making answering the item easier. Research has shown that the location of required information for answering a test item affects item difficulty, with items located at the beginning of input being easier compared to items located in the middle or at the end (Yanagawa & Green, 2008). Second, the instruction for Items 31–33 is “Complete the sentences below. Write NO MORE THAN THREE WORDS for each answer.” Similar to TC items in section two, the performance of moderate- and high-level examinees might have been affected by the lack of clarity in the item instruction. Females also tend to outperform their male counterparts in TC items (Aryadoust, 2012). Items 34–37.  The attributes for these items are linguistic knowledge, world knowledge, ability to make paraphrases, ability to understand detailed or specific factual information, making inferences, and ability to integrate listening, reading, short-term memory span, and writing abilities. The post-hoc analysis showed that except for Item 36, the remaining items functioned in favor of high-level examinees (Class 2). Item responses of examinees indicate that the item appears to have presented challenges to examinees that are unrelated to the listening construct being assessed. Item 36 was designed to assess the comprehension of specific details. The correct answer for the item is “leave the nest”. However, some high-level examinees wrote “live” instead of “leave” on their answer sheets, suggesting that phoneme recognition abilities, while different from listening comprehension, played a notable role. It could be a viable reason to state that some high-level examinees missed this item out of carelessness. Unlike TC items in section 2 where low- and moderate-level examinees outperformed their high-level counterparts, Class 2 members had a better performance on TC items in section 4. This contradiction between the findings can be ascribed to the effect of other factors, such as more complex items, more paraphrased content, longer sentences, more multitasking items, and faster speech delivery in section 4 compared to section 2. Therefore, the location of item formats can affect item difficulty, especially in the IELTS listening test. Females also exhibit better performance in SC items than males (Aryadoust, 2012). 80 F. EFFaTPanaH ET aL. Implications, limitations, and directions for future research This study holds several methodological, theoretical, and pedagogical implications. From a methodological standpoint, the current study builds upon and extends prior applications of MRM in educational testing, particularly in the identification of latent class DIF and the exploration of multiple profiles in L2 listening comprehension. From a theoretical standpoint, gaining a deeper understanding of individual difference profiles would allow scholars to model language processing in a more cohesive manner and develop more robust theories and models of language acquisition and performance, especially with respect to L2 listening comprehension. From a pedagogical perspective, the findings of this study underscore the significance of detecting latent class DIF and exploring multiple profiles in L2 listening comprehension. Analyzing each profile aids test developers in understanding the test-taking patterns of examinees and the mental processes employed by them to achieve correct responses. It also provides teachers with insights into individ- ual differences among students and their learning status. Consequently, they can adapt their classroom instructions to enhance students’ learning, refine instructional materials and activities, customize instruction based on students’ needs and challenging areas, and ultimately provide feedback to promote effective teaching and learning. Several limitations should also be considered when interpreting the results of this study. Firstly, the examinees who took the listening test were drawn from a relatively small international population. Additional research with a larger and more diverse pop- ulation is required to corroborate the generalizability of the findings. Secondly, this study considered only a limited number of item types (e.g., multiple-choice, map label- ing, table completion, and sentence completion) in the administered test. In future stud- ies, researchers may consider incorporating a broader range of item formats such as classifying and matching items in their tests. Third, it is important to note that the sample size for the present study may be considered relatively modest for the applica- tion of MRM. One limitation of MRM is its requirement for larger sample sizes, par- ticularly when extending it to polytomous models, where increasing the number of latent classes necessitates larger samples. As argued by von Davier and Rost (1995), to obtain accurate parameter estimates in a multidimensional analysis, the required sample size should be multiplied by the number of classes. In previous studies, sample sizes greatly varied from 99 to 251,278. A number of studies showed that despite the poten- tial for higher standard error of estimates with small sample sizes and an increase in test items and categories, MRM can still produce stable parameter estimates with rela- tively modest sample sizes (e.g., Aryadoust, 2015; Frick et  al., 2015). Future studies can use larger sample sizes to apply the MRM to explore individual difference profiles across language skills and components. Another limitation of this study was the inability to fully characterize the three latent classes identified due to the lack of access to covariates or variables typically used for such characterization. The study relied solely on item responses from IELTS examinees. Although this practice of analysis is the main purpose of using the MRM, this may constrain the depth of insight into the latent classes’ characteristics and associations with relevant covariates, potentially leading to a less nuanced understanding of student performance or behavior. Therefore, an intriguing avenue for further investigation involves considering a range of covariates and/or factors to provide a comprehensive InTERnaTIOnaL JOuRnaL OF TESTIng 81 picture of individual difference profiles in L2 listening comprehension. Covariates such as lexico-grammatical knowledge, metacognitive strategies, age, gender, working mem- ory capacity, self-efficacy, motivation, and contextual factors could provide valuable insights into the analysis of listening profiles. In particular, it is important to consider working memory capacity in future studies because it significantly affects the performance of examinees with different working memory levels. Working memory capacity influences how well individuals can process and retain information, which is crucial in tasks such as L2 listening comprehension (Goh & Vandergrift, 2022). By accounting for this variable, researchers can better understand and interpret the differences in performance among examinees. Also, the results showed that there should be an interaction among covariates (e.g., gender and item types) causing DIF. As a recent development in DIF analysis, future studies can apply advanced tree models (e.g., Grassi & Tarantino, 2023; Henninger et  al., 2023) for investigating DIF of the IELTS listening test. Moreover, the assumptions about what different parts of the IELTS listening test mea- sure, as shown in Table 1 and argued in the discussion, were based on the subjective judgment of the authors of this study and those in similar studies (e.g., Aryadoust, 2012; Buck, 2001). These descriptions are broad and refer to attributes collectively measured by each item type across the four parts of the test. For instance, MC items are intended to measure a wide range of traits, such as the ability to attentively listen and comprehend the main idea and details of oral stimuli, identify key information like main points and supporting details, understand context, use inference skills to deduce implied meanings, and distinguish between similar options to eliminate distractors. However, a broad descrip- tion like this does not specify which attributes are measured by each MC item. Typically, an item cannot measure the entire range of attributes mentioned. Future studies could employ empirical methods, such as diagnostic classification models (DCMs; Kunina- Habenicht et  al., 2009; Ravand & Baghaei, 2020; Ravand et  al., 2019), to identify the attributes measured by each individual item rather than by each item type as a whole. This approach would provide a clearer understanding of the causes of DIF when an item is flagged for DIF. In the absence of covariates to explain the differences in the latent classes, one can refer to the abilities and traits measured by different items on the test to explain the differential performances of the three latent classes. The TAM package (Robitzsch et  al., 2024) employed in this study uses equal-mean-dif- ficulty anchor method, which assumes the mean of difficulty across items is equal among classes or (sub)groups (commonly equal to zero). Therefore, in this study, the mean of difficulty parameters across items was constrained to be equal to zero for each latent class in MRM analysis. The assumption might be violated when the direction of DIF is unbal- anced (Kopf et  al., 2015) between classes. Another popular anchor method is the constant anchor method, where a set of items as DIF-free items are prespecified (Meade & Wright, 2012). Several researchers have shown that the use of constant anchor method in mixture IRT models for identifying latent classes leads to better model-data fit (Chen et  al., 2023), and that the purification for the constant anchor method reduces the type I error rate in DIF analysis (Meade & Wright, 2012). Future studies can thus use the constant anchor method for exploring multiple profiles of L2 listeners. Numerous researchers have developed both unidimensional (e.g., Sen et  al., 2019; Tseng & Wang, 2021) and multidimensional generalizations (e.g., von Davier, 2008) of MRM, demonstrating successful application of the models to language data. Researchers 82 F. EFFaTPanaH ET aL. could further extend the application of MRM using these models to investigate their effectiveness in exploring profiles of examinees within the domain of educational mea- surement and language assessment. Of particular interest are also potential applications for MRM in the area of PLP listening tests. Previous studies have predominantly utilized WLP tests to explore multiple profiles of listeners and investigate latent class DIF. However, no prior research has ventured into apply- ing the MRM to PLP tests to analyze cognitive processes of examinees and identify their solution patterns while answering a set of test items. Future studies could delve into examin- ing to what extent the introduction of visual input into listening tests may alter listening processing patterns. A number of researchers have advocated for the inclusion of nonverbal information such as gestures, posture, facial expressions, and body movement, often observed in authentic communications, as integral components of the L2 proficiency construct (Lesnov, 2022; Park et  al., 2022; Wagner, 2013; Wolvin & Coakley, 1993). Finally, future studies could directly hypothesize and investigate the bottom-up and top- down listening processing strategies employed by examinees, which yield valuable insights into language comprehension and test performance. Various methodologies such as eye-tracking technology, cognitive interviews, or neuroimaging techniques can be used to observe and analyze how examinees engage in these processes. By understanding which strategies are more effective for different individuals or in various contexts, educators and test developers can tailor instructional approaches and assessment designs to better support language learning and assessment. Disclosure statement No potential conflict of interest was reported by the author(s). Research data policy and data availability statements The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable requests. Funding The author(s) received no specific funding for this work from any funding agencies. ORCID Farshad Effatpanah http://orcid.org/0000-0003-3970-5588 Purya Baghaei http://orcid.org/0000-0002-5765-0413 Hamdollah Ravand http://orcid.org/0000-0002-8757-3850 Olga Kunina-Habenicht http://orcid.org/0000-0002-1646-8260 References Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22(3), 37–51. https://doi.org/10.1111/j.1745-3992.2003.tb00136.x https://doi.org/10.1111/j.1745-3992.2003.tb00136.x InTERnaTIOnaL JOuRnaL OF TESTIng 83 Alavi, S. M., Kaivanpanah, S., & Masjedlou, A. P. (2018). Validity of the listening module of international English language testing system: Multiple sources of evidence. Language Testing in Asia, 8(1), 1–17. https://doi.org/10.1186/s40468-018-0057-4 Alexeev, N., Templin, J. L., & Cohen, A. S. (2011). Spurious latent classes in the mixture Rasch model. Journal of Educational Measurement, 48(3), 313–332. https://doi.org/10.1111/j.1745- 3984.2011.00146.x American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME) (2014). Standards for educational and psychological testing. AERA. Aryadoust, V. (2012). Differential item functioning in while-listening performance tests: The case of IELTS listening test. International Journal of Listening, 26(1), 40–60. https://doi.org/10.1080/ 10904018.2012.639649 Aryadoust, V. (2015). Fitting a mixture Rasch model to EFL listening tests: The role of cognitive and background variables in explaining latent differential item functioning. International Journal of Testing, 15(3), 216–238. https://doi.org/10.1080/15305058.2015.1004409 Aryadoust, V, T (2018). Taxonomies of listening skills. In J. I. Liontas, & M. DelliCarpini (Eds.), The TESOL encyclopedia of English language teaching. (pp. 1–8). https://doi. org/10.1002/9781118784235.eelt0577 Aryadoust, V., Goh, C. C. M., & Kim, L. O. (2011). An investigation of differential item functioning in the MELAB listening test. Language Assessment Quarterly, 8(4), 361–385. https://doi.org/10.1080/15434303.2011.628632 Aryadoust, V., Min, S., & Chen, X. (2024). Investigating differential item functioning across in- teraction variables in listening comprehension assessment. Studies in Educational Evaluation, 80, 101322. https://doi.org/10.1016/j.stueduc.2024.101322 Baghaei, P., & Carstensen, C. H. (2013). Fitting the mixed Rasch model to a reading comprehen- sion test: Identifying reader types. Practical Assessment, Research, & Evaluation, 18(5), 1–13. https://doi.org/10.7275/n191-pt86 Baghaei, P., Kemper, C. J., Reichert, M., & Greiff, S. (2019). Applying the mixed Rasch model in assessing reading comprehension. In V. Aryadoust & M. Raquel (Eds.), Quantitative data anal- ysis for language assessment Volume II: Advanced methods. (pp. 15–32) Routledge. Banerjee, J., & Papageorgiou, S. (2016). What’s in a topic? Exploring the interaction between test-taker age and item content in high-stakes testing. International Journal of Listening, 30(1-2), 8–24. https://doi.org/10.1080/10904018.2015.1056876 Bejar, I., Douglas, D., Jamieson, J., Nissan, S., & Turner, J. (2000). TOEFL 2000 listening frame- work: A working paper. (TOEFL Monograph Series No. MS-19) Educational Testing Service. Bodie, G. D., & Worthington, D. L. (2010). Revisiting the listening styles profile (LSP-16): A confirmatory factor analytic approach to scale validation and reliability estimation. International Journal of Listening, 24(2), 69–88. https://doi.org/10.1080/10904011003744516 Bodie, G. D., Worthington, D., & Fitch-Hauser, M. (2011). A comparison of four measurement models for the Watson-Barker Listening Test (WBLT)-Form C. Communication Research Reports, 28(1), 32–42. https://doi.org/10.1080/08824096.2011.540547 Bodie, G. D., Worthington, D. L., & Gearhart, C. C. (2013). The listening styles profile revised (LSP-R): A scale revision and evidence for validity. Communication Quarterly, 61(1), 72–90. https://doi.org/10.1080/01463373.2012.720343 Bodie, G. D., Winter, J., Dupuis, D., & Tompkins, T. (2020). The echo listening profile: Initial validity evidence for a measure of four listening habits. International Journal of Listening, 34(3), 131–155. https://doi.org/10.1080/10904018.2019.1611433 Bolger, N., & Kellaghan, T. (1990). Method of measurement and gender differences in scholastic achievement. Journal of Educational Measurement, 27(2), 165–174. https://doi.org/10.1111/j.1745-3984. 1990.tb00740.x Bond, T. G., Yan, Z., & Heene, M. (2020). Applying the Rasch model: Fundamental measurement in the human sciences. (4th Ed.) Routledge. Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061–1071. https://doi.org/10.1037/0033-295X.111.4.1061 https://doi.org/10.1186/s40468-018-0057-4 https://doi.org/10.1111/j.1745-3984.2011.00146.x https://doi.org/10.1111/j.1745-3984.2011.00146.x https://doi.org/10.1080/10904018.2012.639649 https://doi.org/10.1080/10904018.2012.639649 https://doi.org/10.1080/15305058.2015.1004409 https://doi.org/10.1002/9781118784235.eelt0577 https://doi.org/10.1002/9781118784235.eelt0577 https://doi.org/10.1080/15434303.2011.628632 https://doi.org/10.1016/j.stueduc.2024.101322 https://doi.org/10.7275/n191-pt86 https://doi.org/10.1080/10904018.2015.1056876 https://doi.org/10.1080/10904011003744516 https://doi.org/10.1080/08824096.2011.540547 https://doi.org/10.1080/01463373.2012.720343 https://doi.org/10.1080/10904018.2019.1611433 https://doi.org/10.1111/j.1745-3984.1990.tb00740.x https://doi.org/10.1111/j.1745-3984.1990.tb00740.x https://doi.org/10.1037/0033-295X.111.4.1061 84 F. EFFaTPanaH ET aL. Bourdeaud’Hui, H., Aesaert, K., & van Braak, J. (2021). Exploring the validity of a comprehensive listening test to identify differences in primary school students’ listening skills. Language Assessment Quarterly, 18(3), 228–252. https://doi.org/10.1080/15434303.2020.1860059 Buck, G. (2001). Assessing listening. Cambridge University Press. Buck, G., & Tatsuoka, K. (1998). Application of the rule-space procedure to language testing: Examining attributes of a free response listening test. Language Testing, 15(2), 119–157. https:// doi.org/10.1191/026553298667688289 Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach. (2nd Ed.) Springer. https://doi.org/10.1007/b97636 Carlton, S. T., & Harris, A. M. (1992). Characteristics associated with differential item function- ing on the scholastic aptitude test: Gender and majority/minority group comparisons. ETS Research Report Series, 1992(2), i–143. https://doi.org/10.1002/j.2333-8504.1992.tb01495.x Chang, A. C.-S., & Read, J. (2013). Investigating the effects of multiple-choice listening test items in the oral versus written mode on L2 listeners’ performance and perceptions. System, 41(3), 575–586. https://doi.org/10.1016/j.system.2013.06.001 Chen, H., & Chen, J. (2021). Investigating the relationships between listening skills and genre competence through cognitive diagnosis approach. Sage Open, 11(4), 1–14. https://doi. org/10.1177/21582440211061342 Chen, C., W., Andersson, B., & Zhu, J. (2023). A factor mixture model for item responses and certainty of response indices to identify student knowledge profiles. Journal of Educational Measurement, 60(1), 28–51. https://doi.org/10.1111/jedm.12344 Choi, I. H., Paek, I., & Cho, S. J. (2017). The impact of various class-distinction features on model selection in the mixture Rasch model. The Journal of Experimental Education, 85(3), 411–424. https://doi.org/10.1080/00220973.2016.1250208 Chon, Y. V., & Shin, T. (2019). Profile of second language learners’ metacognitive awareness and academic motivation for successful listening: A latent class analysis. Learning and Individual Differences, 70, 62–75. https://doi.org/10.1016/j.lindif.2019.01.007 Cid, J., Wei, Y., Kim, S., & Hauck, C. (2017). Statistical analyses for the updated TOEIC® listening and reading test (Research Memorandum No. RM-17-05). Educational Testing Service. Cohen, A., & Bolt, D. M. (2005). A mixture model analysis of differential item functioning. Journal of Educational Measurement, 42(2), 133–148. https://doi.org/10.1111/j.1745-3984.2005.00007 Cole, N. S. (1997). The ETS gender study: How males and females perform in educational settings. Educational Testing Service, College Entrance Examination. Coleman, G., Heap, S. (1998). The misinterpretation of directions for the questions in the Academic Reading and Listening sub-tests of the IELTS test. (Research Report No. 1, IELTS Australia). URL: https://ielts.org/researchers/our-research/research-reports/the-misinterpretation-of- directions-for-the-questions-in-the-academic-reading-and-listening-sub-tests-of-the-ielts-test Curley, W., & Schmitt, A. P. (1993). Revising SAT®-Verbal items to eliminate differential item function- ing. ETS Research Report Series, 1993(2), i–18. https://doi.org/10.1002/j.2333-8504.1993.tb01572.x De Ayala, R. J., & Santiago, S. Y. (2017). An introduction to mixture item response theory mod- els. Journal of School Psychology, 60, 25–40. https://doi.org/10.1016/j.jsp.2016.01.002 Du, G., & Man, D. (2022). Person factors and strategic processing in L2 listening comprehension: Examining the role of vocabulary size, metacognitive knowledge, self efficacy, and strategy use. System, 107, 102801. https://doi.org/10.1016/j.system.2022.102801 Effatpanah, F. (2019). Application of cognitive diagnostic models to the listening section of the International English Language Testing System (IELTS). International Journal of Language Testing, 9(1), 1–28. https://www.ijlt.ir/article_114295.html Effatpanah, F., Baghaei, P., & Karimi, M. N. (2024). A mixed Rasch model analysis of multiple profiles in L2 writing. Assessing Writing, 59, 100803. https://doi.org/10.1016/j.asw.2023.100803 Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Lawrence Erlbaum Associates Publishers. Estaji, M., & Banitalebi, Z. (2023). A study of test-taking strategies of Iranian IELTS repeaters: Any change in the strategy use? International Journal of Testing, 23(3), 205–230. https://doi.or g/10.1080/15305058.2023.2195662 https://doi.org/10.1080/15434303.2020.1860059 https://doi.org/10.1191/026553298667688289 https://doi.org/10.1191/026553298667688289 https://doi.org/10.1007/b97636 https://doi.org/10.1002/j.2333-8504.1992.tb01495.x https://doi.org/10.1016/j.system.2013.06.001 https://doi.org/10.1177/21582440211061342 https://doi.org/10.1177/21582440211061342 https://doi.org/10.1111/jedm.12344 https://doi.org/10.1080/00220973.2016.1250208 https://doi.org/10.1016/j.lindif.2019.01.007 https://doi.org/10.1111/j.1745-3984.2005.00007 https://ielts.org/researchers/our-research/research-reports/the-misinterpretation-of-directions-for-the-questions-in-the-academic-reading-and-listening-sub-tests-of-the-ielts-test https://ielts.org/researchers/our-research/research-reports/the-misinterpretation-of-directions-for-the-questions-in-the-academic-reading-and-listening-sub-tests-of-the-ielts-test https://doi.org/10.1002/j.2333-8504.1993.tb01572.x https://doi.org/10.1016/j.jsp.2016.01.002 https://doi.org/10.1016/j.system.2022.102801 https://www.ijlt.ir/article_114295.html https://doi.org/10.1016/j.asw.2023.100803 https://doi.org/10.1080/15305058.2023.2195662 https://doi.org/10.1080/15305058.2023.2195662 InTERnaTIOnaL JOuRnaL OF TESTIng 85 Field, J. (2009). A cognitive validation of the lecture-listening component of the IELTS listening paper. In L. Taylor (Ed.), IELTS research reports. (Vol. 9, pp. 17–65) Pty Ltd & British Council. Field, J. (2013). Cognitive validity. In A. Geranpayeh & L. Taylor (Eds.), Examining listening: Research and practice in assessing second language listening. (pp. 77–151) Cambridge University Press. Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel- Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29(4), 278– 295. https://doi.org/10.1177/0146621605275728 Freedle, R., & Kostin, I. (1996). The prediction of TOEFL listening comprehension item difficulty for mini-talk passages: Implications for construct validity. ETS Research Report Series, 1996(2), i–61. https:// doi.org/10.1002/j.2333-8504.1996.tb01707.x Frick, H., Strobl, C., & Zeileis, A. (2015). Rasch mixture models for DIF detection: A comparison of old and new score specifications. Educational and Psychological Measurement, 75(2), 208–234. https://doi.org/10.1177/0013164414536183 Geranpayeh, A., & Kunnan, A. J. (2007). Differential item functioning in terms of age in the certificate in advanced English examination. Language Assessment Quarterly, 4(2), 190–222. https://doi.org/10.1080/15434300701375758 Geranpayeh, A., & Taylor, L. (2008). Examining listening: Developments and issues in assessing second language listening. Cambridge Research Notes, 32, 3–5. https://www.cambridgeenglish. org/images/23151-research-notes-32.pdf Glas, C. A. W., & Verhelst, N. D. (1995). Testing the Rasch model. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications. (pp. 69–95) Springer. https:// doi.org/10.1007/978-1-4612-4230-7_5 Goh, C., & Aryadoust, S. V. (2010). Investigating the construct validity of MELAB listening test through the Rasch analysis and correlated uniqueness modeling. Spaan Fellowship Working Papers in Second of Foreign Language Assessment, 8, 31–68. https://michiganassessment.org/wp- content/uploads/2020/02/20.02.pdf.Res_.InvestigatingtheConstructValidityoftheMELABListening TestthroughtheRaschAnalysisandCorrelatedUniquenessModeling.pdf Goh, C. C. M., & Vandergrift, L. (2022). Teaching and learning second language listening: Metacognition in action. (2nd Ed.) Routledge. Graham, S. (2017). Research into practice: Listening strategies in an instructed classroom setting. Language Teaching, 50(1), 107–119. https://doi.org/10.1017/S0261444816000306 Grassi, M., & Tarantino, B. (2023). SEMtree: Tree-based structure learning methods with struc- tural equation models. Bioinformatics, 39(6), 1–9. https://doi.org/10.1093/bioinformatics/btad377 Harding, L. (2012). Accent, listening assessment and the potential for a shared-L1 advantage: A DIF perspective. Language Testing, 29(2), 163–180. https://doi.org/10.1177/0265532211421161 Harding, L., Alderson, J. C., & Brunfaut, T. (2015). Diagnostic assessment of reading and listening in a second or foreign language: Elaborating on diagnostic principles. Language Testing, 32(3), 317–336. https://doi.org/10.1177/0265532214564505 He, L., & Jiang, Z. (2020). Assessing second language listening over the past twenty years: A review within the socio-cognitive framework. Frontiers in Psychology, 11, 2123. https://doi. org/10.3389/fpsyg.2020.02123 Henninger, M., Debelak, R., & Strobl, C. (2023). A new stopping criterion for Rasch trees based on the Mantel–Haenszel effect size measure for differential item functioning. Educational and Psychological Measurement, 83(1), 181–212. https://doi.org/10.1177/00131644221077135 Hickendorff, M., Edelsbrunner, P. A., McMullen, J., Schneider, M., & Trezise, K. (2018). Informative tools for characterizing individual differences in learning: Latent class, latent profile, and latent transition analysis. Learning and Individual Differences, 66, 4–15. https:// doi.org/10.1016/j.lindif.2017.11.001 Hofstede, G. (2001). Culture’s consequences: Comparing values, behaviors, institutions, and organi- zations across nations. (2nd Edition) Sage. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity. (pp. 129–145) LEA. Holland, P. W., & Wainer, H. (1993). Differential item functioning. Lawrence Erlbaum Associates, Inc. https://doi.org/10.1177/0146621605275728 https://doi.org/10.1002/j.2333-8504.1996.tb01707.x https://doi.org/10.1002/j.2333-8504.1996.tb01707.x https://doi.org/10.1177/0013164414536183 https://doi.org/10.1080/15434300701375758 https://www.cambridgeenglish.org/images/23151-research-notes-32.pdf https://www.cambridgeenglish.org/images/23151-research-notes-32.pdf https://doi.org/10.1007/978-1-4612-4230-7_5 https://doi.org/10.1007/978-1-4612-4230-7_5 https://michiganassessment.org/wp-content/uploads/2020/02/20.02.pdf.Res_.InvestigatingtheConstructValidityoftheMELABListeningTestthroughtheRaschAnalysisandCorrelatedUniquenessModeling.pdf https://michiganassessment.org/wp-content/uploads/2020/02/20.02.pdf.Res_.InvestigatingtheConstructValidityoftheMELABListeningTestthroughtheRaschAnalysisandCorrelatedUniquenessModeling.pdf https://michiganassessment.org/wp-content/uploads/2020/02/20.02.pdf.Res_.InvestigatingtheConstructValidityoftheMELABListeningTestthroughtheRaschAnalysisandCorrelatedUniquenessModeling.pdf https://doi.org/10.1017/S0261444816000306 https://doi.org/10.1093/bioinformatics/btad377 https://doi.org/10.1177/0265532211421161 https://doi.org/10.1177/0265532214564505 https://doi.org/10.3389/fpsyg.2020.02123 https://doi.org/10.3389/fpsyg.2020.02123 https://doi.org/10.1177/00131644221077135 https://doi.org/10.1016/j.lindif.2017.11.001 https://doi.org/10.1016/j.lindif.2017.11.001 86 F. EFFaTPanaH ET aL. Humphry, S., & Montuoro, P. (2021). The Rasch model cannot reveal systematic differential Item func- tioning in single tests: Subset DIF analysis as an alternative methodology. Frontiers in Education, 6, 742560. https://doi.org/10.3389/feduc.2021.742560 Imhof, M., & Janusik, L. A. (2006). Development and validation of the Imhof-Janusik listening concepts inventory to measure listening conceptualization differences between cultures. Journal of Intercultural Communication Research, 35(2), 79–98. https://doi.org/10.1080/ 17475750600909246 Isbell, D. R., & Kremmel, B. (2020). Test review: Current options in at-home language proficien- cy tests for making high-stakes decisions. Language Testing, 37(4), 600–619. https://doi. org/10.1177/0265532220943483 Kahneman, D. (1973). Attention and effort. Prentice-Hall. Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge University Press. Kopf, J., Zeileis, A., & Strobl, C. (2015). Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. Educational and Psychological Measurement, 75(1), 22–56. https://doi.org/10.1177/0013164414529792 Kunina-Habenicht, O., Rupp, A. A., & Wilhelm, O. (2009). A practical illustration of multidimen- sional diagnostic skills profiling: Comparing results from confirmatory factor analysis and di- agnostic classification models. Studies in Educational Evaluation, 35(2–3), 64–70. https://doi. org/10.1016/j.stueduc.2009.10.003 Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Houghton Mifflin. Lesnov, R. O. (2022). Furthering the argument for visually inclusive L2 academic listening tests: The role of content-rich videos. Studies in Educational Evaluation, 72, 101087. https://doi. org/10.1016/j.stueduc.2021.101087 Li, F., Cohen, A. S., Kim, S.-H., & Cho, S.-J. (2009). Model selection methods for mixture di- chotmous IRT models. Applied Psychological Measurement, 33(5), 353–373. https://doi. org/10.1177/0146621608326422 Liao, L., & Yao, D. (2021). Grade-related differential item functioning in general English proficiency test-kids listening. Frontiers in Psychology, 12, 767244. https://doi.org/10.3389/fpsyg.2021.767244 Lin, J., & Wu, F. (2003, April 22-24). Differential performance by gender in foreign language testing [Poster presentation]. The Annual Meeting of the National Council on Measurement in Education (NCME), Chicago, IL, U.S.A. URL: https://files.eric.ed.gov/fulltext/ED478206.pdf Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions, 16(2), 878. https://www.rasch.org/rmt/rmt162f.htm Linacre, J. M. (2024). A user’s guide to WINSTEPS. Winsteps. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Lawrence Erbaum Associates. Marx, A., Heppt, B., & Henschel, S. (2017). Listening comprehension of academic and everyday language in first language and second language students. Applied Psycholinguistics, 38(3), 571– 600. https://doi.org/10.1017/S0142716416000333 Mazzeo, J., Schmitt, A. P., & Bleistein, C. A. (1993). Sex-related performance differences on con- structed-response and multiple-choice sections of Advanced Placement Examinations. ETS Research Report Series, 1, i–29. https://doi.org/10.1002/j.2333-8504.1993.tb01516.x Meade, A. W., & Wright, N. A. (2012). Solving the measurement invariance anchor item problem in item response theory. The Journal of Applied Psychology, 97(5), 1016–1031. https://doi. org/10.1037/a0027934 Meredith, W. (1993). Measurement invariance, factor analysis, and factorial invariance. Psychometrika, 58(4), 525–543. https://doi.org/10.1007/BF02294825 Nishizawa, H. (2023). Construct validity and fairness of an operational listening test with world Englishes. Language Testing, 40(3), 493–520. https://doi.org/10.1177/02655322221137869 Nylund, K. L., Asparouhov, T., & Muthén, B. O. (2007). Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study. Structural Equation Modeling: A Multidisciplinary Journal, 14(4), 535–569. https://doi. org/10.1080/10705510701575396 Official IELTS Practice Materials (2007). Available from. www.IELTS.org https://doi.org/10.3389/feduc.2021.742560 https://doi.org/10.1080/17475750600909246 https://doi.org/10.1080/17475750600909246 https://doi.org/10.1177/0265532220943483 https://doi.org/10.1177/0265532220943483 https://doi.org/10.1177/0013164414529792 https://doi.org/10.1016/j.stueduc.2009.10.003 https://doi.org/10.1016/j.stueduc.2009.10.003 https://doi.org/10.1016/j.stueduc.2021.101087 https://doi.org/10.1016/j.stueduc.2021.101087 https://doi.org/10.1177/0146621608326422 https://doi.org/10.1177/0146621608326422 https://doi.org/10.3389/fpsyg.2021.767244 https://files.eric.ed.gov/fulltext/ED478206.pdf https://www.rasch.org/rmt/rmt162f.htm https://doi.org/10.1017/S0142716416000333 https://doi.org/10.1002/j.2333-8504.1993.tb01516.x https://doi.org/10.1037/a0027934 https://doi.org/10.1037/a0027934 https://doi.org/10.1007/BF02294825 https://doi.org/10.1177/02655322221137869 https://doi.org/10.1080/10705510701575396 https://doi.org/10.1080/10705510701575396 http://www.IELTS.org InTERnaTIOnaL JOuRnaL OF TESTIng 87 O’Neill, K. A., & McPeek, W. M. (1993). Item and test characteristics that are associated with differential item functioning. In Holland, P. W. & Wainer, H. (Eds.), Differential item function- ing., (pp. 255–276) Lawrence Earlbaum. Oshima, T. C., Raju, N. S., & Flowers, C. P. (1997). Development and demonstration of multidi- mensional IRT-based internal measures of differential functioning of items and tests. Journal of Educational Measurement, 34(3), 253–272. https://doi.org/10.1111/j.1745-3984.1997.tb00518.x Pae, T.-I. (2004). DIF for examinees with different academic backgrounds. Language Testing, 21(1), 53–73. https://doi.org/10.1191/0265532204lt274oa Pae, T.-I. (2012). Causes of gender DIF on an EFL language test: A multiple-data analysis over nine years. Language Testing, 29(4), 533–554. https://doi.org/10.1177/0265532211434027 Park, G. P. (2008). Differential item functioning on an English listening test across gender. TESOL Quarterly, 42(1), 115–123. https://doi.org/10.1002/j.1545-7249.2008.tb00212.x Park, Y., Lee, S., & Shin, S. Y. (2022). Developing a local academic English listening test using authentic unscripted audio-visual texts. Language Testing, 39(3), 401–424. https://doi. org/10.1177/02655322221076024 Preinerstorfer, D., & Formann, A. K. (2012). Parameter recovery and model selection in mixed Rasch models. British Journal of Mathematical and Statistical Psychology, 65(2), 251–262. https://doi.org/10.1111/j.2044-8317.2011.02020.x R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. URL:https://www.R-project.org Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53(4), 495– 502. https://doi.org/10.1007/BF02294403 Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14(2), 197–207. https://doi. org/10.1177/014662169001400208 Raquel, M. (2019). The Rasch measurement approach to differential item functioning (DIF) anal- ysis in language assessment research. In V. Aryadoust & M. Raquel (Eds.), Quantitative data analysis for language assessment (volume I): Fundamental techniques. (pp. 103–131) Routledge. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. (expanded edi- tion). University of Chicago Press. Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. In M. Blegvad (Ed.), The Danish yearbook of philosophy. Munksgaard. https://doi.org/10.1163/24689300-01401006 Ravand, H. (2015). Assessing testlet effect, impact, differential testlet, and item functioning using cross-classified multilevel measurement modeling. Sage Open, 5(2), 1–9. https://doi. org/10.1177/2158244015585607 Ravand, H. (2024). Assessing measurement invariance in a university entrance exam: A compar- ison of multigroup confirmatory factor analysis alignment method vs. multigroup item re- sponse theory. Educational Methods & Psychometrics, 2, 11. https://doi.org/10.61186/emp.2024.4 Ravand, H., & Baghaei, P. (2020). Diagnostic classification models: Recent developments, practical issues, and prospects. International Journal of Testing, 20(1), 24–56. https://doi.org/10.1080/153 05058.2019.1588278 Ravand, H., Rohani, G., & Firoozi, T. (2019). Investigating gender and major DIF in the Iranian National University Entrance Exam using multiple-indicators multiple-causes structural equation modelling. Issues in Language Teaching, 8(1), 33–61. https://doi.org/10.22054/ilt.2020.49509.460 Ravand, H., Baghaei, P., & Doebler, P. (2019). Examining parameter invariance in a general diagnostic classification model. Frontiers in Psychology, 10, 2930. https://doi.org/10.3389/fpsyg.2019.02930 Richards, J. C. (1983). Listening comprehension: Approach, design, procedure. TESOL Quarterly, 17(2), 219–240. https://doi.org/10.2307/3586651 Robitzsch, A., Kiefer, T., & Wu, M. (2024). TAM: Test Analysis Modules. R package version 4.2- 21. URL: https://cran.r-project.org/web/packages/TAM Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14(3), 271–282. https://doi.org/10.1177/ 014662169001400305 https://doi.org/10.1111/j.1745-3984.1997.tb00518.x https://doi.org/10.1191/0265532204lt274oa https://doi.org/10.1177/0265532211434027 https://doi.org/10.1002/j.1545-7249.2008.tb00212.x https://doi.org/10.1177/02655322221076024 https://doi.org/10.1177/02655322221076024 https://doi.org/10.1111/j.2044-8317.2011.02020.x https://www.R-project.org https://doi.org/10.1007/BF02294403 https://doi.org/10.1177/014662169001400208 https://doi.org/10.1177/014662169001400208 https://doi.org/10.1163/24689300-01401006 https://doi.org/10.1177/2158244015585607 https://doi.org/10.1177/2158244015585607 https://doi.org/10.61186/emp.2024.4 https://doi.org/10.1080/15305058.2019.1588278 https://doi.org/10.1080/15305058.2019.1588278 https://doi.org/10.22054/ilt.2020.49509.460 https://doi.org/10.3389/fpsyg.2019.02930 https://doi.org/10.2307/3586651 https://cran.r-project.org/web/packages/TAM https://doi.org/10.1177/014662169001400305 https://doi.org/10.1177/014662169001400305 88 F. EFFaTPanaH ET aL. Rost, M. (2016). Teaching and researching listening. (3rd Ed.) Longman. Roussos, L., & Stout, W. (1996). A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20(4), 355–371. https://doi.org/10.1177/014662169602000404 Rukthong, A., & Brunfaut, T. (2020). Is anybody listening? The nature of second language listen- ing in integrated listening-to-summarize tasks. Language Testing, 37(1), 31–53. https://doi. org/10.1177/0265532219871470 Scheuneman, J. D., & Gerritz, K. (1990). Using differential item functioning procedures to ex- plore sources of item difficulty and group performance characteristics. Journal of Educational Measurement, 27(2), 109–131. https://doi.org/10.1111/j.1745-3984.1990.tb00737.x Sen, S. (2018). Spurious latent class problem in the mixed Rasch model: A comparison of three maximum likelihood estimation methods under different ability distributions. International Journal of Testing, 18(1), 71–100. https://doi.org/10.1080/15305058.2017.1312408 Sen, S., & Cohen, A. (2019). Applications of mixture IRT models: A literature review. Measurement: Interdisciplinary Research and Perspectives, 17(4), 177–191. https://doi.org/10.1080/15366367.20 19.1583506 Sen, S., Cohen, A. S., & Kim, S. H. (2019). Model selection for multilevel mixture Rasch models. Applied Psychological Measurement, 43(4), 272–289. https://doi.org/10.1177/0146621618779990 Seo, D., Taherbhai, H., & Frantz, R. (2016). Psychometric evaluation and discussions of English language learners’ listening comprehension. International Journal of Listening, 30(1-2), 47–66. https://doi.org/10.1080/10904018.2015.1065747 Shin, S. Y., Lee, S., & Lidster, R. (2021). Examining the effects of different English speech vari- eties on an L2 academic listening comprehension test at the item level. Language Testing, 38(4), 580–601. https://doi.org/10.1177/0265532220985432 Shohamy, E., & Inbar, O. (1991). Validation of listening comprehension tests: The effect of text and question type. Language Testing, 8(1), 23–40. https://doi.org/10.1177/026553229100800103 Steinberg, L., & Thissen, D. (2006). Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning. Psychological Methods, 11(4), 402–415. https://doi.org/10.1037/1082-989X.11.4.402 Swaminathan, H., & Rogers, H. J. (2000). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370. https://doi. org/10.1111/j.1745-3984.1990.tb00754.x Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2), 257–285. https://doi.org/10.1207/s15516709cog1202_4 Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics. (6th Ed.) Pearson Education. Tay, L., Newman, D. A., & Vermunt, J. K. (2011). Using mixed-measurement item response theory with covariates (MM-IRT-C) to ascertain observed and unobserved measurement equivalence. Organizational Research Methods, 14(1), 147–176. https://doi.org/10.1177/1094428110366037 Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P.W. Holland & H. Wainer (Eds.), Differential item functioning. (pp. 67–113) Erlbaum. Tseng, M. C., & Wang, W. C. (2021). The Q-Matrix anchored mixture Rasch model. Frontiers in Psychology, 12, 564976. https://doi.org/10.3389/fpsyg.2021.564976 Van Nijlen, D., & Janssen, R. (2011). Measuring mastery across grades: An application to spelling ability. Applied Measurement in Education, 24(4), 367–387. https://doi.org/10.1080/08957347.20 11.607064 von Davier, M. (2008). The mixture general diagnostic model. In G. R. Hancock & K. M. Samuelsen (Eds.), Advances in latent variable mixture models. (pp. 1–24) Information Age Publishing. von Davier, M., & Rost, J. (1995). Polytomous mixed Rasch models. In G. H. Fischer & I. W. Molennar (Eds.), Rasch models: Foundations, recent developments, and applications. (pp. 371–379) Springer Verlag. Wagner, E. (2013). An investigation of how the channel of input and access to test questions affect L2 listening test performance. Language Assessment Quarterly, 10(2), 178–195. https:// doi.org/10.1080/15434303.2013.769552 https://doi.org/10.1177/014662169602000404 https://doi.org/10.1177/0265532219871470 https://doi.org/10.1177/0265532219871470 https://doi.org/10.1111/j.1745-3984.1990.tb00737.x https://doi.org/10.1080/15305058.2017.1312408 https://doi.org/10.1080/15366367.2019.1583506 https://doi.org/10.1080/15366367.2019.1583506 https://doi.org/10.1177/0146621618779990 https://doi.org/10.1080/10904018.2015.1065747 https://doi.org/10.1177/0265532220985432 https://doi.org/10.1177/026553229100800103 https://doi.org/10.1037/1082-989X.11.4.402 https://doi.org/10.1111/j.1745-3984.1990.tb00754.x https://doi.org/10.1111/j.1745-3984.1990.tb00754.x https://doi.org/10.1207/s15516709cog1202_4 https://doi.org/10.1177/1094428110366037 https://doi.org/10.3389/fpsyg.2021.564976 https://doi.org/10.1080/08957347.2011.607064 https://doi.org/10.1080/08957347.2011.607064 https://doi.org/10.1080/15434303.2013.769552 https://doi.org/10.1080/15434303.2013.769552 InTERnaTIOnaL JOuRnaL OF TESTIng 89 Wang, W.-C. (2004). Effects of anchor item methods on the detection of differential item function- ing within the family of Rasch models. The Journal of Experimental Education, 72(3), 221–261. https://doi.org/10.3200/JEXE.72.3.221-261 Wang, W., Liu, Y., & Liu, H. (2022). Testing differential item functioning without predefined anchor items using robust regression. Journal of Educational and Behavioral Statistics, 47(6), 666–692. https://doi.org/10.3102/10769986221109208 Watson, K. W., Barker, L. L., & Weaver, J. B. (1995). The listening styles profile (LSP-16): Development and validation of an instrument to assess four listening styles. International Journal of Listening, 9(1), 1–13. https://doi.org/10.1080/10904018.1995.10499138 Weir, C. (1990). Communicative language testing. Prentice Hall. Willingham, W. W., & Cole, N. S. (1997). Fairness issues in test design and use. In Willingham, W.W. & Cole, N. S. (Eds.), gender and fair assessment. (pp. 227–346) Lawrence Erlbaum. https://doi.org/10.4324/9781315045115 Wolvin, A. D. (2013). Understanding the listening process: Rethinking the “one size fits all” mod- el. International Journal of Listening, 27(2), 104–106. https://doi.org/10.1080/10904018.2013.783 351 Wolvin, A. D., & Coakley, C. G. (1993). A listening taxonomy. In A. D. Wolvin & C. G. Coakley (Eds.), Perspectives on listening. (pp. 15–22) Ablex. Yanagawa, K., & Green, A. (2008). To show or not to show: The effects of items stems and an- swer options on performance on a multiple-choice comprehension test. System, 36(1), 107–122. https://doi.org/10.1016/j.system.2007.12.003 Yuan, K. H., Liu, H., & Han, Y. (2021). Differential item functioning analysis without a priori information on anchor items: QQ plots and graphical test. Psychometrika, 86(2), 345–377. https://doi.org/10.1007/s11336-021-09746-5 Zansen, A. V., Hilden, R., & Laihanen, E. (2022). The multimodal listening test in a high stakes context: Gender-neutral or not? International Journal of Listening, 36(2), 152–170. https://doi. org/10.1080/10904018.2021.1993446 Zumbo, B. D. (2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4(2), 223–233. https://doi. org/10.1080/15434300701375832 https://doi.org/10.3200/JEXE.72.3.221-261 https://doi.org/10.3102/10769986221109208 https://doi.org/10.1080/10904018.1995.10499138 https://doi.org/10.4324/9781315045115 https://doi.org/10.1080/10904018.2013.783351 https://doi.org/10.1080/10904018.2013.783351 https://doi.org/10.1016/j.system.2007.12.003 https://doi.org/10.1007/s11336-021-09746-5 https://doi.org/10.1080/10904018.2021.1993446 https://doi.org/10.1080/10904018.2021.1993446 https://doi.org/10.1080/15434300701375832 https://doi.org/10.1080/15434300701375832 Fitting the mixed Rasch model to the listening comprehension section of the IELTS: Identifying latent class differential item functioning ABSTRACT Introduction Background Listening comprehension and multiple profiles Differential item functioning (DIF) DIF in L2 listening comprehension Mixed Rasch model The present study Method Data Data analysis Results Discussion Labeling and characterizing latent classes Content analysis of test items Section 1 Section 2 Section 3 Section 4 Implications, limitations, and directions for future research Disclosure statement Research data policy and data availability statements Funding ORCID References