Post exam analysis : Implication for intervention

The difficulty index, discrimination and distracter efficiency of college level exam paper was analyzed as an input for taking actions in future test developments. The exam papers of 176 first-year regular pre-service diploma students at Gondar CTE were analyzed using descriptive analysis. Difficulty indices and distracter efficiencies were calculated using Microsoft Excel 2007. Other test statistics such as mean, bi-serial correlations and reliability coefficients were computed using SPSS version 20. Results showed that the mean test score, out of 31, was 17.23 ± 3.85. Average difficulty and discrimination indices were 0.56 (SD 0.20) and 0.16 (SD 0.28), respectively. The mean distracter efficiency was 92.1% (SD 17.2%). The reliability of the test was 0.58. Out of 31, 13 (41.9%) items were either too easy or too difficult. Only two items fell into good or excellent discrimination power. Inconsistency in option formats and inappropriate stems were observed in the exam paper. Based on the results the college level exam paper has acceptable level of difficulty index and distracter efficiency. However, the average discrimination power of exam was very low (0.16, acceptable ≥ 0.4). The internal consistency reliability was also less than the acceptable level (0.58, acceptable ≥ 0.7). Thus, future test development interventions should give due emphasis on item reliability, discrimination coefficients and item face validity.


27
The rapid expansion of higher education institutions in Ethiopia seem to compromised quality of 28 education in the country (4). According to Arega Yirdaw (2016), problems related to the 29 teaching-learning process stood first as a key factor that determine quality education in private 30 higher institutions in Ethiopia. To this end, an effective assessment tool, among others, had to be 31 in place to see if the required outcomes could be achieved.

32
Higher education institutes need to combine different approaches and instruments for assessing 33 students (5). This is because students' assessment and evaluation are an integral part of the 34 teaching -learning process (2). The assessments should be relevant while tracking each student's 35 performance in a given course. Considering this, instructors at higher institutions must be aware 36 of the quality and reliability of tests. Otherwise, the final results may be influenced by the test 37 itself, which could lead to a biased assessment (5). Usually instructors receive little or no training 38 on assessment quality. If training given, it doesn't focus on strategies to construct test or item-39 writing rules but only on large-scale test administration and standardized test score interpretation 40 (2). Tavakol and Dennick (2011) pointed out the importance of post-exam item analysis to 41 improve the quality and reliability of assessments.

43
Item-analysis is the process of collecting, summarizing and using information from students' 44 responses to assess the quality of test items (21). It allows teachers to identify: too difficult or too 45 easy items, items that do not discriminate high and low able students or items that have 46 implausible distracters (2, 3). In these cases, teachers can remove too easy or too difficult items 47 or non-discriminating items. Item analysis also help teachers modify instruction to correct any 48 misunderstanding about the content or adjust the way they teach (2).

50
A number of reports on Ethiopian education quality indicated that there was a serious problem in 51 quality of education (4, 19). Assessment for grading students' achievement in the Ethiopian education quality (20). Objective examination results can be analyzed to improve the validity and 56 reliability of assessments. Post-exam analysis is one intervention to improve the quality and 57 reliability of assessments (17). As far as the knowledge of the researcher is concerned, no item 58 analysis was conducted at Gondar CTE. Hence the objective of this study was to improve the 59 skills of college instructors to systematically use standardized and validated student assessments 60 with the autonomy of the department.   The test paper was reviewed for the following face validity parameters.  The internal consistency reliability of the exam paper was determined as it was considered to be 92 the most relevant and accurate method for determining test reliability. The acceptable range of 93 value for test reliability in most literatures is α ≥ 0.7. KR-20 was recommended to determine the 94 internal consistency of a dichotomous item (17). Objective test items can dichotomously be 95 scored as either right or wrong (17). In this study, the Kuder-Richardson method was employed 96 to estimate test reliability. A KR-20 value of 0.7 or greater was considered as reliable. NFD was defined as an incorrect option in MCQ selected by less than 5% of students. score was 17.23 ± 3.85 ( Table 1). The median score is 17.0, slightly lower than the mean score.     considered, there is no any single item which could be labeled as "excellent" (Table 4).

191
Furthermore, easy items (p > 0.7) such as q.3, q.5, q.10 and q.24 have poor discrimination (r <   Figure 2 shows the graphical representation of difficulty index and discrimination power. The NFDs (Appendix C). In addition, four items (q.12, q.14, q.18 and q.20) have distracters selected 209 by more students than the (key) correct answer. There are no items with three NFDs. Only one 210 item (q.13) has two NFDs ( Table 5). The overall mean of DE was 92.1% with minimum 33.3% 211 and maximum 100%.    The internal reliability calculated in this summative test was 0.58. This value is a beat less than 227 the expected range in most standardized assessments (α ≥ 0.7). According to (8), a Cronbach 228 alpha of 0.71 was obtained in a standardized Italian case study. Reliability analysis could be 229 categorized as: excellent if α > 0.9, very good if between 0.8-0.9 and good if between 0.6-0.7 (1).

230
If the reliability value lies within 0.5 -0.6, revision is required. It will be questionable if 231 reliability falls below 0.5 (1).

234
This might also imply that college educators need to validate their assessment tools through item 235 analysis. According to Fraenkel and Wallen in (12) Non-functional distracters make an item easy and reduce its discrimination (10). Question 290 number 31 (Table 7) has moderate difficulty and excellent discrimination power. Probably this 291 could be due to absence of NFDs. However, this doesn't work for other test items probably due 292 to random guessing or some flaws in item writing (10).

295
Post exam item analysis is a simple but effective method to assess the validity and reliability of a 296 test. It detects specific technical flaws and provides information for further test improvement. An 297 item with average difficulty (p = 0.3 -0.7), high discrimination (r ≥ 0.4) and higher DE value 298 (>70%) is considered as an ideal item. In this study, the summative test as a whole has moderate 299 difficulty (mean = 0.56) and good distracter efficiency (mean = 92.1%). But it poorly 300 discriminates between high and low achieving students. The test as a whole needs revision as its 301 reliability was not reasonably good. Some flaws in item writing were also observed.

303
According to Xu and Liu (2009) in (1), teachers' knowledge in assessment and evaluation is not 304 static but rather a complex, dynamic and ongoing activity. Therefore, it is plausible to suggest 305 that teachers or instructors should have some in-service seminars on test developments. Since 306 most of the summative tests constructed within the college are objective types, item analysis is 307 recommended for instructors at some points in their teaching life. It is also suggested that there 308 might be a specific unit responsible for testing and the analysis of the items after exam 309 administration.

311
Competing interests 312 The author declares that there is no competing interest.