This summer Digital Assess undertook a study of Comparative Judgement in order to investigate a number of niggly questions, partly in response to a number of informal studies which have at least challenged the ability of Adaptive Comparative Judgement (ACJ) to produce the level of consistency suggested by the studies Digital Assess and their customers have repeatedly achieved over recent years. To this end, we wanted to compare the ranking produced by ACJ in comparison to non-adaptive Comparative Judgement (CJ) and we wanted to do it using real judges, not abstract computer simulations. We also wanted to look at another suggestion that random judgements have the power to distort the results of ACJ and since they are undetected by the ACJ system can potentially undermine its effectiveness. We wanted to test the plausibility and accuracy of such an argument. The third and final question we wished to look at is what we really mean when we talk about the internal consistency measure and to explore the correlation between the estimated ranking and the true one in simulated ACJ sessions.
So, with our objectives clear and through the Summer break, the Digital Assess team ran a trial of non-adaptive CJ with teachers judging work from secondary Design and Technology students, 20 scripts in total, all taken from a previous ACJ session from which we have undertaken a number of studies and is the subject of the well known paper Kimbell, R., Wheeler, T., Stables, K., Shepard, T., Martin, F., Davies, D., Pollitt, A., Whitehouse, G. (2009). E-scape portfolio assessment phase 3 report. London: Goldsmiths, University of London. For the study we used 6 new (?) judges. Each judge was asked to do all possible pairs of comparisons between the scripts and in doing so completed a full CJ of that data. The overall ranking determined by all the judgements of all the teachers together was then to be used to determine an overall result.
Looking at our results and with regards to the issue of the reliability of the ACJ ranking our overall result was produced by comparing all possible Comparative Judgement for each of our judges. It is for this reason that we limited ourselves to a small number of scripts since these 20 scripts required a total of 1140 judgements in order to generate the overall ranking.
Once we had generated the overall ranking of the 6 judges based on all the possible judgement combinations, we then compared it to the ACJ one that was generated in the original study. We wanted to show that the substantially more efficient ACJ approach produced comparable results with considerably less effort. Our study got a more than satisfactory Spearman correlation coefficient of 0.91 when compared to the original study. We also looked at the correlation between the ranking of the individual judges and the new overall result. These are listed in the table below.
Judge_1 | Judge_2 | Judge_3 | Judge_4 | Judge_5 | Judge_6 | ACJ Rank | |
Correlation Coef. | 0.91 | 0.76 | 0.78 | 0.82 | 0.91 | 0.78 |
0.91 |
We see that the ranking of each judge is well correlated with the overall ranking produced by the full CJ session with the overall result almost completely inline.
Next up and with regards to the issue of random judgements, cited elsewhere as an example of where ACJ may fall down, we wanted to show that this just doesn’t hold up in the reality of actual CJ sessions. ACJ relies on judges being consistent with themselves and some of the studies that have brought ACJ into question have begun with the assumption that this may not be the case. We felt that it was important to demonstrate that even with judges who are not experienced as markers this was a reasonable assumption to make so in our study we investigated the self-consistency of each judge. A judge being consistent with himself means that if he deems that script A beats script B and that script B beats script C, then he will tend to decide that script A beats script C. We examined that consistency through the use of the consistency ratio, which is a well-established metric devised by US Professor Thomas Saaty (http://people.revoledu.com/kardi/tutorial/AHP/Consistency.htm).
In order to compute the consistency ratio, one first builds a comparison matrix which is the matrix of all possible pairs of comparisons for some selected scripts. Then, one computes the largest eigenvalue (which is a characteristic number) of that matrix. One then subtracts the number of scripts from that eigenvalue in order to get a number N. One then repeats this process for a comparison matrix constituted of random judgements and gets a number R. The consistency ratio is then the ratio of number N to number R. More details of that calculation are included in the appendix.
The table below lists the values of the consistency ratios for each judge in our trial and for different samples of the scripts.
According to Saaty’s theory, a consistency ratio less than 10% means that the judge is consistent. In our study at least one of the judges had very little experience as a marker.
Judge_1 | Judge_2 | Judge_3 | Judge_4 | Judge_5 | Judge_6 | |
5 scripts | 4.4% | 4.4% | 4.4% | 4.4% | 4.4% | 4.4% |
10 scripts | 4.4% | 4.4% | 4.4% | 4.4% | 5.34% | 4.4% |
20 scripts | 4.52% | 4.52% | 4.86% | 5.15% | 6.44% | 4.65% |
Furthermore, the presence of a judge making random judgements in ACJ can also be detected through the use of misfit statistics. Misfits is a statistical tool that allows one to detect judges or scripts that perform differently from their peers in a single ACJ or CJ session. The misfit metric is an average of information-weighted squared residuals (the residual being the difference between the observed result of the judgement and its predicted probability). A misfitting judge would then be defined as being more than 2 standard deviations above from the mean, according to the metric, of all judges (below the mean is not misfitting, it is actually very good as there is a smaller difference between the actual result and the prediction). And a loosely misfitting one would be defined to lie between 1 and 2 standard deviations from that mean. For instance, in our trial with 6 judges, the less experienced judge did show up in the misfit statistics as shown below. She was the only one loosely misfitting. Her judgements were not bad, it’s just that they were not as good as her peers.
We also ran a simulation where we combined the real judgements of the trial with completely random judgements made by an additional fictitious judge. From the misfits plot below, it is very clear that the presence of such a judge can be easily detected.
(We note that the misfit metric is a relative scale and not an absolute one and thus this is why the absolute values for the real judges decrease when we introduce the additional fictitious judge. )
Thus it is clear that even in the unlikely event of a judge purposely or otherwise making random responses to comparisons, the system would pick this behaviour up very clearly through the analysis of the misfit stats.
Where this was of particular concern, the issue could also be addressed by introducing before the ACJ session a consistency test for each judge and then running during the ACJ session a misfits stats analysis.
Finally, with regards to the issue of the internal consistency measure (i.e. reliability) not corresponding to the squared value of the correlation between the obtained parameters and the true generating ones. We performed various simulations on that subject. In these simulations, the true parameter values of the generating scripts were known beforehand and could be then compared to the parameters estimated by ACJ.
First of all, the alpha coefficient should not be described as a value for the reliability of the parameters obtained, but as a measure of the internal consistency of the complete set of judgements – the Judgement Consistency Coefficient. The measure applies to the scaled rank produced by this particular community of judges with this particular set of scripts. The consistency measure is an indication of our ability to repeat the exercise and achieve the same scaled rank, knowing that if we were marking and we mixed the judges and scripts up we would be more likely to get a different set of marks.
We know that theoretically, according to Classical Test Theory, the judgement consistency measure should correspond to the squared Pearson correlation between the true parameters and the estimated ones. We didn’t observe this relationship and so this could be a subject of further investigation. However, our simulations show high values of (Pearson) correlation between the true parameters and the estimated ones. They also show high values of Spearman correlation between the true ranking of the scripts and the estimated ranking as shown on the Figure below. We plotted the Spearman correlation as a function of the true standard deviation of the generating scripts. This figure was obtained after 13 rounds of judgement (or an average of 13 judgements per script)
The Spearman correlation (which is a correlation between rankings and not values) is what matters most because we want to know the similarity between the estimated ranking of the scripts and the true one. It is less important that the obtained parameters values match closely the true ones since these parameters values get scaled again on a new scale from 0 to 100. Scaled ranking of the students is what matters most. A correlation coefficient gets squared when you want to get a coefficient of determination to see if the data points fit a linear model but that is not the objective sought here. We desire to establish a ranking as similar as possible to the true one without caring too much about the actual parameter values.
Furthermore, we observe a nonlinear stochastic relationship of the Spearman correlation between the estimated and the true ranking (i.e. the validity) and a number of measured indicators in an ACJ run. We’ve plotted in the figures below the Spearman correlation coefficient as a function of the Judgement consistency coefficient (left figure) and the observed standard deviation of the parameter values (right figure).
Because both the judgement consistency coefficient and the standard deviation are related non-linearly to the Spearman correlation coefficient, they can be used as stochastic predictors in a predictive analytics exercise. We have used these two predictors as well as many others in a machine learning algorithm and we achieved the predicted validity as shown on the following figure. Its accuracy is fairly good when the predicted validity is above 0.94 with a mean error of 0.01 and a standard deviation of 0.01. However, when the predicted validity is less than 0.94, it is more error prone with a mean error of 0.02 and and a standard deviation of 0.02.
Summary of findings:
In summary, our study used full CJ with 20 Design and Technology scripts that were previously marked in an ACJ session. With 6 judges we asked each one to perform a session of full CJ. We then used the decisions from all the judges to produce an overall ranking that we used as our overall result.
The results show that:
The ranking produced by ACJ is strongly correlated with the “Gold standard”.
Any inconsistency in the judging is picked up via the real-time analysis of the misfit statistics of the judgements, and that if this is of particular concern for the assessment context in question then, as can also be applied to marking based contexts, executing a pre-ACJ self-consistency test for each judge will identify any problematic judges ahead of the assessment. In this pilot and as expected, our judges were very consistent with themselves.
Furthermore, simulations show that that the correlation between estimated rankings and the real rank positions achieved through ACJ was very high.
Appendix:
The consistency index was invented by US Professor Thomas Saaty. You calculate it in the following way.
First you build a matrix of all possible pairs of comparisons between your scripts (this matrix is obviously specific to each judge).
For example for a set of 3 scripts (A,B,C), you can have the following matrix
A | B | C | |
A | 1 | 2 | 0.5 |
B | 0.5 | 1 | 2 |
C | 2 | 0.5 | 1 |
where 2 means the script in the row beats the script in the column (e.g. in that example A beats B or B beats C)
0.5=1/2 means that the script in the row loses to the script in the column (in our case, A loses to C or B loses to A)
1 means the script in the raw draws with the script in the column (which is the case
only for the diagonal elements of the matrix when the script is paired with itself)
So if the entry [A,B] = 2 then automatically the entry [B,A] = 0.5 = 1/2
and the all the diagonal entries ([A,A] , [B,B] , [C,C] ) have to be equal to 1.
Furthermore your comparison matrix is squared shaped. For N scripts, the comparison
matrix has dimensions N by N (i.e. N rows by N columns)
Once you’ve built your comparison matrix, you calculate its largest real eigenvalue
An eigenvalue is a value proper to a matrix which you can obtain easily using a function built in most IT languages.
Let’s call this largest real eigenvalue Emax.
Then your consistency index is
CI = (Emax – N) / (N – 1)
The consistency ratio (CR) is then the ratio
CR = Consistency Index (CI) / Random Index(RI)
The Random index is an average consistency index for matrices that are built with pure random judgements. You have one specific RI value per size N of matrix.
According to Saati, if the Consistency Ratio is less than 10% (0.1), then the judge’s level of inconsistency is acceptable and he is deemed to be consistent with himself. Otherwise he is deemed to be inconsistent.
By Dr Ardavan Alamir, Digital Assess Data Scientist
SIMILAR PUBLICATIONS: On 'Reliability' bias in ACJ - Valid simulation of Adaptive Comparative Judgement- Dr Alastair Pollitt