Examining Inter-Rater Reliability in a Reality Baking Show

R
statistics
code
reliability
test theory
Author

Mattan S. Ben-Shachar

Published

October 18, 2018

Game of Chefs is an Israeli reality cooking competition show, where chefs compete for the title of “Israel’s most talented chef”. The show has four stages: blind auditions, training camp, kitchen battles, and finals. Here I will examine the 4th season’s (Game of Chefs: Confectioner) inter-rater reliability in the blind auditions, using some simple R code.

The Format

In the blind auditions, candidates have 90 minutes to prepare a dessert, confection or other baked good, which is then sent to the show’s four judges for a blind taste test. The judges don’t see the candidate, and know nothing about him or her, until after they deliver their decision - A “pass” decision is signified by the judge granting the candidate a kitchen knife; a “fail” decision is signified by the judge not granting a knife. A candidate who receives a knife from at least three of the judges continues on to the next stage, the training camp.

The 4 judges and host, from left to right: Yossi Shitrit, Erez Komarovsky, Miri Bohadana (host), Assaf Granit, and Moshik Roth

Inter-Rater Reliability

I’ve watched all 4 episodes of the blind auditions (for purely academic reasons!), for a total of 31 candidates. For each dish, I recorded each of the four judges’ verdict (fail / pass).

Let’s load the data.1

df <- read.csv('IJR.csv')
head(df)
  Moshik Assaf Erez Yosi
1      1     1    1    1
2      1     0    0    0
3      1     1    1    0
4      1     0    0    0
5      0     1    1    1
6      1     1    1    1

We will need the following packages:

library(dplyr) # for manipulating the data
library(magrittr) # <3
library(psych) # for computing Choen's Kappa
library(corrr) # for manipulating matricies

We can now use the psych package to compute Cohen’s Kappa coefficient for inter-rater agreement for categorical items, such as the fail/pass categories we have here. \(\kappa\) ranges from -1 (full disagreement) through 0 (no pattern of agreement) to +1 (full agreement). Normally, \(\kappa\) is computed between two raters, and for the case of more than two raters, the mean \(\kappa\) across all pair-wise raters is used.

cohen.kappa(df)

Cohen Kappa (below the diagonal) and Weighted Kappa (above the diagonal) 
For confidence intervals and detail print with all=TRUE
       Moshik Assaf  Erez  Yosi
Moshik   1.00  0.34  0.17  0.29
Assaf    0.34  1.00  0.17  0.68
Erez     0.17  0.17  1.00 -0.16
Yosi     0.29  0.68 -0.16  1.00

Average Cohen kappa for all raters  0.25
Average weighted kappa for all raters  0.25

We can see that overall \(\kappa=0.25\) - surprisingly low for what might be expected from a group of pristine, accomplished, professional chefs and bakers in a blind taste test.

When examining the pair-wise coefficients, we can also see that Erez seems to be in lowest agreement with each of the other judges (and even in a slight disagreement with Yossi!). This might be because Erez is new on the show (this is his first season as judge), but it might also be because of the four judges, he is the only one who is actually a confectioner (the other 3 are restaurant chefs).

For curiosity’s sake, let’s also look at the \(\kappa\) coefficient between each judge’s rating and the total fail/pass decision, based on whether a dish got a “pass” from at least 3 judges.

df <- df %>% 
  mutate(PASS = as.numeric(rowSums(.) >= 3))
head(df)
  Moshik Assaf Erez Yosi PASS
1      1     1    1    1    1
2      1     0    0    0    0
3      1     1    1    0    1
4      1     0    0    0    0
5      0     1    1    1    1
6      1     1    1    1    1

We can now use the wonderful new corrr package, which is intended for exploring correlations, but can also generally be used to manipulate any symmetric matrix in a tidy-fashion.

cohen.kappa(df)$cohen.kappa %>%
  as_cordf() %>%
  focus(PASS)
# A tibble: 4 × 2
  term    PASS
  <chr>  <dbl>
1 Moshik 0.619
2 Assaf  0.746
3 Erez   0.159
4 Yosi   0.676

Perhaps unsurprisingly (to people familiar with previous seasons of the show), it seems that Assaf’s judgment of a dish is a good indicator of whether or not a candidate will continue on to the next stage. Also, once again we see that Erez is barely aligned with the other judges’ total decision.

Conclusion

Every man to his taste…

Even among the experts there is little agreement on what is fine cuisine and what is not worth a doggy bag. Having said that, if you still have your heart set on competing in Game of Chefs, it seems that you should at least appeal to Assaf’s palate.

Bon Appetit!