Crowdsourcing Adverse Test Sets to Help Surface AI Blindspots

For every participant we maintain the following in the leaderboard:

  • Submitted Images: number of submitted images per participant;
  • Remaining Quota: remaining number of images a participant is allowed to submit;
  • Adverse Examples: number of adverse examples (i.e. human verified false positive or false negative) that this participant identified;
  • Bonus Quota: number of additional images that this participant is allowed to submit.

Bonus Quota: We multiply the number of Adverse Examples a participant has discovered by 5 in order to calculate the bonus quota per participant. In other words, for every example that is scored as adverse the participant will be allowed to submit 5 more image-label pairs.

Human verification: Human raters will be rating all the image-label pairs in the submission Queue by the participant continuously throughout the challenge.

Awarding points: If multiple participants submit the same image-label pair, a point is awarded to the first participant who submits it (based on the timestamp in the submitted images queue)

Adverse examples: Image-label pairs for which the human verification is in disagreement with a machine prediction, e.g.

  • human verification = Y, machine prediction = N (false negative);
  • human verification = N, machine prediction = Y (false positive).

Winner: The winner is the participant with the highest number of Adverse Examples when the competition closes.

