That’s why it gives you a panel of 9 images. It would have a high confidence on some images, and a low confidence on others. When you pick the correct images and don’t pick incorrect ones it uses the ones it’s confident about as “validation” while taking the feedback on low confidence images to update the training data.
What this does mean in practice is that only ones actually being “graded” are the ones bots can solve anyway.
Not anymore