I've noticed some definite data overlap in the system - a percentage of the validation data has been trained, the R1 constraints aren't necessarily perfect in the current state. Even without the overlap, the early stage model can be directly trained to be R1@100% within 3 epochs, so this isn't the crucial fail point. The fail point is in the large scheme tests having potential overlap, and these cannot have overlap for the big train.
This overlap happened when I switched from the 200k set to the 500k set, so the full 12 million set will need a new validation target other than itself. I only ran it twice, but both times likely bled 20k images of mixed origin from the other, which likely is less than 8% or so bleedover but it's enough to taint the outcome.
I'll require another dataset to validate, something completely removed from the attribution and completely differentiated. I'm going to likely use my own dataset as validation, which is essentially a billion trash prompts that cannot simply be solved, and often make zero sense.
Even a small percentage of the validation data having been trained is enough for me to resort to extreme measures, not to mention the damn thing is reporting R1 100% all the time which is annoying me. I want to see a legitimate series of impossible combinations that cannot be represented, essentially garbage noise mixed with pure captions that the model has never learned from.
The model cannot easily solve these, which will give a perfect measure. I'd say maybe a million of these will be the best possible impossible goal.


