I used https://oneshotlora.com/ to generate this. It was a pretty elegant process, you give them a YouTube link and they handle the rest. Looking through their dataset I wasn't completely satisfied with their captions, but they provide the training data so I am free to fix it if I want.