Hello! I am as of now creating frequency lists for each type of media. Most media have pretty good transcripts available, however actual real life conversations is proving to be slightly more tricky to avoid biases.
I am looking for help in finding any kind of media where people are just having everyday conversations. I am trying very hard not to bias my list in any direction and as such I have come up with lots of rules for what kind of media and how much of it I will allow into my list. I am trying to make something that is as accurate as possible to real life and doesn't over-represent certain types of conversations/accents or speech patterns.
Below I will list certain conditions that would make something an ideal suggestion, however any and all suggestions are appreciated and welcome. I am more than happy to filter through all the videos that are given and decide which are worth adding and which are not. However, if you have videos that fall under any of these categories, that would be ideal:
– Fast, slurred speech with good transcriptions. Automatic subtitle generators have a tough time with people who speak fast and more slurred. As a result, these types of speech patterns risk being under-represented in the data. If you have any accurate transcripts of someone or some group that you think may fall under this kind of speech pattern please let me know!
– Non-Tokyo or Kansai accents. Tokenizers have a tough time with this type of speech but I want to try my best to include them as they naturally represent a massive amount of real life speech. It may be inevitable that the data set will be bias towards Tokyo-ben and Kansai-ben but I will try to have some meaningful inclusion of other accents too.
– Non love related conversation. The only very easy thing to get access to in Japanese seems to be 恋話。So many reality shows in Japanese have themes surrounding people liking each other or putting people in scenarios where they are likely to match etc. As these are easy to get good transcripts of, they are not as useful to me. Of course these kinds of conversations are common in real life, but are at risk of being over-represented by these types of media. So I would ideally like non-love/relationship related conversations (or at least where it isn't a central theme).
– Essentially any other type of speech that you would expect an automatic subtitle generator to have trouble with. Those are mainly what I am looking for. These generators have gotten quite a bit better but are far from infallible so human-made transcripts would be preferred.
Thank you for any suggestions and help!
by Goldeyloxy