I’ve recently been playing around using python combined with selenium and other web packages to start creating lists for TV shows and series that I enjoy on the web. The motivation being that on tracked readers I enjoy using the percentage known after an import to guage difficulty, as well as being able to do some first exposure studying on special words that might be domain specific. For example, I find it helpful to read a little about old chinese words before jumping into some period dramas to aid my comprehension.
All this got me thinking about whats the best way to approach this. I know many people are fond of frequency lists but these can turn into massive decks with large import times, as well as harbouring the pain of having words you’ve seen before get tokenized in different ways effectively bogging you down with duplications.
As a result I’ve made and published some Skritter decks based on these thoughts for me to trial out with different shows and media forms some of my ideas.
Shows/Videos im sharing include skritters most recent Tangyuan video, Netflix’s ‘The Lengend of the White Snake’, and Youkus Falling Into Your Smile 你微笑时很美.
Falling into your smile:
Analysis (full show) Skritter - Learn to Write Chinese and Japanese Characters
Frequency (full show, first 2k) Skritter - Learn to Write Chinese and Japanese Characters
The Legend of the White Snake:
Analysis (full show) (link limit on post, will put in comments)
Analysis (link limit on post, will put in comments)
Ive also been creating private lists where I pull in full shows or creators Uploads from youtube like the above falling into your smile. Wondering if it might be better to do it per episode? Or to take the analysis file/freq list after and categorize it into themes? I’ve found this SUPER helpful for when I’ve wanted to get into a specific topic. For example, I found one creator who had made a couple hundred videos on baking so I made an analysis deck for that and the focused approach has really helped. However for shorter videos this doesnt really make since since the transcript is often quite short so a frequency/analysis list thats automated might rank words that arent as important to the theme as higher since the occurence of each word in the transcript will be low.
Any ideas welcome on this, how do you like your skritter decks? What do you find motivational? What are the limitations of this approach? Biggest limitation im finding is frequency lists end up being way too big (FFYS list was over 50 sections long in original output) and lots of words using my current analysis packages end up being incorrectly tokenized and need manual attention.