Thoughts on automation / frequency lists / analysis - Decks Inside

laiman · December 21, 2021, 2:44am

Hi All,
I’ve recently been playing around using python combined with selenium and other web packages to start creating lists for TV shows and series that I enjoy on the web. The motivation being that on tracked readers I enjoy using the percentage known after an import to guage difficulty, as well as being able to do some first exposure studying on special words that might be domain specific. For example, I find it helpful to read a little about old chinese words before jumping into some period dramas to aid my comprehension.

All this got me thinking about whats the best way to approach this. I know many people are fond of frequency lists but these can turn into massive decks with large import times, as well as harbouring the pain of having words you’ve seen before get tokenized in different ways effectively bogging you down with duplications.

As a result I’ve made and published some Skritter decks based on these thoughts for me to trial out with different shows and media forms some of my ideas.

Shows/Videos im sharing include skritters most recent Tangyuan video, Netflix’s ‘The Lengend of the White Snake’, and Youkus Falling Into Your Smile 你微笑时很美.

Falling into your smile:
Analysis (full show) Skritter - Learn to Write Chinese and Japanese Characters
Frequency (full show, first 2k) Skritter - Learn to Write Chinese and Japanese Characters

The Legend of the White Snake:
Analysis (full show) (link limit on post, will put in comments)

Tuanyuan:
Analysis (link limit on post, will put in comments)

Ive also been creating private lists where I pull in full shows or creators Uploads from youtube like the above falling into your smile. Wondering if it might be better to do it per episode? Or to take the analysis file/freq list after and categorize it into themes? I’ve found this SUPER helpful for when I’ve wanted to get into a specific topic. For example, I found one creator who had made a couple hundred videos on baking so I made an analysis deck for that and the focused approach has really helped. However for shorter videos this doesnt really make since since the transcript is often quite short so a frequency/analysis list thats automated might rank words that arent as important to the theme as higher since the occurence of each word in the transcript will be low.

Any ideas welcome on this, how do you like your skritter decks? What do you find motivational? What are the limitations of this approach? Biggest limitation im finding is frequency lists end up being way too big (FFYS list was over 50 sections long in original output) and lots of words using my current analysis packages end up being incorrectly tokenized and need manual attention.

laiman · December 21, 2021, 2:45am

Links from post above that didnt make it

The Legend of the White Snake: Skritter - Learn to Write Chinese and Japanese Characters
TangYuan： Skritter - Learn to Write Chinese and Japanese Characters

Therebackagain · December 21, 2021, 6:21pm

Thanks so much for this!

The other problem is that Skritter’s browse function is really terrible - you have to scroll forever and if you stop to check something out, it takes you right back to the beginning of the browse list and you have to start going down the list all over again.

So unless someone knows the exact name of your very helpful lists, no one will ever find them.

laiman · December 22, 2021, 10:05am

Ah yeah the browse can be frustrating at the best of times, would be great to see some improvements there. For now I’ve just been sharing with friends when we decide to tackle a series together and waiting unttil we get a week or so in for feedback there. Just curious to see if anyone has any thoughts while I develop this tool (hoping to release as open source once it gets to a point where im happy to open up my terrible code to the world as more content benefits everyone)