Chinese word list ordered by frequency and clustered by similarity

I’ve constructed a new list based on the 3000 most common words according to SUBTLEX-CH-WF:

My goal was to try to recreate some good experiences I had early on with curated lists like Skritter 101. I find it much easier to learn when things are grouped by similarity:

  • Characters that have similar components (either phonetic or radical). Seeing them together makes patterns more obvious, and multiplies the number of reviews you get for the common parts.
  • Words sharing common characters. This multiplies reviews for the common character, and also serves as a group of examples of how a character is used that is easier to understand than its definition alone.

This kind of clustering is going to work better on your own personal list of goal words, since any words you are already studying will be skipped in a new list. The list I’m sharing is one I built for myself as an experiment to see if my ideas worked in practice.

One of the reasons that I’ve made a fairly large list is that you need enough material to make the clustering interesting. The list roughly retains the frequency order (though words are often moved by hundreds of positions) but it needs enough material to make meaningful clusters.

Here’s an example of how the clustering works out (starting around position 215 in the list). Several of those words moved by about 50 places (earlier or later), for example 以后 and 以前 are about 100 places apart on the straight list by frequency:

  • 然后
  • 当然
  • 最后
  • 以为
  • 以前
  • 以后
  • 觉得
  • 感觉
  • 记得
  • 亲爱
  • 父亲

The character clustering does not have as big an impact because the goal for this list was to learn words, so single characters are only prominent near the very beginning of the list. One amusing stretch is where it clustered a bunch of interjections together because of the common 口:

  • (号)

Cool! I’m curious, did you do this fully algorithmically, manually, or some of each?

I tend to use textbooks and their associated lists because good textbooks basically do this for you as they progress. I’m curious if you have tried them and found them better or worse than this?

I wrote a program to do it. I did run it several times and tune it based on the results. It uses CEDICT to get pinyin so it know the difference between 觉 in 感觉 and 睡觉. It uses an IDS (cjkvi-ids) for character structure.

As far as lists go, this isn’t really a replacement for lists (selecting which words), it’s a tool to apply to any list of words so you can learn related things at the same time, hopefully learning all of them faster. One of the main things that inspired me was an experience encountering lots of variants of 青 and 相 grouped together and feeling like I was studying on turbo mode.

The reason I chose the SUBTLEX-CH-WF list as my target was my experience using HSK lists and some of the “top N characters” lists on Skritter. Just using straight word frequency is working much better for me in terms of understanding content. The “top N character” lists especially were just becoming too abstract, so I decided I wanted to switch to words and pick up characters along the way.

At some point straight frequency is going to break down. Personally I’m not nearly there yet. At that point the best advice I’ve seen is to focus on content-specific lists. The idea then would be to export the list of words I know from Skritter, subtract that from some list (e.g. all the words from some book), cluster the rest, and study that list.

One amusing thing about the SUBTLEX word frequency is that it is generally a good list of words by frequency, but sometimes the bias of subtitles shows up. For example, it has a surprising amount of high frequency words related to crime, like police, policeman, jail, a dozen words for murder/murderer, evidence, …

Was about to simply do a copy&paste list of SUBTLEX-CH words but turns out someone already did a better job than me!

Going to ditch my HSK6 list for a while! :joy:


1 Like