Do you have a contact you can reach out to about lists for some of the newer books like Emma? I’ve emailed them with no response.
I’ve reached out to John and Jared about getting the new Mandarin Companion book lists up on Skritter, but I know they’re both busy with a million things related to that and other projects. I’ll be following up again soon if necessary, and I hope we can get them on the site in the near future
Thanks for posting about these books. We love them, but it’s great to hear that you’re using Skritter and MC in conjunction to study!
If they are willing to supply an electronic copy of the text I can break it into words and upload that as a Skritter list.
We’ve found it best if publishers willing to work with us upload the contents on their own. That way an issues or updates to the list is under their control in the future.
It’s on my list of things to work on and we’ll get the new books uploaded as soon as we can!
Which books do you need a list for? I own all the books except Journey to the Center of the Earth, so for all the others I could create lists. They would include some redundant vocabulary like 不能 for example because this is also a separate entry in the dictionary I am using for extraction, but you can manually delete those before starting the list or just ban them if they annoy you.
Emma is the main one I was looking for.
How are you splitting texts? I was using CEDICT and a greedy approach, but there are lots of things in CEDICT that you might not want as words in Skritter. I’ve been using banning to avoid those.
I am also using the cedict dictionary file and dealing with all these composed words you wouldn’t want to study is exactly the problem I have been running into as well. So far I am just using an external library (jieba, you can find it on github), but want to write some sort of greedy algorithm with word frequency rules myself as well.
Here is the list for Emma: https://skritter.com/vocablists/view/6529413028118528 (not sure if you can follow that link, let me know if it does not work for you)
Regarding the cedict dictionary, have you found any other alternatives, some smaller dictionaries maybe that could prevent getting so many unwanted matches?
Thanks for putting that list up. Where did you get the source text? I just have the paper book.
The SUBTLEX (word frequency from movies) has much more semantic data (like parts of speech) as well as frequency, but it still contains 100,000 “words” (CEDICT has about 115,000). You could try to choose some kind of balance between frequency and length, but there are still high-frequency items that I wouldn’t want to study in Skritter like 是不是.
(The greedy approach is pretty easy to write: You just use a Trie, which is one of those data structures that’s awesome but you rarely have a real use for…)
I bought the all the Mandarin Companion books as epubs.
Thanks I’ll have a look at that dictionary, but I do think in the end it’s hard to avoid all the words you don’t want to study using Skritter, since this is also a bit of personal preference.
Oh yes I didn’t even think about using a Trie, that’s a good idea.
I have exported your list so I can process it myself. I was surprised to see that it actually contains many “words” not in SUBTLEX or CEDICT. Obviously they are known to Skritter. So jieba must have an even bigger word list than either of those.
Here are some examples:
刚到 你家 买不到 家来 多好 这是 男女朋友 很快 一个月 很晚 更好 要来 吃完饭 中有 看到 做过 我爱你 没面子 二十年 是因为
That’s indeed weird, I actually load a custom dictionary, which is downloaded from mdbg.net (called cc-cedict, so maybe that’s why it’s different?). When using the default jieba dictionary I am getting fewer words, but also some different useless words.
It looks like CEDICT is just updated very frequently. Just now it has an update for today, and it’s a very slightly different length than the one I have from a few weeks ago. So maybe you have an older version with more/different words?
Just dropping in to provide a quick Mandarin Companion update. I got a chance to spend some time speaking with Jared and they’ll work on uploading missing MC books as official lists on Skritter. Like us, Mandarin Companion is a very small team so I’m sure they would appreciate your patience as they get things uploaded.
@jannesan thanks for taking the time to make and share your Emma list!
My talk with Jared wasn’t just about the missing book lists. We’re also discussing ways of providing some of the Mandarin Companion content free to all subscribers, and offering previews of their entire content line. We need to get our mobile apps rolled out first, but I just wanted to let you know that we hope to include a lot more MC goodness in the future!