Automating Anki card creation for Chinese

I've been studying Mandarin on and off over the past few years, but with a year-end goal to reach HSK 4 (that is, mostly conversational levels). I've tried a number of apps (Duolingo, Drops, Hello Chinese) but ultimately spaced-repetition favors my learning style. (A good intro to that can be read here.)

This being said, creating decks for Anki can be time consuming, and finding prebuilt decks is an adventure of varying quality. In the past, I've mixed prebuilt Anki decks with custom terms I care about and sound bites sourced from videos to accompany them.

This wasn't scaling, nor was it really ethical. However, when playing with Duolingo on the web today, I realized all the new female Mandarin voices were tagged with 'Zhiyu'. A quick google search revealed this is a high-quality TTS voice for Amazon Polly. Embarassingly, it took until now to realize Duolingo's voice was TTS at all, as I'd avoided TTS decks in the past since they sounded worse than Microsoft Sam. Google Translate, for example, gives really off the mark voices. Polly's Zhiyu voice, on the other hand, seems high enough quality to mix in with natural voices in my deck. Jackpot. (Hey, if it's good enough that I didn't realize Duolingo was TTS, that's good enough for me.)

Polly's pricing gives plenty of free tier room, so it's feasible to generate all the HSK decks and then some for free. I'm going to document the process I went through to create these, but if you just want the Anki deck, skip to the bottom.

Creating the Anki cards

The first step was to source HSK banks in a standardized format. Helpfully, CSVs of the published banks were created by Alan Davies, who allows non-commercial use with credit. Transforming the source CSVs into a list of words and set of Anki-importable CSVs was straightforward:

import csv
import hashlib

# Input file, output TSV for Anki, and tags:
file_list = [
        ('data/HSK Official With Definitions 2012 L1.txt', 'data/hsk1.tsv', 'HSK1'),
        ('data/HSK Official With Definitions 2012 L2.txt', 'data/hsk2.tsv', 'HSK2'),
        ('data/HSK Official With Definitions 2012 L3.txt', 'data/hsk3.tsv', 'HSK3'),
        ('data/HSK Official With Definitions 2012 L4.txt', 'data/hsk4.tsv', 'HSK4'),
]

# Create a TSV deck for Anki, plus a full word list for generating TTS.
with open('all-words.txt', 'w', encoding='utf-8') as all_words_file:
    for f in file_list:
        with open(f[0], encoding='utf-8') as csv_file:
            with open(f[1], 'w', encoding='utf-8') as csv_out_file:
                csv_reader = csv.reader(csv_file, delimiter="\t")
                for row in csv_reader:
                    simplified_characters = row[0]
                    pinyin_tone_chars = row[3]
                    english = row[4]
                    # Generate a consistent hash for our media.
                    sound_clip = hashlib.sha1(simplified_characters.encode('utf-8')).hexdigest() + ".mp3"

                    print(simplified_character + "," + pinyin_tone_chars + "," + english)
                    all_words_file.write(simplified_character + '\n')
                    csv_out_file.write(simplified_character + "\t" + pinyin_tone_chars + "\t" + english + "\t" + "[sound:" + sound_clip + "]\t" + f[2] + "\n",)

This gives us a file we can import into Anki with fields [Hanzi, Pinyin, English, Audio]. Now we just need the audio files, which we can generate easily with the helpful reference code from AWS:

import boto3
import hashlib
import os

polly_client = boto3.Session(
                aws_access_key_id="...",
    aws_secret_access_key="...",
    region_name='us-east-2').client('polly')

def gen_for_word(input_text):
    # Make sure this matches the deck generator!
    hashstr = hashlib.sha1(input_text.encode('utf-8'))
    outputfile = 'output/' + hashstr.hexdigest() + '.mp3'

    # No reason to make a web request if we already have it on disk.
    if (os.path.isfile(outputfile)):
        print("File exists for " + input_text)
    else:
        response = polly_client.synthesize_speech(VoiceId='Zhiyu',
                        OutputFormat='mp3',
                        Text = input_text)

        file = open(outputfile, 'wb')
        file.write(response['AudioStream'].read())
        file.close()

with open('all-words.txt', encoding='utf-8') as fp:
    line = fp.readline()

    while line:
        print(line)
        gen_for_word(line.rstrip())
        line = fp.readline()

This is synchronous, but I was able to generate 1200 files in under a minute, which is decent enough. Now we just need to add them to Anki.

Importing to Anki

Create a new deck (I did this in a new profile to make sure my testing didn't break anything). Make sure you have a card type that matches our fields (Hanzi, Pinyin, English, Audio, in that order) and use File->Import.

The Anki import menu for my Chinese deck.

The UI should look something like this, which shows that our fields are all lined up and we can safely import. However, our soundbites are still missing. We need to manually copy all the output files from our Python script to our colllection folder. In Windows, this is %appdata%/Anki2/[profile name]/collection.media. Drop the files in there (not in a subdir) and hit Check Media in Anki. It should recognize your soundbites now, and they'll play automatically.

The final deck

You can grab the packaged up deck (HSK 1-4, plus example sentences) here.

Conclusion

This is very exciting -- Amazon Polly's Zhiyu voice is good enough that I can now quickly generate new cards for terms/sentences I want to hold on to. Hopefully the packaged deck is useful to somebody!