EuroPython 2016

(Machine-)Learning Chinese, with Python!

Speaker(s) Andreas Dewes

(Machine-)Learning Chinese, with Python!

Learning a new language is hard work. Especially if it is Chinese, which is a tonal language that is written using more than 10.000 different characters. Finding our way around in this linguistic labyrinth is a daunting task. But do not fear, for we have the power of Python at our side, and with its help we will machine-learn Chinese!

Mom scolds the horse

Chinese is a tonal language, which means that pronouncing a syllable differently will usually change its meaning. And while this can be very funny, it can also be rather embarrassing for language learners. So, to keep us from getting into linguistic trouble, we’ll write a little Python tool that helps us to improve our pronunciation.

Seeing the trees and the forest, but still being lost

Reading the morning newspaper while having a nice cup of tea doesn’t sound so complicated, does it? Well, if that newspaper is printed in Chinese we will have to know about 2.500 characters just to make it through the first pages. Again, machine-learning will come to our rescue!

Low-hanging fruits, high-flying dragons

Pronunciation and characters mastered, we’ll still have to learn a large amount of words and phrases, so where to begin? To answer this, we’ll make use of Bayesian techniques to identify the low-hanging fruits of the Chinese language.

Congratulations, you should now be fluent in Chinese (or at least Machine-Learning).

in on Thursday 21 July at 14:00 See schedule


  1. Gravatar
    Hi there.

    Very nice and appealing proposal, congrats!
    Nevertheless, I think it would be helpful to better clarify the contents and the expectations for this training directly in the abstract rather than briefly sketching them in the notes for reviewers.

    If you would do so, you will get in return all of my attention (and votes) !-)

    Thanks a lot for your consideration.
    — Valerio Maggio,
  2. Gravatar
    Hey everyone,

    to clarify the content:

    This tutorial is about using machine learning in language learning. We will explore various methods of machine learning and data analysis to help us get a better understanding of the Chinese language.

    I hope to help participants to learn more about the following things:

    * Using Python to work with unstructured data (speech, text, images)
    * Extract meaningful features from this data for analysis
    * Use various data analysis techniques to extract meaning from the data
    * Visualize the results using e.g. Matplotlib

    In the tutorial we will use the "canonical" Python data analysis stack:

    * IPython notebook
    * numpy, scipy & (possibly) pandas
    * scikit-learn
    * matplotlib

    In addition, we are going to use a variety of libraries to help us to retrieve and process different types of input data.

    I will provide a repository with the code and data for the tutorial, and a Virtualbox image with all the necessary tools preinstalled (for people that don't want to set up everything themselves).
    — Andreas Dewes,

New comment