Machine Learning and Natural Language Understanding

Summary description of various projects in these very interesting scientific fields. Are AI systems fit to replace physicians? Building of reliable AI systems.

The ETH Zürich offers strong classes, in particular also in the field of machine learning. Many of them come with hands-on projects that allow learning not only the theory but also practical application of current technology on GPU accelerated computer clusters. TensorFlow runs very well on these high-performance machines that come also with lots of RAM. This is beneficial. Some time ago, I had to learn the hard way, that typical machine-learning applications may not be suited very well for a MacBook Air. :-)

After learning basics in Learning and Intelligent Systems (excellent teaching by Prof. Krause), I took Computational Intelligence Lab (CIL) and Natural Language Understanding (NLU) (both Prof. Hofmann and co-lecturers) in 2018 learning more about machine learning concepts. Both classes came with their own group projects, of which I summarize here some impressions. Maybe some of the code written in the group may become published sometime in the future. Each project lead to a report that explains all in much more detail, of course.

CIL: Text sentiment classification

In this group project together with Aryaman Fasciati, Nikolas Göbel and Philip Junker (The Optimists), we worked on the task of classifying twitter tweets into positive “:)” or negative “:(” associated sentiments as defined in the original data sets. Supervised learning could be used with 2.5 million labeled training tweets (50% each), which we split into training and validation sets (after randomization) during training. Separate test sets (one open to us as a score feedback on kaggle, and one hidden until final submission) were used to assess the accuracy of the predictions. 200-dimensional GloVe (Stanford) vectors, pre-trained on 2 billion tweets, were used to create word embeddings.

A first baseline with a random forest classifier (Python, scikit-learn) achieved an accuracy of 72 %. A second baseline with a very simple recurrent neural network (RNN) model (two layers with GRU cells; Python, TensorFlow) achieved a big improvement to 85.2 %. However, despite trying many additional options, our best final model extending the RNN (testing LSTM cells, increasing hidden state size, adding more input by using known sentiment dictionaries, adding a separete emoji analysis, and other things), could improve this accuracy only up to 87.4 %, which was good in the ranking but eye opening how hard it can be to improve on already quite strong RNN models. RNN are indeed very successfully used in seq2seq tasks such as text translation.

One limitation may also have come from the training data set. Including irony, sarcasm and other play forms in tweets, it seemed sometimes also hard for us to actually see an association of some tweets with the labeled :) or :(. Thus, improving the training set before training may be another option to go. However, this comes with the risk of introducing biases.

Of course, the project was only one small part of the entire classwork. Many more topics were covered.

NLU: Story Cloze Task

In this group project together with Simon Biland, Silvan Melchior and Rüdiger Birkner, we worked on two projects. The first was manual unrolling of a recurrent neural network (RNN), implementing perplexity calculations and predicting how partial sentences would continue (classic seq2seq). This gave us insights into what is going on behind the convenient automatic RNN models when also tweaking various of these background parameters.

The second project addressed the Story Cloze Task, which assesses the ability of the artificial system to deeply understand natural language and its underlying meaning. It consists of five-sentence long short stories created by humans. To master the task, the system needs to choose the correct story ending among two options for the fifth sentence based on the first four sentences. To make the task even more difficult, the actual training set contains only the correct ending, lacking the second ending that would be a bit off and thus wrong.

Only the small validation set has both endings including the label. When a part of this validation set is used to train a classifier and only the second part of the set used for actual validation, this deduced training set is very small to train a classifier that could extrapolate well to the vast set of test stories (hidden to us). Therefore, one of the goals of this project has been to generate the missing sixth sentence in the original training set: sentences that should be wrong but still close enough to be a training challenge for the classifier to be trained.

As a baseline, a discriminator (two layer RNN with GRU cells) was trained on half of the validation set and validated on the second half with an accuracy in predicting the correct ending of 66 % (in comparison, 50 % reflects the expected accuracy when a coin-flip is used to choose).

We explored various paths to improve this accuracy, which would go to too far for this short post here. Our most ambitious approach was in the context of combining two variants of Generative Adversarial Networks (GAN) building a model that consisted of a generator and a discriminator both training and improving the quality of the other.

Indeed, some of the generated story endings look quite natural. An example:

  • given context: ron was watching a <unk> chase on tv . the cop was chasing the bad guy through his neighborhood . it was terrifying ! but then the cop caught the bad guy .
  • generated ending: the man was arrested for a few days .

Tweaking these complex NN properly to balance both to be training with similar speed to see reinforcement learning is quite tricky; the configuration space is huge. Thus, no improvement on the baseline above could be achieved within the limited time. Also other alternative approaches (e.g. focus on sentiment trajectory to build an enriched discriminator without generator) did not improve over the baseline (published accuracy on sentiment analysis alone for this task 64 to 67 %).

It has been very interesting to learn while addressing such a challenging task.

AI replacing physicians?

Indeed, seeing the impressive results of modern systems (image classification typically better than humans, playing chess and go better than humans, learning arcade games autonomously without assistance of humans, predicting some medical outcomes better than humans, self-flying drones, driving cars better than humans? etc.) is eye opening.

“Artificial intelligence” is almost everywhere these days. There are even claims that some of the systems are good enough to replace physicians soon (e.g. blog). I learned there that a robot seems to have passed China’s national medical exam (link). IBM’s Watson “went to Medical School” in Maryland around the time I – coincidentally and not involved with this project – was a research fellow at the NIH in Bethesda, Maryland. Fully neutral, the blog posting above has also a posted counter-opinion (blog).

I fully acknowledge and appreciate the advantages modern data science can provide to physicians and patients. There is still a large potential. However, based on my professional background (as physician and computer scientist), I am very skeptical towards such claims that computers could replace physicians – or should. There are various aspects to this topic, which I may discuss at some point in a separate blog post.

Maybe, I have already learned too many ways how to create adversarial examples in Reliable and Interpretable Artificial Intelligence (RIAI) (Prof. Vechev). For many of the most impressive deep-learning systems, counter-examples can be constructed quite easily that contradict the generally good quality of predictions with absurdly wrong results. Additionally, many of the current systems are more or less black-boxes.

While these deep neural networks can map wide ranges of input to good outputs – given enough and good enough training data – there is not an explanation or rule set that could be shown so that humans can validate how the system came to such conclusions.

Manually tracking down such things can lead to interesting results, such as a system aimed to detect cancer in image analysis. It looks like the system learned during training to basically classify images as cancer when there was a ruler in the picture because this has been a common feature in these training sets. Similar a picture labeler that learned to “detect” horses because many of these pictures had some form of label on it. Thus, it learned label -> horse and not horse-shape -> horse.

This is problematic also in situations when the training set has biases as seen in a machine learning system planned by a company to predict good candidates for hiring. It was realized that the predictions were very biased because the training set had already this bias. Such observations also lead to the philosophical question what learning is, also for us humans. How often do we also just learn based on experienced biases? Probably a story for another post.

Of course, there are methods being developed that address such questions. Reliable or even certifiable neural networks should be an important goal in particular for life-critical applications.

AI replacing other professions?

I see the appeal for technology companies creating AI systems to enter the health care market. However, there are plenty of things to consider that may not be obvious to laypeople without medical training. There are good reasons why Medical School and post-graduate education takes some time to become a physician.

Additionally, there is another aspect that is often not realized by people in the management propagating the replacement of physicians. If AI systems become good enough to replace physicians then they are more than qualified to replace people in many white collar jobs including all management layers. If this is not done then just because of protectionism.

There have been already several revolutions in the job market replacing monotonous and/or dangerous heavy-duty work by machines, and creating new job opportunities that are more interesting for humans. This coming revolution is different: I am not the first person to realize that the current ongoing revolution in the job market will be able to replace many of the so far even academic white-collar professions by AI systems.

In contrast to earlier revolutions, it is much harder to learn enough to climb the ladder above the job type that was just replaced by AI. Additionally, software systems can be distributed for world-wide use without the efforts to ship machines in earlier revolutions. Thus, there will always be only a very limited demand of new AI system designers compared to world-wide population.

Yes, new job types will emerge, as they always emerged with such professional revolutions. The question is whether they emerge fast enough and allow training easy enough for people to maintain or increase their current income. It is a demanding task to make such changes in a socially responsible way. However, history has shown that such a responsible behavior has not always been observed.

I have always been a fan of the very differentiated education system in Switzerland that puts lots of emphasis onto the value of non-university professional education with the early combination of theoretical knowledge and practical skills in real life. Additionally, it comes with an associated flexibility to learn more, and specialize or change directions, at a later time. It is a great system that allows fluctuation. As I have seen in other countries that focus on academic education only (all “real jobs” need to be a at least a BSc or MSc), I got the impression that these systems work worse for the general society. Thus, I am optimistic that these coming changes in the job market can be solved in a responsible way. However, it does not just happen automatically. Good, responsible decisions must be made – as part of the society.