For Binary Classification we only ask yes/no questions. If the question needs more than 2 options it is called Multi-class Classification. Our example above has 3 classes for classification. If there are multiple classes and we might need to select more than one class to classify an entity that is Multi-label Classification. The image above can be classified as a dog, nature, or grass image. And these labels do not contradict each other, so the image can be classified with all these labels. Multi-label classifiers deal with such cases. In real life, most of the classification problems need multi-label classification. The main objective of a multi-label classifier is to enable multiple labels for a single entity. However, most of the common algorithms are designed for multi-class classification (not multi-label classification).
In this article, you are presented with different implementation examples for the solution of multiple label classification problems. For this purpose, multi-label classification algorithm adaptations in the scikit-multilearn library and deep learning implementations in the Keras library were used. In addition to the implementation that you can do yourself, you will also see the multi-label classification capability of Artiwise Analytics.
There are several approaches to deal with multi-label classification problem:
- Problem Transformation Methods: divides multi-label classification problem into multiple multi-class classification problems.
- Problem Adaptation Methods: generalizes multi-class classifiers to directly handle multi-label classification problems. For example; class probabilities of a multi-class classifier can be cut with a threshold and all classes that exceed the threshold can be assigned as labels.
- Ensemble Method: create a solution by combining transformation and adaptation methods. For example; eventual results can be achieved by combining outputs of these methods with some predefined rules.
In this blog post, we will use Label Powerset approach which is a problem transformation method. This approach matches all combinations of labels that can occur together with a combination id and uses these ids as classes and trains multi-class classifiers accordingly.
Using Powerset has pros and cons:
- Combination of labels is considered
- Has high complexity on evaluations
- Might lead to an imbalanced dataset with a combination of labels
Natural Language Processing (NLP) is having the golden age with the usage of deep learning approaches which lead to far better results than the state-of-the-art solutions.
Deep learning comes with great advantages of learning different representations of natural language.
There are several advantages of using deep learning for NLP problems:
- It can create a classifier directly from data, moreover, it can also fix weakness or over-specification of a handcrafted feature.
- Handcrafting features is usually time consuming and nonreusable. Features extracted from one domain are usually not generalizable for other domains. Whereas; deep learning architectures automatically learn features from a different level of abstractions and sub-levels contain information that can be adjusted to other domains after minor changes.
Deep learning models do not approach its labels as mutually-exclusive classes and that leads it to create more efficient analysis of classes than nearest neighbor or clustering models. NLP models can be very fragile if tokens are evaluated independently. Considering the similarity of tokens on a vector space gives an NLP system robustness to the change of tokens and ease of handling complex cases. The method explained in this paper also uses such a system.
Supervised tasks such as classification can be supported by taking advantage of the contribution of unsupervised learning in deep learning. There are already huge amounts of unsigned datasets. Outputs of the language model, semantic distances of words etc. can be produced with an unsupervised learner to be trained over them. Using these outputs with the inputs in the classification problem, preliminary information about the language or field of study can also help in the assessment. To give an example, when a word that is not seen in the vocabulary created in the data set in a classifier created by classical machine learning methods is encountered, this word cannot be taken into account during estimation. However, even if that word has never been seen with the use of preliminary information about the language brought by deep learning, this word can also be used in guessing by carrying out an idea from words that have been encountered in the vectorial representation and the training set.
This is one of the most important advantages of deep learning, the information learned is built on a level basis through composition. The lower level of representation can often be shared across tasks. It naturally deals with the recursion of the human language. Human sentences consist of words and expressions that have a certain structure. In other words, when people construct a sentence, the sequence of words actually contains a lot of information and information flows through these sequences. Unfortunately, it is not possible to represent this information flow in the bag-of-words method, which is very popular in classical machine learning methods. Deep learning, especially repetitive neural networks, can capture sequence information much better.
We used a private dataset of our company to evaluate the model. You can also use your dataset or might prefer to use one of the following public datasets:
Our dataset consists of 17979 content and 5 labels:
Machine Learning Architecture
Split data into train and test sets:
- Trainset: data that will be fed to the model
- Testset: data that will be used to evaluate model performance
Convert words into vectors:
Train Logistic Regression with Label Powerset approach
Make predictions with your examples when train is completed
Deep Learning Model Architecture
This architecture uses attention mechanism to update correct weights by evaluating their importance on the acquired label. Attention mechanism is also used for visualization of weights to make the model decision more understandable.
For more detail please read: https://www.aclweb.org/anthology/P16-2034.pdf
The architecture of the constructed model is as follows:
Prediction outputs example,
Those examples are the output of a trained model for Turkish language, it’s translated to English just to demonstrate how Attention layer could help in making the deep learning model more indicatable.
With Artiwise Analytics:
Create modules where you can arrange settings
On the module trained, select:
- Feature representation method (Existence, TF-IDF, etc.)
- Feature selector (ChiSqure, Mutual Information, etc.)
- Classification type (single, multi, etc.)
Upload your dataset as a file or through API with labels or tags you choose
After training inspect (general or filtering by label) following metrics:
- Precision, recall, accuracy
- Confusion Matrix
You can predict any text by sending a text to the model you trained into “Test” tab.
In addition, you can easily perform various manipulations such as data analysis, adding rules and editing categories via the interface.