Evolution Of Neural Networks for Sequence Tagging - Sustainable Customer Experience Platform

Evolution Of Neural Networks for Sequence Tagging

Sequence tagging is broad research for both traditional statistical linguistics and the Machine Learning era of NLP. The roots of current Neural Sequence Tagging models come from more probabilistic models.

Some of these probabilistic models use Maximum Likelihood Estimation or MAP (Maximum a posteriori) estimation: for example, Hidden Markov Models (HMMs), Maximum Entropy Markov Models (MEMMs), etc.

On the other hand, modern Sequence Tagging models use gradient-based approaches: for example, Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), etc.

Hidden Markov Models (HMMs)

Hidden Markov Model (HMM) is a statistical Markov model with Markov assumption:

Markov Assumption.

Hidden Markov Model for tagging allows us to talk about both observed words (events) and part-of-speech tags (hidden events) that we think of as causal factors in our probabilistic model. Formally defined:

Definitions for HMMs.

For token classification, transition probability means “given tag t_i with observing tag t_{i-1}”:

Transition probability.

and emission probability means “given w_i with observing its tag t_i”:

Emission probability.

The “training procedure” of HMM is calculating emission and transition probabilities. These probabilities are obtained from the tagged training dataset. The formulations of these probabilities are based on bigram Markov assumption. HMMs can be n-gram (in practice, the trigram is enough to get desirable performance). Tagging unobserved samples or evaluation is done by Viterbi Decoding with dynamical programming. The decoding is to choose the tag sequence t that is most probable given the observation sequence of n words (w):


using Bayes’ rule, we have:

Expanding objective.

The probabilities are obtained from the bigram assumption that was mentioned above.

Bigram assumption.

plugging those equations, we have

Full form.

The Viterbi decoder can be implemented with dynamical programming for tagging unobserved samples or evaluation:

Viterbi decoding pseudo code with dynamic programming.

Implementing HMM

Now, we will give an example about Hidden Markov Models. To do that, we will use the x-tagger library. x-tagger is a Natural Language Processing Toolkit for sequence labeling in its simplest form. For more information about x-tagger, you can visit its Github repository: https://github.com/safakkbilici/x-tagger

x-tagger serves Hidden Markov Model with its extensions: bigram, trigram, deleted interpolation, morphological analyzer, prior support. We use CoNLL2000 dataset:

import nltk
from sklearn.model_selection import train_test_splitdata = list(nltk.corpus.conll2000.tagged_sents(tagset='universal'))
train_set, test_set = train_test_split(data, test_size=0.2)

Then we create HMM model and fit:

model = HiddenMarkovModel(
extend_to = "bigram",
language = "en",

Now we can evaluate our model:

random_size = 20,
seed = 15,
eval_metrics = ['acc', 'classwise_f1'],
result_type = "%",

random_size parameter selects uniformly distributed 20 test examples based on the computational complexity of the Viterbi decoder. Eval metrics take many built-in metrics. We choose accuracy and classwise f1. After evaluation, our performance metrics are:

{'acc': 89.08450704225352,
'classwise_f1': {'CONJ': 38.46,
'PRON': 100.0,
'VERB': 87.71,
'NOUN': 87.97,
'ADV': 72.72,
'.': 100.0,
'ADJ': 80.0,
'PRT': 100.0,
'ADP': 95.55,
'NUM': 100.0,
'DET': 98.87,
'X': 100.0}}

Now let’s predict!

s = ["There", "are", "no", "two", "words", "in", "the", "English",       "language", "more", "harmful", "than", "good", "job"]  model.predict(s)

The output will be:

[('There', 'DET'),
('are', 'VERB'),
('no', 'DET'),
('two', 'NUM'),
('words', 'NOUN'),
('in', 'ADP'),
('the', 'DET'),
('English', 'ADJ'),
('language', 'NOUN'),
('more', 'ADV'),
('harmful', 'ADJ'),
('than', 'ADP'),
('good', 'ADJ'),
('job', 'NOUN')]

Recurrent Neural Network (RNN)

Recurrent Neural Networks have been used for a long time for sequential tasks like sequence tagging, sentence classification, or even generative methods: summarization, autoregressive language modeling. Due to its shared parameters, if a framework is built on dynamic computational graphs, Recurrent Neural Networks can take variable size input [x1, x2, …, xn].

RNNs with embedding layer.

The parameters We, Wh, and U are trainable parameters of RNNs. As a choice, the embeddings of words can be learned simultaneously. RNNs have the forward propagation of

RNN forward equation.

and the optimal parameters can be learned with backpropagation through time (BPTT).

Sequence Tagging with RNN can be done by adding a feed-forward network to the top of the RNN. RNN outputs a tensor with shape (B x word_len x hidden_dim). A feed-forward network computes the tag probabilities for each word (B x word_len x n_tags).

Sequence tagging with RNNs.

Long Short-Term Memory

Long Short-Term Memory is a fancy version of RNNs. If an RNN model has long time steps, the magnitude of gradient matrices can be reduced or increased. This will cause vanishing gradients or exploding gradients. LSTM is designed as a solution for these problems. Instead of a unit that simply applies an elementwise nonlinearity to the affine transformation of inputs and recurrent units, LSTM recurrent networks have “LSTM cells” that have an internal recurrence (a self-loop), in addition to the outer recurrence of the RNN. LSTM has different cells and gates:

The forget gate controls what is kept vs. forgotten from the previous cell state:

Forget gate forward equation.

The input gate controls what parts of the new cell content are written to the cell:

Input gate forward equation.

The updating cell state is the new cell content to be written to the cell:

Update cell forward equation.

And the output gate controls what parts of the cell are output to a hidden state:

The output forward equation.

The same sequence tagging approach in RNNs can be made in LSTMs. Adding the classification layer (feed-forward network) is enough.


Language does not have only left to right context. Both syntax and semantic are defined by bidirectionality. Let’s give a straightforward example based on an elementary sentence: “The rat ate cheese.”. The word “rat” and the word “cheese” are related in some way. There is a ‘passive’ advantage in using knowledge about both the past and future words. Cheese is generally eaten by rats, and rats generally eat cheese. This relationship can be obtained in sequence labeling. There is a bidirectional relationship between an adjective and a verb. In general, any property of the current word can be predicted more effectively using this approach, because it uses the context on both sides. For example, the ordering of words in several languages is somewhat different depending on grammatical structure.

Bidirectional LSTM (PyTorch implementation).

The forward equations of each direction are the same and can be calculated concurrently. Their outputs are concatenated, and backward propagation is then applied.

Most state-of-the-art models for sequence labeling include bidirectional models, and the Bi-LSTM is one of these models.

Now, we will give an example about bidirectional LSTMs. x-tagger serves built-in LSTM tagger module. We use CoNLL2000 dataset and prepare it as a PyTorch dataset:

import nltk
from sklearn.model_selection import train_test_split
from xtagger import xtagger_dataset_to_df, df_to_torchtext_datadata = list(nltk.corpus.conll2000.tagged_sents(tagset='universal'))
train_set, test_set = train_test_split(data, test_size=0.2)df_train = xtagger_dataset_to_df(train_set)
df_test = xtagger_dataset_to_df(test_set)device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iter, _, test_iter, TEXT, TAGS = df_to_torchtext_data(
batch_size = 32

Let’s create our bi-LSTM model.

from xtagger import LSTMForTagginginput_dim = len(TEXT.vocab)
out_dim = len(TAGS.vocab)
pad_idx = TEXT.vocab.stoi[TEXT.pad_token]
tag_pad_idx = TAGS.vocab.stoi[TAGS.pad_token]

model = LSTMForTagging(

For monitoring, saving, and loading the best model; x-tagger serves checkpointing class:

from xtagger import Checkpointing

checkpointing = Checkpointing(
model_path = "./",
model_name = "lstm_tagger.pt",
monitor = "eval_acc",
mode = "maximize",
verbose = 1

Then, we fit our model.

epochs = 3,
eval_metrics=["acc", "avg_f1"],
checkpointing = checkpointing

The best model is saved with metrics:

{'eval': {'acc': 96.8260961173789, 'avg_f1': {'weighted': 96.82928259419015, 'micro': 96.8260961173789, 'macro': 82.06780272336617}}, 
'train': {'acc': 99.08795197610405, 'avg_f1': {'weighted': 99.0761306178256, 'micro': 99.08795197610405, 'macro': 83.83178471296974}},
'eval_loss': 0.09836322031375291,
'train_loss': 0.029670139710099377}

Anyone can test on a different dataset with the same model later:

model = LSTMForTagging(input_dim, out_dim, TEXT, TAGS, cuda=True)
model = checkpointing.load(model)

Now let’s predict:

s = ["The", "next", "Charlie", "Parker", "would", "never", "be", "discouraged"]


Output is:

[('The', 'DET'),  ('next', 'ADJ'),  ('Charlie', 'NOUN'),  ('Parker', 'CONJ'),  ('would', 'VERB'),  ('never', 'ADV'),  ('be', 'VERB'),  ('discouraged', 'VERB')]


BERT For Token Classification (Devlin et al., 2018).

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained bidirectional language model. It consists stack of encoders from the original Transformer architecture. In pre-training, BERT’s objective is masked language modeling (MLM) which does not require labeled data. To pre-train BERT, input segments are obtained from an unlabeled corpus, and each segment is randomly masked with a special token [MASK]. The objective of the masked language model is to maximize the log-likelihood of observing masked tokens ~with given masked input segment ^x for sequence length T.

This masking procedure is done randomly, considering 15% of tokens in the input segment. The masking of segments is static. In other words, the masking of a specific input segment does not change in each epoch.

Fine-tuning in BERT is done by using a contextual representation of tokens. For sentence classification, contextualized representation of the first special token of input segment [CLS] can be used to fed the fully connected layer to finetune.

Considering sequence labeling task, passing all contextualized representations of tokens from BERT to a classifier layer (e.g., feed-forward network) allows us to get syntactic tags of our tokens.

Now, we will give an example about BERT. x-tagger serves built-in BERT tagger module. We use CoNLL2000 corpus dataset and prepare it as a PyTorch dataset:

import nltk
from sklearn.model_selection import train_test_split
from xtagger import xtagger_dataset_to_df, df_to_torchtext_datadata = list(nltk.corpus.conll2000.tagged_sents(tagset='universal'))
train_set, test_set = train_test_split(data, test_size=0.2)df_train = xtagger_dataset_to_df(train_set)
df_test = xtagger_dataset_to_df(test_set)device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")train_iter, _, test_iter, TEXT, TAGS = df_to_torchtext_data(
transformers = True,
tokenizer = tokenizer,

Let’s create our BERT For Tagging model:

from xtagger import BERTForTaggingmodel = BERTForTagging(
output_dim = len(TAGS),
dropout = 0.2,
device = device,
cuda = True

And fit it:

eval_metrics = ["acc", "avg_f1"],
epochs = 10

The same monitoring, evaluation, and prediction functions are valid for BERT as well.


  • Dan Jurafsky and James H. Martin, “Speech and Language Processing”, second edition.
  • Ian Goodfellow, Yoshua Bengio and Aaron Courville, “Deep Learning”.

Comments are closed.