Language data is now everywhere on the Internet and it is not in an orderly manner. Think about the times when you are looking at tweets while you are scrolling through your feed or when you are looking at the reviews for the tech gear you want to buy for sometime. They are usually full of punctuation errors. Even us humans are sometimes have an hard time to understand where a sentence ends and other one begins while reading these type of contents. Another example can be given from the “SEO-based” news articles, the majority of the content is like “Find about that news more in …”, “It is talked at the Internet a lot and ..” and you want to find is only a sentence long. There is a solution for each of these problems. Text segmentation solves these problems by extraction of an important subset of the content.
Text segmentation is the task which extracts relevant sub-units like words, phrases, sentences and paragraphs from texts which can be utilized for many tasks. These tasks**** include:
- Word Segmentation
- Sentence Boundary Detection
- Text Summarization
- Sentiment Analysis
An important point of view when looking at the text segmentation task is that it is a sequence labeling task. The reason for that is when you look at other sequence labeling models, you will see that the labeling is carried through selecting a set of words which gives a semantic information such as the type of object which token refers to or what kind of a phrase it is. From that point of view, considering a text with misused punctuations, a semantically dense part of paragraph or a subsentence which there is a sentiment assignment, it can be said that labeling a subsequence is the task of text segmentation. As we explained its applications in text analysis for obtaining a fine-grained and semantically dense sub-units, we’ll be investigating this feature in detail to adapt an Aspect Based Sentiment Analysis.
Background
The research on text segmentation goes back to the 90’s. The literature on segmentation starts with rule-based models. The algorithms try to capture the segments via recurring patterns in texts. Then probabilistic models are used. These models usually rely on the concept of lexical cohesion. Lexical cohesion is the concept in which related words are often in a close window span. Since topic segmentation is relevant in these models we see Latent Dirichlet Allocation in these algorithms. These algorithms adapt this approach to extract the highest possible set of words.
Popular Approaches in Early Work:
- TextTiling
- C99
- CVS
- Bayesseg
- PLDA
- Bayesseg-MD
- MultiSeg
- BeamSeg
These algorithms start from 1997 until 2015. With the emerging use of neural networks these algorithms have shifted towards neural models over time. With the idea of sequence labeling in mind we see architectures such as CRF’s, RNN’s, Transformers and even fine-tuned BERT models.
How to Adapt Segmentation Algorithms On ABSA?
Before we dive into different types of segmentation models, we need to talk about ABSA. There is a scarce literature on the use of segmentation in ABSA literature. Therefore we need to adapt segmentation to this task as we discussed before. In Aspect Based Sentiment Analysis literature, there are 3 subtask as
- ATE (Aspect Term Extraction) is the task to identify the target term(s) which the sentiment expressed about
- ACE (Aspect Category Detection) is the task to identify the target category/topic which the sentiment expressed about. This is usually for commercial usage.
- APC(Aspect Polarity Classification) is the task to identify expressed sentiment (positive, negative and neutral) about the target term or category
The literature consists of two separate approaches, it either uses a joint model which carries ATE and APC at the same time or a linear combination of them (ATE then APC). In ATE and ACD, we utilize the knowledge of which topic the sentence is about. There we can use the topic segmentation literature for extracting the segments about the topic or we can use discourse segmentation to eliminate irrelevant sentences in data. These segmentation types will be explained later. In APC, however, we need sentiment based segmentation.
These tasks are easy on the academic literature since the datasets are usually in orderly manner, separated and labeled sentence by sentence. However real life data is more complex than this. Since the human-generated data is full of mistakes it is a task itself to determine where a sentence ends and another one begins. Therefore we use Sentence Boundary Detection(SBD) in larger bodies of texts which contain multiple sentences. There are different types of SBD models for Chinese orthography or financial texts in literature but in real life the majority of data is coming from social media.
Along with erroneous use of punctuations, we also see use of punctuations such as “!!!” or “:)))” “ok…..” type of occurrences which cannot be dealt with regular expressions only. Although the majority of literature rely on rule-based algorithms as opposed to what we have discussed, there are studies which utilizes architectures such as CRF, BiLSTM, BERT.
However, we cannot just rely on separating sentences for ABSA since there can be segments with no aspects or segments containing multiple aspects at once. Therefore we need segments which are filtered by relevancy or segments that are smaller than a sentence. This approach is referred to as Compositional Approach to Sentiment Analysis[2]. For example:
I like the ambiance of this restaurant. However the food was not that good
- In this example, we can see that there are two aspects that are expressed as the ambiance and food quality. The spans where the both aspect and sentiment is divided via sentence boundary.
- There, we can say that the sentence boundary detection is helpful for aspect based sentiment analysis and we will be separating it as below:
“I like the ambiance of this restaurant.”
Aspect: ambiance
Polarity: positive
“However the food was not that good”
Aspect: food quality
Polarity: negative
- However if we encounter a sentence like:
I like the ambiance but the food was terrible.
- There are two aspects expressed in one sentence. There is no sentence separator for them so how can we separate them?
- There we must extract the sub-units that are smaller than sentences:
“I like the ambiance”
Aspect: ambiance
Polarity: positive
“but the food was terrible.”
Aspect: food quality
Polarity: negative
- This approach looks deeper and deeper into a body of text to find a unit which is composed of one aspect and one sentiment for the analysis. Mainly, a compositional approach to extract aspect-sentiment pairs.
- To apply compositional approach in ABSA, we need categorical information to process the texts. There we can use two main information sources of ABSA which are the aspect information and sentiment information. We will be investigating how extraction can be done regarding these two information sources.
Aspect Polarity Based Segmentation
There are not many studies implementing this approach to ABSA. The focus of these studies is to extract the segment where the sentiment is assigned. These studies basically segment a body of text according to the polarity of sub-units. For example:
[I don’t know.] [I like the restaurant] [but not the food.]
neutral positive negative
These studies heavily rely on the syntactic structure of sentences. There are rule based studies that use heuristic rules to extract sentiment information. Such as:
- If a “sentiment-denoting” adjective as good before a noun food, then the segment is good food or
- If a “sentiment-denoting” verb as hate and objects as service and restaurant then the segment is I hate the service and restaurant
The further deep learning studies use Recursive Neural Networks to extract this set of rules in an unsupervised manner. However these studies do not give any promising results in the long term. Therefore we mainly rely on our third approach that extracts segments by the topics.
Aspect Category Based Segmentation
This is the most prominent literature for ABSA. As we discussed before, Aspect Category Detection is used for detecting the category of the aspect(s) which the sentiment is expressed. With this information, we want to extract segments which are coherent in terms of topic and discard the ones that are not. There are two types of segmentation that can be applied for this premise:
- Topic Segmentation is the type of segmentation method to extract the segments which are topically coherent.
- Discourse Segmentation is the type of segmentation method to extract Elementary Discourse Units (EDU) and any other discourse units.
Topic segmentation is traditionally used for segmentation of a large body of text such as an article or a book chapter to extract the parts with distinguishable differences in topic. These studies are in interaction with discourse segmentation units. The reason behind this is that the discourse segmentation extracts “category-independent” segments with linguistic sub-units from texts. To further demonstrate this:
- This is an output from the SEGBOT model which gives clause-like units that serve as building blocks for discourse parsing and topic segmentation.
- There we have the elementary units to evaluate the aspect information and sentiment information.
- You can further look at the demo from here.
The literature on this started with TextTilling (Hearst, 1997) and followed by many probabilistic, heuristic, machine learning and deep learning approaches. Following studies might pave your way for better understanding of topic segmentation:
- TextTiling is an unsupervised technique that makes use of patterns of lexical co-occurrence and distribution within texts.
- C99 is a method for linear text segmentation, which replaces inter-sentence similarity by rank in local context.
- TopSeg is based on probabilistic latent semantic analysis (PLSA) and exploits similarities in word meaning detected by PLSA.
- TopicTiling modifies TextTiling with topic IDs, obtained by an LDA model, instead of words.
- BiLSTM-CRF is a state-of-the-art neural architecture for sequence labeling. The sequence labeling approach to segmentation is implemented in this model.
The later studies consist of deep learning approaches:
- SEGBOT uses a bidirectional recurrent neural network to encode an input text sequence initially. Then uses another recurrent neural network, together with a pointer network, to select text boundaries in the input sequence.
- BiLSTM-CNN uses CNNs to learn sentence embeddings. Then the segments are predicted based on contextual information by the Attention-based BiLSTM model.
There are a number of tools which we utilize for our task, but how are we going to train our models? We provide the main datasets which are used in topic segmentation.
Related Datasets
These datasets are used for topic and discourse segmentation tasks. However, we imply that these datasets can be adapted to other segmentation tasks too. While Choi dataset and Wiki-727K dataset are for topic segmentation, RST-DT dataset is for discourse segmentation.
Choi Dataset: The commonest dataset used in the training of a segmentation model. It consists of 700 documents, each being a concatenation of 10 segments. The corpus was generated by an automatic procedure. A segment of a document is the first n (s.t. 3 ≤ n ≤ 11 , 4 subsets in total) sentences of a randomly selected document from the Brown corpus.
The WIKI-727K Dataset: It is a collection of 727,746 English Wikipedia documents, and their hierarchical segmentation, as it appears in their table of contents.
RST-DT Dataset: This dataset is used in discourse segmentation models. The Rhetorical Structure Theory Discourse Treebank (RST-DT) is a publicly available corpus, manually annotated with Elementary Discourse Unit (EDU) segmentation and discourse relations according to Rhetorical Structure Theory. The RST-DT corpus is partitioned into a training set of 347 articles (6,132 sentences) and a test set of 38 articles (991 sentences), both from the Wall Street Journal.
The links are below:
Evaluation Metrics
Standard evaluation metrics which we see in the machine learning literature a lot (Precision, Recall and F-1 score) are not applicable in this task. Yet there is a body of literature in which we see these metrics in their evaluation sections.
Pk is the probability that when passing a sliding window of size k over sentences, the sentences at the boundaries of the window will be incorrectly classified as belonging to the same segment (or vice versa).
$$ P_k = \sum_{1≤s≤t≤T} 1(δtru(s, t) \not= δhyp(s, t)) $$
Windowdiff moves a fixed-sized window across the text and penalizes the algorithm whenever the number of boundaries within the window does not match the true number of boundaries for that window of text.
$$ WindowDiff = \frac{1}{N − k}\sum^{N−k}_{i=0}(R_{i,i+k} \not= C_{i,i+k}) $$
The important knowledge for these formulas is that the accuracy is higher if pk and windowdiff score is lower.
The glossary for the formulas:
- Total span δ()
- window size k
- total number of sentences N
- the number of reference boundaries in the window from i to i+k, and
the number of computed boundaries in the same window C.
Useful Links
- https://github.com/koomri/text-segmentation
- https://github.com/ldulcic/text-segmentation
- https://github.com/pinkeshbadjatiya/neuralTextSegmentation
- https://github.com/ReemHal/Semantic-Text-Segmentation-with-Embeddings
References
- Pak, Irina, and Phoey Lee Teh. “Text segmentation techniques: a critical review.” Innovative Computing, Optimization and Its Applications (2018): 167-181.
- Kaur, J., & Singh, J. (2019). Deep Neural Network Based Sentence Boundary Detection and End Marker Suggestion for Social Media Text. 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS).doi:10.1109/icccis48478.2019.8974495
- Du, Jinhua, Yan Huang, and Karo Moilanen. “AIG Investments. AI at the FinSBD task: Sentence boundary detection through sequence labelling and BERT fine-tuning.” Proceedings of the First Workshop on Financial Technology and Natural Language Processing. 2019.
- C. R. Aydin and T. Güngör, “Combination of Recursive and Recurrent Neural Networks for Aspect-Based Sentiment Analysis Using Inter-Aspect Relations,” in IEEE Access, vol. 8, pp. 77820-77832, 2020,
- Kayaalp, Naime F., et al. “Extracting customer opinions associated with an aspect by using a heuristic based sentence segmentation approach.” International Journal of Business Information Systems 26.2 (2017): 236-260.
- J. Li, B. Chiu, S. Shang and L. Shao, “Neural Text Segmentation and Its Application to Sentiment Analysis,” in IEEE Transactions on Knowledge and Data Engineering.
- Badjatiya, Pinkesh, et al. “Attention-based neural text segmentation.” European Conference on Information Retrieval. Springer, Cham, 2018.
- Pevzner, L., & Hearst, M. A. (2002). A Critique and Improvement of an Evaluation Metric for Text Segmentation. Computational Linguistics, 28(1), 19–36.
Leave a Reply