Overview of Unsupervised Methods for Extractive Summarization

I. Introduction

In short, extractive summarization selects sentences from the source text to construct a summary.


The challenge lies in understanding sentence context information and identifying relationships between sentences and words.

II. Techniques

Algorithms Description Weakness Link
TextRank TextRank is a graph-based ranking algorithm inspired by PageRank. It connects words or sentences based on how frequently they appear near each other in the text and uses the number of shared words between sentences to establish similarity. May not capture complex relationships between sentences accurately. TextRank: A Graph-Based NLP Algorithm : Networks Course blog for INFO 2040/CS 2850/Econ 2040/SOC 2090
LexRank LexRank is similar to TextRank but uses cosine similarity of TF-IDF vectors (sentence vectors) and is more tailored towards the extraction of information from multiple texts written about the same topic. The algorithm may not perform well on a set of unclustered/unrelated set of documents LexRank: Graph-based Lexical Centrality as Salience in Text Summarization
LSA LSA creates a term-sentence matrix (frequency of words within sentences of the document then applies SVD (Single-Value Decomposition) to learn about relationships between words and sentences. Struggles with polysemy and synonyms Latent Semantic Analysis
KMeans Clustering KMeans Clustering group sentence vectors into different clusters. Sentence vectors that are closest to cluster centroids are included in summaries. Figuring out the best pre-defined k value for training KMeans Clustering with TF-IDF and KMeans Clustering with word2vec
HDBSCAN work in progress work in progress work in progress
Mean shift work in progress work in progress work in progress

III. Improvements

Co-reference resolution

It is a technique that matches pronouns in the next sentences with the nouns in previous sentences.

To give an example:

Spiderman is the coolest superhero ever! I love him.

If we apply co-reference resolution, then the sentence would transform into:

Spiderman is the coolest superhero ever! I love Spiderman.

That helps us create better summaries because we understand the relationships between sentences. It gives more context to unsupervised algorithms. More technical definition:

The task of locating all expressions that are coreferential with any of the entities identified in the text is known as coreference resolution, and it occurs when two or more expressions in the text relate to the same person or object. As a result, pronouns and other referring expressions must be resolved in order to infer the correct understanding of the text.

Hobb's algorithm


Dimensionality reduction


A quick search for relevant research papers with QnA (questions and answering): https://typeset.io/

Quick Note

This page will be frequently updated. If you have a passion for or are an expert in the field of NLP, please reach out! I am happy to hear your input and advice!