Key phrase Extraction from Document Using RAKE and Text Rank Algorithms

Nikita Mangwani
6 min readDec 28, 2020
Keyword extraction

Abstract — Traditional approaches to extract useful Keyphrase from a sentence rely heavily on human effort. In this Blog, to overcome this challenge, Automatic Keyphrase Extraction algorithm has been used to extract a Keyphrase efficiently that reduces the scope for human errors and saves time. The Machine Learning algorithms detect the Keyphrase from a sentence that the user feeds as an input and sets a reminder using the Keyphrase. RAKE and Textrank algorithms help to extract Keyphrase or important terms of a given text document. RAKE and TextRank techniques applied to find and analyze the best possible way of extracting the Keyphrase efficiently. With slight modifications to the code, the algorithms can be implemented to serve different application domain such as message or threat decoding in military purposes and can be extended to use in speech-to-text translation and sentimental analysis of the data.

Keywords — Keyphrase Extraction, Approaches, Natural Language Processing, NLTK Tagging, RAKE, TextRank, Performance Analysis.

I. INTRODUCTION

Keyphrase extraction is a fundamental task in natural language processing that facilitates mapping of documents to a set of representative phrases. The concise understanding of the text and grasping the central theme behind the given text can be achieved through Keyphrase extraction. Spending a huge amount of time in reading can be avoided. Information can be extracted efficiently comparing to the traditional extraction techniques.

extract data from pdf

At present times, where there exists a vast amount of information in the form of text on internet, the generation of Key phrase has assumed much wider application and importance. With the growing abundance of resource materials on the internet, the need of information retrieval calls for automatic tagging of a text or document to extract relevant information for a particular query of a user.

Keyword extraction using NLP modeling (python)

Without any doubt, the task of manually tagging or summarizing such texts will be herculean and this calls for automation in this field to reduce the time and effort and of course to meet the unprecedented volume of information to be exchanged today. Keyphrase extractors used for research documents such as TextRank and RAKE.

II. VARIOUS APPROACHES TOWARDS PHRASE DETECTION

NLP is the widely used technique to extract key phrases from large chunk of data. Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken. NLP is a component of artificial intelligence (AI). Natural language refers to the way we humans communicate with each other namely, speech and text.

  • Parsing of Text/ Sentence Segmentation: Text parsing is a common programming task that splits the given sequence of characters or values (text) into smaller parts based on some rules.
  • Storing the segmented words/Sentence in List: The segmented word is then stored in a list. The sequence is further analyzed, tokenized and grammar is determined.
  • Tokenization: “Tokens” are usually individual words and “tokenization” is taking a text or set of text and breaking it up into its individual words. These tokens are then used as the input for other types of analysis or tasks, like parsing (automatically tagging the syntactic relationship between words).
  • PART OF SPEECH(POS) Tagging: A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like ‘noun-plural’
  • Listing the Candidate Keyphrase: The candidate Keyphrase listed based on tags. The co-occurring Keyphrase are identified.
Block diagram — Text Extraction

First it will take a sentence as a raw text. Then it performs sentence segmentation. After that it will go for tokenization and will assign token to each sentence. Then it will recognize the entity and disambiguation it which means it will remove the stop words, frequently used words, etc. Then it will find the relation between the entity and at the end it will provide us the final event.

III. KEYPHRASE EXTRACTION ALGORITHMS

In this paper, following sub sections contains workflow of Rapid Automatic Keyword Extraction (RAKE) which is an unsupervised, domain independent, and language-independent method for extracting Keyphrase from individual documents and workflow of TextRank algorithm is briefly discussed.

A. Rapid Automatic Keyword Extraction — RAKE Algorithm Rake refers to Rapid Automatic Keyphrase Extraction and it is efficient and fastest growing algorithm for keywords and Keyphrase extraction. Candidates are extracted from the text by finding strings of words that do not include phrase delimiters or stop words (a, the, of, etc.) This produces the list of candidate keywords/phrases.

B. TextRank Algorithm — In general, Text Rank creates a graph of the words and relationships between them from a document, then identifies the most important vertices of the graph (words) based on importance scores calculated recursively from the entire graph.

Candidates are extracted from the text via sentence and then word parsing to produce a list of words to be evaluated. Each word is then added to the graph and relationships are added between the word and others in a sliding window around the word.

IV. PERFORMANCE ANALYSIS

Performance analysis of Rake and Text Rank

From above table we can observe the result obtained by both of the articles, by which we can say that it is about Gate Array Implementation. The most important thing is to notice here is that Text Rank gives us keywords (only one entry has two words, the rest have only one word), while RAKE gives us phrases. The conclusion came here is that text rank is used for tagging system (particular keyword) while, RAKE would prove very useful for summarization task. As per our project definition keyword extraction and summarization the suitable algorithm is RAKE. Reason behind the choosing the RAKE algorithm, while using the text rank cannot able to conclude the topic of the text. And there is some information are left and it also gives some inappropriate information. For e.g., single characters.

So, from above result we have implemented rake algorithm and the results are shown below.

Performance analysis of Rake algorithm

Here, the rake algorithm is performed on ten different articles. The analysis shown below. So, accuracy can be defined.

Performance Analysis of Rake algorithm of 10 articles
Performance Analysis of Rake Algorithm

V. CONCLUSION

The above proposed was implemented in Python and used the NLTK toolkit to preprocess text. Keyphrase extraction techniques spare time and assets, by allows to consequently investigating huge arrangements of information in not more than seconds. Keyphrase extraction automatically extracting and classifying information from document which gives a keen and strong course of action, making it possible to separate text for a colossal degree and get speedy and exact results. In this paper we implemented Rapid Automatic Keyphrase Extraction and TextRank algorithms for data driven text and analyzed the predictions and accuracy which results represented above. The top keywords from the contents are displayed to the user. We infer that RAKE algorithm gives the best results.

REFERENCES

  1. https://www.sflscientific.com/data-science-blog/2016/11/17/text-summarization-in-natural-language-processing
  2. https://www.semanticscholar.org/paper/Text-Classification-through-Statistical-and-Machine-Vasa/2699caaf2675cbd86d9bee3394f59b6c3da591ca
  3. https://www.hipdf.com/pdf-to-text
  4. https://www.analyticsvidhya.com/blog/2020/06/nlp-project-information-extraction/
  5. https://docparser.com/blog/extract-data-from-pdf/
  6. https://towardsdatascience.com/textrank-for-keyword-extraction-by-python-c0bae21bcec0
  7. https://www.r-bloggers.com/2018/04/an-overview-of-keyword-extraction-techniques/

--

--