[PYTHON] Analyze data using RegEx's 100x Flash Text

When it comes to data analysis, ** FlashText ** is a faster and more reliable tool for searching and replacing.

If you've done text or data analysis, you may already be familiar with regular expressions (RegEx). RegEx has evolved as the tool you need to perform text editing. If you're still using RegEx to handle text processing, there may be issues to address. Why is that? For large texts, RegEx's inefficiency can make data analysis unacceptably time consuming.

In this article, you can analyze data 100 times faster than RegEx [Python](https://www.alibabacloud.com/blog/how-to-write-a-headless-web-scraping-bot- in-python_594829? spm = a2c65.11461447.0.0.61739943rkkVz0) Introducing how to use the library "FlashText".

Comparison of RegEx and FlashText

Before proceeding with the analysis, even the simplest texts need to be cleaned of the source data. This often involves searching and replacing keywords. For example, search for the keyword "Python" in the corpus, or replace all "python" with "Python".

RegEx is an ideal tool if you need to find and replace hundreds of keywords. However, many of these tasks involve natural language processing (NLP). You may come across tens of thousands of such operations. Using RegEx to meet these requirements can take days.

Of course, you might think that parallelizing the process could solve this problem, but in reality this solution doesn't make a big difference.

Is there any other way to deal with this problem?

FlashText developers faced the same problem at the time. Some research did not produce any results, so he decided to write a new algorithm.

Before understanding the algorithm behind it, let's take a look at a comparison table showing the speed of FlashText in search and regular expressions in search.

image.png

Looking at the figure above, you can see that the processing time of RegEx increases almost linearly as the number of keywords increases. However, on the other hand, the increase in keywords does not affect FlashText.

Next, let's look at another chart for keyword substitution.

image.png

Similarly, as the number of keywords increases, the processing time of FlashText does not change much.

How to make data cleansing smarter and faster-FlashText

As the name implies, FlashText is one of the fastest ways to search and replace keywords. This is an open source Python library on GitHub.

When using FlashText, start by providing a list of keywords. FlashText uses this list to build an internal Trie dictionary. Then send a string of text depending on whether you want to find or replace it.

If you want to do a replacement, create a new string that contains the replacement keywords. To perform a search, it returns a list of keywords in the string. These tasks iterate over the string only once.

Why is FlashText so fast?

To really understand why FlashText is so fast, consider an example. Consider a sentence consisting of the three words "I like Python". Suppose you have a corpus of four words {Python, Java, J2ee, Ruby}.

If you select every word in the corpus and see if it appears in the sentence, you need to iterate the string four times.

image.png

It requires several iterations for the n words in the corpus. And each step (in the statement? This is the logic of RegEx matching. There is also another method that contradicts the first method. It's about checking for each word in a sentence to see if it exists in the corpus.

image.png

For m words in a sentence, you have m cycles. In this situation, the time spent depends only on the number of words in the sentence. You can use a dictionary to perform this step quickly.

The FlashText algorithm uses the second method. In addition, the Aho-Corasick algorithm and Trie data structures are the inspiration for that algorithm.

How FlashText works

First, create a Trie data structure from the corpus. It should look like the graph below.

image.png

Start and EOT (End of Term) indicate word boundaries. Either a space, a comma, or a line return value. You can match the keyword if it has boundaries on both sides. This will prevent cases such as pineapple apple matching.

Let's search character by character using the string "I like Python".

image.png

image.png

This algorithm searches character by character, so when searching for 1, it's easy to jump over because l isn't immediately after. This mechanism allows you to skip all non-existent words.

The FlashText algorithm inspects every character in the string "I like Python". Even if the dictionary contains 1 million keywords, it does not actually affect the operation.

When to use FlashText?

We recommend using FlashText when the number of keywords exceeds 500.

image.png

In terms of search, FlashText performs better than RegEx when the number of keywords is 500 or more.

In addition, RegEx can search for special characters like "^, $, *, d", but FlashText doesn't support them.

It does not match partial words (eg "worddvec"), but it does match full words ("word2vec").

Let's take a look at the basic usage of FlashText. try it. You'll find that it's much faster than RegEx.

Below is some code to help you use FlashText. Code: Search for keywords using FlashText.

image.png

Code: Search for keywords using FlashText.

image.png

Conclusion

Hopefully this article will convince you of the fact that FlashText is a better tool than RegEx. In particular, I've shown a graph showing the performance of RegEx and FlashText.

Recommended Posts

Analyze data using RegEx's 100x Flash Text
Select features using text data
[Python3] Let's analyze data using machine learning! (Regression)
Let's analyze Covid-19 (Corona) data using Python [For beginners]
[Memo] Text matching in pandas data frame using flashtext
Data analysis using xarray
Data analysis using Python 0
Data cleansing 2 Data cleansing using DataFrame
Data cleaning using Python
Inflating text data by retranslation using google translate in Python
Analyze stock prices using pandas data aggregation and group operations
I tried to analyze scRNA-seq data using Topological Data Analysis (TDA)