SoFunction
Updated on 2024-11-21

Python Basics of Stopword Filtering Explained

I. What is a stop word

In Chinese, there is a class of words that have little meaning, such as the group word "(used form a nominal expression)", the conjunction "as well as", the adverb "so much so that", and the intonation "bar (loanword) (serving drinks, or providing Internet access etc)", which are called deactivated words. ", are called stop words. Removing these stop words from a sentence does not affect comprehension. So, for natural language processing, we generally filter out the stop words.

And the HanLP library provides a small dictionary of deactivated words, which is located in the Lib\site-packages\pyhanlp\static\data\dictionary directory under the name. This text contains common Chinese and English nonsense words, one word per line. Examples are shown below:

示例

We can use any of BinTrie, DoubleArrayTrie and AhoCorasickDoubleArrayTrie to store the dictionary when we are doing natural language processing. Considering that the dictionary is full of phrases and is relatively large, it is more cost-effective and faster processing time to use a double array dictionary tree.

II. Loading the dictionary of deactivated words

From the previous section, we know that it is more cost-effective to load a dictionary of deactivated words using a double-array dictionary tree. In the following, let's load its deactivated words and return the key-value pair structure. The code is as follows:

def load_dictionary(path):
    map=JClass('')()
    with open(path,encoding='utf-8') as src:
        for word in src:
            word=()
            map[word]=word
    return JClass('')(map)

III. Deletion of stop words

By loading the deactivated words above, we get the vocabulary of the DoubleArrayTrie tree structure. If you want to delete the deactivated words, you can directly use the result after word splitting to eliminate the deactivated words. The method of elimination is as follows:

def remove_stopwords(termlist,trie):
    return [ for term in termlist if not ()]

IV. Word splitting and deletion of stop words

In previous blog posts, we've learned how to split words, and now we've learned how to weed out stop words. Here, we combine the two to achieve the split word effect. The code is as follows:

if __name__ == "__main__":
    =False
    trie=load_dictionary()
    text="That's it for today! Can we talk about it tomorrow?"
    segment=DoubleArrayTrieSegment()
    termlist=(text)
    print("Segmentation results",termlist)
    print("Remove stop words.",remove_stopwords(termlist,trie))

After running it, you get the following result:

结果

V. Direct deletion of deactivated words (no word splitting)

Corresponding to the above result, we first split the words before deleting the deactivated words. However, sometimes we also like to delete the stop words first before segmentation. In the following, let's realize the deletion of stop words directly.

The code is as follows:

# Direct filtering methods
def direct_remove_stopwords(text,replacement,trie):
    JString=JClass('')
    searcher=(JString(text),0)
    offset=0
    result=''
    while ():
        begin=
        end=begin+
        if begin>offset:
            result+=text[offset:begin]
            result+=replacement
            offset=end
    if offset<len(text):
        result+=text[offset:]
    return result


if __name__ == "__main__":
     = False
    trie = load_dictionary()
    text = "That's it for today! Can we talk about it tomorrow?"
    segment = DoubleArrayTrieSegment()
    termlist = (text)
    print("Segmentation results", termlist)
    print("Remove stop words.", remove_stopwords(termlist, trie))
    print("Unsubordinated de-duplicated words.", direct_remove_stopwords(text, "**", trie))

After running it, the result is as follows:

运行结果

to this article on the basics of python deactivation filtering detailed article is introduced to this, more related python deactivation filtering content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!