SoFunction
Updated on 2024-11-15

How to remove various symbols from text with a single line of Python code for data cleansing

preamble

After a lot of text corpus has been collected, a long process of data cleansing, usually iterative, will begin.

1. Description of the problem

Some text data will contain special symbols.

Guessing it may have been pasted directly into the page from some rich text editor.

If you want to remove these special symbols, you need specialized tools.

2. Relevant knowledge

The Unicode Standard divides symbols into four main categories, which are:

abridge particulars
[Sc] Symbol, Currency
[Sk] Symbol, Modifier
[Sm] Symbol, Math
[So] Symbol, Other

The symbols that generally need to be cleaned up will beSoType.But it's still important to analyze your data situation specifically

3. Solutions

Symbols encountered during data cleaning may include: miscellaneous symbols, geometric shapes, arrows, hearts, stars, emoji Emoji, currency symbols, etc.

If all these symbols are to be removed above, the following code can be used.

text = "".join(ch for ch in text if (ch)[0]!= 'S')

If you need to remove a category individually, or want to know the specific category a symbol belongs to, you need to go to this site:.

/charts/

Finds the corresponding symbol type.

Take the arrow symbol as an example.

Start by searching the above page with Arrows to find the pure arrow term Arrows, which corresponds to the document:/charts/PDF/

Find the arrow you need and check the corresponding name.

Example: arrows

RIGHTWARDS ARROW, and then use thepythonofferedunicodedatastandard library, look up the category of this symbol.

('RIGHTWARDS ARROW')
'→'
('→')
'Sm'

In this way, it is known that the arrow symbol to be found, belongs to the Sm category (math symbols).

Example: black square

BLACK SQUARE ■ U+25A0

('BLACK SQUARE')
'■'
('■')
'So'

Example: Black heart

('BLACK HEART SUIT')
'♥'
('♥')
'So'

Example: black star

('BLACK FOUR POINTED STAR')
'✦'
('✦')
'So'

If you only need to remove miscellaneous symbols, you can use the following python code.

text = "".join(ch for ch in text if (ch) != 'So')

Another useful URL:

/info/unicode/category/

summarize

to this article on data cleaning how to use a line of Python code to remove a variety of symbols in the text of the article is introduced to this, more relevant Python to remove a variety of symbols in the text of the content, please search for my previous posts or continue to browse the following related articles I hope that you will support me in the future!