SoFunction
Updated on 2024-11-18

Pandas Splits String into Multiple Columns Using Delimiters or Regular Expressions

How Pandas splits columns with string elements into multiple columns.

Methods that use the following strings.

  • (): split by delimiter
  • (): split by regular expression

String methods are methods.

Applies to or columns

(): split by delimiter

To split by delimiter, use the string method ().

Take the following as an example.

import pandas as pd

s_org = (['aaa@', 'bbb@', 'ccc@', 'ddd'], index=['A', 'B', 'C', 'D'])
print(s_org)
print(type(s_org))
# A    aaa@
# B    bbb@
# C    ccc@
# D            ddd
# dtype: object
# <class ''>

Specifies the delimiter as the first argument. An element is returned as a list of split strings.

s = s_org.('@')
print(s)
print(type(s))
# A    [aaa, ]
# B    [bbb, ]
# C    [ccc, ]
# D             [ddd]
# dtype: object
# <class ''>

Specifying split = True as a parameter can be split into multiple columns and fetched as. The default value is expand = False.

Elements that do not have enough row divisions are "None".

df = s_org.('@', expand=True)
print(df)
print(type(df))
#      0        1
# A  aaa  
# B  bbb  
# C  ccc  
# D  ddd     None
# <class ''>

You can specify the name of the fetched column in the column.

 = ['local', 'domain']
print(df)
#   local   domain
# A   aaa  
# B   bbb  
# C   ccc  
# D   ddd     None

It would be a bit tedious to update a specific column by splitting it into multiple columns. There may be a better way.

Take the previously created one as an example.

print(df)
#   local   domain
# A   aaa  
# B   bbb  
# C   ccc  
# D   ddd     None

Use () on a specific column to get a split.

print(df['domain'].('.', expand=True))
#       0     1
# A   xxx   com
# B   yyy   com
# C   zzz   com
# D  None  None

Use () to concatenate (join) with the original and use the drop() method to delete the original column.

df2 = ([df, df['domain'].('.', expand=True)], axis=1).drop('domain', axis=1)
print(df2)
#   local     0     1
# A   aaa   xxx   com
# B   bbb   yyy   com
# C   ccc   zzz   com
# D   ddd  None  None

If there are very few columns remaining, only the columns required when connected (coupled) in series with ( ) can be selected.

df3 = ([df['local'], df['domain'].('.', expand=True)], axis=1)
print(df3)
#   local     0     1
# A   aaa   xxx   com
# B   bbb   yyy   com
# C   ccc   zzz   com
# D   ddd  None  None

To rename a specific column, use the rename() method.

(columns={0: 'second_LD', 1: 'TLD'}, inplace=True)
print(df3)
#   local second_LD   TLD
# A   aaa       xxx   com
# B   bbb       yyy   com
# C   ccc       zzz   com
# D   ddd      None  None

reference article

Modification of row and column names of the

(): split by regular expression

Use the string method () to split regular expressions.

Take the following as an example.

import pandas as pd

s_org = (['aaa@', 'bbb@', 'ccc@', 'ddd'], index=['A', 'B', 'C', 'D'])
print(s_org)
# A    aaa@
# B    bbb@
# C    ccc@
# D            ddd
# dtype: object

Specify the regular expression in the first argument. For each string that partially matches the group enclosed in () in the regular expression, it is divided.

When extracting multiple groups, it will return regardless of the argument expand.

NaN if there is no match.

df = s_org.('(.+)@(.+)\.(.+)', expand=True)
print(df)
#      0    1    2
# A  aaa  xxx  com
# B  bbb  yyy  com
# C  ccc  zzz  com
# D  NaN  NaN  NaN

df = s_org.('(.+)@(.+)\.(.+)', expand=False)
print(df)
#      0    1    2
# A  aaa  xxx  com
# B  bbb  yyy  com
# C  ccc  zzz  com
# D  NaN  NaN  NaN

If there is only one set, it returns when the argument expand = True, or if expand = False.

df_single = s_org.('(\w+)', expand=True)
print(df_single)
print(type(df_single))
#      0
# A  aaa
# B  bbb
# C  ccc
# D  ddd
# <class ''>

s = s_org.('(\w+)', expand=False)
print(s)
print(type(s))
# A    aaa
# B    bbb
# C    ccc
# D    ddd
# dtype: object
# <class ''>

Expand = False is the default in the current version 0.22.0, but expand = True will be the default in the future.

FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) 
but in a future version of pandas this will be changed to expand=True (return DataFrame)

If a named group (?P ...) is used for a regular expression pattern, the name will be the column name as is.

df_name = s_org.('(?P<local>.*)@(?P<second_LD>.*)\.(?P<TLD>.*)', expand=True)
print(df_name)
#   local second_LD  TLD
# A   aaa       xxx  com
# B   bbb       yyy  com
# C   ccc       zzz  com
# D   NaN       NaN  NaN

To update a specific column by dividing it into multiple columns, see the () example above. Use () to join (concatenate) the original and use the drop() method to remove the original column.

to this article on the use of Pandas delimiter or regular expression will be split into multiple columns of the string is introduced to this article, more related to Pandas string split into multiple columns of content, please search for my previous posts or continue to browse the following related articles I hope that you will support me in the future more!