SoFunction
Updated on 2024-12-16

Python pandas column to row operation details (similar to the explode method in hive)

Recently at work to use Python's pandas library to deal with excel files, encountered the problem of columns to rows. After looking for some information successfully, record.

1. If there is only one column to be exploded:

df=({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[1]: 
 A  B
0 1 [1, 2]
1 2 [1, 2]

If you want to explode the column B, you can use the explode method directly (provided that your version of pandas is higher than or equal to 0.25)

('B')
 
  A B
 0 1 1
 1 1 2
 2 2 1
 3 2 2

2. If there are 2 or more columns to be exploded.

df=({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
df
Out[592]: 
 A  B  C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]

Then you can use write a method with the following code:

def unnesting(df, explode):
 idx = (df[explode[0]].())
 df1 = ([
  ({x: (df[x].values)}) for x in explode], axis=1)
  = idx
 
 return ((explode, 1), how='left')
 
 
unnesting(df,['B','C'])
Out[2]: 
 B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2

Additional knowledge:pandas:break a column into multiple columns (',',expand=True);pyspark break a column into multiple columns

Source shuju

 question_id       id
0   17576     70391,70394
1   17576  70391,70392,70393,70394
2   17576     70391,70392
3   40430   155032,155033,155034
4   40430 155032,155033,155034,155035
5   40430   155033,155034,155035
6   40430    155032,155035
7   40430    155034,155035
8   40430    155032,155034
9   40430   155032,155034,155035
10  40430    155033,155034
11  40430    155032,155033
12  40430    155033,155035
13  40430   155032,155033,155035

pandas solution

(df['id'].(',',expand=True)

result

   0  1  2  3
0 70391 70394 None None
1 70391 70392 70393 70394
2 70391 70392 None None
3 155032 155033 155034 None
4 155032 155033 155034 155035
5 155033 155034 155035 None
6 155032 155035 None None
7 155034 155035 None None
8 155032 155034 None None
9 155032 155034 155035 None
10 155033 155034 None None
11 155032 155033 None None
12 155033 155035 None None
13 155032 155033 155035 None

# Note that expand=True

(df['id'].(',',expand=True))

 question_id       id  0  1  2  3
0   17576     70391,70394 70391 70394 None None
1   17576  70391,70392,70393,70394 70391 70392 70393 70394
2   17576     70391,70392 70391 70392 None None
3   40430   155032,155033,155034 155032 155033 155034 None
4   40430 155032,155033,155034,155035 155032 155033 155034 155035
5   40430   155033,155034,155035 155033 155034 155035 None
6   40430    155032,155035 155032 155035 None None
7   40430    155034,155035 155034 155035 None None
8   40430    155032,155034 155032 155034 None None
9   40430   155032,155034,155035 155032 155034 155035 None
10  40430    155033,155034 155033 155034 None None
11  40430    155032,155033 155032 155033 None None
12  40430    155033,155035 155033 155035 None None
13  40430   155032,155033,155035 155032 155033 155035 None
pyspark solution
 tdf=((,',').alias('ss'),'question_id','count_num')
 ('question_id').show()
 res=(().alias('new'),'question_id','count_num')
('question_id').show()
('question_id','new').sum().sort('question_id').show()

result

Above this Python pandas column to line operation details (similar to the explode method in hive) is all that I have shared with you, I hope to be able to give you a reference, and I hope you will support me more.