pandas hierarchical index implementation

Hierarchical indexing is an important feature of pandas that enables you to have multiple (more than two) index levels on a single axis.

Creates a Series and indexes it with a list or array of lists.

data=Series((10),
index=[['a','a','a','b','b','b','c','c','d','d'],
[1,2,3,1,2,3,1,2,2,3]])

data
Out[6]: 
a 1  -2.842857
  2  0.376199
  3  -0.512978
b 1  0.225243
  2  -1.242407
  3  -0.663188
c 1  -0.149269
  2  -1.079174
d 2  -0.952380
  3  -1.113689
dtype: float64

This is the formatted output of a Series with MultiIndex indexes. The "spacing" between indexes means "use the labels above directly".


Out[7]: 
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

For a hierarchically indexed object, the operation of picking a subset of data is simple:

data['b']
Out[8]: 
1  0.225243
2  -1.242407
3  -0.663188
dtype: float64


data['b':'c']
Out[10]: 
b 1  0.225243
  2  -1.242407
  3  -0.663188
c 1  -0.149269
  2  -1.079174
dtype: float64

[['b','d']]
__main__:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
/pandas-docs/stable/#ix-indexer-is-deprecated
Out[11]: 
b 1  0.225243
  2  -1.242407
  3  -0.663188
d 2  -0.952380
  3  -1.113689
dtype: float64

You can even select it from the "inner layer":

data[:,2]
Out[12]: 
a  0.376199
b  -1.242407
c  -1.079174
d  -0.952380
dtype: float64

Hierarchical indexes play an important role in data remodeling and grouping-based operations.

can be rearranged into a DataFrame via the unstack method:

()
Out[13]: 
     1     2     3
a -2.842857 0.376199 -0.512978
b 0.225243 -1.242407 -0.663188
c -0.149269 -1.079174    NaN
d    NaN -0.952380 -1.113689


The inverse of #unstack is stack.
().stack()
Out[14]: 
a 1  -2.842857
  2  0.376199
  3  -0.512978
b 1  0.225243
  2  -1.242407
  3  -0.663188
c 1  -0.149269
  2  -1.079174
d 2  -0.952380
  3  -1.113689
dtype: float64

For DataFrame, each axis can have hierarchical indexes:

frame=DataFrame((12).reshape((4,3)),
index=[['a','a','b','b'],[1,2,1,2]],
columns=[['Ohio','Ohio','Colorado'],
['Green','Red','Green']])

frame
Out[16]: 
   Ohio   Colorado
  Green Red  Green
a 1   0  1    2
 2   3  4    5
b 1   6  7    8
 2   9 10    11

The layers can have names. If names are specified, they will be displayed in the console (don't confuse index names with axis labels!)

=['key1','key2']
=['state','color']

frame
Out[22]: 
state   Ohio   Colorado
color   Green Red  Green
key1 key2          
a  1    0  1    2
   2    3  4    5
b  1    6  7    8
   2    9 10    11

Column grouping can be easily selected thanks to the divisional column index:

frame['Ohio']
Out[23]: 
color   Green Red
key1 key2      
a  1     0  1
   2     3  4
b  1     6  7
   2     9  10

Rearrangement of hierarchical ordering

Sometimes it is necessary to reorder the levels on an axis, or to sort the data according to the values on a specified level. swaplevel accepts two level numbers or names and returns a new object with the levels swapped (but the data is not changed):

('key1','key2')
Out[24]: 
state   Ohio   Colorado
color   Green Red  Green
key2 key1          
1  a    0  1    2
2  a    3  4    5
1  b    6  7    8
2  b    9 10    11

sortlevel sorts the data according to the values in the individual levels. When swapping levels, it is common to get a sortlevel so that the end result is also ordered:

(0,1)
Out[27]: 
state   Ohio   Colorado
color   Green Red  Green
key2 key1          
1  a    0  1    2
2  a    3  4    5
1  b    6  7    8
2  b    9 10    11

# Swap levels 0,1 (i.e. key1,key2)
# Then sort on axis=0
(0,1).sortlevel(0)
__main__:1: FutureWarning: sortlevel is deprecated, use sort_index(level= ...)
Out[28]: 
state   Ohio   Colorado
color   Green Red  Green
key2 key1          
1  a    0  1    2
   b    6  7    8
2  a    3  4    5
   b    9 10    11

Summary statistics by level

(level='key2')
Out[29]: 
state Ohio   Colorado
color Green Red  Green
key2          
1     6  8    10
2    12 14    16

(level='color',axis=1)
Out[30]: 
color   Green Red
key1 key2      
a  1     2  1
   2     8  4
b  1    14  7
   2    20  10

Columns using DataFrame

Use one or more columns of a DataFrame as a row index, or turn a row index into a column of a Dataframe.

frame=DataFrame({'a':range(7),'b':range(7,0,-1),
'c':['one','one','one','two','two','two','two'],
'd':[0,1,2,0,1,2,3]})

frame
Out[32]: 
  a b  c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3

A DataFrame's set_index function converts one or more of its columns to a row index and creates a new DataFrame:

frame2=frame.set_index(['c','d'])

frame2
Out[34]: 
    a b
c  d   
one 0 0 7
  1 1 6
  2 2 5
two 0 3 4
  1 4 3
  2 5 2
  3 6 1

By default, those columns are removed from the DataFrame, but it is possible to keep them there:

frame.set_index(['c','d'],drop=False)
Out[35]: 
    a b  c d
c  d       
one 0 0 7 one 0
  1 1 6 one 1
  2 2 5 one 2
two 0 3 4 two 0
  1 4 3 two 1
  2 5 2 two 2
  3 6 1 two 3

The function of reset_index is just the opposite of set_index, where the level of the hierarchical index is moved inside the column:

frame2.reset_index()
Out[36]: 
   c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1

This is the whole content of this article.