Share 8 popular index settings for pandas

1. Converting an index from a groupby operation to a column

groupbyGrouping methods are often used. For example, the following grouping is done by adding a grouping column TEAM.

>>> df0["team"] = ["X", "X", "Y", "Y", "Y"]
>>> df0
          A         B         C team
0  0.548012  0.288583  0.734276    X
1  0.342895  0.207917  0.995485    X
2  0.378794  0.160913  0.971951    Y
3  0.039738  0.008414  0.226510    Y
4  0.581093  0.750331  0.133022    Y
>>> ("team").mean()
             A         B         C
team                              
X     0.445453  0.248250  0.864881
Y     0.333208  0.306553  0.443828

By default, the grouping programs the grouped columnsindexIndex. But in many cases, we don't want the grouped columns to become indexes, because there may be some calculations or judgment logic that still need to use the columns. Therefore, we need to set up so that the grouped columns do not become indexes, and at the same time can also fulfill the function of grouping.

There are two ways to accomplish the desired operation, the first is to use thereset_indexThe second is in thegroupbySetting in the methodas_index=False. Personally, I prefer the second method, which involves only two steps and is more concise.

>>> ("team").mean().reset_index()
  team         A         B         C
0    X  0.445453  0.248250  0.864881
1    Y  0.333208  0.306553  0.443828
>>> ("team", as_index=False).mean()
  team         A         B         C
0    X  0.445453  0.248250  0.864881
1    Y  0.333208  0.306553  0.443828

2. Use an existing DataFrame to set up indexes.

Of course, if the data has already been read or after doing some data processing steps, we can pass theset_indexSet the index manually.

>>> df = pd.read_csv("", parse_dates=["date"])
>>> df.set_index("date")
            temperature  humidity
date                             
2021-07-01           95        50
2021-07-02           94        55
2021-07-03           94        56

There are two things to keep in mind here:

set_indexmethod will by default create a newDataFrame. To change the index of a df in-place, you need to set theinplace=True。

df.set_index(“date”, inplace=True)

If you want to keep the columns that will be set as indexes, you can set thedrop=False。

df.set_index(“date”, drop=False)

3. Resetting indexes after some operations

in dealing withDataFrame When you do this, certain operations (e.g., row deletion, index selection, etc.) will generate a subset of the original index, so that the default numeric index ordering is messed up. To regenerate consecutive indexes, you can use thereset_indexMethods.

>>> df0 = ((5, 3), columns=list("ABC"))
>>> df0
          A         B         C
0  0.548012  0.288583  0.734276
1  0.342895  0.207917  0.995485
2  0.378794  0.160913  0.971951
3  0.039738  0.008414  0.226510
4  0.581093  0.750331  0.133022
>>> df1 = df0[ % 2 == 0]
>>> df1
          A         B         C
0  0.548012  0.288583  0.734276
2  0.378794  0.160913  0.971951
4  0.581093  0.750331  0.133022
>>> df1.reset_index(drop=True)
          A         B         C
0  0.548012  0.288583  0.734276
1  0.378794  0.160913  0.971951
2  0.581093  0.750331  0.133022

Normally, we don't need to keep the old index, so the drop parameter can be set to True. similarly, to reset the index in place, set theinplaceparameter is True, otherwise a newDataFrame。

4. Reset index after sorting

replace the old with newsort_valueThis problem is also encountered when sorting methods, because by default, indexesindexIt follows the sort order, so it's messy snow. If we want the index to not follow the sort order, again, we need to add a new index to thesort_valuesSet the parameters in the methodignore_indexReady to go.

>>> df0.sort_values("A")
          A         B         C team
3  0.039738  0.008414  0.226510    Y
1  0.342895  0.207917  0.995485    X
2  0.378794  0.160913  0.971951    Y
0  0.548012  0.288583  0.734276    X
4  0.581093  0.750331  0.133022    Y
>>> df0.sort_values("A", ignore_index=True)
          A         B         C team
0  0.039738  0.008414  0.226510    Y
1  0.342895  0.207917  0.995485    X
2  0.378794  0.160913  0.971951    Y
3  0.548012  0.288583  0.734276    X
4  0.581093  0.750331  0.133022    Y

5. Reset the index after deleting duplicates

Removing duplicates, like sorting, also upsets the sort order when executed by default. Similarly, thedrop_duplicatesmethod.ignore_indexparametersTrueReady to go.

>>> df0
          A         B         C team
0  0.548012  0.288583  0.734276    X
1  0.342895  0.207917  0.995485    X
2  0.378794  0.160913  0.971951    Y
3  0.039738  0.008414  0.226510    Y
4  0.581093  0.750331  0.133022    Y
>>> df0.drop_duplicates("team", ignore_index=True)
          A         B         C team
0  0.548012  0.288583  0.734276    X
1  0.378794  0.160913  0.971951    Y

6. Direct assignment of indexes

When we have aDataFrame When you want to assign indexes using a different data source or a separate operation. In this case, the index can be assigned directly to an existing 。

>>> better_index = ["X1", "X2", "Y1", "Y2", "Y3"]
>>>  = better_index
>>> df0
           A         B         C team
X1  0.548012  0.288583  0.734276    X
X2  0.342895  0.207917  0.995485    X
Y1  0.378794  0.160913  0.971951    Y
Y2  0.039738  0.008414  0.226510    Y
Y3  0.581093  0.750331  0.133022    Y

7. Ignore indexes when writing CSV files

Default when exporting data to a CSV fileDataFrame has an index starting from 0. If we don't want to include it in the exported CSV file, we can add it in theto_csvmethod.indexParameters.

>>> df0.to_csv("exported_file.csv", index=False)

As shown below, the index columns are not included in the exported CSV file.

In fact, many of the methods have been set up on the index, but we are generally more concerned about the data, and often ignore the index, which leads to continue to run may report an error. The above several high-frequency operations are indexed settings, it is recommended that you usually use the habit of setting the index, which will save a lot of time.

8. Specify the index column when reading

In many cases, our data source is a CSV file. Suppose there is a file named, contains the following data.

date,temperature,humidity
07/01/21,95,50
07/02/21,94,55
07/03/21,94,56

By default, pandas will create an indexed row starting at 0 as follows:

>>> pd.read_csv("", parse_dates=["date"])
        date  temperature  humidity
0 2021-07-01           95        50
1 2021-07-02           94        55
2 2021-07-03           94        56

However, we can make it easier for the importing process by setting theindex_colA parameter set to a column can directly specify the index column.

>>> pd.read_csv("", parse_dates=["date"], index_col="date")
            temperature  humidity
date                             
2021-07-01           95        50
2021-07-02           94        55
2021-07-03           94        56

to this article on the sharing of 8 commonly used pandas index settings of the article is introduced to this, more related to commonly used pandas index settings content please search for my previous posts or continue to browse the following related articles I hope you will support me in the future more!