Pandas datatype of category usage

Create category

Created with Series

In the creation of Series at the same time add dtype="category" can create a good category. category is divided into two parts, one is the order, one is the literal:

In [1]: s = (["a", "b", "c", "a"], dtype="category")

In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

It is possible to convert a Series in DF to a category:

In [3]: df = ({"A": ["a", "b", "c", "a"]})

In [4]: df["B"] = df["A"].astype("category")

In [5]: df["B"]
Out[32]: 
0    a
1    b
2    c
3    a
Name: B, dtype: category
Categories (3, object): [a, b, c]

It is possible to create a good , passing it as a parameter to Series:

In [10]: raw_cat = (
   ....:     ["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=False
   ....: )
   ....: 

In [11]: s = (raw_cat)

In [12]: s
Out[12]: 
0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): ['b', 'c', 'd']

Created with DF

You can also pass in dtype="category" when creating a DataFrame:

In [17]: df = ({"A": list("abca"), "B": list("bccd")}, dtype="category")

In [18]: 
Out[18]: 
A    category
B    category
dtype: object

A and B in DF are both a category.

In [19]: df["A"]
Out[19]: 
0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (3, object): ['a', 'b', 'c']

In [20]: df["B"]
Out[20]: 
0    b
1    c
2    c
3    d
Name: B, dtype: category
Categories (3, object): ['b', 'c', 'd']

Or use ("category") to convert all the Series in DF to category:.

In [21]: df = ({"A": list("abca"), "B": list("bccd")})

In [22]: df_cat = ("category")

In [23]: df_cat.dtypes
Out[23]: 
A    category
B    category
dtype: object

Creating Controls

By default the category created by passing dtype='category' uses the default value:

is extrapolated from the data.

是没有大小顺序的。

可以显示创建CategoricalDtypeto change the two defaults above：

In [26]: from  import CategoricalDtype

In [27]: s = (["a", "b", "c", "a"])

In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"], ordered=True)

In [29]: s_cat = (cat_type)

In [30]: s_cat
Out[30]: 
0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): ['b' < 'c' < 'd']

The same CategoricalDtype can also be used in DF:

In [31]: from  import CategoricalDtype

In [32]: df = ({"A": list("abca"), "B": list("bccd")})

In [33]: cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)

In [34]: df_cat = (cat_type)

In [35]: df_cat["A"]
Out[35]: 
0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']

In [36]: df_cat["B"]
Out[36]: 
0    b
1    c
2    c
3    d
Name: B, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']

Convert to original type

utilization(original_dtype) or(categorical)Category can be converted to its original type:

In [39]: s = (["a", "b", "c", "a"])

In [40]: s
Out[40]: 
0    a
1    b
2    c
3    a
dtype: object

In [41]: s2 = ("category")

In [42]: s2
Out[42]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [43]: (str)
Out[43]: 
0    a
1    b
2    c
3    a
dtype: object

In [44]: (s2)
Out[44]: array(['a', 'b', 'c', 'a'], dtype=object)

Categories operation

Get attributes of category

Categorical data arecategories 和 ordered 两个属性。可以通过和来获取：

In [57]: s = (["a", "b", "c", "a"], dtype="category")

In [58]: 
Out[58]: Index(['a', 'b', 'c'], dtype='object')

In [59]: 
Out[59]: False

重排category的顺序：

In [60]: s = ((["a", "b", "c", "a"], categories=["c", "b", "a"]))

In [61]: 
Out[61]: Index(['c', 'b', 'a'], dtype='object')

In [62]: 
Out[62]: False

重命名categories

You can rename categories by assigning a value to them.

In [67]: s = (["a", "b", "c", "a"], dtype="category")

In [68]: s
Out[68]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [69]:  = ["Group %s" % g for g in ]

In [70]: s
Out[70]: 
0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (3, object): ['Group a', 'Group b', 'Group c']

The same effect can be achieved using rename_categories:

In [71]: s = .rename_categories([1, 2, 3])

In [72]: s
Out[72]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [1, 2, 3]

Or use a dictionary object:

# You can also pass a dict-like object to map the renaming
In [73]: s = .rename_categories({1: "x", 2: "y", 3: "z"})

In [74]: s
Out[74]: 
0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): ['x', 'y', 'z']

Adding categories with add_categories

You can use add_categories to add categories.

In [77]: s = .add_categories([4])

In [78]: 
Out[78]: Index(['x', 'y', 'z', 4], dtype='object')

In [79]: s
Out[79]: 
0    x
1    y
2    z
3    x
dtype: category
Categories (4, object): ['x', 'y', 'z', 4]

Removing categories with remove_categories

In [80]: s = .remove_categories([4])

In [81]: s
Out[81]: 
0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): ['x', 'y', 'z']

Delete unused cagtegory

In [82]: s = ((["a", "b", "a"], categories=["a", "b", "c", "d"]))

In [83]: s
Out[83]: 
0    a
1    b
2    a
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [84]: .remove_unused_categories()
Out[84]: 
0    a
1    b
2    a
dtype: category
Categories (2, object): ['a', 'b']

Reset cagtegory

utilizationset_categories()You can add and remove categories at the same time:

In [85]: s = (["one", "two", "four", "-"], dtype="category")

In [86]: s
Out[86]: 
0     one
1     two
2    four
3       -
dtype: category
Categories (4, object): ['-', 'four', 'one', 'two']

In [87]: s = .set_categories(["one", "two", "three", "four"])

In [88]: s
Out[88]: 
0     one
1     two
2    four
3     NaN
dtype: category
Categories (4, object): ['one', 'two', 'three', 'four']

Sort by category

If the category is created with ordered=True, then it can be ordered:

In [91]: s = (["a", "b", "c", "a"]).astype(CategoricalDtype(ordered=True))

In [92]: s.sort_values(inplace=True)

In [93]: s
Out[93]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): ['a' < 'b' < 'c']

In [94]: (), ()
Out[94]: ('a', 'c')

You can use as_ordered() or as_unordered() to force sorting or not:

In [95]: .as_ordered()
Out[95]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): ['a' < 'b' < 'c']

In [96]: .as_unordered()
Out[96]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

reorder

Existing categories can be reordered using Categorical.reorder_categories():

In [103]: s = ([1, 2, 3, 1], dtype="category")

In [104]: s = .reorder_categories([2, 3, 1], ordered=True)

In [105]: s
Out[105]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

multicolumn sorting

sort_values supports multiple columns for sorting:

In [109]: dfs = (
   .....:     {
   .....:         "A": (
   .....:             list("bbeebbaa"),
   .....:             categories=["e", "a", "b"],
   .....:             ordered=True,
   .....:         ),
   .....:         "B": [1, 2, 1, 2, 2, 1, 2, 1],
   .....:     }
   .....: )
   .....: 

In [110]: dfs.sort_values(by=["A", "B"])
Out[110]: 
   A  B
2  e  1
3  e  2
7  a  1
6  a  2
0  b  1
5  b  1
1  b  2
4  b  2

comparison operation

If ordered==True is set at creation time, then comparison operations can be performed between categories. Supported ==, !=, >, >=, <, and<=These operators.

In [113]: cat = ([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))

In [114]: cat_base = ([2, 2, 2]).astype(CategoricalDtype([3, 2, 1], ordered=True))

In [115]: cat_base2 = ([2, 2, 2]).astype(CategoricalDtype(ordered=True))
In [119]: cat > cat_base
Out[119]: 
0     True
1    False
2    False
dtype: bool

In [120]: cat > 2
Out[120]: 
0     True
1    False
2    False
dtype: bool

Other operations

Cagetory is still essentially a Series, so the Series operations category are basically available, such as: (), () and ().

value_counts：

In [131]: s = ((["a", "b", "c", "c"], categories=["c", "a", "b", "d"]))

In [132]: s.value_counts()
Out[132]: 
c    2
a    1
b    1
d    0
dtype: int64

()：

In [133]: columns = (
   .....:     ["One", "One", "Two"], categories=["One", "Two", "Three"], ordered=True
   .....: )
   .....: 

In [134]: df = (
   .....:     data=[[1, 2, 3], [4, 5, 6]],
   .....:     columns=.from_arrays([["A", "B", "B"], columns]),
   .....: )
   .....: 

In [135]: (axis=1, level=1)
Out[135]: 
   One  Two  Three
0    3    3      0
1    9    6      0

Groupby：

In [136]: cats = (
   .....:     ["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"]
   .....: )
   .....: 

In [137]: df = ({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})

In [138]: ("cats").mean()
Out[138]: 
      values
cats        
a        1.0
b        2.0
c        4.0
d        NaN

In [139]: cats2 = (["a", "a", "b", "b"], categories=["a", "b", "c"])

In [140]: df2 = (
   .....:     {
   .....:         "cats": cats2,
   .....:         "B": ["c", "d", "c", "d"],
   .....:         "values": [1, 2, 3, 4],
   .....:     }
   .....: )
   .....: 

In [141]: (["cats", "B"]).mean()
Out[141]: 
        values
cats B        
a    c     1.0
     d     2.0
b    c     3.0
     d     4.0
c    c     NaN
     d     NaN

Pivot tables：

In [142]: raw_cat = (["a", "a", "b", "b"], categories=["a", "b", "c"])

In [143]: df = ({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]})

In [144]: pd.pivot_table(df, values="values", index=["A", "B"])
Out[144]: 
     values
A B        
a c       1
  d       2
b c       3
  d       4

to this article on the use of Pandas data types of the category of the article is introduced to this, more related to the use of category content please search for my previous articles or continue to browse the following related articles I hope that you will support me in the future more!