1 | factors = np.random.randn(9) |
pd.qcut()
qcut是根据这些值的频率来选择箱子的均匀间隔,即每个箱子中含有的数的数量是相同的。
传入q参数1
2
3
4
5
6
7
8
9
10
11
12pd.qcut(factors, 3) #返回每个数对应的分组
>>>[(-2.097, 0.00648], (0.6, 1.188], (0.6, 1.188], (0.00648, 0.6], (-2.097, 0.00648], (0.6, 1.188], (0.00648, 0.6], (-2.097, 0.00648], (0.00648, 0.6]]
Categories (3, interval[float64]): [(-2.097, 0.00648] < (0.00648, 0.6] < (0.6, 1.188]]
pd.qcut(factors, 3).value_counts() #计算每个分组中含有的数的数量
>>>
(-2.097, 0.00648] 3
(0.00648, 0.6] 3
(0.6, 1.188] 3
dtype: int64
传入label参数1
2
3
4
5
6
7
8
9pd.qcut(factors, 3, labels = ["a","b","c"]) #返回每个数对应的分组,但分组名称由label指示
[a, c, c, b, a, c, b, a, b]
Categories (3, object): [a < b < c]
pd.qcut(factors, 3, labels = False) #返回每个数对应的分组,但仅显示分组下标
0 2 2 1 0 2 1 0 1] [
传入retbins参数1
2
3
4pd.qcut(factors, 3, retbins=True) #返回每个数对应的分组,且额外返回bins,即每个边界值
-2.097, 0.00648], (0.6, 1.188], (0.6, 1.188], (0.00648, 0.6], (-2.097, 0.00648], (0.6, 1.188], (0.00648, 0.6], (-2.097, 0.00648], (0.00648, 0.6]] ([(
Categories (3, interval[float64]): [(-2.097, 0.00648] < (0.00648, 0.6] < (0.6, 1.188]], array([-2.0963765 , 0.00648121, 0.60009524, 1.1875407 ]))
参数 | 说明 |
---|---|
x | ndarray或Series |
q | integer,指示划分的组数 |
labels | array或bool,默认为None。当传入数组时,分组的名称由label指示;当传入False时,仅显示分组下标 |
retbins | bool,是否返回bins,默认为False。当传入True时,额外返回bins,即每个边界值。 |
precision | int,精度,默认为3 |
pd.cut()
cut将根据值本身来选择箱子均匀间隔,即每个箱子的间距都是相同的
传入bins参数1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18pd.cut(factors, 3) #返回每个数对应的数组
-1.002, 0.0929], (0.0929, 1.188], (0.0929, 1.188], (0.0929, 1.188], (-2.1, -1.002], (0.0929, 1.188], (0.0929, 1.188], (-2.1, -1.002], (-1.002, 0.0929]] [(
Categories (3, interval[float64]): [(-2.1, -1.002] < (-1.002, 0.0929] < (0.0929, 1.188]]
pd.cut(factors, bins=[-3, -2, -1, 0, 1, 2, 3])
-1, 0], (0, 1], (0, 1], (0, 1], (-3, -2], (1, 2], (0, 1], (-2, -1], (0, 1]] [(
Categories (6, interval[int64]): [(-3, -2] < (-2, -1] < (-1, 0] < (0, 1] < (1, 2] < (2, 3]]
pd.cut(factors, 3).value_counts() #计算每个分组中含有的数的数量
>>>
(-2.1, -1.002] 2
(-1.002, 0.0929] 2
(0.0929, 1.188] 5
dtype: int64
传入label参数1
2
3
4
5
6
7
8pd.cut(factors, 3, labels=["a", "b", "c"]) #返回每个数对应的分组,但分组名称由label指示
[b, c, c, c, a, c, c, a, b]
Categories (3, object): [a < b < c]
pd.cut(factors, 3, labels=False) #返回每个数对应的分组,但仅显示分组下标
1 2 2 2 0 2 2 0 1] [
传入retbins参数1
2
3
4pd.cut(factors, 3, retbins=True) #返回每个数对应的分组,且额外返回bins,即每个边界值
>>> ([(-1.002, 0.0929], (0.0929, 1.188], (0.0929, 1.188], (0.0929, 1.188], (-2.1, -1.002], (0.0929, 1.188], (0.0929, 1.188], (-2.1, -1.002], (-1.002, 0.0929]]
Categories (3, interval[float64]): [(-2.1, -1.002] < (-1.002, 0.0929] < (0.0929, 1.188]], array([-2.09966042, -1.00173743, 0.09290163, 1.1875407 ]))
参数 | 说明 |
---|---|
x | array,仅能使用一维数组 |
bins | Integer或sequence of scalars,指示划分的组数或指定组距 |
labels | array或bool,默认为None。当传入数组时,分组的名称由label指示;当传入False时,仅显示分组下标 |
retbins | bool,是否返回bins,默认为False。当传入True时,额外返回bins,即每个边界值 |
precision | int,精度,默认为3 |