Python数据分析，必须要求掌握Pandas大熊猫

2023-02-27

data out ohio

我写的pandas不是我国可爱的大熊猫国宝 pandas是基于NumPy的一种工具，该工具是为了解决数据分析任务而创建的。Pandas纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。你很快就会发现，它是使

我写的pandas不是我国可爱的大熊猫国宝

pandas 是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。你很快就会发现，它是使Python成为强大而高效的数据分析环境的重要因素之一。

1.pandas数据结构的介绍

Series：一维数组，与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近。Series如今能保存不同种数据类型，字符串、boolean值、数字等都能保存在Series中。
Time- Series：以时间为索引的Series。
DataFrame：二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。
Panel ：三维的数组，可以理解为DataFrame的容器。

2.Series的操作

2.1 对象创建 2.1.1 直接创建2.1.2 字典创建

import pandas as pd 
import numpy as np 
# 直接创建 
s = pd.Series(np.random.randn(5), index=['a','b','c','d','e']) 
print(s)  
# 字典（dict）类型数据创建 
s = pd.Series( {'a':10, 'b':20, 'c':30}, index=['b', 'c', 'a', 'd']) 
 
OUT: 
a   -0.620323 
b   -0.189133 
c    1.677690 
d   -1.480348 
e   -0.539061 
dtype: float64 
 
OUT: 
a    10 
b    20 
c    30 
dtype: int64 
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.

2.2 查看数据切片、索引、dict操作 Series既然是一维数组类型的数据结构，那么它支持想数组那样去操作它。通过数组下标索引、切片都可以去操作他，且它的data可以是dict类型的，那么它肯定也就支持字典的索引方式。

import pandas as pd 
import numpy as np 
 
s = pd.Series(np.random.randn(5), index=['a','b','c','d','e']) 
 
print(s) 
 
# 下标索引 
print('下标索引方式s[0] = : %s' % s[0]) 
 
# 字典访问方式 
print('字典访问方式s[b] = ：%s' % s['b']) 
 
# 切片操作 
print('切片操作s[2:]\n:%s' % s[2:]) 
print('a' in s) 
print('k' in s) 
OUT: 
a   -0.799676 
b   -1.581704 
c   -1.240885 
d    0.623757 
e   -0.234417 
dtype: float64 
 
下标索引方式s[0] = : -0.799676067487 
字典访问方式s[b] = ：-1.58170351838 
切片操作s[2:]: 
c   -1.240885 
d    0.623757 
e   -0.234417 
True 
False 
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.

2.3 Series的算术操作

import pandas as pd 
import numpy as np 
 
s1 = pd.Series(np.random.randn(3), index=['a','b','c']) 
 
s2 = pd.Series(np.random.randn(3), index=['a','b','c']) 
print(s1+s2) 
print(s1-s2) 
print(s1*s2) 
print(s1/s2) 
OUT: 
a    0.236514 
b   -0.132153 
c    0.203186 
dtype: float64 
 
a    0.305397 
b   -1.474441 
c   -1.697982 
dtype: float64 
 
a   -0.009332 
b   -0.539128 
c   -0.710465 
dtype: float64 
 
a   -7.867120 
b   -1.196907 
c   -0.786252 
dtype: float64 
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.

3.dataframe的操作

3.1 对象创建

In [70]:  data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],'year': [2000, 2001, 20 
    ...: 02, 2001, 2002],'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} 
In [71]: data 
Out[71]:  
{'pop': [1.5, 1.7, 3.6, 2.4, 2.9], 
 'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 
 'year': [2000, 2001, 2002, 2001, 2002]} 
# 建立DataFrame对象 
In [72]: frame1 = DataFrame(data) 
# 红色部分为自动生成的索引 
In [73]: frame1 
Out[73]:  
   pop   state  year 
0  1.5    Ohio  2000 
1  1.7    Ohio  2001 
2  3.6    Ohio  2002 
3  2.4  Nevada  2001 
4  2.9  Nevada  2002 
 
>>> lista = [1,2,5,7] 
>>> listb = ['a','b','c','d'] 
>>> df = pd.DataFrame({'col1':lista,'col2':listb}) 
>>> df 
   col1 col2 
0     1    a 
1     2    b 
2     5    c 
3     7    d 
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.

3.2 选择数据

In [1]: import numpy as np 
   ...: import pandas as 
   ...: df = pd.DataFrame 
 
In [2]: df 
Out[2]: 
    a   b   c 
0   0   2   4 
1   6   8  10 
2  12  14  16 
3  18  20  22 
4  24  26  28 
5  30  32  34 
6  36  38  40 
7  42  44  46 
8  48  50  52 
9  54  56  58 
 
In [3]: df.loc[0,'c'] 
Out[3]: 4 
 
In [4]: df.loc[1:4,['a','c']] 
Out[4]: 
    a   c 
1   6  10 
2  12  16 
3  18  22 
4  24  28 
In [5]: df.iloc[0,2] 
Out[5]: 4 
 
In [6]: df.iloc[1:4,[0,2]] 
Out[6]: 
    a   c 
1   6  10 
2  12  16 
3  18  22 
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.

3.3 函数应用

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), 
                     index=['Utah', 'Ohio', 'Texas', 'Oregon']) 
frame 
np.abs(frame) 
 
OUT： 
            b      d            e 
Utah 0.204708 0.478943 0.519439 
Ohio 0.555730 1.965781 1.393406 
Texas 0.092908 0.281746 0.769023 
Oregon 1.246435 1.007189 1.296221 
 
 
f = lambda x: x.max() - x.min() 
frame.apply(f) 
OUT： 
b    1.802165 
d    1.684034 
e    2.689627 
dtype: float64 
 
def f(x): 
    return pd.Series([x.min(), x.max()], index=['min', 'max']) 
frame.apply(f) 
 
            b d       e 
Utah -0.20 0.48 -0.52 
Ohio -0.56 1.97 1.39 
Texas 0.09 0.28 0.77 
Oregon 1.25 1.01 -1.30 
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.

3.4 统计概述和计算

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], 
                   [np.nan, np.nan], [0.75, -1.3]], 
                  index=['a', 'b', 'c', 'd'], 
                  columns=['one', 'two']) 
df 
OUT: 
    one     two 
a 1.40 NaN 
b 7.10 -4.5 
c NaN     NaN 
d 0.75 -1.3 
 
 
df.info() 
df.describe() 
 
 
<class 'pandas.core.frame.DataFrame'> 
Index: 4 entries, a to d 
Data columns (total 2 columns): 
one    3 non-null float64 
two    2 non-null float64 
dtypes: float64(2) 
memory usage: 256.0+ bytes 
 
OUT： 
          one          two 
count 3.000000 2.000000 
mean 3.083333 -2.900000 
std 3.493685 2.262742 
min 0.750000 -4.500000 
25% 1.075000 -3.700000 
50% 1.400000 -2.900000 
75% 4.250000 -2.100000 
max 7.100000 -1.300000 
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.

3.5 数据读取

data = pd.read_csv('./dataset/HR.csv') 
data.info() 
 
out： 
<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 14999 entries, 0 to 14998 
Data columns (total 10 columns): 
satisfaction_level       14999 non-null float64 
last_evaluation          14999 non-null float64 
number_project           14999 non-null int64 
average_montly_hours     14999 non-null int64 
time_spend_company       14999 non-null int64 
Work_accident            14999 non-null int64 
left                     14999 non-null int64 
promotion_last_5years    14999 non-null int64 
sales                    14999 non-null object 
salary                   14999 non-null object 
dtypes: float64(2), int64(6), object(2) 
memory usage: 1.1+ MB 
 
data = pd.read_csv('./dataset/movielens/movies.dat', header=None, names=['name', 'types'], sep='::', engine='python') 
data.head() 
OUT： 
                name types 
1 Toy Story (1995) Animation|Children's|Comedy 
2 Jumanji (1995) Adventure|Children's|Fantasy 
3 Grumpier Old Men (1995) Comedy|Romance 
4 Waiting to Exhale (1995) Comedy|Drama 
5 Father of the Bride Part II (1995) Comedy 
 
 
data = pd.read_excel('./dataset/my_excel.xlsx', sheet_name=1) 
data.head() 
ouput: 
        date H1 H2 H3 
0 2014-06-01 1 2 3 
1 2014-06-02 2 3 4 
2 2014-06-03 3 4 5 
3 2014-06-04 4 5 6 
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.

#4. Time- Series的操作

生成日期范围：

import pandas as pd 
pd.data_range('20190313',periods=10) 
 
OUT: 
DatetimeIndex(['2019-03-13', '2019-03-14', '2019-03-15', '2019-03-16', 
               '2019-03-17', '2019-03-18', '2019-03-19', '2019-03-20', 
               '2019-03-21', '2019-03-22'], 
              dtype='datetime64[ns]', freq='D') 
1.
2.
3.
4.
5.
6.
7.
8.

5. 绘图功能

ts = pd.DataFrame(np.random.randn(1000,4),index=pd.date_range('20180101',periods=1000),columns=list('abcd')) 
ts = ts.cumsum() 
ts.plot(figsize = (12,8)) 
plt.show() 
1.
2.
3.
4.