Pandas(Data Manipulation) in Machine Learning

12月 28, 2017

Pandas(Data Manipulation) in Machine Learning

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.


import pandas as pd
import numpy as np

The two primary data structures of pandas,

Series (1-dimensional)

one-dimensional labeled array

DataFrame (2-dimensional)

2-dimensional labeled data structure with columns of potentially different types

Create Series


s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s
Out[4]: 
a    0.4691
b   -0.2829
c   -1.5091
d   -1.1356
e    1.2121
dtype: float64

s.index
Out[5]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

pd.Series(np.random.randn(5))
Out[6]: 
0   -0.1732
1    0.1192
2   -1.0442
3   -0.8618
4   -2.1046
dtype: float64

Create DataFrame


columns = ['name', 'age', 'gender', 'job']
user1 = pd.DataFrame([['alice', 19, "F", "student"], ['john', 26, "M", "student"]], columns=columns)
user2 = pd.DataFrame([['eric', 22, "M", "student"], ['paul', 58, "F", "manager"]], columns=columns)
user3 = pd.DataFrame(dict(name=['peter', 'julie'], age=[33, 44], gender=['M', 'F'], job=['engineer', 'scientist']))
print(user3)

   age gender        job   name
0   33      M   engineer  peter
1   44      F  scientist  julie

Concatenate DataFrame


user1.append(user2)
Out[66]: 
    name  age gender      job
0  alice   19      F  student
1   john   26      M  student
0   eric   22      M  student
1   paul   58      F  manager

print(user1)
    name  age gender      job
0  alice   19      F  student
1   john   26      M  student

users = pd.concat([user1, user2, user3])

print(users)
   age gender        job   name
0   19      F    student  alice
1   26      M    student   john
0   22      M    student   eric
1   58      F    manager   paul
0   33      M   engineer  peter
1   44      F  scientist  julie

Join DataFrame


user4 = pd.DataFrame(dict(name=['alice', 'john', 'eric', 'julie'], height=[165, 180, 175, 171]))

print(user4)
   height   name
0     165  alice
1     180   john
2     175   eric
3     171  julie

Use intersection of keys from both frames


merge_inter = pd.merge(users, user4, on="name")

print(merge_inter)
   age gender        job   name  height
0   19      F    student  alice     165
1   26      M    student   john     180
2   22      M    student   eric     175
3   44      F  scientist  julie     171

Use union of keys from both frames


users = pd.merge(users, user4, on="name", how='outer')

 print(users)
   age gender        job   name  height
0   19      F    student  alice   165.0
1   26      M    student   john   180.0
2   22      M    student   eric   175.0
3   58      F    manager   paul     NaN
4   33      M   engineer  peter     NaN
5   44      F  scientist  julie   171.0

Drop DataFrame

pandas.DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') return new object with labels in requested axis removed.
For examples,


 df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D'])
 df
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

Drop columns


 df.drop(['B', 'C'], axis=1)
   A   D
0  0   3
1  4   7
2  8  11

 df.drop(columns=['B', 'C'])
   A   D
0  0   3
1  4   7
2  8  11

Drop a row by index


 df.drop([0, 1])
   A  B   C   D
2  8  9  10  11

Summarizing

Examine the users data


users # print the first 30 and last 30 rows
Out[77]: 
   age gender        job   name  height
0   19      F    student  alice   165.0
1   26      M    student   john   180.0
2   22      M    student   eric   175.0
3   58      F    manager   paul     NaN
4   33      M   engineer  peter     NaN
5   44      F  scientist  julie   171.0

type(users)
Out[78]: pandas.core.frame.DataFrame

users.head() # print the first 5 rows
Out[79]: 
   age gender       job   name  height
0   19      F   student  alice   165.0
1   26      M   student   john   180.0
2   22      M   student   eric   175.0
3   58      F   manager   paul     NaN
4   33      M  engineer  peter     NaN

users.tail() # print the last 5 rows
Out[80]: 
   age gender        job   name  height
1   26      M    student   john   180.0
2   22      M    student   eric   175.0
3   58      F    manager   paul     NaN
4   33      M   engineer  peter     NaN
5   44      F  scientist  julie   171.0

print(users.describe()) # summarize all numeric columns

             age      height
count   6.000000    4.000000
mean   33.666667  172.750000
std    14.895189    6.344289
min    19.000000  165.000000
25%    23.000000  169.500000
50%    29.500000  173.000000
75%    41.250000  176.250000
max    58.000000  180.000000


users.index # "the index" (aka "the labels")
Out[83]: Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

users.columns # column names (which is "an index")
Out[84]: Index(['age', 'gender', 'job', 'name', 'height'], dtype='object')

users.dtypes # data types of each column
Out[85]: 
age         int64
gender     object
job        object
name       object
height    float64
dtype: object

users.shape # number of rows and columns
Out[86]: (6, 5)

users.values # underlying numpy array
Out[87]: 
array([[19, 'F', 'student', 'alice', 165.0],
       [26, 'M', 'student', 'john', 180.0],
       [22, 'M', 'student', 'eric', 175.0],
       [58, 'F', 'manager', 'paul', nan],
       [33, 'M', 'engineer', 'peter', nan],
       [44, 'F', 'scientist', 'julie', 171.0]], dtype=object)

users.info() # concise summary (includes memory usage as of pandas 0.15.0)

Int64Index: 6 entries, 0 to 5
Data columns (total 5 columns):
age       6 non-null int64
gender    6 non-null object
job       6 non-null object
name      6 non-null object
height    4 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 288.0+ bytes

summarize all columns


print(users.describe(include='all'))
              age gender      job   name      height
count    6.000000      6        6      6    4.000000
unique        NaN      2        4      6         NaN
top           NaN      F  student  peter         NaN
freq          NaN      3        3      1         NaN
mean    33.666667    NaN      NaN    NaN  172.750000
std     14.895189    NaN      NaN    NaN    6.344289
min     19.000000    NaN      NaN    NaN  165.000000
25%     23.000000    NaN      NaN    NaN  169.500000
50%     29.500000    NaN      NaN    NaN  173.000000
75%     41.250000    NaN      NaN    NaN  176.250000
max     58.000000    NaN      NaN    NaN  180.000000

print(users.describe(include=['object'])) # limit to one (or more) types
       gender      job   name
count       6        6      6
unique      2        4      6
top         F  student  peter
freq        3        3

Column Selection


users['gender'] # select one column
Out[91]: 
0    F
1    M
2    M
3    F
4    M
5    F
Name: gender, dtype: object

type(users['gender']) # Series
Out[92]: pandas.core.series.Series

users.gender # select one column using the DataFrame

# select multiple columns
users[['age', 'gender']] # select two columns
Out[94]: 
   age gender
0   19      F
1   26      M
2   22      M
3   58      F
4   33      M
5   44      F

my_cols = ['age', 'gender'] # or, create a list then use that list to select columns
users[my_cols]
type(users[my_cols]) # DataFrame

Rows Selection


# iloc is strictly integer position based
df = users.copy()
df.iloc[0] # first row
Out[99]: 
age            19
gender          F
job       student
name        alice
height        165
Name: 0, dtype: object

df.iloc[0, 0] # first item of first row 
Out[100]: 19

df.iloc[0, 0] = 55
users.shape
Out[101]: (6, 5)

for i in range(users.shape[0]):
    row = df.iloc[i]
    print(row.age)

Filtering


# simple logical filtering
users[users.age < 20] # only show users with age < 20
Out[111]: 
   age gender      job   name  height
0   19      F  student  alice   165.0

young_bool = users.age < 20 # or, create a Series of booleans...
Out[113]: 
0     True
1    False
2    False
3    False
4    False
5    False
Name: age, dtype: bool

young = users[young_bool] # ...and use that Series to filter rows 
Out[115]: 
   age gender      job   name  height
0   19      F  student  alice   165.0

users[users.age < 20].job # select one column from the filtered results print(young)
Out[116]: 
0    student
Name: job, dtype: object





Sorting

df = users.copy()
   age gender        job   name
0   19      F    student  alice
1   26      M    student   john
0   22      M    student   eric
1   58      F    manager   paul
0   33      M   engineer  peter
1   44      F  scientist  julie

df.age.sort_values() # only works for a Series 
Out[9]: 
0    19
0    22
1    26
0    33
1    44
1    58
Name: age, dtype: int64

df.sort_values(by='age') # sort rows by a specific column 
df.sort_values(by='age', ascending=False) # use descending order instead 
df.sort_values(by=['job', 'age']) # sort by multiple columns 
Out[13]: 
   age gender        job   name
0   33      M   engineer  peter
1   58      F    manager   paul
1   44      F  scientist  julie
0   19      F    student  alice
0   22      M    student   eric
1   26      M    student   john

df.sort_values(by=['job', 'age'], inplace=True) # modify df
print(df)


Reshaping Data
Converting or transforming data from one format to another.



The following can be thought as the same format:

stacked format, because the individual observations are stacked on top of each other.
record format, because each row is a single record, i.e. a single observation.
long format, because this format will be long in the vertical direction as opposed to wide in the horizontal direction.


The pivot method of the dataframe can be used to view interested comparison according to an index:


df.pivot(index='job', columns='name', values='age')
Out[19]: 
name       alice  eric  john  julie  paul  peter
job                                             
engineer     NaN   NaN   NaN    NaN   NaN   33.0
manager      NaN   NaN   NaN    NaN  58.0    NaN
scientist    NaN   NaN   NaN   44.0   NaN    NaN
student     19.0  22.0  26.0    NaN   NaN    NaN


The above format of this table can be referred to as:

wide format, because the table is now wider rather than longer.
unstacked format, because the individual observations (one person/one date) are no longer stacked on top of each other.


Quality Check


Rename Values


dealing with Outliers


Groupby



for grp, data in users.groupby("job"):
    print(grp, data)
    
engineer    age gender       job   name
0   33      M  engineer  peter
manager    age gender      job  name
1   58      F  manager  paul
scientist    age gender        job   name
1   44      F  scientist  julie
student    age gender      job   name
0   19      F  student  alice
1   26      M  student   john
0   22      M  student   eric


File I/O
csv
Save dataframe as csv in the tmp working directory


import tempfile, os.path
tmpdir = tempfile.gettempdir()
csv_filename = os.path.join(tmpdir, "users.csv") 
users.to_csv(csv_filename, index=False) 


$ cat /var/folders/v3/r84ymcjn5zd_vlhtdp6c7h940000gn/T/users.csv


age,gender,job,name
19,F,student,alice
26,M,student,john
22,M,student,eric
58,F,manager,paul
33,M,engineer,peter
44,F,scientist,julie
Load a csv


readin_csv = pd.read_csv(csv_filename)
readin_csv
Out[34]: 
   age gender        job   name
0   19      F    student  alice
1   26      M    student   john
2   22      M    student   eric
3   58      F    manager   paul
4   33      M   engineer  peter
5   44      F  scientist  julie
Load a csv with no headers


df = pd.read_csv(csv_filename, header=None)
df
Out[37]: 
     0       1          2      3
0  age  gender        job   name
1   19       F    student  alice
2   26       M    student   john
3   22       M    student   eric
4   58       F    manager   paul
5   33       M   engineer  peter
6   44       F  scientist  julie


Excel


Visualization 
Reference



Refer the matplotlib API:


 import matplotlib.pyplot as plt


The plot method on Series and DataFrame is just a simple wrapper around plt.plot():


ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()


On DataFrame, plot() is a convenience to plot all of the columns with labels:


df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list('ABCD'))
df = df.cumsum()
plt.figure(); df.plot();


You can plot one column versus another using the x and y keywords in plot():


df3 = pd.DataFrame(np.random.randn(1000, 2), columns=['B', 'C']).cumsum()
df3['A'] = pd.Series(list(range(len(df))))
df3.plot(x='A', y='B')




Univariate Statistics


Testing pairwise associations
In statistics, a categorical variable or factor is a variable that can take on one of a limited, and usually fixed, number of possible values.

For example, Sex (Female, Male), Hair color (blonde, brown, etc.).



pandas.DataFrame.corr(method='pearson', min_periods=1)


Compute pairwise correlation of columns, excluding NA/null values

Parameters
method : {‘pearson’, ‘kendall’, ‘spearman’} pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation

min_periods : int, optional Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation
Returns
y : DataFrame 


Example:


  import pandas as pd
  df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})

   A  B
0  0  0
1  1  2
2  2  4
3  3  6
Then


  df['A'].corr(df['B'])
gives 1 as expected.

Now, if you change a value, e.g.


  df.loc[2, 'B'] = 4.5

   A    B
0  0  0.0
1  1  2.0
2  2  4.5
3  3  6.0
the command


  df['A'].corr(df['B'])


returns


  0.99586
which is still close to 1, as expected.



If you apply .corr directly to your dataframe, it will return all pairwise correlations between your columns



Multivariate Statistics












取得連結





Facebook





X





Pinterest





以電子郵件傳送





其他應用程式







電子郵件發布




標籤：
machine learning
python




留言




張貼留言



熱門文章











7月 28, 2015





A Tutorial on the Device Tree 










取得連結





Facebook





X





Pinterest





以電子郵件傳送





其他應用程式








1 則留言
















3月 31, 2021






Linux Modem Manager










取得連結





Facebook





X





Pinterest





以電子郵件傳送





其他應用程式








1 則留言

搜尋此網誌

I'm Jay's father

Pandas(Data Manipulation) in Machine Learning

Pandas(Data Manipulation) in Machine Learning

Create Series

Create DataFrame

Concatenate DataFrame

Join DataFrame

Drop DataFrame

Summarizing

Column Selection

Rows Selection

Filtering

Sorting

Reshaping Data

Quality Check

Rename Values

dealing with Outliers

Groupby

File I/O

csv

Excel

Visualization

Univariate Statistics

Testing pairwise associations

pandas.DataFrame.corr(method='pearson', min_periods=1)

Multivariate Statistics

留言

熱門文章

A Tutorial on the Device Tree

Linux Modem Manager