Pandas(Data Manipulation) in Machine Learning

Pandas(Data Manipulation) in Machine Learning


pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.


import pandas as pd
import numpy as np

The two primary data structures of pandas,
  • Series (1-dimensional)
  • Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call pd.Series
  • DataFrame (2-dimensional)
  • DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It stems from the R data.frame() object.

Create Series



s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s
Out[4]: 
a    0.4691
b   -0.2829
c   -1.5091
d   -1.1356
e    1.2121
dtype: float64

s.index
Out[5]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

pd.Series(np.random.randn(5))
Out[6]: 
0   -0.1732
1    0.1192
2   -1.0442
3   -0.8618
4   -2.1046
dtype: float64


Create DataFrame


columns = ['name', 'age', 'gender', 'job']
user1 = pd.DataFrame([['alice', 19, "F", "student"], ['john', 26, "M", "student"]], columns=columns)
user2 = pd.DataFrame([['eric', 22, "M", "student"], ['paul', 58, "F", "manager"]], columns=columns)
user3 = pd.DataFrame(dict(name=['peter', 'julie'], age=[33, 44], gender=['M', 'F'], job=['engineer', 'scientist']))
print(user3)

   age gender        job   name
0   33      M   engineer  peter
1   44      F  scientist  julie

Concatenate DataFrame



user1.append(user2)
Out[66]: 
    name  age gender      job
0  alice   19      F  student
1   john   26      M  student
0   eric   22      M  student
1   paul   58      F  manager

print(user1)
    name  age gender      job
0  alice   19      F  student
1   john   26      M  student

users = pd.concat([user1, user2, user3])

print(users)
   age gender        job   name
0   19      F    student  alice
1   26      M    student   john
0   22      M    student   eric
1   58      F    manager   paul
0   33      M   engineer  peter
1   44      F  scientist  julie

Join DataFrame


user4 = pd.DataFrame(dict(name=['alice', 'john', 'eric', 'julie'], height=[165, 180, 175, 171]))

print(user4)
   height   name
0     165  alice
1     180   john
2     175   eric
3     171  julie

Use intersection of keys from both frames

merge_inter = pd.merge(users, user4, on="name")

print(merge_inter)
   age gender        job   name  height
0   19      F    student  alice     165
1   26      M    student   john     180
2   22      M    student   eric     175
3   44      F  scientist  julie     171

Use union of keys from both frames

users = pd.merge(users, user4, on="name", how='outer')

 print(users)
   age gender        job   name  height
0   19      F    student  alice   165.0
1   26      M    student   john   180.0
2   22      M    student   eric   175.0
3   58      F    manager   paul     NaN
4   33      M   engineer  peter     NaN
5   44      F  scientist  julie   171.0

Drop DataFrame


pandas.DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') return new object with labels in requested axis removed.
For examples,

 df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D'])
 df
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

Drop columns

 df.drop(['B', 'C'], axis=1)
   A   D
0  0   3
1  4   7
2  8  11

 df.drop(columns=['B', 'C'])
   A   D
0  0   3
1  4   7
2  8  11

Drop a row by index

 df.drop([0, 1])
   A  B   C   D
2  8  9  10  11

Summarizing


Examine the users data

users # print the first 30 and last 30 rows
Out[77]: 
   age gender        job   name  height
0   19      F    student  alice   165.0
1   26      M    student   john   180.0
2   22      M    student   eric   175.0
3   58      F    manager   paul     NaN
4   33      M   engineer  peter     NaN
5   44      F  scientist  julie   171.0

type(users)
Out[78]: pandas.core.frame.DataFrame

users.head() # print the first 5 rows
Out[79]: 
   age gender       job   name  height
0   19      F   student  alice   165.0
1   26      M   student   john   180.0
2   22      M   student   eric   175.0
3   58      F   manager   paul     NaN
4   33      M  engineer  peter     NaN

users.tail() # print the last 5 rows
Out[80]: 
   age gender        job   name  height
1   26      M    student   john   180.0
2   22      M    student   eric   175.0
3   58      F    manager   paul     NaN
4   33      M   engineer  peter     NaN
5   44      F  scientist  julie   171.0

print(users.describe()) # summarize all numeric columns

             age      height
count   6.000000    4.000000
mean   33.666667  172.750000
std    14.895189    6.344289
min    19.000000  165.000000
25%    23.000000  169.500000
50%    29.500000  173.000000
75%    41.250000  176.250000
max    58.000000  180.000000


users.index # "the index" (aka "the labels")
Out[83]: Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

users.columns # column names (which is "an index")
Out[84]: Index(['age', 'gender', 'job', 'name', 'height'], dtype='object')

users.dtypes # data types of each column
Out[85]: 
age         int64
gender     object
job        object
name       object
height    float64
dtype: object

users.shape # number of rows and columns
Out[86]: (6, 5)

users.values # underlying numpy array
Out[87]: 
array([[19, 'F', 'student', 'alice', 165.0],
       [26, 'M', 'student', 'john', 180.0],
       [22, 'M', 'student', 'eric', 175.0],
       [58, 'F', 'manager', 'paul', nan],
       [33, 'M', 'engineer', 'peter', nan],
       [44, 'F', 'scientist', 'julie', 171.0]], dtype=object)

users.info() # concise summary (includes memory usage as of pandas 0.15.0)

Int64Index: 6 entries, 0 to 5
Data columns (total 5 columns):
age       6 non-null int64
gender    6 non-null object
job       6 non-null object
name      6 non-null object
height    4 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 288.0+ bytes


summarize all columns

print(users.describe(include='all'))
              age gender      job   name      height
count    6.000000      6        6      6    4.000000
unique        NaN      2        4      6         NaN
top           NaN      F  student  peter         NaN
freq          NaN      3        3      1         NaN
mean    33.666667    NaN      NaN    NaN  172.750000
std     14.895189    NaN      NaN    NaN    6.344289
min     19.000000    NaN      NaN    NaN  165.000000
25%     23.000000    NaN      NaN    NaN  169.500000
50%     29.500000    NaN      NaN    NaN  173.000000
75%     41.250000    NaN      NaN    NaN  176.250000
max     58.000000    NaN      NaN    NaN  180.000000

print(users.describe(include=['object'])) # limit to one (or more) types
       gender      job   name
count       6        6      6
unique      2        4      6
top         F  student  peter
freq        3        3    

Column Selection


users['gender'] # select one column
Out[91]: 
0    F
1    M
2    M
3    F
4    M
5    F
Name: gender, dtype: object

type(users['gender']) # Series
Out[92]: pandas.core.series.Series

users.gender # select one column using the DataFrame

# select multiple columns
users[['age', 'gender']] # select two columns
Out[94]: 
   age gender
0   19      F
1   26      M
2   22      M
3   58      F
4   33      M
5   44      F

my_cols = ['age', 'gender'] # or, create a list then use that list to select columns
users[my_cols]
type(users[my_cols]) # DataFrame

Rows Selection



# iloc is strictly integer position based
df = users.copy()
df.iloc[0] # first row
Out[99]: 
age            19
gender          F
job       student
name        alice
height        165
Name: 0, dtype: object

df.iloc[0, 0] # first item of first row 
Out[100]: 19

df.iloc[0, 0] = 55
users.shape
Out[101]: (6, 5)

for i in range(users.shape[0]):
    row = df.iloc[i]
    print(row.age)


Filtering


# simple logical filtering
users[users.age < 20] # only show users with age < 20
Out[111]: 
   age gender      job   name  height
0   19      F  student  alice   165.0

young_bool = users.age < 20 # or, create a Series of booleans...
Out[113]: 
0     True
1    False
2    False
3    False
4    False
5    False
Name: age, dtype: bool

young = users[young_bool] # ...and use that Series to filter rows 
Out[115]: 
   age gender      job   name  height
0   19      F  student  alice   165.0

users[users.age < 20].job # select one column from the filtered results print(young)
Out[116]: 
0    student
Name: job, dtype: object


Sorting


df = users.copy()
   age gender        job   name
0   19      F    student  alice
1   26      M    student   john
0   22      M    student   eric
1   58      F    manager   paul
0   33      M   engineer  peter
1   44      F  scientist  julie

df.age.sort_values() # only works for a Series 
Out[9]: 
0    19
0    22
1    26
0    33
1    44
1    58
Name: age, dtype: int64

df.sort_values(by='age') # sort rows by a specific column 
df.sort_values(by='age', ascending=False) # use descending order instead 
df.sort_values(by=['job', 'age']) # sort by multiple columns 
Out[13]: 
   age gender        job   name
0   33      M   engineer  peter
1   58      F    manager   paul
1   44      F  scientist  julie
0   19      F    student  alice
0   22      M    student   eric
1   26      M    student   john

df.sort_values(by=['job', 'age'], inplace=True) # modify df
print(df)

Reshaping Data

Converting or transforming data from one format to another.

The following can be thought as the same format:
  • stacked format, because the individual observations are stacked on top of each other.
  • record format, because each row is a single record, i.e. a single observation.
  • long format, because this format will be long in the vertical direction as opposed to wide in the horizontal direction.

The pivot method of the dataframe can be used to view interested comparison according to an index:

df.pivot(index='job', columns='name', values='age')
Out[19]: 
name       alice  eric  john  julie  paul  peter
job                                             
engineer     NaN   NaN   NaN    NaN   NaN   33.0
manager      NaN   NaN   NaN    NaN  58.0    NaN
scientist    NaN   NaN   NaN   44.0   NaN    NaN
student     19.0  22.0  26.0    NaN   NaN    NaN

The above format of this table can be referred to as:
  • wide format, because the table is now wider rather than longer.
  • unstacked format, because the individual observations (one person/one date) are no longer stacked on top of each other.

Quality Check


Rename Values


dealing with Outliers


Groupby



for grp, data in users.groupby("job"):
    print(grp, data)
    
engineer    age gender       job   name
0   33      M  engineer  peter
manager    age gender      job  name
1   58      F  manager  paul
scientist    age gender        job   name
1   44      F  scientist  julie
student    age gender      job   name
0   19      F  student  alice
1   26      M  student   john
0   22      M  student   eric

File I/O

csv

Save dataframe as csv in the tmp working directory

import tempfile, os.path
tmpdir = tempfile.gettempdir()
csv_filename = os.path.join(tmpdir, "users.csv") 
users.to_csv(csv_filename, index=False) 

$ cat /var/folders/v3/r84ymcjn5zd_vlhtdp6c7h940000gn/T/users.csv

age,gender,job,name
19,F,student,alice
26,M,student,john
22,M,student,eric
58,F,manager,paul
33,M,engineer,peter
44,F,scientist,julie
Load a csv

readin_csv = pd.read_csv(csv_filename)
readin_csv
Out[34]: 
   age gender        job   name
0   19      F    student  alice
1   26      M    student   john
2   22      M    student   eric
3   58      F    manager   paul
4   33      M   engineer  peter
5   44      F  scientist  julie
Load a csv with no headers

df = pd.read_csv(csv_filename, header=None)
df
Out[37]: 
     0       1          2      3
0  age  gender        job   name
1   19       F    student  alice
2   26       M    student   john
3   22       M    student   eric
4   58       F    manager   paul
5   33       M   engineer  peter
6   44       F  scientist  julie

Excel


Visualization

Reference

Refer the matplotlib API:

 import matplotlib.pyplot as plt

The plot method on Series and DataFrame is just a simple wrapper around plt.plot():

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()

On DataFrame, plot() is a convenience to plot all of the columns with labels:

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list('ABCD'))
df = df.cumsum()
plt.figure(); df.plot();

You can plot one column versus another using the x and y keywords in plot():

df3 = pd.DataFrame(np.random.randn(1000, 2), columns=['B', 'C']).cumsum()
df3['A'] = pd.Series(list(range(len(df))))
df3.plot(x='A', y='B')


Univariate Statistics


Testing pairwise associations

In statistics, a categorical variable or factor is a variable that can take on one of a limited, and usually fixed, number of possible values.
For example, Sex (Female, Male), Hair color (blonde, brown, etc.).

pandas.DataFrame.corr(method='pearson', min_periods=1)


Compute pairwise correlation of columns, excluding NA/null values
  • Parameters
  • method : {‘pearson’, ‘kendall’, ‘spearman’}
    • pearson : standard correlation coefficient
    • kendall : Kendall Tau correlation coefficient
    • spearman : Spearman rank correlation
    min_periods : int, optional Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation
  • Returns
  • y : DataFrame

Example:

  import pandas as pd
  df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})

   A  B
0  0  0
1  1  2
2  2  4
3  3  6
Then

  df['A'].corr(df['B'])
gives 1 as expected.
Now, if you change a value, e.g.

  df.loc[2, 'B'] = 4.5

   A    B
0  0  0.0
1  1  2.0
2  2  4.5
3  3  6.0
the command

  df['A'].corr(df['B'])

returns

  0.99586
which is still close to 1, as expected.

If you apply .corr directly to your dataframe, it will return all pairwise correlations between your columns

Multivariate Statistics




留言

熱門文章