Pandas(Data Manipulation) in Machine Learning
Pandas(Data Manipulation) in Machine Learning
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
import pandas as pd
import numpy as np
The two primary data structures of pandas,
- Series (1-dimensional) Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call pd.Series
- DataFrame (2-dimensional) DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It stems from the R data.frame() object.
Create Series
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s
Out[4]:
a 0.4691
b -0.2829
c -1.5091
d -1.1356
e 1.2121
dtype: float64
s.index
Out[5]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
pd.Series(np.random.randn(5))
Out[6]:
0 -0.1732
1 0.1192
2 -1.0442
3 -0.8618
4 -2.1046
dtype: float64
Create DataFrame
columns = ['name', 'age', 'gender', 'job']
user1 = pd.DataFrame([['alice', 19, "F", "student"], ['john', 26, "M", "student"]], columns=columns)
user2 = pd.DataFrame([['eric', 22, "M", "student"], ['paul', 58, "F", "manager"]], columns=columns)
user3 = pd.DataFrame(dict(name=['peter', 'julie'], age=[33, 44], gender=['M', 'F'], job=['engineer', 'scientist']))
print(user3)
age gender job name
0 33 M engineer peter
1 44 F scientist julie
Concatenate DataFrame
user1.append(user2)
Out[66]:
name age gender job
0 alice 19 F student
1 john 26 M student
0 eric 22 M student
1 paul 58 F manager
print(user1)
name age gender job
0 alice 19 F student
1 john 26 M student
users = pd.concat([user1, user2, user3])
print(users)
age gender job name
0 19 F student alice
1 26 M student john
0 22 M student eric
1 58 F manager paul
0 33 M engineer peter
1 44 F scientist julie
Join DataFrame
user4 = pd.DataFrame(dict(name=['alice', 'john', 'eric', 'julie'], height=[165, 180, 175, 171]))
print(user4)
height name
0 165 alice
1 180 john
2 175 eric
3 171 julie
Use intersection of keys from both frames
merge_inter = pd.merge(users, user4, on="name")
print(merge_inter)
age gender job name height
0 19 F student alice 165
1 26 M student john 180
2 22 M student eric 175
3 44 F scientist julie 171
Use union of keys from both frames
users = pd.merge(users, user4, on="name", how='outer')
print(users)
age gender job name height
0 19 F student alice 165.0
1 26 M student john 180.0
2 22 M student eric 175.0
3 58 F manager paul NaN
4 33 M engineer peter NaN
5 44 F scientist julie 171.0
Drop DataFrame
pandas.DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') return new object with labels in requested axis removed.
For examples,
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D'])
df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Drop columns
df.drop(['B', 'C'], axis=1)
A D
0 0 3
1 4 7
2 8 11
df.drop(columns=['B', 'C'])
A D
0 0 3
1 4 7
2 8 11
Drop a row by index
df.drop([0, 1])
A B C D
2 8 9 10 11
Summarizing
Examine the users data
users # print the first 30 and last 30 rows
Out[77]:
age gender job name height
0 19 F student alice 165.0
1 26 M student john 180.0
2 22 M student eric 175.0
3 58 F manager paul NaN
4 33 M engineer peter NaN
5 44 F scientist julie 171.0
type(users)
Out[78]: pandas.core.frame.DataFrame
users.head() # print the first 5 rows
Out[79]:
age gender job name height
0 19 F student alice 165.0
1 26 M student john 180.0
2 22 M student eric 175.0
3 58 F manager paul NaN
4 33 M engineer peter NaN
users.tail() # print the last 5 rows
Out[80]:
age gender job name height
1 26 M student john 180.0
2 22 M student eric 175.0
3 58 F manager paul NaN
4 33 M engineer peter NaN
5 44 F scientist julie 171.0
print(users.describe()) # summarize all numeric columns
age height
count 6.000000 4.000000
mean 33.666667 172.750000
std 14.895189 6.344289
min 19.000000 165.000000
25% 23.000000 169.500000
50% 29.500000 173.000000
75% 41.250000 176.250000
max 58.000000 180.000000
users.index # "the index" (aka "the labels")
Out[83]: Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
users.columns # column names (which is "an index")
Out[84]: Index(['age', 'gender', 'job', 'name', 'height'], dtype='object')
users.dtypes # data types of each column
Out[85]:
age int64
gender object
job object
name object
height float64
dtype: object
users.shape # number of rows and columns
Out[86]: (6, 5)
users.values # underlying numpy array
Out[87]:
array([[19, 'F', 'student', 'alice', 165.0],
[26, 'M', 'student', 'john', 180.0],
[22, 'M', 'student', 'eric', 175.0],
[58, 'F', 'manager', 'paul', nan],
[33, 'M', 'engineer', 'peter', nan],
[44, 'F', 'scientist', 'julie', 171.0]], dtype=object)
users.info() # concise summary (includes memory usage as of pandas 0.15.0)
Int64Index: 6 entries, 0 to 5
Data columns (total 5 columns):
age 6 non-null int64
gender 6 non-null object
job 6 non-null object
name 6 non-null object
height 4 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 288.0+ bytes
summarize all columns
print(users.describe(include='all'))
age gender job name height
count 6.000000 6 6 6 4.000000
unique NaN 2 4 6 NaN
top NaN F student peter NaN
freq NaN 3 3 1 NaN
mean 33.666667 NaN NaN NaN 172.750000
std 14.895189 NaN NaN NaN 6.344289
min 19.000000 NaN NaN NaN 165.000000
25% 23.000000 NaN NaN NaN 169.500000
50% 29.500000 NaN NaN NaN 173.000000
75% 41.250000 NaN NaN NaN 176.250000
max 58.000000 NaN NaN NaN 180.000000
print(users.describe(include=['object'])) # limit to one (or more) types
gender job name
count 6 6 6
unique 2 4 6
top F student peter
freq 3 3
Column Selection
users['gender'] # select one column
Out[91]:
0 F
1 M
2 M
3 F
4 M
5 F
Name: gender, dtype: object
type(users['gender']) # Series
Out[92]: pandas.core.series.Series
users.gender # select one column using the DataFrame
# select multiple columns
users[['age', 'gender']] # select two columns
Out[94]:
age gender
0 19 F
1 26 M
2 22 M
3 58 F
4 33 M
5 44 F
my_cols = ['age', 'gender'] # or, create a list then use that list to select columns
users[my_cols]
type(users[my_cols]) # DataFrame
Rows Selection
# iloc is strictly integer position based
df = users.copy()
df.iloc[0] # first row
Out[99]:
age 19
gender F
job student
name alice
height 165
Name: 0, dtype: object
df.iloc[0, 0] # first item of first row
Out[100]: 19
df.iloc[0, 0] = 55
users.shape
Out[101]: (6, 5)
for i in range(users.shape[0]):
row = df.iloc[i]
print(row.age)
Filtering
# simple logical filtering
users[users.age < 20] # only show users with age < 20
Out[111]:
age gender job name height
0 19 F student alice 165.0
young_bool = users.age < 20 # or, create a Series of booleans...
Out[113]:
0 True
1 False
2 False
3 False
4 False
5 False
Name: age, dtype: bool
young = users[young_bool] # ...and use that Series to filter rows
Out[115]:
age gender job name height
0 19 F student alice 165.0
users[users.age < 20].job # select one column from the filtered results print(young)
Out[116]:
0 student
Name: job, dtype: object
Sorting
df = users.copy()
age gender job name
0 19 F student alice
1 26 M student john
0 22 M student eric
1 58 F manager paul
0 33 M engineer peter
1 44 F scientist julie
df.age.sort_values() # only works for a Series
Out[9]:
0 19
0 22
1 26
0 33
1 44
1 58
Name: age, dtype: int64
df.sort_values(by='age') # sort rows by a specific column
df.sort_values(by='age', ascending=False) # use descending order instead
df.sort_values(by=['job', 'age']) # sort by multiple columns
Out[13]:
age gender job name
0 33 M engineer peter
1 58 F manager paul
1 44 F scientist julie
0 19 F student alice
0 22 M student eric
1 26 M student john
df.sort_values(by=['job', 'age'], inplace=True) # modify df
print(df)
Reshaping Data
Converting or transforming data from one format to another.The following can be thought as the same format:
- stacked format, because the individual observations are stacked on top of each other.
- record format, because each row is a single record, i.e. a single observation.
- long format, because this format will be long in the vertical direction as opposed to wide in the horizontal direction.
The pivot method of the dataframe can be used to view interested comparison according to an index:
df.pivot(index='job', columns='name', values='age')
Out[19]:
name alice eric john julie paul peter
job
engineer NaN NaN NaN NaN NaN 33.0
manager NaN NaN NaN NaN 58.0 NaN
scientist NaN NaN NaN 44.0 NaN NaN
student 19.0 22.0 26.0 NaN NaN NaN
The above format of this table can be referred to as:
- wide format, because the table is now wider rather than longer.
- unstacked format, because the individual observations (one person/one date) are no longer stacked on top of each other.
Quality Check
Rename Values
dealing with Outliers
Groupby

for grp, data in users.groupby("job"):
print(grp, data)
engineer age gender job name
0 33 M engineer peter
manager age gender job name
1 58 F manager paul
scientist age gender job name
1 44 F scientist julie
student age gender job name
0 19 F student alice
1 26 M student john
0 22 M student eric
File I/O
csv
Save dataframe as csv in the tmp working directory
import tempfile, os.path
tmpdir = tempfile.gettempdir()
csv_filename = os.path.join(tmpdir, "users.csv")
users.to_csv(csv_filename, index=False)
$ cat /var/folders/v3/r84ymcjn5zd_vlhtdp6c7h940000gn/T/users.csv
age,gender,job,name
19,F,student,alice
26,M,student,john
22,M,student,eric
58,F,manager,paul
33,M,engineer,peter
44,F,scientist,julie
Load a csv
readin_csv = pd.read_csv(csv_filename)
readin_csv
Out[34]:
age gender job name
0 19 F student alice
1 26 M student john
2 22 M student eric
3 58 F manager paul
4 33 M engineer peter
5 44 F scientist julie
Load a csv with no headers
df = pd.read_csv(csv_filename, header=None)
df
Out[37]:
0 1 2 3
0 age gender job name
1 19 F student alice
2 26 M student john
3 22 M student eric
4 58 F manager paul
5 33 M engineer peter
6 44 F scientist julie
Excel
Visualization
ReferenceRefer the matplotlib API:
import matplotlib.pyplot as plt
The plot method on Series and DataFrame is just a simple wrapper around plt.plot():
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
On DataFrame, plot() is a convenience to plot all of the columns with labels:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list('ABCD'))
df = df.cumsum()
plt.figure(); df.plot();
You can plot one column versus another using the x and y keywords in plot():
df3 = pd.DataFrame(np.random.randn(1000, 2), columns=['B', 'C']).cumsum()
df3['A'] = pd.Series(list(range(len(df))))
df3.plot(x='A', y='B')
Univariate Statistics
Testing pairwise associations
In statistics, a categorical variable or factor is a variable that can take on one of a limited, and usually fixed, number of possible values.For example, Sex (Female, Male), Hair color (blonde, brown, etc.).
pandas.DataFrame.corr(method='pearson', min_periods=1)
Compute pairwise correlation of columns, excluding NA/null values
- Parameters method : {‘pearson’, ‘kendall’, ‘spearman’}
- pearson : standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
- Returns y : DataFrame
Example:
import pandas as pd
df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})
A B
0 0 0
1 1 2
2 2 4
3 3 6
Then
df['A'].corr(df['B'])
gives 1 as expected.Now, if you change a value, e.g.
df.loc[2, 'B'] = 4.5
A B
0 0 0.0
1 1 2.0
2 2 4.5
3 3 6.0
the command
df['A'].corr(df['B'])
returns
0.99586
which is still close to 1, as expected.If you apply .corr directly to your dataframe, it will return all pairwise correlations between your columns
留言