How to make complex data cleaning in pandas

How to make complex data cleaning in pandas - python

For example, I have a DataFrame as following.
lineNum id name Cname score
1 001 Jack Math 99
2 002 Jack English 110
3 003 Jack Chinese 90
4 003 Jack Chinese 90
5 004 Tom Math Nan
6 005 Tom English 75
7 006 Tom Chinese 85
As you see, I want to data cleaning for this data.
1) delete the duplicate value which is line 3 and line 4.
2) deal with ths unreasonable value. In line 2, Jack's English get 110 which is over the max value 100. I want to set his score to the mean value for all students' english score.
3) deal with the Nan value. Tom's Math score is Nan. I want to change to the mean value for all students' math score.
I can do every requirement respectively. But I don't know how to do all these three requirements. Thanks!

plan
I drop duplicates to start.
use mask to make scores > than 100 null
filter new dataframe and group by with mean
map means and use it to fill nulls
d = df.drop_duplicates(['id', 'name', 'Cname'])
s0 = d.score
s1 = s0.mask(s > 100)
m = s1.mask(s1 > 100).notnull()
d.assign(score=s1.fillna(d.Cname.map(d[m].groupby('Cname').score.mean())))
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 110.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0

You can use:
cols = ['id','name','Cname','score']
#remove duplicates by columns
df = df.drop_duplicates(subset=cols)
#replace values > 100 to NaN
df.loc[df['score'] > 100, 'score'] = np.nan
#replace NaN by mean for all students by subject
df['score'] = df.groupby('Cname')['score'].transform(lambda x: x.fillna(x.mean()))
print (df)
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 75.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0
Alternative solution with mask for NaN:
cols = ['id','name','Cname','score']
df = df.drop_duplicates(subset=cols)
df['score'] = df['score'].mask(df['score'] > 100)
df['score'] = df.groupby('Cname')['score'].apply(lambda x: x.fillna(x.mean()))
print (df)
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 75.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0

You should consider `.apply(func)' if the data is not too big.
import pandas as pd
df = pd.read_table('sample.txt', delimiter='\s+', na_values='Nan') # Your sample data
df = df.set_index('lineNum').drop_duplicates()
def deal_with(x):
if (x['score'] > 100.) or (pd.isnull(x['score'])):
df_ = df[df['id'] != x['id']]
x['score'] = df_.loc[df_['Cname'] == x['Cname'], 'score'].mean()
return x
print(df.apply(deal_with, axis=1))
id name Cname score
lineNum
1 1 Jack Math 99.0
2 2 Jack English 75.0
3 3 Jack Chinese 90.0
5 4 Tom Math 99.0
6 5 Tom English 75.0
7 6 Tom Chinese 85.0

Related

How to apply function on very large pandas dataframe, with function depending on consecutive rows?

I want to calculate the speed (m/s and km/h) with euclidian distance based on positions (x,y in meters) and time (in seconds). I found a way to take into account the fact that each time a name appears for the first time in dataframe, the speed is equal to NaN.
Problem: my dataframe is so large (> 1.5 millions rows) that, when I run the code, it is not done after more than 2 hours...
The code works with a shorter dataframe, the problem seems to be the length of the initial df.
Here is the simplified dataframe, followed by the code:
df
name time x y
0 Mary 0 17 15
1 Mary 1 18.5 16
2 Mary 2 21 18
3 Steve 0 12 16
4 Steve 1 10.5 14
5 Steve 2 8 13
6 Jane 0 15 16
7 Jane 1 17 17
8 Jane 2 18 19
# calculating speeds:
for i in range(len(df)):
if i >= 1:
df.loc[i,'speed (m/s)'] = sqrt( (df.loc[i,'x'] - df.loc[i-1,'x'])**2 + (df.loc[i,'y'] - df.loc[i-1,'y'])**2 )
df.loc[i,'speed (km/h)'] = df.loc[i,'speed (m/s)']*3.6
# each first time a name appears, speeds are equal to NaN:
first_indexes = []
names = df['name'].unique()
for j in names:
a = df.index[df['name'] == j].tolist()
if len(a) > 0 :
first_indexes.append(a[0])
for index in first_indexes:
df.loc[index, 'speed (m/s)'] = np.nan
df.loc[index, 'speed (km/h)'] = np.nan
Iterating over this dataframe is way too long, I'm looking for a way to do this faster...
Thanks by advance for helping !
EDIT
df = pd.DataFrame([["Mary",0,17,15],
["Mary",1,18.5,16],
["Mary",2,21,18],
["Steve",0,12,16],
["Steve",1,10.5,14],
["Steve",2,8,13],
["Jane",0,15,16],
["Jane",1,17,17],
["Jane",2,18,19]],columns = [ "name","time","x","y" ])

You can apply method for all data without loops and then set missing value for first name rows (data has to be sorted by name):
df['speed (m/s)'] = (np.sqrt(df['x'].sub(df['x'].shift()).pow(2) +
df['y'].sub(df['y'].shift()).pow(2)) )
df['speed (km/h)'] = df['speed (m/s)']*3.6
cols = ['speed (m/s)','speed (km/h)']
df[cols] = df[cols].mask(~df['name'].duplicated())
print (df)
name time x y speed (m/s) speed (km/h)
0 Mary 0 17.0 15 NaN NaN
1 Mary 1 18.5 16 1.802776 6.489992
2 Mary 2 21.0 18 3.201562 11.525624
3 Steve 0 12.0 16 NaN NaN
4 Steve 1 10.5 14 2.500000 9.000000
5 Steve 2 8.0 13 2.692582 9.693297
6 Jane 0 15.0 16 NaN NaN
7 Jane 1 17.0 17 2.236068 8.049845
8 Jane 2 18.0 19 2.236068 8.049845

Try this:
df = pd.read_csv('data.csv')
def calculate_speed(s):
return sqrt((s['dx'])**2 + (s['dy'])**2)
df = df.join(df.groupby('name')[['x','y']].diff().rename({'x':'dx', 'y':'dy'}, axis=1))
df['speed (m/s)'] = df.apply(calculate_speed, axis=1)
df['speed (km/h)'] = df['speed (m/s)']*3.6
print(df)

If you want to work with a MultiIndex (which has many nice properties when working with dataframes with names and time indices), you could pivot your table to make name, x and y a column MultiIndex with time being the index:
dfp = df.pivot(index='time', columns=['name'])
Then you can easily calculate the speed for each name without having to check for np.NaN, duplicates or other invalid values:
speed_ms = np.sqrt((dfp['x'] - dfp['x'].shift(-1))**2 + (dfp['y'] - dfp['y'].shift(-1))**2).shift(1)
Now get the speed in km/h
speed_kmh = speed_ms * 3.6
And make both to a multiindex to make merging/concatenating the dataframes more explicit:
speed_ms.columns = pd.MultiIndex.from_product((['speed (m/s)'], speed_ms.columns))
speed_kmh.columns = pd.MultiIndex.from_product((['speed (km/h)'], speed_kmh.columns))
And finally concatenate the results to the dataframe. swaplevel makes all columns primarily indexable by the name, while sort_index sorts by the names:
dfp = pd.concat((dfp, speed_ms, speed_kmh), axis=1).swaplevel(1, 0, 1).sort_index(axis=1)
Now your dataframe looks like:
# Out[100]:
name Jane ... Steve
speed (km/h) speed (m/s) x y ... speed (km/h) speed (m/s) x y
time ...
0 NaN NaN 15.0 16 ... NaN NaN 12.0 16
1 8.049845 2.236068 17.0 17 ... 9.000000 2.500000 10.5 14
2 8.049845 2.236068 18.0 19 ... 9.693297 2.692582 8.0 13
[3 rows x 12 columns]
And you can easily index speeds and positions by names:
dfp['Mary']
#Out[107]:
speed (km/h) speed (m/s) x y
time
0 NaN NaN 17.0 15
1 6.489992 1.802776 18.5 16
2 11.525624 3.201562 21.0 18
With dfp.stack(0) you re-transform it to your input-df-style, while keeping the names as a second index level:
dfp.stack(0).sort_index(level=1)
# Out[104]:
speed (km/h) speed (m/s) x y
time name
0 Jane NaN NaN 15.0 16
Mary NaN NaN 17.0 15
Steve NaN NaN 12.0 16
1 Jane 8.049845 2.236068 17.0 17
Mary 6.489992 1.802776 18.5 16
Steve 9.000000 2.500000 10.5 14
2 Jane 8.049845 2.236068 18.0 19
Mary 11.525624 3.201562 21.0 18
Steve 9.693297 2.692582 8.0 13
While dfp.stack(1) sets the names as columns, while setting speeds etc as indices.

Group by fuzzy string matches with fuzzywuzzy and groupby

I have a dataset of random words and names and I am trying to group all of the similar words and names. So given the dataframe below:
Name ID Value
0 James 1 10
1 James 2 2 142
2 Bike 3 1
3 Bicycle 4 1197
4 James Marsh 5 12
5 Ants 6 54
6 Job 7 6
7 Michael 8 80007
8 Arm 9 47
9 Mike K 10 9
10 Michael k 11 1
My pseudo code would be something like:
import pandas as pd
from fuzzywuzzy import fuzz
minratio = 95
for idx1, name1 in df['Name'].iteritems():
for idx2, name2 in df['Name'].iteritems():
ratio = fuzz.WRatio(name1, name2)
if ratio > minratio:
grouped = df.groupby(['Name', 'ID'])['Value']\
.agg(Total_Value='sum', Group_Size='count')
This would then give me the desired output:
print(grouped)
Name ID Total_Value Group_Size
0 James 1 164 3 # All James' grouped
2 Bike 3 1198 2 # Bike's and Bicycles grouped
5 Ants 6 54 1
6 Job 7 6 1
7 Michael 8 80017 3 # Mike's and Michael's grouped
8 Arm 9 47 1
Obviously this doesn't work, and honestly, I am not sure if this is even possible, but this is what I'm trying to accomplish. Any advice that could get me on the right track would be useful.

Using affinity propagation clustering (not perfect but maybe a starting point):
import pandas as pd
import numpy as np
import io
from fuzzywuzzy import fuzz
from scipy import spatial
import sklearn.cluster
s="""Name ID Value
0 James 1 10
1 James 2 2 142
2 Bike 3 1
3 Bicycle 4 1197
4 James Marsh 5 12
5 Ants 6 54
6 Job 7 6
7 Michael 8 80007
8 Arm 9 47
9 Mike K 10 9
10 Michael k 11 1"""
df = pd.read_csv(io.StringIO(s),sep='\s\s+',engine='python')
names = df.Name.values
sim = spatial.distance.pdist(names.reshape((-1,1)), lambda x,y: fuzz.WRatio(x,y))
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", random_state=None)
affprop.fit(spatial.distance.squareform(sim))
res = df.groupby(affprop.labels_).agg(
Names=('Name',','.join),
First_ID=('ID','first'),
Total_Value=('Value','sum'),
Group_Size=('Value','count')
)
Result
Names First_ID Total_Value Group_Size
0 James,James 2,James Marsh,Ants,Arm 1 265 5
1 Bike,Bicycle 3 1198 2
2 Job 7 6 1
3 Michael,Mike K,Michael k 8 80017 3

How to calculate a groupby mean and variance in a pandas DataFrame?

I have a DataFrame and I want to calculate the mean and the variance for each row for each person. Moreover, there is a column date and the chronological order must be respect when calculating the mean and the variance; the dataframe is already sorted by date. The date are just the number of day after the earliest date. The mean for the earliest date of a person row is simply the value in the column Points and the variance should be NAN or 0. Then, for the second date, the mean should be the means of the value in the column Points for this date and the previous one. Here is my code to generate the dataframe:
import pandas as pd
import numpy as np
data=[["Al",0, 12],["Bob",2, 10],["Carl",5, 12],["Al",5, 5],["Bob",9, 2]
,["Al",22, 4],["Bob",22, 16],["Carl",33, 2],["Al",45, 7],["Bob",68, 4]
,["Al",72, 11],["Bob",79, 5]]
df= pd.DataFrame(data, columns=["Name", "Date", "Points"])
print(df)
Name Date Points
0 Al 0 12
1 Bob 2 10
2 Carl 5 12
3 Al 5 5
4 Bob 9 2
5 Al 22 4
6 Bob 22 16
7 Carl 33 2
8 Al 45 7
9 Bob 68 4
10 Al 72 11
11 Bob 79 5
Here is my code to obtain the mean and the variance:
df['Mean'] = df.apply(
lambda x: df[(df.Name == x.Name) & (df.Date < x.Date)].Points.mean(),
axis=1)
df['Variance'] = df.apply(
lambda x: df[(df.Name == x.Name)& (df.Date < x.Date)].Points.var(),
axis=1)
However, the mean is shifted by one row and the variance by two rows. The dataframe obtained when sort by Nameand Dateis:
Name Date Points Mean Variance
0 Al 0 12 NaN NaN
3 Al 5 5 12.000000 NaN
5 Al 22 4 8.50000 24.500000
8 Al 45 7 7.000000 19.000000
10 Al 72 11 7.000000 12.666667
1 Bob 2 10 NaN NaN
4 Bob 9 2 10.000000 NaN
6 Bob 22 16 6.000000 32.000000
9 Bob 68 4 9.333333 49.333333
11 Bob 79 5 8.000000 40.000000
2 Carl 5 12 NaN NaN
7 Carl 33 2 12.000000 NaN
Instead, the dataframe should be as below:
Name Date Points Mean Variance
0 Al 0 12 12 NaN
3 Al 5 5 8.5 24.5
5 Al 22 4 7 19
8 Al 45 7 7 12.67
10 Al 72 11 7.8 ...
1 Bob 2 10 10 NaN
4 Bob 9 2 6 32
6 Bob 22 16 9.33 49.33
9 Bob 68 4 8 40
11 Bob 79 5 7.4 ...
2 Carl 5 12 12 NaN
7 Carl 33 2 7 50
What should I change ?

Creating a numeric variable issue

I'm trying to create a new variable as the mean of another numeric var present in my database (mark1 type = float).
Unfortunately, the result is a new colunm with all NaN values.
still can't understand the reanson why.
The code i made is the following:
df = pd.read_csv("students2.csv")
df.loc[:, 'mean_m1'] = pd.Series(np.mean(df['mark1']).mean(), index= df)
this the first few rows after the code:
df.head()
ID gender subject mark1 mark2 mark3 fres mean_m1
0 1 mm 1 17.0 20.0 15.0 neg NaN
1 2 f 2 24.0 330.0 23.0 pos NaN
2 3 FEMale 1 17.0 16.0 24.0 0 NaN
3 4 male 3 27.0 23.0 21.0 1 NaN
4 5 m 2 30.0 22.0 24.0 positive NaN
None error messages are printed.
thx so much!

You need GroupBy + transform with 'mean'.
For the data you have provided, this is trivially equal to mark1. You should probably map your genders to categories, e.g. M or F, as a preliminary step.
df['mean_m1'] = df.groupby('gender')['mark1'].transform('mean')
print(df)
ID gender subject mark1 mark2 mark3 fres mean_m1
0 1 mm 1 17.000 20.000 15.000 neg 17.000
1 2 f 2 24.000 330.000 23.000 pos 24.000
2 3 FEMale 1 17.000 16.000 24.000 0 17.000
3 4 male 3 27.000 23.000 21.000 1 27.000
4 5 m 2 30.000 22.000 24.000 positive 30.000

Multiple Indexes for Dataframe Grouping

I'll just start with the example and then break down what is happening.
This is a sample input:
DataFrame:
**Name** **No.** **Test** ***Grade***
Bob 2123320 Math Nan
Joe 2832883 English 90
John 2139300 Science 85
Bob 2123320 History 93
John 2234903 Math 99
Hopeful output:
**Name** ********2139300********* ********2234903*******
Math English Science Math English Science
John 0 0 85 99 0 0
Like the title suggests, I am trying to apply multiple indexes. So basically it starts by looking for each name, and then for each name it finds to see how many distinct No.'s it has. In this case it sets a threshold at at least 2 distinct No.'s (which is why only John is outputted and Joe/Bob are not).
Now in each of these distinct No's. I have a specific subset of Tests I want to search for, in this case only {Math, English, Science}. For each of these tests, if the person in question took it in that No., there should be a grade. I would like that grade to be outputted for the test in question and for the tests not taken by that person on that No. I would like it to output some sort of simple marker (i.e if the person only took Math on that day, for English and Science output 0).
So in effect, it first indexes people by the number of distinct No.'s and groups them as such. It then indexes them by type of Test (for which I only want a subset). It finally assigns each person a value for the type of test they took and for the ones they didn't simply outputs an 0.
It's similar to another problem I asked earlier:
Grouped Feature Matrix in Python #2- Follow Up
Except now instead of 1's and 0's I have another column with actual values that I would like to output.
Thank you.
EDIT: More sample/Output
**Name** **No.** **Test** ***Grade***
Bob 2123320 Math Nan
Joe 2832883 English 90
John 2139300 Science 85
Bob 2123320 History 93
John 2234903 Math 99
Bob 2932848 English 99
**Name** 2139300 2234903 2123320 2932848
M E S M E S M E S M E S
John 0 0 85 99 0 0 Nan Nan Nan Nan Nan Nan
Bob Nan Nan Nan Nan nan Nan 86 0 0 0 99 0

Let's use:
Filter the dataframe to only those records you are concerned with
df_out = df[df.groupby(['Name'])['No.'].transform(lambda x: x.nunique() > 1)]
Now, reshape dataframe with set_index, unstack, and reindex:
df_out.set_index(['Name','No.','Test'])['Grade'].sum(level=[0,1,2])\
.unstack(-1, fill_value=0)\
.reindex(['Math','English','Science'], axis=1, fill_value=0)\
.unstack(-1, fill_value=0).swaplevel(0, 1, axis=1)\
.sort_index(1)
Output:
No. 2123320 2139300 2234903 2932848
Test English Math Science English Math Science English Math Science English Math Science
Name
Bob 0 0 0 0 0 0 0 0 0 99 0 0
John 0 0 0 0 0 85 0 99 0 0 0 0

You can use pivot_table:
In [11]: df.pivot_table(values="Grade", index=["Name"], columns=["No.", "Test"])
Out[11]:
No. 2123320 2139300 2234903 2832883
Test History Science Math English
Name
Bob 93.0 NaN NaN NaN
Joe NaN NaN NaN 90.0
John NaN 85.0 99.0 NaN
With the dropna flag to include all the NaN columns:
In [12]: df.pivot_table(values="Grade", index=["Name"], columns=["No.", "Test"], dropna=False)
Out[12]:
No. 2123320 2139300 2234903 2832883
Test English History Math Science English History Math Science English History Math Science English History Math Science
Name
Bob NaN 93.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Joe NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 90.0 NaN NaN NaN
John NaN NaN NaN NaN NaN NaN NaN 85.0 NaN NaN 99.0 NaN NaN NaN NaN NaN
and with fill_value=0
In [13]: df.pivot_table(values="Grade", index=["Name"], columns=["No.", "Test"], dropna=False, fill_value=0)
Out[13]:
No. 2123320 2139300 2234903 2832883
Test English History Math Science English History Math Science English History Math Science English History Math Science
Name
Bob 0 93 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Joe 0 0 0 0 0 0 0 0 0 0 0 0 90 0 0 0
John 0 0 0 0 0 0 0 85 0 0 99 0 0 0 0 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.