I have a csv file with some messy data.
I have following dataframe in pandas
Name
Age
Sex
Salary
Status
John
32
Nan
NaN
NaN
Nan
Male
4000
Single
NaN
May
20
Female
5000
Married
teresa
45
Desired output:
Name Age Sex Salary Status
0 John 32 Male 4000 Single
1 May 20 Female 5000 Married
2 teresa 45
So Does anyone know how to do it with Pandas?
You can use a bit of numpy magic to drop the NaNs and reshape the underlying array:
a = df.replace({'Nan': float('nan')}).values.flatten()
pd.DataFrame(a[~pd.isna(a)].reshape(-1, len(df.columns)),
columns=df.columns)
output:
Name Age Sex Salary Status
0 John 32 Male 4000 Single
1 May 20 Female 5000 Married
Try groupby:
>>> df.groupby(df['Name'].notna().cumsum()).apply(lambda x: x.apply(lambda x: next(iter(x.dropna()), np.nan))).reset_index(drop=True)
Name Age Sex Salary Status
0 John 32 4000 Single NaN
1 May 20 Female 5000 Married
>>>
Related
Pandas DataFrame removing NaN rows based on condition.
I'm trying to remove the rows whose gender==male and status == NaN.
Sample df:
name status gender leaves
0 tom NaN male 5
1 tom True male 6
2 tom True male 7
3 mary True female 1
4 mary NaN female 10
5 mary True female 15
6 john NaN male 2
7 mark True male 3
Expected Ouput:
name status gender leaves
0 tom True male 6
1 tom True male 7
2 mary True female 1
3 mary NaN female 10
4 mary True female 15
5 mark True male 3
You can use isna (or isnull) function to get the rows with a value of NaN.
With this knowledge, you can filter your dataframe using something like:
conditions = (df.gender == 'male')&(df.status.isna())
filtered_df = df[~conditions]
Good One given by #Derlin, other way I tried is using fillna() filled NaN with -1 and indexed them, just like below:
>>> df[~((df.fillna(-1)['status']==-1)&(df['gender']=='male'))]
Just for reference ~ operator is the same as np.logical_not() of numpy. So if you use this:
df[np.logical_not((df.fillna(-1)['status']==-1)&(df['gender']=='male'))] (dont forget to import numpy as np), means the same.
Suppose I have two dataframes:
df1:
Person Number Type
0 Kyle 12 Male
1 Jacob 15 Male
2 Jacob 15 Male
df2:
A much larger dataset with similar format except there is a count column that needs to increment based on df1
Person Number Type Count
0 Kyle 12 Male 0
1 Jacob 15 Male 0
3 Sally 43 Female 0
4 Mary 15 Female 5
What I am looking to do is increase the count column based on the number of occurrences of the same person in df1
Excepted output for this example:
Person Number Type Count
0 Kyle 12 Male 1
1 Jacob 15 Male 2
3 Sally 43 Female 0
4 Mary 15 Female 5
Increase count to 1 for Kyle because there is one instance, increase count to 2 because there are two instances for Jacob. Don't change value for Sally and Mary and keep the value the same.
How do I do this? I have tried using .loc but I can't figure out how to account for two instances of the same row. Meaning that I can only get count to increase by one for Jacob even though there are two Jacobs in df1.
I have tried
df2.loc[df2['Person'].values == df1['Person'].values, 'Count'] += 1
However this does not account for duplicates.
df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df1 = df1.set_index(['Person','Number','Type'])
df2 = df2.set_index(['Person','Number','Type'])
df1.add(df2, fill_value=0).reset_index()
Or
df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df2.merge(df1, on=['Person','Number','Type'], how='left').set_index(['Person','Number','Type']).sum(axis=1).to_frame('Count').reset_index()
value_counts + Index alignment.
u = df2.set_index("Person")
u.assign(Count=df1["Person"].value_counts().add(u["Count"], fill_value=0))
Number Type Count
Person
Kyle 12 Male 1.0
Jacob 15 Male 2.0
Sally 43 Female 0.0
Mary 15 Female 5.0
I am trying to add a Pandas.Series as a new row to a Pandas.DataFrame. However, the Series always appear to be added with its index appearing as individual rows.
How can we append it as a single row?
import pandas as pd
df = pd.DataFrame([
('Tom', 'male', 10),
('Jane', 'female', 7),
('Peter', 'male', 9),
], columns=['name', 'gender', 'age'])
df.set_index(['name'], inplace=True)
print(df)
gender age
name
Tom male 10
Jane female 7
Peter male 9
s = pd.Series(('Jon', 'male', 12), index=['name', 'gender', 'age'])
print(s)
name Jon
gender male
age 12
dtype: object
Expected Result
gender age
name
Tom male 10
Jane female 7
Peter male 9
Jon male 12
Attempt 1
df2 = df.append(pd.DataFrame(s))
print(df2)
0 age gender
Tom NaN 10.0 male
Jane NaN 7.0 female
Peter NaN 9.0 male
name Jon NaN NaN
gender male NaN NaN
age 12 NaN NaN
Attempt #2
df2 = pd.concat([df, s], axis=0)
print(df2)
0 age gender
Tom NaN 10.0 male
Jane NaN 7.0 female
Peter NaN 9.0 male
name Jon NaN NaN
gender male NaN NaN
age 12 NaN NaN
Attempt #3
df2 = pd.concat([df, pd.DataFrame(s)], axis=0)
print(df2)
0 age gender
Tom NaN 10.0 male
Jane NaN 7.0 female
Peter NaN 9.0 male
name Jon NaN NaN
gender male NaN NaN
age 12 NaN NaN
This "works", but you may want to reconsider how you are building your dataframes in the first place. If you append data, do it all at once instead of row by row.
>>> pd.concat([df, s.to_frame().T.set_index('name')])
gender age
name
Tom male 10
Jane female 7
Peter male 9
Jon male 12
As a column of a dataframe, a Series is generally all the same data type (e.g. age). In this case, your series represents a single row of data for a given record, e.g. a row in a database with potentially mixed types. You may want to consider your series as a dataframe row instead.
row = pd.DataFrame({'gender': 'male', 'age': 12},
index=pd.Index(['Jon'], name='name'))
>>> pd.concat([df, row])
gender age
name
Tom male 10
Jane female 7
Peter male 9
Jon male 12
>>> pd.concat([df, row])
Let's say I have a Panda DataFrame like this
import pandas as pd
a=pd.Series([{'Country'='Italy','Name'='Augustina','Gender'='Female','Number'=1}])
b=pd.Series([{'Country'='Italy','Name'='Piero','Gender'='Male','Number'=2}])
c=pd.Series([{'Country'='Italy','Name'='Carla','Gender'='Female','Number'=3}])
d=pd.Series([{'Country'='Italy','Name'='Roma','Gender'='Female','Number'=4}])
e=pd.Series([{'Country'='Greece','Name'='Sophia','Gender'='Female','Number'=5}])
f=pd.Series([{'Country'='Greece','Name'='Zeus','Gender'='Male','Number'=6}])
df=pd.DataFrame([a,b,c,d,e,f])
then, I sort with multiindex, like
df.set_index(['Country','Gender'],inplace=True)
Now, I wold like to know how to count how many people are from Italy, or how many Greek female I have in the dataframe.
I've tried
df['Italy'].count()
and
df['Greece']['Female'].count()
. None of them works,
Thanks
I think you need groupby with aggregatingsize:
What is the difference between size and count in pandas?
a=pd.DataFrame([{'Country':'Italy','Name':'Augustina','Gender':'Female','Number':1}])
b=pd.DataFrame([{'Country':'Italy','Name':'Piero','Gender':'Male','Number':2}])
c=pd.DataFrame([{'Country':'Italy','Name':'Carla','Gender':'Female','Number':3}])
d=pd.DataFrame([{'Country':'Italy','Name':'Roma','Gender':'Female','Number':4}])
e=pd.DataFrame([{'Country':'Greece','Name':'Sophia','Gender':'Female','Number':5}])
f=pd.DataFrame([{'Country':'Greece','Name':'Zeus','Gender':'Male','Number':6}])
df=pd.concat([a,b,c,d,e,f], ignore_index=True)
print (df)
Country Gender Name Number
0 Italy Female Augustina 1
1 Italy Male Piero 2
2 Italy Female Carla 3
3 Italy Female Roma 4
4 Greece Female Sophia 5
5 Greece Male Zeus 6
df = df.groupby('Country').size()
print (df)
Country
Greece 2
Italy 4
dtype: int64
df = df.groupby(['Country', 'Gender']).size()
print (df)
Country Gender
Greece Female 1
Male 1
Italy Female 3
Male 1
dtype: int64
If need only some sizes with select by MultiIndex by xs or slicers:
df.set_index(['Country','Gender'],inplace=True)
print (df)
Name Number
Country Gender
Italy Female Augustina 1
Male Piero 2
Female Carla 3
Female Roma 4
Greece Female Sophia 5
Male Zeus 6
print (df.xs('Italy', level='Country'))
Name Number
Gender
Female Augustina 1
Male Piero 2
Female Carla 3
Female Roma 4
print (len(df.xs('Italy', level='Country').index))
4
print (df.xs(('Greece', 'Female'), level=('Country', 'Gender')))
Name Number
Country Gender
Greece Female Sophia 5
print (len(df.xs(('Greece', 'Female'), level=('Country', 'Gender')).index))
1
#KeyError: 'MultiIndex Slicing requires
#the index to be fully lexsorted tuple len (2), lexsort depth (0)'
df.sort_index(inplace=True)
idx = pd.IndexSlice
print (df.loc[idx['Italy', :],:])
Name Number
Country Gender
Italy Female Augustina 1
Female Carla 3
Female Roma 4
Male Piero 2
print (len(df.loc[idx['Italy', :],:].index))
4
print (df.loc[idx['Greece', 'Female'],:])
Name Number
Country Gender
Greece Female Sophia 5
print (len(df.loc[idx['Greece', 'Female'],:].index))
1
When I drop John as duplicate specifying 'name' as the column name:
import pandas as pd
data = {'name':['Bill','Steve','John','John','John'], 'age':[21,28,22,30,29]}
df = pd.DataFrame(data)
df = df.drop_duplicates('name')
pandas drops all matching entities leaving the left-most:
age name
0 21 Bill
1 28 Steve
2 22 John
Instead I would like to keep the row where John's age is the highest (in this example it is the age 30. How to achieve this?
Try this:
In [75]: df
Out[75]:
age name
0 21 Bill
1 28 Steve
2 22 John
3 30 John
4 29 John
In [76]: df.sort_values('age').drop_duplicates('name', keep='last')
Out[76]:
age name
0 21 Bill
1 28 Steve
3 30 John
or this depending on your goals:
In [77]: df.drop_duplicates('name', keep='last')
Out[77]:
age name
0 21 Bill
1 28 Steve
4 29 John