I am working on Python Pandas.
I have a pandas dataframe with columns like this:
ID
Cities
1
New York
1
''
1
Atlanta
2
Tokyo
2
Kyoto
2
''
3
Paris
3
Bordeaux
3
''
4
Mumbai
4
''
4
Bangalore
5
London
5
''
5
Bermingham
Note the empty cells in the column are either empty string ('') or Nan or None. (For simplicity lets just say they are empty strings ('')).
And I want the result to be like this:
ID
Cities
1
New York, Atlanta
2
Tokyo, Kyoto
3
Paris, Bordeaux
4
Mumbai, Bangalore
5
London, Bermingham
In short, I want to group by ID and then getting the list (by removing the empty strings).
I have a sample code for this but it actually gives me result with empty strings, I want to remove empty strings.
dataFrame.groupby(['ID'], as_index=False)
.agg({'Cities': lambda x: x.tolist()})
It gives me result like this:
ID
Cities
1
New York, ,Atlanta
2
Tokyo, Kyoto,
3
Paris, Bordeaux,
4
Mumbai, , Bangalore
5
London, , Bermingham
But I dont want empty strings...
Please help me here.
Thank you so much for you help.
You can try replacing empty string by NaN and then add .dropna() to the aggregate lambda function, as follows:
df['Cities'] = df['Cities'].replace('', np.nan)
(df.groupby('ID', as_index=False)
.agg({'Cities': lambda x: x.dropna().tolist()})
)
Result:
ID Cities
0 1 [New York, Atlanta]
1 2 [Tokyo, Kyoto]
2 3 [Paris, Bordeaux]
3 4 [Mumbai, Bangalore]
4 5 [London, Bermingham]
We can also perform the operations at the Series level, by mask out the unneeded values like empty string (''), dropna to remove the missing/empty values, then groupby aggregate into whatever type needed, like a list:
new_df = (
df['Cities']
.mask(df['Cities'].eq("")) # Replace Empty String with NaN
.dropna() # Exclude NaN
.groupby(df['ID']) # Groupby ID
.aggregate(list) # Join Into List
.reset_index() # Convert Back to DataFrame
)
Or filter out unneeded rows by condition:
new_df = (
# Filter out by condition
df.loc[df['Cities'].ne("") & df['Cities'].notnull(), 'Cities']
.groupby(df['ID']) # Groupby ID
.aggregate(list) # Join Into List
.reset_index() # Convert Back to DataFrame
)
new_df:
ID Cities
0 1 [New York, Atlanta]
1 2 [Tokyo, Kyoto]
2 3 [Paris, Bordeaux]
3 4 [Mumbai, Bangalore]
4 5 [London, Bermingham]
Setup:
import pandas as pd
df = pd.DataFrame({
'ID': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5],
'Cities': ['New York', "", 'Atlanta', 'Tokyo', 'Kyoto', "", 'Paris',
'Bordeaux', "", 'Mumbai', "", 'Bangalore', 'London', "",
'Bermingham']
})
I have two dataframes of a format similar to below:
df1:
0 fname lname note
1 abby ross note1
2 rob william note2
3 abby ross note3
4 john doe note4
5 bob dole note5
df2:
0 fname lname note
1 abby ross note6
2 rob william note4
I want to merge finding matches based on fname and lname and then update the note column in the first DataFrame with the note column in the second DataFrame
The result I am trying to achieve would be like this:
0 fname lname note
1 abby ross note6
2 rob william note4
3 abby ross note6
4 john doe note4
5 bob dole note5
This is the code I was working with so far:
pd.merge(df1, df2, on=['fname', 'lname'], how='left')
but it just creates a new column with _y appended to it. How can I get it to just update that column?
Any help would be greatly appreciate, thanks!
You can merge and then correct the values:
df_3 = pd.merge(df1, df2, on=['fname', 'lname'], how='outer')
df_3['note'] = df_3['note_x']
df_3.loc[df_3['note'].isna(), 'note'] = df_3['note_y']
df_3 = df_3.drop(['note_x', 'note_y'], axis=1)
Do what you are doing:
then:
# fill nan values in note_y
out_df['note_y'].fillna(out_df['note_x'])
# Keep cols you want
out_df = out_df[['fname', 'lname', 'note_y']]
# rename the columns
out_df.columns = ['fname', 'lname', 'note']
I don't like this approach a whole lot as it won't be very scalable or generalize able. Waiting for a stellar answer for this question.
Try with update
df1=df1.set_index(['fname','lname'])
df1.update(df2.set_index(['fname','lname']))
df1=df1.reset_index()
df1
Out[55]:
fname lname 0 note
0 abby ross 1.0 note6
1 rob william 2.0 note4
2 john doe 3.0 note3
3 bob dole 4.0 note4
I have a df with US citizens state and I would like to use that as a lookup for world citizens
df1=
[Sam, New York;
Nick, California;
Sarah, Texas]
df2 =
[Sam;
Phillip;
Will;
Sam]
I would like to either df2.replace() with the states or create df3 where my output is:
[New York;
NaN;
NaN;
New York]
I have tried mapping with set_index and dict(zip()) but have had no luck so far.
Thank you.
How about this method:
import pandas as pd
df1 = pd.DataFrame([['Sam','New York'],['Nick','California'],['Sarah','Texas']],\
columns = ['name','state'])
display(df1)
df2 = pd.DataFrame(['Sam','Phillip','Will','Sam'],\
columns = ['name'])
display(df2)
df2.merge(right=df1,left_on='name',right_on='name',how='left')
resulting in
name state
0 Sam New York
1 Nick California
2 Sarah Texas
name
0 Sam
1 Phillip
2 Will
3 Sam
name state
0 Sam New York
1 Phillip NaN
2 Will NaN
3 Sam New York
you can then filter for just the state column in the merged dataframe
so I've data like this:
Id Title Fname lname email
1 meeting with Jay, Aj Jay kay jk#something.com
1 meeting with Jay, Aj Aj xyz aj#something.com
2 call with Steve Steve Jack st#something.com
2 call with Steve Harvey Ray h#something.com
3 lunch Mike Mil Mike m#something.com
I want to remove firstname & last name for each unique Id from Title.
I tried grouping by Id which gives series Objects for Title, Fname, Lname,etc
df.groupby('Id')
I've concatenated Fname with .agg(lambda x: x.sum() if x.dtype == 'float64' else ','.join(x))
& kept in concated dataframe.
likewise all other columns get aggregated. Question is how do I replace values in Title based on this aggregated series.
concated['newTitle'] = [ concated.Title.str.replace(e[0]).replace(e[1]).replace(e[1])
for e in
zip(concated.FName.str.split(','), concated.LName.str.split(','))
]
I want something like this, or some other way, by which for each Id, I could get newTitle, with replaced values.
output be like:
Id Title
1 Meeting with ,
2 call with
3 lunch
Create a mapper series by joining Fname and lname and replace,
s = df.groupby('Id')[['Fname', 'lname']].apply(lambda x: '|'.join(x.stack()))
df.set_index('Id')['Title'].replace(s, '', regex = True).drop_duplicates()
Id
1 meeting with ,
2 call with
3 lunch
I'm starting with input data like this
df1 = pandas.DataFrame( {
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )
Which when printed appears as this:
City Name
0 Seattle Alice
1 Seattle Bob
2 Portland Mallory
3 Seattle Mallory
4 Seattle Bob
5 Portland Mallory
Grouping is simple enough:
g1 = df1.groupby( [ "Name", "City"] ).count()
and printing yields a GroupBy object:
City Name
Name City
Alice Seattle 1 1
Bob Seattle 2 2
Mallory Portland 2 2
Seattle 1 1
But what I want eventually is another DataFrame object that contains all the rows in the GroupBy object. In other words I want to get the following result:
City Name
Name City
Alice Seattle 1 1
Bob Seattle 2 2
Mallory Portland 2 2
Mallory Seattle 1 1
I can't quite see how to accomplish this in the pandas documentation. Any hints would be welcome.
g1 here is a DataFrame. It has a hierarchical index, though:
In [19]: type(g1)
Out[19]: pandas.core.frame.DataFrame
In [20]: g1.index
Out[20]:
MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'),
('Mallory', 'Seattle')], dtype=object)
Perhaps you want something like this?
In [21]: g1.add_suffix('_Count').reset_index()
Out[21]:
Name City City_Count Name_Count
0 Alice Seattle 1 1
1 Bob Seattle 2 2
2 Mallory Portland 2 2
3 Mallory Seattle 1 1
Or something like:
In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index()
Out[36]:
Name City count
0 Alice Seattle 1
1 Bob Seattle 2
2 Mallory Portland 2
3 Mallory Seattle 1
I want to slightly change the answer given by Wes, because version 0.16.2 requires as_index=False. If you don't set it, you get an empty dataframe.
Source:
Aggregation functions will not return the groups that you are aggregating over if they are named columns, when as_index=True, the default. The grouped columns will be the indices of the returned object.
Passing as_index=False will return the groups that you are aggregating over, if they are named columns.
Aggregating functions are ones that reduce the dimension of the returned objects, for example: mean, sum, size, count, std, var, sem, describe, first, last, nth, min, max. This is what happens when you do for example DataFrame.sum() and get back a Series.
nth can act as a reducer or a filter, see here.
import pandas as pd
df1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
"City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]})
print df1
#
# City Name
#0 Seattle Alice
#1 Seattle Bob
#2 Portland Mallory
#3 Seattle Mallory
#4 Seattle Bob
#5 Portland Mallory
#
g1 = df1.groupby(["Name", "City"], as_index=False).count()
print g1
#
# City Name
#Name City
#Alice Seattle 1 1
#Bob Seattle 2 2
#Mallory Portland 2 2
# Seattle 1 1
#
EDIT:
In version 0.17.1 and later you can use subset in count and reset_index with parameter name in size:
print df1.groupby(["Name", "City"], as_index=False ).count()
#IndexError: list index out of range
print df1.groupby(["Name", "City"]).count()
#Empty DataFrame
#Columns: []
#Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)]
print df1.groupby(["Name", "City"])[['Name','City']].count()
# Name City
#Name City
#Alice Seattle 1 1
#Bob Seattle 2 2
#Mallory Portland 2 2
# Seattle 1 1
print df1.groupby(["Name", "City"]).size().reset_index(name='count')
# Name City count
#0 Alice Seattle 1
#1 Bob Seattle 2
#2 Mallory Portland 2
#3 Mallory Seattle 1
The difference between count and size is that size counts NaN values while count does not.
The key is to use the reset_index() method.
Use:
import pandas
df1 = pandas.DataFrame( {
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )
g1 = df1.groupby( [ "Name", "City"] ).count().reset_index()
Now you have your new dataframe in g1:
Simply, this should do the task:
import pandas as pd
grouped_df = df1.groupby( [ "Name", "City"] )
pd.DataFrame(grouped_df.size().reset_index(name = "Group_Count"))
Here, grouped_df.size() pulls up the unique groupby count, and reset_index() method resets the name of the column you want it to be.
Finally, the pandas Dataframe() function is called upon to create a DataFrame object.
Maybe I misunderstand the question but if you want to convert the groupby back to a dataframe you can use .to_frame(). I wanted to reset the index when I did this so I included that part as well.
example code unrelated to question
df = df['TIME'].groupby(df['Name']).min()
df = df.to_frame()
df = df.reset_index(level=['Name',"TIME"])
I found this worked for me.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})
df1['City_count'] = 1
df1['Name_count'] = 1
df1.groupby(['Name', 'City'], as_index=False).count()
Below solution may be simpler:
df1.reset_index().groupby( [ "Name", "City"],as_index=False ).count()
I have aggregated with Qty wise data and store to dataframe
almo_grp_data = pd.DataFrame({'Qty_cnt' :
almo_slt_models_data.groupby( ['orderDate','Item','State Abv']
)['Qty'].sum()}).reset_index()
This returns the ordinal levels/indices in the same order as a vanilla groupby() method. It's basically the same as the answer #NehalJWani posted in his comment, but stored in a variable with the reset_index() method called on it.
fare_class = df.groupby(['Satisfaction Rating','Fare Class']).size().to_frame(name = 'Count')
fare_class.reset_index()
This version not only returns the same data with percentages which is useful for stats, but also includes a lambda function.
fare_class_percent = df.groupby(['Satisfaction Rating', 'Fare Class']).size().to_frame(name = 'Percentage')
fare_class_percent.transform(lambda x: 100 * x/x.sum()).reset_index()
Satisfaction Rating Fare Class Percentage
0 Dissatisfied Business 14.624269
1 Dissatisfied Economy 36.469048
2 Satisfied Business 5.460425
3 Satisfied Economy 33.235294
Example:
These solutions only partially worked for me because I was doing multiple aggregations. Here is a sample output of my grouped by that I wanted to convert to a dataframe:
Because I wanted more than the count provided by reset_index(), I wrote a manual method for converting the image above into a dataframe. I understand this is not the most pythonic/pandas way of doing this as it is quite verbose and explicit, but it was all I needed. Basically, use the reset_index() method explained above to start a "scaffolding" dataframe, then loop through the group pairings in the grouped dataframe, retrieve the indices, perform your calculations against the ungrouped dataframe, and set the value in your new aggregated dataframe.
df_grouped = df[['Salary Basis', 'Job Title', 'Hourly Rate', 'Male Count', 'Female Count']]
df_grouped = df_grouped.groupby(['Salary Basis', 'Job Title'], as_index=False)
# Grouped gives us the indices we want for each grouping
# We cannot convert a groupedby object back to a dataframe, so we need to do it manually
# Create a new dataframe to work against
df_aggregated = df_grouped.size().to_frame('Total Count').reset_index()
df_aggregated['Male Count'] = 0
df_aggregated['Female Count'] = 0
df_aggregated['Job Rate'] = 0
def manualAggregations(indices_array):
temp_df = df.iloc[indices_array]
return {
'Male Count': temp_df['Male Count'].sum(),
'Female Count': temp_df['Female Count'].sum(),
'Job Rate': temp_df['Hourly Rate'].max()
}
for name, group in df_grouped:
ix = df_grouped.indices[name]
calcDict = manualAggregations(ix)
for key in calcDict:
#Salary Basis, Job Title
columns = list(name)
df_aggregated.loc[(df_aggregated['Salary Basis'] == columns[0]) &
(df_aggregated['Job Title'] == columns[1]), key] = calcDict[key]
If a dictionary isn't your thing, the calculations could be applied inline in the for loop:
df_aggregated['Male Count'].loc[(df_aggregated['Salary Basis'] == columns[0]) &
(df_aggregated['Job Title'] == columns[1])] = df['Male Count'].iloc[ix].sum()
grouped=df.groupby(['Team','Year'])['W'].count().reset_index()
team_wins_df=pd.DataFrame(grouped)
team_wins_df=team_wins_df.rename({'W':'Wins'},axis=1)
team_wins_df['Wins']=team_wins_df['Wins'].astype(np.int32)
team_wins_df.reset_index()
print(team_wins_df)
Try to set group_keys=False in the group_by method to prevent adding the group key to the index.
Example:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})
df1.groupby(["Name"], group_keys=False)