I have a dataset with three columns:
Name Customer Value
Johnny Mike 1
Christopher Luke 0
Christopher Mike 0
Carl Marilyn 1
Carl Stephen 1
I need to create a new dataset where I have two columns: one with unique values from Name and Customer columns, and the Value column. Values in the Value column were assigned to Name (this means that multiple rows with same Name have the same value: Carl has value 1, Christopher has value 0, and Johnny has value 1), so Customer elements should have empty values in Value column in the new dataset.
My expected output is
All Value
Johnny 1
Christopher 0
Carl 1
Mike
Luke
Marilyn
Stephen
For unique values in All column I consider unique().to_list() from both Name and Customer:
name = file['Name'].unique().tolist()
customer = file['Customer'].unique().tolist()
all_with_dupl = name + customer
customers=list(dict.fromkeys(all_with_dupl))
df= pd.DataFrame(columns=['All','Value'])
df['All']= customers
I do not know how to assign the values in the new dataset after creating the list with all names and customers with no duplicates.
Any help would be great.
Split columns and .drop_duplicates on data frame to remove duplicates and then append it back:
(df.drop('Customer', 1)
.drop_duplicates()
.rename(columns={'Name': 'All'})
.append(
df[['Customer']].rename(columns={'Customer': 'All'})
.drop_duplicates(),
ignore_index=True
))
All Value
0 Johnny 1.0
1 Christopher 0.0
2 Carl 1.0
3 Mike NaN
4 Luke NaN
5 Marilyn NaN
6 Stephen NaN
Or to split the steps up:
names = df.drop('Customer', 1).drop_duplicates().rename(columns={'Name': 'All'})
customers = df[['Customer']].drop_duplicates().rename(columns={'Customer': 'All'})
names.append(customers, ignore_index=True)
Anaother way
d=dict(zip(df['Name Customer'].str.split('\s').str[0],df['Value']))#Create dict
df['Name Customer']=df['Name Customer'].str.split('\s')
df=df.explode('Name Customer').drop_duplicates(keep='first').assign(Value='')Explode dataframe and drop duplicates
df['Value']=df['Name Customer'].map(d).fillna('')#Map values back
So I have a use case where I have a few tables with different types of events in a time series, plus another table with base information. The events are of different types with different columns, for example an event of "marriage" could have the columns "husband name" and "wife name", and a table of events on "jobs" can have columns of "hired on" and "fired on" but can also have "husband name". The base info table is not time series data, and has stuff like "case ID" and "city of case".
The goal would be to 1. have all the different time series tables in one table with all possible columns, wherever there's no data in a column it's okay to have NaN. And 2. All entries in the time series should have all available data from the base data table.
For example:
df = pd.DataFrame(np.array([['Dave', 1,'call'], ['Josh', 2, 'rejection'], ['Greg', 3,'call']]), columns=['husband name', 'casenum', 'event'])
df2 = pd.DataFrame(np.array([['Dave', 'Mona', 1, 'new lamp'], ['Max', 'Lisa',1, 'big increase'],['Pete', 'Esther',3,'call'], ['Josh', 'Moana', 2, 'delivery']]), columns=['husband name','wife name','casenum', 'event'])
df3 = pd.DataFrame(np.array([[1, 'new york'],[3,'old york'], [2, 'york']]), columns=['casenum','city'])
I'm trying a concat:
concat = pd.concat([df, df2, df3])
This doesn't work, because we already know that for case num 1 the city is 'new york'
I'm trying a join:
innerjoin = pd.merge(df, df2, on='casenum', how='inner')
innerjoin = pd.merge(innerjoin, df3, on='casenum', how='inner')
This also isn't right, as I want to have a record of all the events from both tables. Also, interestingly enough, the result is the same for both inner and outer joins on the dummy data, however, on my actual data an inner join will result in more rows than the sum of both the event tables, which I don't quite understand.
Basically, my desired outcome would be:
husband name casenum event wife name city
0 Dave 1 call NaN new york
1 Josh 2 rejection NaN york
2 Greg 3 call NaN old york
0 Dave 1 new lamp Mona new york
1 Max 1 big increase Lisa new york
2 Pete 3 call Esther old york
3 Josh 2 delivery Moana york
I've tried inner joins, outer joins, concats, none seem to work. Maybe I'm just too tired, but what do I need to do to get this output? Thank you!
I think you can merge twice with outer option:
(df.merge(df2,on=['husband name', 'casenum', 'event'], how='outer')
.merge(df3, on='casenum')
)
Output:
husband name casenum event wife name city
0 Dave 1 call NaN new york
1 Dave 1 new lamp Mona new york
2 Max 1 big increase Lisa new york
3 Josh 2 rejection NaN york
4 Josh 2 delivery Moana york
5 Greg 3 call NaN old york
6 Pete 3 call Esther old york
I want to de-dupe rows in pandas based off of multiple criteria.
I have 3 columns: name, id and nick_name.
First rule is look for duplicate id's. When id's match, only keep rows where name and nick_name are different as long as I am keeping at least one row.
In other words, if name and nick_name don't match, keep that row. If name and nick_name match, then get rid of that row, as long as it isn't the only row that would be left for that id.
Example data:
data = {"name": ["Sam", "Sam", "Joseph", "Joseph", "Joseph", "Philip", "Philip", "James"],
"id": [1,1,2,2,2,3,3,4],
"nick_name": ["Sammie", "Sam", "Joseph", "Joe", "Joey", "Philip", "Philip", "James"]}
df = pd.DataFrame(data)
df
Produces:
name id nick_name
0 Sam 1 Sammie
1 Sam 1 Sam
2 Joseph 2 Joseph
3 Joseph 2 Joe
4 Joseph 2 Joey
5 Philip 3 Philip
6 Philip 3 Philip
7 James 4 James
Based on my rules above, I want a resulting dataframe to produce the following:
name id nick_name
0 Sam 1 Sammie
3 Joseph 2 Joe
4 Joseph 2 Joey
5 Philip 3 Philip
7 James 4 James
We can split this into 3 boolean condtions to filter your initial dataframe by.
#where name and nick_name match, keep the first value.
con1 = df.duplicated(subset=['name','nick_name'],keep='first')
# where ids are duplicated and name is not equal to nick_name
con2 = df.duplicated(subset=['id'],keep=False) & df['name'].ne(df['nick_name'])
# where no duplicate exists.
con3 = df.groupby('id')['id'].transform('size').eq(1)
print(df.loc[con1 | con2 | con3])
name id nick_name
0 Sam 1 Sammie
3 Joseph 2 Joe
4 Joseph 2 Joey
6 Philip 3 Philip
7 James 4 James
I'm starting with input data like this
df1 = pandas.DataFrame( {
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )
Which when printed appears as this:
City Name
0 Seattle Alice
1 Seattle Bob
2 Portland Mallory
3 Seattle Mallory
4 Seattle Bob
5 Portland Mallory
Grouping is simple enough:
g1 = df1.groupby( [ "Name", "City"] ).count()
and printing yields a GroupBy object:
City Name
Name City
Alice Seattle 1 1
Bob Seattle 2 2
Mallory Portland 2 2
Seattle 1 1
But what I want eventually is another DataFrame object that contains all the rows in the GroupBy object. In other words I want to get the following result:
City Name
Name City
Alice Seattle 1 1
Bob Seattle 2 2
Mallory Portland 2 2
Mallory Seattle 1 1
I can't quite see how to accomplish this in the pandas documentation. Any hints would be welcome.
g1 here is a DataFrame. It has a hierarchical index, though:
In [19]: type(g1)
Out[19]: pandas.core.frame.DataFrame
In [20]: g1.index
Out[20]:
MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'),
('Mallory', 'Seattle')], dtype=object)
Perhaps you want something like this?
In [21]: g1.add_suffix('_Count').reset_index()
Out[21]:
Name City City_Count Name_Count
0 Alice Seattle 1 1
1 Bob Seattle 2 2
2 Mallory Portland 2 2
3 Mallory Seattle 1 1
Or something like:
In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index()
Out[36]:
Name City count
0 Alice Seattle 1
1 Bob Seattle 2
2 Mallory Portland 2
3 Mallory Seattle 1
I want to slightly change the answer given by Wes, because version 0.16.2 requires as_index=False. If you don't set it, you get an empty dataframe.
Source:
Aggregation functions will not return the groups that you are aggregating over if they are named columns, when as_index=True, the default. The grouped columns will be the indices of the returned object.
Passing as_index=False will return the groups that you are aggregating over, if they are named columns.
Aggregating functions are ones that reduce the dimension of the returned objects, for example: mean, sum, size, count, std, var, sem, describe, first, last, nth, min, max. This is what happens when you do for example DataFrame.sum() and get back a Series.
nth can act as a reducer or a filter, see here.
import pandas as pd
df1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
"City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]})
print df1
#
# City Name
#0 Seattle Alice
#1 Seattle Bob
#2 Portland Mallory
#3 Seattle Mallory
#4 Seattle Bob
#5 Portland Mallory
#
g1 = df1.groupby(["Name", "City"], as_index=False).count()
print g1
#
# City Name
#Name City
#Alice Seattle 1 1
#Bob Seattle 2 2
#Mallory Portland 2 2
# Seattle 1 1
#
EDIT:
In version 0.17.1 and later you can use subset in count and reset_index with parameter name in size:
print df1.groupby(["Name", "City"], as_index=False ).count()
#IndexError: list index out of range
print df1.groupby(["Name", "City"]).count()
#Empty DataFrame
#Columns: []
#Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)]
print df1.groupby(["Name", "City"])[['Name','City']].count()
# Name City
#Name City
#Alice Seattle 1 1
#Bob Seattle 2 2
#Mallory Portland 2 2
# Seattle 1 1
print df1.groupby(["Name", "City"]).size().reset_index(name='count')
# Name City count
#0 Alice Seattle 1
#1 Bob Seattle 2
#2 Mallory Portland 2
#3 Mallory Seattle 1
The difference between count and size is that size counts NaN values while count does not.
The key is to use the reset_index() method.
Use:
import pandas
df1 = pandas.DataFrame( {
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )
g1 = df1.groupby( [ "Name", "City"] ).count().reset_index()
Now you have your new dataframe in g1:
Simply, this should do the task:
import pandas as pd
grouped_df = df1.groupby( [ "Name", "City"] )
pd.DataFrame(grouped_df.size().reset_index(name = "Group_Count"))
Here, grouped_df.size() pulls up the unique groupby count, and reset_index() method resets the name of the column you want it to be.
Finally, the pandas Dataframe() function is called upon to create a DataFrame object.
Maybe I misunderstand the question but if you want to convert the groupby back to a dataframe you can use .to_frame(). I wanted to reset the index when I did this so I included that part as well.
example code unrelated to question
df = df['TIME'].groupby(df['Name']).min()
df = df.to_frame()
df = df.reset_index(level=['Name',"TIME"])
I found this worked for me.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})
df1['City_count'] = 1
df1['Name_count'] = 1
df1.groupby(['Name', 'City'], as_index=False).count()
Below solution may be simpler:
df1.reset_index().groupby( [ "Name", "City"],as_index=False ).count()
I have aggregated with Qty wise data and store to dataframe
almo_grp_data = pd.DataFrame({'Qty_cnt' :
almo_slt_models_data.groupby( ['orderDate','Item','State Abv']
)['Qty'].sum()}).reset_index()
This returns the ordinal levels/indices in the same order as a vanilla groupby() method. It's basically the same as the answer #NehalJWani posted in his comment, but stored in a variable with the reset_index() method called on it.
fare_class = df.groupby(['Satisfaction Rating','Fare Class']).size().to_frame(name = 'Count')
fare_class.reset_index()
This version not only returns the same data with percentages which is useful for stats, but also includes a lambda function.
fare_class_percent = df.groupby(['Satisfaction Rating', 'Fare Class']).size().to_frame(name = 'Percentage')
fare_class_percent.transform(lambda x: 100 * x/x.sum()).reset_index()
Satisfaction Rating Fare Class Percentage
0 Dissatisfied Business 14.624269
1 Dissatisfied Economy 36.469048
2 Satisfied Business 5.460425
3 Satisfied Economy 33.235294
Example:
These solutions only partially worked for me because I was doing multiple aggregations. Here is a sample output of my grouped by that I wanted to convert to a dataframe:
Because I wanted more than the count provided by reset_index(), I wrote a manual method for converting the image above into a dataframe. I understand this is not the most pythonic/pandas way of doing this as it is quite verbose and explicit, but it was all I needed. Basically, use the reset_index() method explained above to start a "scaffolding" dataframe, then loop through the group pairings in the grouped dataframe, retrieve the indices, perform your calculations against the ungrouped dataframe, and set the value in your new aggregated dataframe.
df_grouped = df[['Salary Basis', 'Job Title', 'Hourly Rate', 'Male Count', 'Female Count']]
df_grouped = df_grouped.groupby(['Salary Basis', 'Job Title'], as_index=False)
# Grouped gives us the indices we want for each grouping
# We cannot convert a groupedby object back to a dataframe, so we need to do it manually
# Create a new dataframe to work against
df_aggregated = df_grouped.size().to_frame('Total Count').reset_index()
df_aggregated['Male Count'] = 0
df_aggregated['Female Count'] = 0
df_aggregated['Job Rate'] = 0
def manualAggregations(indices_array):
temp_df = df.iloc[indices_array]
return {
'Male Count': temp_df['Male Count'].sum(),
'Female Count': temp_df['Female Count'].sum(),
'Job Rate': temp_df['Hourly Rate'].max()
}
for name, group in df_grouped:
ix = df_grouped.indices[name]
calcDict = manualAggregations(ix)
for key in calcDict:
#Salary Basis, Job Title
columns = list(name)
df_aggregated.loc[(df_aggregated['Salary Basis'] == columns[0]) &
(df_aggregated['Job Title'] == columns[1]), key] = calcDict[key]
If a dictionary isn't your thing, the calculations could be applied inline in the for loop:
df_aggregated['Male Count'].loc[(df_aggregated['Salary Basis'] == columns[0]) &
(df_aggregated['Job Title'] == columns[1])] = df['Male Count'].iloc[ix].sum()
grouped=df.groupby(['Team','Year'])['W'].count().reset_index()
team_wins_df=pd.DataFrame(grouped)
team_wins_df=team_wins_df.rename({'W':'Wins'},axis=1)
team_wins_df['Wins']=team_wins_df['Wins'].astype(np.int32)
team_wins_df.reset_index()
print(team_wins_df)
Try to set group_keys=False in the group_by method to prevent adding the group key to the index.
Example:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})
df1.groupby(["Name"], group_keys=False)