Splitting DataFrame into DataFrame's - python

I have one DataFrame where different rows can have the same value for one column.
As an example:
import pandas as pd
df = pd.DataFrame( {
"Name" : ["Alice", "Bob", "John", "Mark", "Emma" , "Mary"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )
City Name
0 Seattle Alice
1 Seattle Bob
2 Portland John
3 Seattle Mark
4 Seattle Emma
5 Portland Mary
Here, a given value for "City" (e.g. "Portland") is shared by several rows.
I want to create from this data frame several data frames that have in common the value of one column. For the example above, I want to get the following data frames:
City Name
0 Seattle Alice
1 Seattle Bob
3 Seattle Mark
4 Seattle Emma
and
City Name
2 Portland John
5 Portland Mary
From this answer, I am creating a mask that can be used to generate one data frame:
def mask_with_in1d(df, column, val):
mask = np.in1d(df[column].values, [val])
return df[mask]
# Return the last data frame above
mask_with_in1d(df, 'City', 'Portland')
The problem is to create efficiently all data frames, to which a name will be assigned. I am doing it this way:
unique_values = np.sort(df['City'].unique())
for city_value in unique_values:
exec("df_{0} = mask_with_in1d(df, 'City', '{0}')".format(city_value))
which gives me the data frames df_Seattle and df_Portland that I can further manipulate.
Is there a better way of doing this?

Have you got a fixed list of cities you want to do this for? Simplest solution is to group by city and can then loop over the groups
for city, names in df.groupby("City"):
print(city)
print(names)
Portland
City Name
2 Portland John
5 Portland Mary
Seattle
City Name
0 Seattle Alice
1 Seattle Bob
3 Seattle Mark
4 Seattle Emma
Could then assign to a dictionary or some such (df_city[city] = names) if you wanted df_city["Portland"] to work. Depends what you want to do with the groups once split.

You can use groupby for this:
dfs = [gb[1] for gb in df.groupby('City')]
This will construct a list of dataframes, one per value of the 'City' column.
In case you want tuples with the value of the dataframe, you can use:
dfs = list(df.groupby('City'))
Note that assigning by name is usually an anti-pattern. And exec and eval are definitely antipatterns.

Related

Keep values assigned to one column in a new dataframe

I have a dataset with three columns:
Name Customer Value
Johnny Mike 1
Christopher Luke 0
Christopher Mike 0
Carl Marilyn 1
Carl Stephen 1
I need to create a new dataset where I have two columns: one with unique values from Name and Customer columns, and the Value column. Values in the Value column were assigned to Name (this means that multiple rows with same Name have the same value: Carl has value 1, Christopher has value 0, and Johnny has value 1), so Customer elements should have empty values in Value column in the new dataset.
My expected output is
All Value
Johnny 1
Christopher 0
Carl 1
Mike
Luke
Marilyn
Stephen
For unique values in All column I consider unique().to_list() from both Name and Customer:
name = file['Name'].unique().tolist()
customer = file['Customer'].unique().tolist()
all_with_dupl = name + customer
customers=list(dict.fromkeys(all_with_dupl))
df= pd.DataFrame(columns=['All','Value'])
df['All']= customers
I do not know how to assign the values in the new dataset after creating the list with all names and customers with no duplicates.
Any help would be great.
Split columns and .drop_duplicates on data frame to remove duplicates and then append it back:
(df.drop('Customer', 1)
.drop_duplicates()
.rename(columns={'Name': 'All'})
.append(
df[['Customer']].rename(columns={'Customer': 'All'})
.drop_duplicates(),
ignore_index=True
))
All Value
0 Johnny 1.0
1 Christopher 0.0
2 Carl 1.0
3 Mike NaN
4 Luke NaN
5 Marilyn NaN
6 Stephen NaN
Or to split the steps up:
names = df.drop('Customer', 1).drop_duplicates().rename(columns={'Name': 'All'})
customers = df[['Customer']].drop_duplicates().rename(columns={'Customer': 'All'})
names.append(customers, ignore_index=True)
Anaother way
d=dict(zip(df['Name Customer'].str.split('\s').str[0],df['Value']))#Create dict
df['Name Customer']=df['Name Customer'].str.split('\s')
df=df.explode('Name Customer').drop_duplicates(keep='first').assign(Value='')Explode dataframe and drop duplicates
df['Value']=df['Name Customer'].map(d).fillna('')#Map values back

How to join tables of multiple events while preserving information?

So I have a use case where I have a few tables with different types of events in a time series, plus another table with base information. The events are of different types with different columns, for example an event of "marriage" could have the columns "husband name" and "wife name", and a table of events on "jobs" can have columns of "hired on" and "fired on" but can also have "husband name". The base info table is not time series data, and has stuff like "case ID" and "city of case".
The goal would be to 1. have all the different time series tables in one table with all possible columns, wherever there's no data in a column it's okay to have NaN. And 2. All entries in the time series should have all available data from the base data table.
For example:
df = pd.DataFrame(np.array([['Dave', 1,'call'], ['Josh', 2, 'rejection'], ['Greg', 3,'call']]), columns=['husband name', 'casenum', 'event'])
df2 = pd.DataFrame(np.array([['Dave', 'Mona', 1, 'new lamp'], ['Max', 'Lisa',1, 'big increase'],['Pete', 'Esther',3,'call'], ['Josh', 'Moana', 2, 'delivery']]), columns=['husband name','wife name','casenum', 'event'])
df3 = pd.DataFrame(np.array([[1, 'new york'],[3,'old york'], [2, 'york']]), columns=['casenum','city'])
I'm trying a concat:
concat = pd.concat([df, df2, df3])
This doesn't work, because we already know that for case num 1 the city is 'new york'
I'm trying a join:
innerjoin = pd.merge(df, df2, on='casenum', how='inner')
innerjoin = pd.merge(innerjoin, df3, on='casenum', how='inner')
This also isn't right, as I want to have a record of all the events from both tables. Also, interestingly enough, the result is the same for both inner and outer joins on the dummy data, however, on my actual data an inner join will result in more rows than the sum of both the event tables, which I don't quite understand.
Basically, my desired outcome would be:
husband name casenum event wife name city
0 Dave 1 call NaN new york
1 Josh 2 rejection NaN york
2 Greg 3 call NaN old york
0 Dave 1 new lamp Mona new york
1 Max 1 big increase Lisa new york
2 Pete 3 call Esther old york
3 Josh 2 delivery Moana york
I've tried inner joins, outer joins, concats, none seem to work. Maybe I'm just too tired, but what do I need to do to get this output? Thank you!
I think you can merge twice with outer option:
(df.merge(df2,on=['husband name', 'casenum', 'event'], how='outer')
.merge(df3, on='casenum')
)
Output:
husband name casenum event wife name city
0 Dave 1 call NaN new york
1 Dave 1 new lamp Mona new york
2 Max 1 big increase Lisa new york
3 Josh 2 rejection NaN york
4 Josh 2 delivery Moana york
5 Greg 3 call NaN old york
6 Pete 3 call Esther old york

De-Duplicate in Pandas based off of multiple rules

I want to de-dupe rows in pandas based off of multiple criteria.
I have 3 columns: name, id and nick_name.
First rule is look for duplicate id's. When id's match, only keep rows where name and nick_name are different as long as I am keeping at least one row.
In other words, if name and nick_name don't match, keep that row. If name and nick_name match, then get rid of that row, as long as it isn't the only row that would be left for that id.
Example data:
data = {"name": ["Sam", "Sam", "Joseph", "Joseph", "Joseph", "Philip", "Philip", "James"],
"id": [1,1,2,2,2,3,3,4],
"nick_name": ["Sammie", "Sam", "Joseph", "Joe", "Joey", "Philip", "Philip", "James"]}
df = pd.DataFrame(data)
df
Produces:
name id nick_name
0 Sam 1 Sammie
1 Sam 1 Sam
2 Joseph 2 Joseph
3 Joseph 2 Joe
4 Joseph 2 Joey
5 Philip 3 Philip
6 Philip 3 Philip
7 James 4 James
Based on my rules above, I want a resulting dataframe to produce the following:
name id nick_name
0 Sam 1 Sammie
3 Joseph 2 Joe
4 Joseph 2 Joey
5 Philip 3 Philip
7 James 4 James
We can split this into 3 boolean condtions to filter your initial dataframe by.
#where name and nick_name match, keep the first value.
con1 = df.duplicated(subset=['name','nick_name'],keep='first')
# where ids are duplicated and name is not equal to nick_name
con2 = df.duplicated(subset=['id'],keep=False) & df['name'].ne(df['nick_name'])
# where no duplicate exists.
con3 = df.groupby('id')['id'].transform('size').eq(1)
print(df.loc[con1 | con2 | con3])
name id nick_name
0 Sam 1 Sammie
3 Joseph 2 Joe
4 Joseph 2 Joey
6 Philip 3 Philip
7 James 4 James

Iterating through two pandas dataframes and appending data from one dataframe to the other

I have two pandas data-frames that look like this:
data_frame_1:
index un_id city
1 abc new york
2 def atlanta
3 gei toronto
4 lmn tampa
data_frame_2:
index name un_id
1 frank gei
2 john lmn
3 lisa abc
4 jessica def
I need to match names to cities via the un_id column either in a new data-frame or an existing data-frame. I am having trouble figuring out how to iterate through one column, grab the un_id, iterate through the other un_id column in the other data-frame with that un_id, and then append the information needed back to the original data-frame.
use pandas merge:
In[14]:df2.merge(df1,on='un_id')
Out[14]:
name un_id city
0 frank gei toronto
1 john lmn tampa
2 lisa abc new york
3 jessica def atlanta

Converting a Pandas GroupBy output from Series to DataFrame

I'm starting with input data like this
df1 = pandas.DataFrame( {
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )
Which when printed appears as this:
City Name
0 Seattle Alice
1 Seattle Bob
2 Portland Mallory
3 Seattle Mallory
4 Seattle Bob
5 Portland Mallory
Grouping is simple enough:
g1 = df1.groupby( [ "Name", "City"] ).count()
and printing yields a GroupBy object:
City Name
Name City
Alice Seattle 1 1
Bob Seattle 2 2
Mallory Portland 2 2
Seattle 1 1
But what I want eventually is another DataFrame object that contains all the rows in the GroupBy object. In other words I want to get the following result:
City Name
Name City
Alice Seattle 1 1
Bob Seattle 2 2
Mallory Portland 2 2
Mallory Seattle 1 1
I can't quite see how to accomplish this in the pandas documentation. Any hints would be welcome.
g1 here is a DataFrame. It has a hierarchical index, though:
In [19]: type(g1)
Out[19]: pandas.core.frame.DataFrame
In [20]: g1.index
Out[20]:
MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'),
('Mallory', 'Seattle')], dtype=object)
Perhaps you want something like this?
In [21]: g1.add_suffix('_Count').reset_index()
Out[21]:
Name City City_Count Name_Count
0 Alice Seattle 1 1
1 Bob Seattle 2 2
2 Mallory Portland 2 2
3 Mallory Seattle 1 1
Or something like:
In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index()
Out[36]:
Name City count
0 Alice Seattle 1
1 Bob Seattle 2
2 Mallory Portland 2
3 Mallory Seattle 1
I want to slightly change the answer given by Wes, because version 0.16.2 requires as_index=False. If you don't set it, you get an empty dataframe.
Source:
Aggregation functions will not return the groups that you are aggregating over if they are named columns, when as_index=True, the default. The grouped columns will be the indices of the returned object.
Passing as_index=False will return the groups that you are aggregating over, if they are named columns.
Aggregating functions are ones that reduce the dimension of the returned objects, for example: mean, sum, size, count, std, var, sem, describe, first, last, nth, min, max. This is what happens when you do for example DataFrame.sum() and get back a Series.
nth can act as a reducer or a filter, see here.
import pandas as pd
df1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
"City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]})
print df1
#
# City Name
#0 Seattle Alice
#1 Seattle Bob
#2 Portland Mallory
#3 Seattle Mallory
#4 Seattle Bob
#5 Portland Mallory
#
g1 = df1.groupby(["Name", "City"], as_index=False).count()
print g1
#
# City Name
#Name City
#Alice Seattle 1 1
#Bob Seattle 2 2
#Mallory Portland 2 2
# Seattle 1 1
#
EDIT:
In version 0.17.1 and later you can use subset in count and reset_index with parameter name in size:
print df1.groupby(["Name", "City"], as_index=False ).count()
#IndexError: list index out of range
print df1.groupby(["Name", "City"]).count()
#Empty DataFrame
#Columns: []
#Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)]
print df1.groupby(["Name", "City"])[['Name','City']].count()
# Name City
#Name City
#Alice Seattle 1 1
#Bob Seattle 2 2
#Mallory Portland 2 2
# Seattle 1 1
print df1.groupby(["Name", "City"]).size().reset_index(name='count')
# Name City count
#0 Alice Seattle 1
#1 Bob Seattle 2
#2 Mallory Portland 2
#3 Mallory Seattle 1
The difference between count and size is that size counts NaN values while count does not.
The key is to use the reset_index() method.
Use:
import pandas
df1 = pandas.DataFrame( {
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )
g1 = df1.groupby( [ "Name", "City"] ).count().reset_index()
Now you have your new dataframe in g1:
Simply, this should do the task:
import pandas as pd
grouped_df = df1.groupby( [ "Name", "City"] )
pd.DataFrame(grouped_df.size().reset_index(name = "Group_Count"))
Here, grouped_df.size() pulls up the unique groupby count, and reset_index() method resets the name of the column you want it to be.
Finally, the pandas Dataframe() function is called upon to create a DataFrame object.
Maybe I misunderstand the question but if you want to convert the groupby back to a dataframe you can use .to_frame(). I wanted to reset the index when I did this so I included that part as well.
example code unrelated to question
df = df['TIME'].groupby(df['Name']).min()
df = df.to_frame()
df = df.reset_index(level=['Name',"TIME"])
I found this worked for me.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})
df1['City_count'] = 1
df1['Name_count'] = 1
df1.groupby(['Name', 'City'], as_index=False).count()
Below solution may be simpler:
df1.reset_index().groupby( [ "Name", "City"],as_index=False ).count()
I have aggregated with Qty wise data and store to dataframe
almo_grp_data = pd.DataFrame({'Qty_cnt' :
almo_slt_models_data.groupby( ['orderDate','Item','State Abv']
)['Qty'].sum()}).reset_index()
This returns the ordinal levels/indices in the same order as a vanilla groupby() method. It's basically the same as the answer #NehalJWani posted in his comment, but stored in a variable with the reset_index() method called on it.
fare_class = df.groupby(['Satisfaction Rating','Fare Class']).size().to_frame(name = 'Count')
fare_class.reset_index()
This version not only returns the same data with percentages which is useful for stats, but also includes a lambda function.
fare_class_percent = df.groupby(['Satisfaction Rating', 'Fare Class']).size().to_frame(name = 'Percentage')
fare_class_percent.transform(lambda x: 100 * x/x.sum()).reset_index()
Satisfaction Rating Fare Class Percentage
0 Dissatisfied Business 14.624269
1 Dissatisfied Economy 36.469048
2 Satisfied Business 5.460425
3 Satisfied Economy 33.235294
Example:
These solutions only partially worked for me because I was doing multiple aggregations. Here is a sample output of my grouped by that I wanted to convert to a dataframe:
Because I wanted more than the count provided by reset_index(), I wrote a manual method for converting the image above into a dataframe. I understand this is not the most pythonic/pandas way of doing this as it is quite verbose and explicit, but it was all I needed. Basically, use the reset_index() method explained above to start a "scaffolding" dataframe, then loop through the group pairings in the grouped dataframe, retrieve the indices, perform your calculations against the ungrouped dataframe, and set the value in your new aggregated dataframe.
df_grouped = df[['Salary Basis', 'Job Title', 'Hourly Rate', 'Male Count', 'Female Count']]
df_grouped = df_grouped.groupby(['Salary Basis', 'Job Title'], as_index=False)
# Grouped gives us the indices we want for each grouping
# We cannot convert a groupedby object back to a dataframe, so we need to do it manually
# Create a new dataframe to work against
df_aggregated = df_grouped.size().to_frame('Total Count').reset_index()
df_aggregated['Male Count'] = 0
df_aggregated['Female Count'] = 0
df_aggregated['Job Rate'] = 0
def manualAggregations(indices_array):
temp_df = df.iloc[indices_array]
return {
'Male Count': temp_df['Male Count'].sum(),
'Female Count': temp_df['Female Count'].sum(),
'Job Rate': temp_df['Hourly Rate'].max()
}
for name, group in df_grouped:
ix = df_grouped.indices[name]
calcDict = manualAggregations(ix)
for key in calcDict:
#Salary Basis, Job Title
columns = list(name)
df_aggregated.loc[(df_aggregated['Salary Basis'] == columns[0]) &
(df_aggregated['Job Title'] == columns[1]), key] = calcDict[key]
If a dictionary isn't your thing, the calculations could be applied inline in the for loop:
df_aggregated['Male Count'].loc[(df_aggregated['Salary Basis'] == columns[0]) &
(df_aggregated['Job Title'] == columns[1])] = df['Male Count'].iloc[ix].sum()
grouped=df.groupby(['Team','Year'])['W'].count().reset_index()
team_wins_df=pd.DataFrame(grouped)
team_wins_df=team_wins_df.rename({'W':'Wins'},axis=1)
team_wins_df['Wins']=team_wins_df['Wins'].astype(np.int32)
team_wins_df.reset_index()
print(team_wins_df)
Try to set group_keys=False in the group_by method to prevent adding the group key to the index.
Example:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})
df1.groupby(["Name"], group_keys=False)

Categories