Use Groupby and Sum to reduce a dataframe - python

I have a dataframe with some recurring values in one column. I want to group by that column and sum the other columns. The dataframe looks like this:
Edit: here is the code to create the dataframe. Notice the column called 'Able' which is the index.
df=pd.DataFrame({'Able': ['Blue', 'Green', 'Red', 'Orange'], 'Baker':[ 'New York', 'New Jersey', 'New York', 'New Jersey'], 'Charlie':[3,4,'',7], 'Delta':['',5,6,''],'Echo':[100,200,300,400]}).set_index('Able')
The result should group on 'Baker' and sum the other three columns. I've tried various flavors of groupby and pivot_table. They return the correct two rows (New York and New Jersey) but they only return 'Baker' and the sum for the rightmost column, 'Echo.' The far left column 'Able' which is the index for the source dataframe should be ignored. My output should look like this (edited thanks to #corralien for spotting an error):
Baker Charlie Delta Echo
New Jersey 11 5 600
New York 3 6 400
How do I return all the columns, ideally without listing them by name in the code?

Replace the space with 0 and agg sum. This will depend on what dype, the last three columns are. I repoduced df for you, feel free to edit if I got the dtypes wrong and edit the question. The forum will guide you.
Dataframe
df=pd.DataFrame({'Baker':[ 'New York', 'New Jersey', 'New York', 'New Jersey'], 'Charlie':[3,4,'',7], 'Delta':['',5,6,''],'Echo':[100,200,300,400]})
Code
df.replace('',0).groupby('Baker').agg('sum')
Output
Charlie Delta Echo
Baker
New Jersey 11 5 600
New York 3 6 400

Use pivot_table:
>>> df.pivot_table(index='Baker', values=['Charlie', 'Delta', 'Echo'],
aggfunc='sum').reset_index()
Baker Charlie Delta Echo
0 New Jersey 11.0 5.0 600
1 New York 3.0 6.0 400
Ensure your columns C, D, E are numeric, try df.replace('', 0) or df.fillna(0) to fill your blank cells.

Related

Adding a row for missing column value, with all other column values as NaNs

Let's say I have a dataframe with a column with categorical values, e.g. vegetable = ['carrot', 'eggplant', 'peas', 'tomato', 'zucchini']. Now, the dataframe also has another column with categorical values (e.g., location) and, if we group the dataframe by location, it often happens that a location does not have all the vegetables.
What I would like to do is to group the dataframe by location and, for each location dataframe, add one row for each missing vegetable, where all other cells are NaN.
The vegetables should be in alphabetical order, so I would like to add each new row in the right place. What would be the most pythonic way to do this?
This will work off the bat:
import pandas as pd
# specify the rows
rows = [
{'vegetable': 'carrot', 'location': 'london'},
{'vegetable': 'carrot', 'location': 'paris'},
{'vegetable': 'peas', 'location': 'new york'},
{'vegetable': 'tomato', 'location': 'new york'},
{'vegetable': 'zucchini', 'location': 'london'}
]
# create df
df = pd.DataFrame(rows)
# set as categorical
df['vegetable'] = df['vegetable'].astype('category')
df['location'] = df['location'].astype('category')
# group by and count how many we have
df.groupby(['vegetable', 'location']).size()
Output:
vegetable location
carrot london 1
new york 0
paris 1
peas london 0
new york 1
paris 0
tomato london 0
new york 1
paris 0
zucchini london 1
new york 0
paris 0
dtype: int64

How to join tables of multiple events while preserving information?

So I have a use case where I have a few tables with different types of events in a time series, plus another table with base information. The events are of different types with different columns, for example an event of "marriage" could have the columns "husband name" and "wife name", and a table of events on "jobs" can have columns of "hired on" and "fired on" but can also have "husband name". The base info table is not time series data, and has stuff like "case ID" and "city of case".
The goal would be to 1. have all the different time series tables in one table with all possible columns, wherever there's no data in a column it's okay to have NaN. And 2. All entries in the time series should have all available data from the base data table.
For example:
df = pd.DataFrame(np.array([['Dave', 1,'call'], ['Josh', 2, 'rejection'], ['Greg', 3,'call']]), columns=['husband name', 'casenum', 'event'])
df2 = pd.DataFrame(np.array([['Dave', 'Mona', 1, 'new lamp'], ['Max', 'Lisa',1, 'big increase'],['Pete', 'Esther',3,'call'], ['Josh', 'Moana', 2, 'delivery']]), columns=['husband name','wife name','casenum', 'event'])
df3 = pd.DataFrame(np.array([[1, 'new york'],[3,'old york'], [2, 'york']]), columns=['casenum','city'])
I'm trying a concat:
concat = pd.concat([df, df2, df3])
This doesn't work, because we already know that for case num 1 the city is 'new york'
I'm trying a join:
innerjoin = pd.merge(df, df2, on='casenum', how='inner')
innerjoin = pd.merge(innerjoin, df3, on='casenum', how='inner')
This also isn't right, as I want to have a record of all the events from both tables. Also, interestingly enough, the result is the same for both inner and outer joins on the dummy data, however, on my actual data an inner join will result in more rows than the sum of both the event tables, which I don't quite understand.
Basically, my desired outcome would be:
husband name casenum event wife name city
0 Dave 1 call NaN new york
1 Josh 2 rejection NaN york
2 Greg 3 call NaN old york
0 Dave 1 new lamp Mona new york
1 Max 1 big increase Lisa new york
2 Pete 3 call Esther old york
3 Josh 2 delivery Moana york
I've tried inner joins, outer joins, concats, none seem to work. Maybe I'm just too tired, but what do I need to do to get this output? Thank you!
I think you can merge twice with outer option:
(df.merge(df2,on=['husband name', 'casenum', 'event'], how='outer')
.merge(df3, on='casenum')
)
Output:
husband name casenum event wife name city
0 Dave 1 call NaN new york
1 Dave 1 new lamp Mona new york
2 Max 1 big increase Lisa new york
3 Josh 2 rejection NaN york
4 Josh 2 delivery Moana york
5 Greg 3 call NaN old york
6 Pete 3 call Esther old york

How to turn header inside rows into columns?

How do I turn the headers inside the rows into columns?
For example I have the Dataframe below.
enter image description here
and would like it to be
enter image description here
EDIT:
Code to produce current df example
import pandas as pd
df = pd.DataFrame({'Date':[2020,2021,2022], 'James':'', ' Sales': [3,4,5], ' City':'NY', ' DIV':'a', 'KIM':'', ' Sales ': [3,4,5], ' City ':'SF', ' DIV ':'b'}).T.reset_index()
index 0 1 2
0 Date 2020 2021 2022
1 James
2 Sales 3 4 5
3 City NY NY NY
4 DIV a a a
5 KIM
6 Sales 3 4 5
7 City SF SF SF
8 DIV b b b
looking to get
Name City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 3 4 5
I think the best way is to iterate over the first column if the name(eg James) has no indent its turn into a column until it hits a other value (KIM). So to find a way to categories the header which is not indent into a new column which stops when a new header comes up (KIM).
#Edit 2 there not only two names (KIM or JAMES) there is like 20 names. Or only the three second levels (Sales, City, Div). Different names have more that 3 second levels some have 7 levels. The only thing that is consistent is the Names are not indent but the second levels are.
Using a slightly simpler example, this works, but it sure ain't pretty:
df = pd.DataFrame({
'date': ['James', 'Sales', 'City', 'Kim', 'Sales', 'City',],
'2020': ['', '3', 'NY', '', '4', 'SF'],
'2021': ['', '4', 'NY', '', '5', 'SF'],
})
def rows_to_columns(group):
for value in group.date.values:
if value != group.person.values[0] and value != 'Sales':
temp_column = '_'+value
group.loc[group['date']==value, temp_column] = group['2020']
group[value.lower()] = (
group[temp_column]
.fillna(method='ffill')
.fillna(method='bfill')
)
group.drop([temp_column], axis=1, inplace=True)
pass
pass
return group
df.loc[df['2020']=='', 'person'] = df.date
df.person = df.person.fillna(method='ffill')
new_df = (df
.groupby('person')
.apply(lambda x:rows_to_columns(x))
.drop(['date'], axis=1)
.loc[df.date=='Sales']
)
The basic idea is to
Copy the name into a separate column and fill that column using .fillna(method='ffill'). This works if the assumption holds that every person's block begins with the person's name. Otherwise it wreaks havoc.
All other values, such as 'div' and 'city' will be converted by row_to_columns(group). The function iterates over all rows in a group that are neither the person's name nor 'Sales', copies the value from the row into a temp column, creates a new column for that row and uses ffill and bfill to fill it out. It then deletes the temp column and returns the group.
The resulting data frame is the intended format once the column 'Sales' is dropped.
Note: This solution probably does not work well on larger datasets.
You gave more details, and I see you are not working with multi-level indexes. The best way for you would be to create the DataFrame already in the format you need in this case. The way you are creating the first DataFrame is not well structured and the information is not indexed by name (James/KIM) as they are columns with empty values, no link with the other values. The stacking you did use blank spaces on a string. Take a look at multi-indexing and generate a data frame you can work with, or create the data frame in the format you need in the end.
-- Answer considering multi-level indexes --
Using the few information provided, I see your Dataframe is stacked, it means, you have multiple indexes. The first level is person (James/KIM) and the second level is Sales/City/DIV. So your Dataframe should be created like this:
import pandas
multi_index = pandas.MultiIndex.from_tuples([
('James', 'Sales'), ('James', 'City'), ('James', 'DIV'),
('KIM', 'Sales'), ('KIM', 'City'), ('KIM', 'DIV')])
year_2020 = pandas.Series([3, 'NY', 'a', 4, 'SF', 'b'], index=multi_index)
year_2021 = pandas.Series([4, 'NY', 'a', 5, 'SF', 'b'], index=multi_index)
year_2022 = pandas.Series([5, 'NY', 'a', 6, 'SF', 'b'], index=multi_index)
frame = { '2020': year_2020, '2021': year_2021, '2022': year_2022}
df = pandas.DataFrame(frame)
print(df)
2020 2021 2022
James Sales 3 4 5
City NY NY NY
DIV a a a
KIM Sales 4 5 6
City SF SF SF
DIV b b b
Now that you have the multi_level DataFrame, you have many ways to transform it. This is what we will do to make it one level:
sales_df = df.xs('Sales', axis=0, level=1).copy()
div_df = df.xs('DIV', axis=0, level=1).copy()
city_df = df.xs('City', axis=0, level=1).copy()
The results will be:
print(sales)
2020 2021 2022
James 3 4 5
KIM 4 5 6
print(div_df)
2020 2021 2022
James a a a
KIM b b b
print(city_df)
2020 2021 2022
James NY NY NY
KIM SF SF SF
You are discarding any information regarding DIV or City changes from years, so we can reduce the City and DIV dataframe to a Series, taking the first one as reference:
div_series = div_df.iloc[:,0]
city_series = city_df.iloc[:,0]
Take the sales DF as reference, and add the City and DIV series:
sales_df['DIV'] = div_series
sales_df['City'] = city_series
sales_df['Account'] = 'Sales'
Now reorder the columns as you wish:
sales_df = sales_df[['City', 'DIV', 'Account', '2020', '2021', '2022']]
print(sales_df)
City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 4 5 6

Merging Two Dataframes without a Key Column

I have a requirement where I want to merge two data frames without any key column.
From the input table, I am treating first three columns as one data frame and the last column as another one. My plan is to sort the second data frame and then merge it to the first one without any key column so that it looks like the above output.
Is it possible to merge in this way or if there are any alternatives?
One way is to use pd.DataFrame.join after filtering out null values.
Data from #ALollz.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
res = df1.join(pd.DataFrame(list(filter(None, df2.values)), columns=['comments']))
Result:
Country comments
0 USA X
1 UK Y
2 Finland Z
3 Spain NaN
4 Australia NaN
If by "sort the second dataframe" you mean move the NULL values to the end of the list and keep the rest of the order in tact, then this will get the job done.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia'],
'Name': ['Sam', 'Chris', 'Jeff', 'Kartik', 'Mavenn']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
df1['Comments'] = df2[df2.Comments.notnull()].reset_index().drop(columns='index')
Country Name Comments
0 USA Sam X
1 UK Chris Y
2 Finland Jeff Z
3 Spain Kartik NaN
4 Australia Mavenn NaN
IIUC:
input['Comments'] = input.Comments.sort_values().values
Output:
Comments Country Name
1 X USA Sam
2 Y UK Chris
3 Z Finland Jeff
4 NaN Spain Kartik
5 NaN Australia Maven

From tuples to multiple columns in pandas

How do I convert this dataframe
location value
0 (Richmond, Virginia, nan, USA) 100
1 (New York City, New York, nan, USA) 200
to this:
city state region country value
0 Richmond Virginia nan USA 100
1 New York City New York nan USA 200
Note that the location column in the first dataframe contains tuples. I want to create four columns out of the location column.
new_col_list = ['city','state','regions','country']
for n,col in enumerate(new_col_list):
df[col] = df['location'].apply(lambda location: location[n])
df = df.drop('location',axis=1)
If you return a Series of the (split) location, you can merge (join to merge on index) the resulting DF directly with your value column.
addr = ['city', 'state', 'region', 'country']
df[['value']].join(df.location.apply(lambda loc: Series(loc, index=addr)))
value city state region country
0 100 Richmond Virginia NaN USA
1 200 New York City New York NaN USA
I haven't timed this, but I would suggest this option:
df.loc[:,'city']=df.location.map(lambda x:x[0])
df.loc[:,'state']=df.location.map(lambda x:x[1])
df.loc[:,'regions']=df.location.map(lambda x:x[2])
df.loc[:,'country']=df.location.map(lambda x:x[3])
I'm guessing avoiding explicit for loop might lend itself to a SIMD instruction (certainly numpy looks for that, but perhaps not other libraries)
I prefer to use pd.DataFrame.from_records to convert the tuples to Series. Then this can be joined to the previous dataset as described by meloncholy.
df = pd.DataFrame({"location":[("Richmond", "Virginia", pd.NA, "USA"),
("New York City", "New York", pd.NA, "USA")],
"value": [100,200]})
loc = pd.DataFrame.from_records(df.location, columns=['city','state','regions','country'])
df.drop("location", axis=1).join(loc)
from_records does assume a sequential index. If this is not the case you should pass the index to the new DataFrame:
loc = pd.DataFrame.from_records(df.location.reset_index(drop=True),
columns=['city','state','regions','country'],
index=df.index)

Categories