Pandas - Sum of grouping and transferring results to another dataframe - python

I have multiple dataframes with same columns. First one is df1
Name
I
Jack
1.0
Louis
1.0
Jack
2.0
Louis
5.0
Jack
4.0
Mark
2.0
-
-
Mark
3.0
df_2
Name
I
Jack
3.0
Louis
3.0
Jack
2.0
Louis
1.0
Jack
6.0
Mark
7.0
-
-
Mark
3.0
I should have a new dataframe ndf as
Name
res_df1
res_df2
Jack
7.0
11.0
Louis
6.0
4.0
Mark
5.0
10.0
res_df1 and res_df2 are the sum grouped by name from corresponding dataframes.
How to get res table. How to match the sum of group results from different dataframes and write the sum result to the corresponding group in new df. I have done like this:
frames =[df1, df2, ...df9]
ndf = pd.concat(frames)
ndf = ndf.drop_duplicates('Name')
ndf['res_df1'] = df1.groupby(['Name', sort=False)[I'].transform('sum').round(2)
ndf['res_df2'] = df2.groupby(['Name', sort=False)[I'].transform('sum').round(2)
---
ndf['res_df9'] = df9.groupby(['Name', sort=False)[I'].transform('sum').round(2)
But the problem is I can't get right sum.

Try:
frames = [df_1, df_2]
final_df = pd.DataFrame()
for index, df in enumerate(frames):
df_count = df.groupby('Name')['l'].sum().reset_index(name=f'res_df{index}')
if index==0:
final_df = df_count.copy(deep=True)
else:
final_df = final_df.merge(df_count, how='outer', on='Name')
print(final_df)

Related

How can I 1) change a part of text in values (e.g., ', ' -> '__') and 2) give different values to missing values in Python dataframe?

I converted a JSON variable to multiple paired variables.
As a result, I have a dataset like
home_city_1 home_number_1 home_city_2 home_number_2 home_city_3 home_number_3 home_city_4 home_number_4
Coeur D Alene, ID 13.0 Hayden, ID 8.0 Renton, WA 2.0 NaN NaN
Spokane, WA 3.0 Amber, WA 2.0 NaN NaN NaN NaN
Sioux Falls, SD 9.0 Stone Mountain, GA 2.0 Watertown, SD 2.0 Dell Rapids, SD 2.0
Ludowici, GA 11.0 NaN NaN NaN NaN NaN NaN
This data set has 600 columns (300 * 2).
I want to convert the values with those conditions:
Change ' ' or ',' in the home_city_# column values to '_' (under bar). For example, 'Sioux Falls, SD' to 'Sioux_Falls__SD'
Convert missing values to 'm' (missing in home_city_#) or -1 (missing in home_number_#)
I have tried
customer_home_city_json_2 = customer_home_city_json_1.replace(',', '_')
customer_home_city_json_2 = customer_home_city_json_2 .apply(lambda x: x.replace('null', "-1"))
Try
citys = [col for col in df.columns if 'home_city_' in col]
numbers = [col for col in df.columns if 'home_number_' in col]
df[citys] = df[citys].replace("\s|,", "_", regex=True)
df[citys] = df[citys].fillna('m')
df[numbers] = df[numbers].fillna(-1)
To perform the correct tasks you have to get columns names for 'home_city_#' and 'home_number_#'. This is done in the first two lines.
For replacing " " and "," with "_" I call replace() with regex=True to use regular expressions. \s (is a shortcut) and removes all whitespaces, this could be replaced also by .
For filling the NaNs I use fillna and set the wanted value -1 or m. I suggestnot to mix types in a column. Therefor I use -1 for "numbers" and m for citys.
Example
It this is you DataFrame
home_city_1 home_number_1 home_city_2 home_number_2
0 Coeur D Alene, ID 13.0 Hayden, ID 8.0
1 Spokane, WA 3.0 Amber, WA 2.0
2 Sioux Falls, SD 9.0 Stone Mountain, GA 2.0
3 Ludowici, GA 11.0 NaN NaN
the output will be
home_city_1 home_number_1 home_city_2 home_number_2
0 Coeur_D_Alene__ID 13.0 Hayden__ID 8.0
1 Spokane__WA 3.0 Amber__WA 2.0
2 Sioux_Falls__SD 9.0 Stone_Mountain__GA 2.0
3 Ludowici__GA 11.0 m -1.0
Considering that df is the name of your dataframe, you can try this :
city_cols = df.filter(regex='^home_city').columns
df[city_cols] = (df[city_cols]
.replace('', '-')
.replace(',', '-', regex=True)
.fillna('m'))
number_cols = df.filter(regex='^home_number').columns
df[number_cols] = df[number_cols].fillna(-1)
By using pandas.DataFrame.filter and regex you can filter by columns that have the same prefix.

Bind one row cell with multiple rows cell for excle sheet in panda jupyter notebook

I have an excel sheet like this.
If I search using the below method I got only 1 row.
df4 = df.loc[(df['NAME '] == 'HIR')]
df4
But I want to get all rows connecting with this name (same for birthdate and place).
expected output:
How can I achieve this? how can I bind these things
You need to forward fill the data with ffill():
df = df.replace('', np.nan) # in case you don't have null values, but you have empty strings
df['NAME '] = df['NAME '].ffill()
df4 = df.loc[(df['NAME '] == 'HIR')]
df4
That will then bring up all of the rows when you use loc. You can do this on other columns as well.
First you need to remove those blank rows in your excel. then fill values by the previous value
import pandas as pd
df = pd.read_excel('so.xlsx')
df = df[~df['HOBBY'].isna()]
df[['SNO','NAME']] = df[['SNO','NAME']].ffill()
df
SNO NAME HOBBY COURSE BIRTHDATE PLACE
0 1.0 HIR DANCING BTECH 1990.0 USA
1 1.0 HIR MUSIC MTECH NaN NaN
2 1.0 HIR TRAVELLING AI NaN NaN
4 2.0 BH GAMES BTECH 1992.0 INDIA
5 2.0 BH BOOKS AI NaN NaN
6 2.0 BH SWIMMING NaN NaN NaN
7 2.0 BH MUSIC NaN NaN NaN
8 2.0 BH DANCING NaN NaN NaN

Count weights above and below given value and groub by hospital name

I need some small help. I have data containing hospital names and Birth weights in kilograms. Now I do want to group and count weights below 1kg and above 1kg per individual hospitals . Here is how my data looks like
# intialise data of lists.
data = {'Hospital':['Ruack', 'Ruack', 'Pens', 'Rick','Pens', 'Rick'],'Birth_weight':['1.0', '0.1', '2.1', '0.9', '2.19', '0.88']}
# Create DataFrame
dfy = pd.DataFrame(data)
# Print the output.
print(dfy)
Here is what I tried
#weight below 1kg
weight_count=pd.DataFrame(dfy.groupby('Hospital')['Birth_weight'] < 1.value_counts())
weight_count = weight_count.rename({'Birth_weight': 'weight_count'}, axis='columns')
weight_final = weight_count.reset_index()
#weight above 1kg
weight_count=pd.DataFrame(dfy.groupby('Hospital')['Birth_weight'] > 1.value_counts())
weight_count = weight_count.rename({'Birth_weight': 'weight_count'}, axis='columns')
weight_final = weight_count.reset_index()
end results
Expected result is a table with weight counts of birth weights under 1kg and above 1kg grouped per hospital.
EXPECTED TABLE
# intialise data of lists.
data = {'Hospital':['Ruack' , 'Rick','pens'],'< 1kg_count':['1', '2' , 'NAN'], '>1kg_count':['1','NAN' ,'2']}
# Create DataFrame
df_final = pd.DataFrame(data)
# Print the output.
print(df_final)
Use numpy.where for catagorize to new column and then GroupBy.size with Series.unstack:
#if encessary convert to floats
dfy['Birth_weight'] = dfy['Birth_weight'].astype(float)
dfy['group'] = np.where(dfy['Birth_weight'] < 1,'< 1kg_count','>1kg_count')
df = dfy.groupby(['Hospital', 'group']).size().unstack().reset_index()
print (df)
group Hospital < 1kg_count >1kg_count
0 Pens NaN 2.0
1 Rick 2.0 NaN
2 Ruack 1.0 1.0
Another idea with DataFrame.pivot_table:
dfy['Birth_weight'] = dfy['Birth_weight'].astype(float)
g = np.where(dfy['Birth_weight'] < 1,'< 1kg_count','>1kg_count')
df = dfy.pivot_table(index='Hospital', columns=g, aggfunc='size').reset_index()
print (df)
Hospital < 1kg_count >1kg_count
0 Pens NaN 2.0
1 Rick 2.0 NaN
2 Ruack 1.0 1.0
EDIT: If want binning of column use cut:
dfy['Birth_weight'] = dfy['Birth_weight'].astype(float)
bins = np.arange(0, 5.5, 0.5)
labels = ['{}-{}kg_count'.format(i, j) for i, j in zip(bins[:-1], bins[1:])]
#print (bins)
#print (labels)
g = pd.cut(dfy['Birth_weight'], bins=bins, labels=labels)
df = dfy.pivot_table(index='Hospital', columns=g, aggfunc='size')
print (df)
Birth_weight 0.0-0.5kg_count 0.5-1.0kg_count 2.0-2.5kg_count
Hospital
Pens NaN NaN 2.0
Rick NaN 2.0 NaN
Ruack 1.0 1.0 NaN
Are you looking for something like this?
a=(dfy['Birth_weight'].astype(float)<1).map({True: 'Less than 1kg', False: 'More than 1kg'})
dfy.groupby(['Hospital',a])['Birth_weight'].count().reset_index(name='Count')
Output
Hospital Birth_weight Count
0 Pens More than 1kg 2
1 Rick Less than 1kg 2
2 Ruack Less than 1kg 1
3 Ruack More than 1kg 1
import pandas as pd
import numpy as np
# intialise data of lists.
data = {'Hospital':['Ruack', 'Ruack', 'Pens', 'Rick','Pens', 'Rick'],'Birth_weight':
['1.0', '0.1', '2.1', '0.9', '2.19', '0.88']}
# Create DataFrame
dfy = pd.DataFrame(data)
dfy['Birth_weight'] = dfy['Birth_weight'].astype(float)
df1 = dfy.groupby(['Hospital','Birth_weight'])
df1.filter(lambda x: x['Birth_weight']>1)
df1.filter(lambda x: x['Birth_weight']<1)

Python Pandas groupby, row value to column headers

I have a DataFrame which I want to transpose:
import pandas as pd
sid= '13HKQ0Ue1_YCP-pKUxFuqdiqgmW_AZeR7P3VsUwrCnZo' # spreadsheet id
gid = 0 # sheet unique id (0 equals sheet0)
url = 'https://docs.google.com/spreadsheets/d/{}/export?gid={}&format=csv'.format(sid,gid)
df = pd.read_csv(url)
What I want to do is get the StoreName and CATegory as column header and have weights vs price for every category.
Desired Output :
I have tried Loops, Pandas but cannot figure it out,
I thought it could have been done by df.GroupBy but the returned object is not a DataFrame.
I get all this from a JSON output of an API:
API Link for 1STORE
import pandas as pd
import json, requests
from cytoolz.dicttoolz import merge
page = requests.get(mainurl)
dict_dta = json.loads(page.text) # load in Python DICT
list_columns = ['id', 'name', 'category_name', 'ounce', 'gram', 'two_grams', 'quarter', 'eighth','half_ounce','unit','half_gram'] # get the unformatted output
df = pd.io.json.json_normalize(dict_dta, ['categories', ['items']]).pipe(lambda x: x.drop('prices', 1).join(x.prices.apply(lambda y: pd.Series(merge(y)))))[list_columns]
df.to_csv('name')
I have tried tons of methods.
If someone could just point me in the right direction, it would be very helpful.
Is this in the right direction?
import pandas as pd
sid= '13HKQ0Ue1_YCP-pKUxFuqdiqgmW_AZeR7P3VsUwrCnZo' # spreadsheet id
gid = 0 # sheet unique id (0 equals sheet0)
url = 'https://docs.google.com/spreadsheets/d/{}/export?gid={}&format=csv'.format(sid,gid)
df = pd.read_csv(url)
for idx, dfx in df.groupby(df.CAT):
if idx != 'Flower':
continue
df_test = dfx.drop(['CAT','NAME'], axis=1)
df_test = df_test.rename(columns={'StoreNAME':idx}).set_index(idx).T
df_test
Returns:
Flower Pueblo West Organics - Adult Use Pueblo West Organics - Adult Use \
UNIT NaN NaN
HALFOUNCE 15.0 50.0
EIGHTH NaN 25.0
TWOGRAMS NaN NaN
QUARTER NaN 40.0
OUNCE 30.0 69.0
GRAM NaN 9.0
Flower Pueblo West Organics - Adult Use Three Rivers Dispensary - REC \
UNIT NaN NaN
HALFOUNCE 50.0 75.0
EIGHTH 25.0 20.0
TWOGRAMS NaN NaN
QUARTER 40.0 45.0
OUNCE 69.0 125.0
GRAM 9.0 8.0
Flower Three Rivers Dispensary - REC
UNIT NaN
HALFOUNCE 75.0
EIGHTH 20.0
TWOGRAMS NaN
QUARTER 40.0
OUNCE 125.0
GRAM 8.0

Dropping NaN rows, certain columns in specific excel files using glob/merge

I would like to drop NaN rows in the final file in a for loop loading in excel files, and dropping all company, emails, created duplicated columns from all but the final loaded in excel file.
Here is my for loop (and subsequent merging into a single DF), currently:
for f in glob.glob("./gowall-users-export-*.xlsx"):
df = pd.read_excel(f)
all_users_sheets_hosts.append(df)
j = re.search('(\d+)', f)
df.columns = df.columns.str.replace('.*Hosted Meetings.*', 'Hosted Meetings' + ' ' + j.group(1))
all_users_sheets_hosts = reduce(lambda left,right: pd.merge(left,right,on=['First Name', 'Last Name'], how='outer'), all_users_sheets_hosts)
Here are the first few rows of the resulting DF:
Company_x First Name Last Name Emails_x Created_x Hosted Meetings 03112016 Facilitated Meetings_x Attended Meetings_x Company_y Emails_y ... Created_x Hosted Meetings 04122016 Facilitated Meetings_x Attended Meetings_x Company_y Emails_y Created_y Hosted Meetings 04212016 Facilitated Meetings_y Attended Meetings_y
0 TS X Y X#Y.com 03/10/2016 0.0 0.0 0.0 TS X#Y.com ... 03/10/2016 0.0 0.0 2.0 NaN NaN NaN NaN NaN NaN
1 TS X Y X#Y.com 03/10/2016 0.0 0.0 0.0 TS X#Y.com ... 01/25/2016 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN
2 TS X Y X#Y.com 03/10/2016 0.0 0.0 0.0 TS X#Y.com ... 04/06/2015 9.0 10.0 17.0 NaN NaN NaN NaN NaN NaN
To prevent multiple Company, Emails, Created, Facilitated Meetings and Attended Meetings columns, drop them from the right DataFrame. To remove rows with all NaN values, use result.dropna(how='all', axis=0):
import pandas as pd
import functools
for f in glob.glob("./gowall-users-export-*.xlsx"):
df = pd.read_excel(f)
all_users_sheets_hosts.append(df)
j = re.search('(\d+)', f)
df.columns = df.columns.str.replace('.*Hosted Meetings.*',
'Hosted Meetings' + ' ' + j.group(1))
# Drop rows of all NaNs from the final DataFrame in `all_users_sheets_hosts`
all_users_sheets_hosts[-1] = all_users_sheets_hosts[-1].dropna(how='all', axis=0)
def mergefunc(left, right):
cols = ['Company', 'Emails', 'Created', 'Facilitated Meetings', 'Attended Meetings']
right = right.drop(cols, axis=1)
result = pd.merge(left, right, on=['First Name', 'Last Name'], how='outer')
return result
all_users_sheets_hosts = functools.reduce(mergefunc, all_users_sheets_hosts)
Since the Company et. al. columns will only exist in the left DataFrame, there will be no proliferation of those columns. Note, however, that if the left and right DataFrames have different values in those columns, only the values in the first DataFrame in all_users_sheets_hosts will be kept.
Alternative, if the left and right DataFrames have the same values for the Company et. al. columns, then another option would be to simple merge on those columns too:
def mergefunc(left, right):
cols = ['First Name', 'Last Name', 'Company', 'Emails', 'Created',
'Facilitated Meetings', 'Attended Meetings']
result = pd.merge(left, right, on=cols, how='outer')
return result
all_users_sheets_hosts = functools.reduce(mergefunc, all_users_sheets_hosts)

Categories