Python Pandas groupby, row value to column headers - python

I have a DataFrame which I want to transpose:
import pandas as pd
sid= '13HKQ0Ue1_YCP-pKUxFuqdiqgmW_AZeR7P3VsUwrCnZo' # spreadsheet id
gid = 0 # sheet unique id (0 equals sheet0)
url = 'https://docs.google.com/spreadsheets/d/{}/export?gid={}&format=csv'.format(sid,gid)
df = pd.read_csv(url)
What I want to do is get the StoreName and CATegory as column header and have weights vs price for every category.
Desired Output :
I have tried Loops, Pandas but cannot figure it out,
I thought it could have been done by df.GroupBy but the returned object is not a DataFrame.
I get all this from a JSON output of an API:
API Link for 1STORE
import pandas as pd
import json, requests
from cytoolz.dicttoolz import merge
page = requests.get(mainurl)
dict_dta = json.loads(page.text) # load in Python DICT
list_columns = ['id', 'name', 'category_name', 'ounce', 'gram', 'two_grams', 'quarter', 'eighth','half_ounce','unit','half_gram'] # get the unformatted output
df = pd.io.json.json_normalize(dict_dta, ['categories', ['items']]).pipe(lambda x: x.drop('prices', 1).join(x.prices.apply(lambda y: pd.Series(merge(y)))))[list_columns]
df.to_csv('name')
I have tried tons of methods.
If someone could just point me in the right direction, it would be very helpful.

Is this in the right direction?
import pandas as pd
sid= '13HKQ0Ue1_YCP-pKUxFuqdiqgmW_AZeR7P3VsUwrCnZo' # spreadsheet id
gid = 0 # sheet unique id (0 equals sheet0)
url = 'https://docs.google.com/spreadsheets/d/{}/export?gid={}&format=csv'.format(sid,gid)
df = pd.read_csv(url)
for idx, dfx in df.groupby(df.CAT):
if idx != 'Flower':
continue
df_test = dfx.drop(['CAT','NAME'], axis=1)
df_test = df_test.rename(columns={'StoreNAME':idx}).set_index(idx).T
df_test
Returns:
Flower Pueblo West Organics - Adult Use Pueblo West Organics - Adult Use \
UNIT NaN NaN
HALFOUNCE 15.0 50.0
EIGHTH NaN 25.0
TWOGRAMS NaN NaN
QUARTER NaN 40.0
OUNCE 30.0 69.0
GRAM NaN 9.0
Flower Pueblo West Organics - Adult Use Three Rivers Dispensary - REC \
UNIT NaN NaN
HALFOUNCE 50.0 75.0
EIGHTH 25.0 20.0
TWOGRAMS NaN NaN
QUARTER 40.0 45.0
OUNCE 69.0 125.0
GRAM 9.0 8.0
Flower Three Rivers Dispensary - REC
UNIT NaN
HALFOUNCE 75.0
EIGHTH 20.0
TWOGRAMS NaN
QUARTER 40.0
OUNCE 125.0
GRAM 8.0

Related

Pandas - Sum of grouping and transferring results to another dataframe

I have multiple dataframes with same columns. First one is df1
Name
I
Jack
1.0
Louis
1.0
Jack
2.0
Louis
5.0
Jack
4.0
Mark
2.0
-
-
Mark
3.0
df_2
Name
I
Jack
3.0
Louis
3.0
Jack
2.0
Louis
1.0
Jack
6.0
Mark
7.0
-
-
Mark
3.0
I should have a new dataframe ndf as
Name
res_df1
res_df2
Jack
7.0
11.0
Louis
6.0
4.0
Mark
5.0
10.0
res_df1 and res_df2 are the sum grouped by name from corresponding dataframes.
How to get res table. How to match the sum of group results from different dataframes and write the sum result to the corresponding group in new df. I have done like this:
frames =[df1, df2, ...df9]
ndf = pd.concat(frames)
ndf = ndf.drop_duplicates('Name')
ndf['res_df1'] = df1.groupby(['Name', sort=False)[I'].transform('sum').round(2)
ndf['res_df2'] = df2.groupby(['Name', sort=False)[I'].transform('sum').round(2)
---
ndf['res_df9'] = df9.groupby(['Name', sort=False)[I'].transform('sum').round(2)
But the problem is I can't get right sum.
Try:
frames = [df_1, df_2]
final_df = pd.DataFrame()
for index, df in enumerate(frames):
df_count = df.groupby('Name')['l'].sum().reset_index(name=f'res_df{index}')
if index==0:
final_df = df_count.copy(deep=True)
else:
final_df = final_df.merge(df_count, how='outer', on='Name')
print(final_df)

Bind one row cell with multiple rows cell for excle sheet in panda jupyter notebook

I have an excel sheet like this.
If I search using the below method I got only 1 row.
df4 = df.loc[(df['NAME '] == 'HIR')]
df4
But I want to get all rows connecting with this name (same for birthdate and place).
expected output:
How can I achieve this? how can I bind these things
You need to forward fill the data with ffill():
df = df.replace('', np.nan) # in case you don't have null values, but you have empty strings
df['NAME '] = df['NAME '].ffill()
df4 = df.loc[(df['NAME '] == 'HIR')]
df4
That will then bring up all of the rows when you use loc. You can do this on other columns as well.
First you need to remove those blank rows in your excel. then fill values by the previous value
import pandas as pd
df = pd.read_excel('so.xlsx')
df = df[~df['HOBBY'].isna()]
df[['SNO','NAME']] = df[['SNO','NAME']].ffill()
df
SNO NAME HOBBY COURSE BIRTHDATE PLACE
0 1.0 HIR DANCING BTECH 1990.0 USA
1 1.0 HIR MUSIC MTECH NaN NaN
2 1.0 HIR TRAVELLING AI NaN NaN
4 2.0 BH GAMES BTECH 1992.0 INDIA
5 2.0 BH BOOKS AI NaN NaN
6 2.0 BH SWIMMING NaN NaN NaN
7 2.0 BH MUSIC NaN NaN NaN
8 2.0 BH DANCING NaN NaN NaN

Python pandas show repeated values

I'm trying to get data from txt file with pandas.read_csv but it doesn't show the repeated(same) values in the file such as I have 2043 in the row but It shows it once not in every row.
My file sample
Result set
All the circles I've drawn should be 2043 also but they are empty.
My code is :
import pandas as pd
df= pd.read_csv('samplefile.txt', sep='\t', header=None,
names = ["234", "235", "236"]
You get MultiIndex, so first level value are not shown only.
You can convert MultiIndex to columns by reset_index:
df = df.reset_index()
Or specify each column in parameter names for avoid MultiIndex:
df = pd.read_csv('samplefile.txt', sep='\t', names = ["one","two","next", "234", "235", "236"]
A word of warning with MultiIndex as I was bitten by this yesterday and wasted time trying to trouble shoot a non-existant problem.
If one of your index levels is of type float64 then you may find that the indexes are not shown in full. I had a dataframe I was df.groupby().describe() and the variable I was performing the groupby() on was originally a long int, at some point it was converted to a float and when printing out this index was rounded. There were a number of values very close to each other and so it appeared on printing that the groupby() had found multiple levels of the second index.
Thats not very clear so here is an illustrative example...
import numpy as np
import pandas as pd
index = np.random.uniform(low=89908893132829,
high=89908893132929,
size=(50,))
df = pd.DataFrame({'obs': np.arange(100)},
index=np.append(index, index)).sort_index()
df.index.name = 'index1'
df['index2'] = [1, 2] * 50
df.reset_index(inplace=True)
df.set_index(['index1', 'index2'], inplace=True)
Look at the dataframe and it appears that there is only one level of index1...
df.head(10)
obs
index1 index2
8.990889e+13 1 4
2 54
1 61
2 11
1 89
2 39
1 65
2 15
1 60
2 10
groupby(['index1', 'index2']).describe() and it looks like there is only one level of index1...
summary = df.groupby(['index1', 'index2']).describe()
summary.head()
obs
count mean std min 25% 50% 75% max
index1 index2
8.990889e+13 1 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0
2 1.0 54.0 NaN 54.0 54.0 54.0 54.0 54.0
1 1.0 61.0 NaN 61.0 61.0 61.0 61.0 61.0
2 1.0 11.0 NaN 11.0 11.0 11.0 11.0 11.0
1 1.0 89.0 NaN 89.0 89.0 89.0 89.0 89.0
But if you look at the actual values of index1 in either you see that there are multiple unique values. In the original dataframe...
df.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132848.5,
89908893132848.5, 89908893132855.17, 89908893132855.17,
89908893132855.45, 89908893132855.45, 89908893132864.62,
89908893132864.62, 89908893132868.61, 89908893132868.61,
89908893132873.16, 89908893132873.16, 89908893132875.6,
89908893132875.6, 89908893132875.83, 89908893132875.83,
89908893132878.73, 89908893132878.73, 89908893132879.9,
89908893132879.9, 89908893132880.67, 89908893132880.67,
89908893132880.69, 89908893132880.69, 89908893132881.31,
89908893132881.31, 89908893132881.69, 89908893132881.69,
89908893132884.45, 89908893132884.45, 89908893132887.27,
89908893132887.27, 89908893132887.83, 89908893132887.83,
89908893132892.8, 89908893132892.8, 89908893132894.34,
89908893132894.34, 89908893132894.5, 89908893132894.5,
89908893132901.88, 89908893132901.88, 89908893132903.27,
89908893132903.27, 89908893132904.53, 89908893132904.53,
89908893132909.27, 89908893132909.27, 89908893132910.38,
89908893132910.38, 89908893132911.86, 89908893132911.86,
89908893132913.4, 89908893132913.4, 89908893132915.73,
89908893132915.73, 89908893132916.06, 89908893132916.06,
89908893132922.48, 89908893132922.48, 89908893132923.44,
89908893132923.44, 89908893132924.66, 89908893132924.66,
89908893132925.14, 89908893132925.14, 89908893132928.28,
89908893132928.28],
dtype='float64', name='index1')
...and in the summarised dataframe...
summary.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132855.17,
89908893132855.17, 89908893132855.45, 89908893132855.45,
89908893132864.62, 89908893132864.62, 89908893132868.61,
89908893132868.61, 89908893132873.16, 89908893132873.16,
89908893132875.6, 89908893132875.6, 89908893132875.83,
89908893132875.83, 89908893132878.73, 89908893132878.73,
89908893132879.9, 89908893132879.9, 89908893132880.67,
89908893132880.67, 89908893132880.69, 89908893132880.69,
89908893132881.31, 89908893132881.31, 89908893132881.69,
89908893132881.69, 89908893132884.45, 89908893132884.45,
89908893132887.27, 89908893132887.27, 89908893132887.83,
89908893132887.83, 89908893132892.8, 89908893132892.8,
89908893132894.34, 89908893132894.34, 89908893132894.5,
89908893132894.5, 89908893132901.88, 89908893132901.88,
89908893132903.27, 89908893132903.27, 89908893132904.53,
89908893132904.53, 89908893132909.27, 89908893132909.27,
89908893132910.38, 89908893132910.38, 89908893132911.86,
89908893132911.86, 89908893132913.4, 89908893132913.4,
89908893132915.73, 89908893132915.73, 89908893132916.06,
89908893132916.06, 89908893132922.48, 89908893132922.48,
89908893132923.44, 89908893132923.44, 89908893132924.66,
89908893132924.66, 89908893132925.14, 89908893132925.14,
89908893132928.28, 89908893132928.28],
dtype='float64', name='index1')
I wasted time scratching my head wondering why my groupby([index1,index2) had produced only one level of index1!

Multi values or multi index pivot table in pandas

I have a sample data data table like
import pandas as pd
compnaies = ['Microsoft', 'Google', 'Amazon', 'Microsoft', 'Facebook', 'Google']
products = ['OS', 'Search', 'E-comm', 'E-comm', 'Social Media', 'OS']
count = [5,7,3,19,23,54]
average = [1.2,3.4,2.4,5.2,3.2,4.4]
df = pd.DataFrame({'company' : compnaies, 'product':products,
'count':count , 'average' : average})
df
average company count product
0 1.2 Microsoft 5 OS
1 3.4 Google 7 Search
2 2.4 Amazon 3 E-comm
3 5.2 Microsoft 19 E-comm
4 3.2 Facebook 23 Social Media
5 4.4 Google 54 OS
Now I want to create pivot view on both 'average' and 'count' but I am not able to define both values, here the sample code with one 'average'
df.pivot_table(index='company', columns='product', values='average', fill_value=0)
the output will be
but I need the data in below format, can someone please help meanwhile I tried the stack, and group by which creates multi index data frame but it does not give desired output, I will share the code if needed
desired output which I need to download in excel
Use set_index with stack and unstack:
df = (df.set_index(['company','product'])
.stack()
.unstack(axis=1)
.rename_axis([None, None])
.rename_axis(None, axis=1))
print (df)
E-comm OS Search Social Media
Amazon count 3.0 NaN NaN NaN
average 2.4 NaN NaN NaN
Facebook count NaN NaN NaN 23.0
average NaN NaN NaN 3.2
Google count NaN 54.0 7.0 NaN
average NaN 4.4 3.4 NaN
Microsoft count 19.0 5.0 NaN NaN
average 5.2 1.2 NaN NaN

Dropping NaN rows, certain columns in specific excel files using glob/merge

I would like to drop NaN rows in the final file in a for loop loading in excel files, and dropping all company, emails, created duplicated columns from all but the final loaded in excel file.
Here is my for loop (and subsequent merging into a single DF), currently:
for f in glob.glob("./gowall-users-export-*.xlsx"):
df = pd.read_excel(f)
all_users_sheets_hosts.append(df)
j = re.search('(\d+)', f)
df.columns = df.columns.str.replace('.*Hosted Meetings.*', 'Hosted Meetings' + ' ' + j.group(1))
all_users_sheets_hosts = reduce(lambda left,right: pd.merge(left,right,on=['First Name', 'Last Name'], how='outer'), all_users_sheets_hosts)
Here are the first few rows of the resulting DF:
Company_x First Name Last Name Emails_x Created_x Hosted Meetings 03112016 Facilitated Meetings_x Attended Meetings_x Company_y Emails_y ... Created_x Hosted Meetings 04122016 Facilitated Meetings_x Attended Meetings_x Company_y Emails_y Created_y Hosted Meetings 04212016 Facilitated Meetings_y Attended Meetings_y
0 TS X Y X#Y.com 03/10/2016 0.0 0.0 0.0 TS X#Y.com ... 03/10/2016 0.0 0.0 2.0 NaN NaN NaN NaN NaN NaN
1 TS X Y X#Y.com 03/10/2016 0.0 0.0 0.0 TS X#Y.com ... 01/25/2016 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN
2 TS X Y X#Y.com 03/10/2016 0.0 0.0 0.0 TS X#Y.com ... 04/06/2015 9.0 10.0 17.0 NaN NaN NaN NaN NaN NaN
To prevent multiple Company, Emails, Created, Facilitated Meetings and Attended Meetings columns, drop them from the right DataFrame. To remove rows with all NaN values, use result.dropna(how='all', axis=0):
import pandas as pd
import functools
for f in glob.glob("./gowall-users-export-*.xlsx"):
df = pd.read_excel(f)
all_users_sheets_hosts.append(df)
j = re.search('(\d+)', f)
df.columns = df.columns.str.replace('.*Hosted Meetings.*',
'Hosted Meetings' + ' ' + j.group(1))
# Drop rows of all NaNs from the final DataFrame in `all_users_sheets_hosts`
all_users_sheets_hosts[-1] = all_users_sheets_hosts[-1].dropna(how='all', axis=0)
def mergefunc(left, right):
cols = ['Company', 'Emails', 'Created', 'Facilitated Meetings', 'Attended Meetings']
right = right.drop(cols, axis=1)
result = pd.merge(left, right, on=['First Name', 'Last Name'], how='outer')
return result
all_users_sheets_hosts = functools.reduce(mergefunc, all_users_sheets_hosts)
Since the Company et. al. columns will only exist in the left DataFrame, there will be no proliferation of those columns. Note, however, that if the left and right DataFrames have different values in those columns, only the values in the first DataFrame in all_users_sheets_hosts will be kept.
Alternative, if the left and right DataFrames have the same values for the Company et. al. columns, then another option would be to simple merge on those columns too:
def mergefunc(left, right):
cols = ['First Name', 'Last Name', 'Company', 'Emails', 'Created',
'Facilitated Meetings', 'Attended Meetings']
result = pd.merge(left, right, on=cols, how='outer')
return result
all_users_sheets_hosts = functools.reduce(mergefunc, all_users_sheets_hosts)

Categories