Python - multiplying dataframes of different size - python

I have two dataframes:
df1 - is a pivot table that has totals for both columns and rows, both with default names "All"
df2 - a df I created manually by specifying values and using the same index and column names as are used in the pivot table above. This table does not have totals.
I need to multiply the first dataframe by the values in the second. I expect the totals return NaNs since totals don't exist in the second table.
When I perform multiplication, I get the following error:
ValueError: cannot join with no level specified and no overlapping names
When I try the same on dummy dataframes it works as expected:
import pandas as pd
import numpy as np
table1 = np.matrix([[10, 20, 30, 60],
[50, 60, 70, 180],
[90, 10, 10, 110],
[150, 90, 110, 350]])
df1 = pd.DataFrame(data = table1, index = ['One','Two','Three', 'All'], columns =['A', 'B','C', 'All'] )
print(df1)
table2 = np.matrix([[1.0, 2.0, 3.0],
[5.0, 6.0, 7.0],
[2.0, 1.0, 5.0]])
df2 = pd.DataFrame(data = table2, index = ['One','Two','Three'], columns =['A', 'B','C'] )
print(df2)
df3 = df1*df2
print(df3)
This gives me the following output:
A B C All
One 10 20 30 60
Two 50 60 70 180
Three 90 10 10 110
All 150 90 110 350
A B C
One 1.00 2.00 3.00
Two 5.00 6.00 7.00
Three 2.00 1.00 5.00
A All B C
All nan nan nan nan
One 10.00 nan 40.00 90.00
Three 180.00 nan 10.00 50.00
Two 250.00 nan 360.00 490.00
So, visually, the only difference between df1 and df2 is the presence/absence of the column and row "All".
And I think the only difference between my dummy dataframes and the real ones is that the real df1 was created with pd.pivot_table method:
df1_real = pd.pivot_table(PY, values = ['Annual Pay'], index = ['PAR Rating'],
columns = ['CR Range'], aggfunc = [np.sum], margins = True)
I do need to keep the total as I'm using them in other calculations.
I'm sure there is a workaround but I just really want to understand why the same code works on some dataframes of different sizes but not others. Or maybe an issue is something completely different.
Thank you for reading. I realize it's a very long post..

IIUC,
My Preferred Approach
you can use the mul method in order to pass the fill_value argument. In this case, you'll want a value of 1 (multiplicative identity) to preserve the value from the dataframe in which the value is not missing.
df1.mul(df2, fill_value=1)
A All B C
All 150.0 350.0 90.0 110.0
One 10.0 60.0 40.0 90.0
Three 180.0 110.0 10.0 50.0
Two 250.0 180.0 360.0 490.0
Alternate Approach
You can also embrace the np.nan and use a follow-up combine_first to fill back in the missing bits from df1
(df1 * df2).combine_first(df1)
A All B C
All 150.0 350.0 90.0 110.0
One 10.0 60.0 40.0 90.0
Three 180.0 110.0 10.0 50.0
Two 250.0 180.0 360.0 490.0

I really like Pir 's approach , and here is mine :-)
df1.loc[df2.index,df2.columns]*=df2
df1
Out[293]:
A B C All
One 10.0 40.0 90.0 60
Two 250.0 360.0 490.0 180
Three 180.0 10.0 50.0 110
All 150.0 90.0 110.0 350

#Wen, #piRSquared, thank you for your help. This is what I ended up doing. There is probably a more elegant solution but this worked for me.
Since I was able to multiply two dummy dataframes of different sizes, I reasoned the issue wasn't the size, but the fact that one of the dataframes was created as a pivot table. Somehow in this pivot table, the headers were not recognized, though visually they were there. So, I decided to convert the pivot table to a regular dataframe. Steps I took:
Converted the pivot table to records and then back to dataframe using solution from this thread: pandas pivot table to data frame .
Cleaned up the column headers using solution from the same thread above: pandas pivot table to data frame .
Set my first column as the index following suggestion in this thread: How to remove index from a created Dataframe in Python?
This gave me a dataframe that was visually identical to what I had before but was no longer a pivot table.
I was then able to multiply the two dataframes with no issues. I used approach suggested by #Wen because I like that it preserves the structure.

Related

Python>Pandas>Summing columns in different data frames which have same column names, same index values but not same same length of index

i have two data frames which look like below. i want to sum df2 and df1 and override df1 to reflect this sum. though column name matches in both data frames, and even index have similar values, but DF2 is smaller in size and not have all rows(or index values). how can i best do this operation? "Buckets" is the index on both the data frame.
No need to merge, let's use pandas intrinsic data aligment with indexes:
df1.set_index("Buckets")\
.add(df2.set_index("Buckets"), fill_value=0)\
.reset_index()
Output:
Buckets EUR
0 20Y 200.0
1 25Y 200.0
2 30Y 200.0
3 35Y 200.0
Note: You can leave out the set_index if Buckets is already in the index.
Do, df1.add(df2, fill_value=0)
Try this (the join type left or outer, you can decide as per your data)
df1 = pd.merge(df1, df2, on=['Buckets'], how='left').set_index(['Buckets']).sum(axis=1).reset_index()
# .set_index(['Buckets']) this is optional for you, as it is already index(as mentioned by you)
# output, You may have to rename column 0 to EUR after that
Buckets 0
0 20Y 200.0
1 25Y 200.0
2 30Y 200.0
3 35Y 200.0
OR try this
df1 = pd.merge(df1, df2, on=['Buckets'], how='left')
# you wll have 2 columns for EUR(as both df1 and df2 has it) suffixed as _x and _y
df1['EUR_y'] = df1['EUR_y'].fillna(0) # as NaN will create issue
df1['EUR'] = df1['EUR_x'] +df1['EUR_y']
# o/p
>>> df1
Buckets EUR_x EUR_y EUR
0 20Y 100 100.0 200.0
1 25Y 200 0.0 200.0
2 30Y 200 0.0 200.0
3 35Y 400 -200.0 200.0

Python groupby nested dictionary is ambiguous in aggregation

I am currently working on my thesis and facing some problems in a groupby function I want to do. I am trying to find out someone's total purchase amount, average purchase amount, purchase count, how many products bought in total and the average value per product.
The data looks like thise:
id purchase_amount price_products #_products
0 123 30 20.00 2
2 123 NaN 10.00 NaN
3 124 50.00 25.00 3
4 124 NaN 15.00 NaN
5 124 NaN 10.00 NaN
My code looks like this:
df.groupby('id')[['purchase_amount','price_products','#_products']].agg(total_purchase_amount=('purchase_amount','sum'),average_purchase_amount=('purchase_amount','mean'),times_purchased=('#_products','count'),total_amount_products_purchased=('price_products','count'),average_value_products=('price_products','mean'))
But I get the following error:
SpecificationError: nested dictionary is ambiguous in aggregation
I cannot seem to find what I am doing wrong, hopefully someone can help me!
Do like this for all calculations
df.groupby('id')['purchase_amount'].agg({'total_purchase_amount':'sum'})
Since you have several variables to aggregate, I would suggest using the following form of aggregation:
df.groupby('id')[<variables-list>].agg([<statistics-list>])
For example:
df_agg = df.groupby('id')[['purchase_amount','price_products','#_products']].agg(["count", "mean", "sum"])
This will create a column-wise multi-level output data frame df_agg that looks like:
purchase_amount price_products #_products
count mean sum count mean sum count mean sum
id
123 1 30.0 30.0 2 15 30 1 2.0 2.0
124 1 50.0 50.0 3 17 51 1 3.0 3.0
You can then refer to a particular entry in the output data frame using the multi-index as follows:
df_agg['purchase_amount']['mean']
id
123 30.0
124 50.0
Name: mean, dtype: float64
or if you want e.g. all the means, use the cross-sectional method xs():
df_agg.xs('mean', axis=1, level=1)
purchase_amount price_products #_products
id
123 30.0 15 2.0
124 50.0 17 3.0
Note: presumably, the above piece of code will make Python compute more statistics than needed, as is the case in your example. But this may not be an issue in certain contexts, and it has the advantage that the code is shorter and generalizable to any set and number of (numeric and float) variables to aggregate.
You can do this in an organized way using a dictionary for your aggregation.
df = pd.DataFrame([[123, 30, 20, 2],
[123, np.nan, 10, np.nan],
[124, 50, 25, 3],
[124, np.nan, 15, np.nan],
[124, np.nan, 10, np.nan]],
columns=['id', 'purchase_amount', 'price_products', 'num_products']
)
agg_dict = {
'purchase_amount': [np.sum, np.mean],
'num_products': [np.count_nonzero],
'price_products': [np.count_nonzero, np.mean],
}
print(df.groupby('id').agg(agg_dict))
output:
purchase_amount num_products price_products
sum mean count_nonzero count_nonzero mean
id
123 30.0 30.0 2.0 2 15.000000
124 50.0 50.0 3.0 3 16.666667

Python pandas show repeated values

I'm trying to get data from txt file with pandas.read_csv but it doesn't show the repeated(same) values in the file such as I have 2043 in the row but It shows it once not in every row.
My file sample
Result set
All the circles I've drawn should be 2043 also but they are empty.
My code is :
import pandas as pd
df= pd.read_csv('samplefile.txt', sep='\t', header=None,
names = ["234", "235", "236"]
You get MultiIndex, so first level value are not shown only.
You can convert MultiIndex to columns by reset_index:
df = df.reset_index()
Or specify each column in parameter names for avoid MultiIndex:
df = pd.read_csv('samplefile.txt', sep='\t', names = ["one","two","next", "234", "235", "236"]
A word of warning with MultiIndex as I was bitten by this yesterday and wasted time trying to trouble shoot a non-existant problem.
If one of your index levels is of type float64 then you may find that the indexes are not shown in full. I had a dataframe I was df.groupby().describe() and the variable I was performing the groupby() on was originally a long int, at some point it was converted to a float and when printing out this index was rounded. There were a number of values very close to each other and so it appeared on printing that the groupby() had found multiple levels of the second index.
Thats not very clear so here is an illustrative example...
import numpy as np
import pandas as pd
index = np.random.uniform(low=89908893132829,
high=89908893132929,
size=(50,))
df = pd.DataFrame({'obs': np.arange(100)},
index=np.append(index, index)).sort_index()
df.index.name = 'index1'
df['index2'] = [1, 2] * 50
df.reset_index(inplace=True)
df.set_index(['index1', 'index2'], inplace=True)
Look at the dataframe and it appears that there is only one level of index1...
df.head(10)
obs
index1 index2
8.990889e+13 1 4
2 54
1 61
2 11
1 89
2 39
1 65
2 15
1 60
2 10
groupby(['index1', 'index2']).describe() and it looks like there is only one level of index1...
summary = df.groupby(['index1', 'index2']).describe()
summary.head()
obs
count mean std min 25% 50% 75% max
index1 index2
8.990889e+13 1 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0
2 1.0 54.0 NaN 54.0 54.0 54.0 54.0 54.0
1 1.0 61.0 NaN 61.0 61.0 61.0 61.0 61.0
2 1.0 11.0 NaN 11.0 11.0 11.0 11.0 11.0
1 1.0 89.0 NaN 89.0 89.0 89.0 89.0 89.0
But if you look at the actual values of index1 in either you see that there are multiple unique values. In the original dataframe...
df.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132848.5,
89908893132848.5, 89908893132855.17, 89908893132855.17,
89908893132855.45, 89908893132855.45, 89908893132864.62,
89908893132864.62, 89908893132868.61, 89908893132868.61,
89908893132873.16, 89908893132873.16, 89908893132875.6,
89908893132875.6, 89908893132875.83, 89908893132875.83,
89908893132878.73, 89908893132878.73, 89908893132879.9,
89908893132879.9, 89908893132880.67, 89908893132880.67,
89908893132880.69, 89908893132880.69, 89908893132881.31,
89908893132881.31, 89908893132881.69, 89908893132881.69,
89908893132884.45, 89908893132884.45, 89908893132887.27,
89908893132887.27, 89908893132887.83, 89908893132887.83,
89908893132892.8, 89908893132892.8, 89908893132894.34,
89908893132894.34, 89908893132894.5, 89908893132894.5,
89908893132901.88, 89908893132901.88, 89908893132903.27,
89908893132903.27, 89908893132904.53, 89908893132904.53,
89908893132909.27, 89908893132909.27, 89908893132910.38,
89908893132910.38, 89908893132911.86, 89908893132911.86,
89908893132913.4, 89908893132913.4, 89908893132915.73,
89908893132915.73, 89908893132916.06, 89908893132916.06,
89908893132922.48, 89908893132922.48, 89908893132923.44,
89908893132923.44, 89908893132924.66, 89908893132924.66,
89908893132925.14, 89908893132925.14, 89908893132928.28,
89908893132928.28],
dtype='float64', name='index1')
...and in the summarised dataframe...
summary.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132855.17,
89908893132855.17, 89908893132855.45, 89908893132855.45,
89908893132864.62, 89908893132864.62, 89908893132868.61,
89908893132868.61, 89908893132873.16, 89908893132873.16,
89908893132875.6, 89908893132875.6, 89908893132875.83,
89908893132875.83, 89908893132878.73, 89908893132878.73,
89908893132879.9, 89908893132879.9, 89908893132880.67,
89908893132880.67, 89908893132880.69, 89908893132880.69,
89908893132881.31, 89908893132881.31, 89908893132881.69,
89908893132881.69, 89908893132884.45, 89908893132884.45,
89908893132887.27, 89908893132887.27, 89908893132887.83,
89908893132887.83, 89908893132892.8, 89908893132892.8,
89908893132894.34, 89908893132894.34, 89908893132894.5,
89908893132894.5, 89908893132901.88, 89908893132901.88,
89908893132903.27, 89908893132903.27, 89908893132904.53,
89908893132904.53, 89908893132909.27, 89908893132909.27,
89908893132910.38, 89908893132910.38, 89908893132911.86,
89908893132911.86, 89908893132913.4, 89908893132913.4,
89908893132915.73, 89908893132915.73, 89908893132916.06,
89908893132916.06, 89908893132922.48, 89908893132922.48,
89908893132923.44, 89908893132923.44, 89908893132924.66,
89908893132924.66, 89908893132925.14, 89908893132925.14,
89908893132928.28, 89908893132928.28],
dtype='float64', name='index1')
I wasted time scratching my head wondering why my groupby([index1,index2) had produced only one level of index1!

Create multiple dataframe columns containing calculated values from an existing column

I have a dataframe, sega_df:
Month 2016-11-01 2016-12-01
Character
Sonic 12.0 3.0
Shadow 5.0 23.0
I would like to create multiple new columns, by applying a formula for each already existing column within my dataframe (to put it shortly, pretty much double the number of columns). That formula is (100 - [5*eachcell])*0.2.
For example, for November for Sonic, (100-[5*12.0])*0.2 = 8.0, and December for Sonic, (100-[5*3.0])*0.2 = 17.0 My ideal output is:
Month 2016-11-01 2016-12-01 Weighted_2016-11-01 Weighted_2016-12-01
Character
Sonic 12.0 3.0 8.0 17.0
Shadow 5.0 23.0 15.0 -3.0
I know how to create a for loop to create one column. This is for if only one month was in consideration:
for w in range(1,len(sega_df.index)):
sega_df['Weighted'] = (100 - 5*sega_df)*0.2
sega_df[sega_df < 0] = 0
I haven't gotten the skills or experience yet to create multiple columns. I've looked for other questions that may answer what exactly I am doing but haven't gotten anything to work yet. Thanks in advance.
One vectorised approach is to drown to numpy:
A = sega_df.values
A = (100 - 5*A) * 0.2
res = pd.DataFrame(A, index=sega_df.index, columns=('Weighted_'+sega_df.columns))
Then join the result to your original dataframe:
sega_df = sega_df.join(res)

resample Pandas dataframe and merge strings in column

I want to resample a pandas dataframe and apply different functions to different columns. The problem is that I cannot properly process a column with strings. I would like to apply a function that merges the string with a delimiter such as " - ". This is a data example:
import pandas as pd
import numpy as np
idx = pd.date_range('2017-01-31', '2017-02-03')
data=list([[1,10,"ok"],[2,20,"merge"],[3,30,"us"]])
dates=pd.DatetimeIndex(['2017-01-31','2017-02-03','2017-02-03'])
d=pd.DataFrame(data, index=,columns=list('ABC'))
A B C
2017-01-31 1 10 ok
2017-02-03 2 20 merge
2017-02-03 3 30 us
Resampling the numeric columns A and B with a sum and mean aggregator works. Column C however kind of works with sum (but it gets placed on the second place, which might mean that something fails).
d.resample('D').agg({'A': sum, 'B': np.mean, 'C': sum})
A C B
2017-01-31 1.0 a 10.0
2017-02-01 NaN 0 NaN
2017-02-02 NaN 0 NaN
2017-02-03 5.0 merge us 25.0
I would like to get this:
...
2017-02-03 5.0 merge - us 25.0
I tried using lambda in different ways but without success (not shown).
If I may ask a second related question: I can do some post processing for this, but how to fill missing cells in different columns with zeros or ""?
Your agg function for column 'C' should be a join
d.resample('D').agg({'A': sum, 'B': np.mean, 'C': ' - '.join})
A B C
2017-01-31 1.0 10.0 ok
2017-02-01 NaN NaN
2017-02-02 NaN NaN
2017-02-03 5.0 25.0 merge - us

Categories