i have two data frames which look like below. i want to sum df2 and df1 and override df1 to reflect this sum. though column name matches in both data frames, and even index have similar values, but DF2 is smaller in size and not have all rows(or index values). how can i best do this operation? "Buckets" is the index on both the data frame.
No need to merge, let's use pandas intrinsic data aligment with indexes:
df1.set_index("Buckets")\
.add(df2.set_index("Buckets"), fill_value=0)\
.reset_index()
Output:
Buckets EUR
0 20Y 200.0
1 25Y 200.0
2 30Y 200.0
3 35Y 200.0
Note: You can leave out the set_index if Buckets is already in the index.
Do, df1.add(df2, fill_value=0)
Try this (the join type left or outer, you can decide as per your data)
df1 = pd.merge(df1, df2, on=['Buckets'], how='left').set_index(['Buckets']).sum(axis=1).reset_index()
# .set_index(['Buckets']) this is optional for you, as it is already index(as mentioned by you)
# output, You may have to rename column 0 to EUR after that
Buckets 0
0 20Y 200.0
1 25Y 200.0
2 30Y 200.0
3 35Y 200.0
OR try this
df1 = pd.merge(df1, df2, on=['Buckets'], how='left')
# you wll have 2 columns for EUR(as both df1 and df2 has it) suffixed as _x and _y
df1['EUR_y'] = df1['EUR_y'].fillna(0) # as NaN will create issue
df1['EUR'] = df1['EUR_x'] +df1['EUR_y']
# o/p
>>> df1
Buckets EUR_x EUR_y EUR
0 20Y 100 100.0 200.0
1 25Y 200 0.0 200.0
2 30Y 200 0.0 200.0
3 35Y 400 -200.0 200.0
I am currently working on my thesis and facing some problems in a groupby function I want to do. I am trying to find out someone's total purchase amount, average purchase amount, purchase count, how many products bought in total and the average value per product.
The data looks like thise:
id purchase_amount price_products #_products
0 123 30 20.00 2
2 123 NaN 10.00 NaN
3 124 50.00 25.00 3
4 124 NaN 15.00 NaN
5 124 NaN 10.00 NaN
My code looks like this:
df.groupby('id')[['purchase_amount','price_products','#_products']].agg(total_purchase_amount=('purchase_amount','sum'),average_purchase_amount=('purchase_amount','mean'),times_purchased=('#_products','count'),total_amount_products_purchased=('price_products','count'),average_value_products=('price_products','mean'))
But I get the following error:
SpecificationError: nested dictionary is ambiguous in aggregation
I cannot seem to find what I am doing wrong, hopefully someone can help me!
Do like this for all calculations
df.groupby('id')['purchase_amount'].agg({'total_purchase_amount':'sum'})
Since you have several variables to aggregate, I would suggest using the following form of aggregation:
df.groupby('id')[<variables-list>].agg([<statistics-list>])
For example:
df_agg = df.groupby('id')[['purchase_amount','price_products','#_products']].agg(["count", "mean", "sum"])
This will create a column-wise multi-level output data frame df_agg that looks like:
purchase_amount price_products #_products
count mean sum count mean sum count mean sum
id
123 1 30.0 30.0 2 15 30 1 2.0 2.0
124 1 50.0 50.0 3 17 51 1 3.0 3.0
You can then refer to a particular entry in the output data frame using the multi-index as follows:
df_agg['purchase_amount']['mean']
id
123 30.0
124 50.0
Name: mean, dtype: float64
or if you want e.g. all the means, use the cross-sectional method xs():
df_agg.xs('mean', axis=1, level=1)
purchase_amount price_products #_products
id
123 30.0 15 2.0
124 50.0 17 3.0
Note: presumably, the above piece of code will make Python compute more statistics than needed, as is the case in your example. But this may not be an issue in certain contexts, and it has the advantage that the code is shorter and generalizable to any set and number of (numeric and float) variables to aggregate.
You can do this in an organized way using a dictionary for your aggregation.
df = pd.DataFrame([[123, 30, 20, 2],
[123, np.nan, 10, np.nan],
[124, 50, 25, 3],
[124, np.nan, 15, np.nan],
[124, np.nan, 10, np.nan]],
columns=['id', 'purchase_amount', 'price_products', 'num_products']
)
agg_dict = {
'purchase_amount': [np.sum, np.mean],
'num_products': [np.count_nonzero],
'price_products': [np.count_nonzero, np.mean],
}
print(df.groupby('id').agg(agg_dict))
output:
purchase_amount num_products price_products
sum mean count_nonzero count_nonzero mean
id
123 30.0 30.0 2.0 2 15.000000
124 50.0 50.0 3.0 3 16.666667
I'm trying to get data from txt file with pandas.read_csv but it doesn't show the repeated(same) values in the file such as I have 2043 in the row but It shows it once not in every row.
My file sample
Result set
All the circles I've drawn should be 2043 also but they are empty.
My code is :
import pandas as pd
df= pd.read_csv('samplefile.txt', sep='\t', header=None,
names = ["234", "235", "236"]
You get MultiIndex, so first level value are not shown only.
You can convert MultiIndex to columns by reset_index:
df = df.reset_index()
Or specify each column in parameter names for avoid MultiIndex:
df = pd.read_csv('samplefile.txt', sep='\t', names = ["one","two","next", "234", "235", "236"]
A word of warning with MultiIndex as I was bitten by this yesterday and wasted time trying to trouble shoot a non-existant problem.
If one of your index levels is of type float64 then you may find that the indexes are not shown in full. I had a dataframe I was df.groupby().describe() and the variable I was performing the groupby() on was originally a long int, at some point it was converted to a float and when printing out this index was rounded. There were a number of values very close to each other and so it appeared on printing that the groupby() had found multiple levels of the second index.
Thats not very clear so here is an illustrative example...
import numpy as np
import pandas as pd
index = np.random.uniform(low=89908893132829,
high=89908893132929,
size=(50,))
df = pd.DataFrame({'obs': np.arange(100)},
index=np.append(index, index)).sort_index()
df.index.name = 'index1'
df['index2'] = [1, 2] * 50
df.reset_index(inplace=True)
df.set_index(['index1', 'index2'], inplace=True)
Look at the dataframe and it appears that there is only one level of index1...
df.head(10)
obs
index1 index2
8.990889e+13 1 4
2 54
1 61
2 11
1 89
2 39
1 65
2 15
1 60
2 10
groupby(['index1', 'index2']).describe() and it looks like there is only one level of index1...
summary = df.groupby(['index1', 'index2']).describe()
summary.head()
obs
count mean std min 25% 50% 75% max
index1 index2
8.990889e+13 1 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0
2 1.0 54.0 NaN 54.0 54.0 54.0 54.0 54.0
1 1.0 61.0 NaN 61.0 61.0 61.0 61.0 61.0
2 1.0 11.0 NaN 11.0 11.0 11.0 11.0 11.0
1 1.0 89.0 NaN 89.0 89.0 89.0 89.0 89.0
But if you look at the actual values of index1 in either you see that there are multiple unique values. In the original dataframe...
df.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132848.5,
89908893132848.5, 89908893132855.17, 89908893132855.17,
89908893132855.45, 89908893132855.45, 89908893132864.62,
89908893132864.62, 89908893132868.61, 89908893132868.61,
89908893132873.16, 89908893132873.16, 89908893132875.6,
89908893132875.6, 89908893132875.83, 89908893132875.83,
89908893132878.73, 89908893132878.73, 89908893132879.9,
89908893132879.9, 89908893132880.67, 89908893132880.67,
89908893132880.69, 89908893132880.69, 89908893132881.31,
89908893132881.31, 89908893132881.69, 89908893132881.69,
89908893132884.45, 89908893132884.45, 89908893132887.27,
89908893132887.27, 89908893132887.83, 89908893132887.83,
89908893132892.8, 89908893132892.8, 89908893132894.34,
89908893132894.34, 89908893132894.5, 89908893132894.5,
89908893132901.88, 89908893132901.88, 89908893132903.27,
89908893132903.27, 89908893132904.53, 89908893132904.53,
89908893132909.27, 89908893132909.27, 89908893132910.38,
89908893132910.38, 89908893132911.86, 89908893132911.86,
89908893132913.4, 89908893132913.4, 89908893132915.73,
89908893132915.73, 89908893132916.06, 89908893132916.06,
89908893132922.48, 89908893132922.48, 89908893132923.44,
89908893132923.44, 89908893132924.66, 89908893132924.66,
89908893132925.14, 89908893132925.14, 89908893132928.28,
89908893132928.28],
dtype='float64', name='index1')
...and in the summarised dataframe...
summary.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132855.17,
89908893132855.17, 89908893132855.45, 89908893132855.45,
89908893132864.62, 89908893132864.62, 89908893132868.61,
89908893132868.61, 89908893132873.16, 89908893132873.16,
89908893132875.6, 89908893132875.6, 89908893132875.83,
89908893132875.83, 89908893132878.73, 89908893132878.73,
89908893132879.9, 89908893132879.9, 89908893132880.67,
89908893132880.67, 89908893132880.69, 89908893132880.69,
89908893132881.31, 89908893132881.31, 89908893132881.69,
89908893132881.69, 89908893132884.45, 89908893132884.45,
89908893132887.27, 89908893132887.27, 89908893132887.83,
89908893132887.83, 89908893132892.8, 89908893132892.8,
89908893132894.34, 89908893132894.34, 89908893132894.5,
89908893132894.5, 89908893132901.88, 89908893132901.88,
89908893132903.27, 89908893132903.27, 89908893132904.53,
89908893132904.53, 89908893132909.27, 89908893132909.27,
89908893132910.38, 89908893132910.38, 89908893132911.86,
89908893132911.86, 89908893132913.4, 89908893132913.4,
89908893132915.73, 89908893132915.73, 89908893132916.06,
89908893132916.06, 89908893132922.48, 89908893132922.48,
89908893132923.44, 89908893132923.44, 89908893132924.66,
89908893132924.66, 89908893132925.14, 89908893132925.14,
89908893132928.28, 89908893132928.28],
dtype='float64', name='index1')
I wasted time scratching my head wondering why my groupby([index1,index2) had produced only one level of index1!
I want to resample a pandas dataframe and apply different functions to different columns. The problem is that I cannot properly process a column with strings. I would like to apply a function that merges the string with a delimiter such as " - ". This is a data example:
import pandas as pd
import numpy as np
idx = pd.date_range('2017-01-31', '2017-02-03')
data=list([[1,10,"ok"],[2,20,"merge"],[3,30,"us"]])
dates=pd.DatetimeIndex(['2017-01-31','2017-02-03','2017-02-03'])
d=pd.DataFrame(data, index=,columns=list('ABC'))
A B C
2017-01-31 1 10 ok
2017-02-03 2 20 merge
2017-02-03 3 30 us
Resampling the numeric columns A and B with a sum and mean aggregator works. Column C however kind of works with sum (but it gets placed on the second place, which might mean that something fails).
d.resample('D').agg({'A': sum, 'B': np.mean, 'C': sum})
A C B
2017-01-31 1.0 a 10.0
2017-02-01 NaN 0 NaN
2017-02-02 NaN 0 NaN
2017-02-03 5.0 merge us 25.0
I would like to get this:
...
2017-02-03 5.0 merge - us 25.0
I tried using lambda in different ways but without success (not shown).
If I may ask a second related question: I can do some post processing for this, but how to fill missing cells in different columns with zeros or ""?
Your agg function for column 'C' should be a join
d.resample('D').agg({'A': sum, 'B': np.mean, 'C': ' - '.join})
A B C
2017-01-31 1.0 10.0 ok
2017-02-01 NaN NaN
2017-02-02 NaN NaN
2017-02-03 5.0 25.0 merge - us