groupby aggregate does not work as expected for Pandas

groupby aggregate does not work as expected for Pandas - python

I need some help with aggregation and joining the dataframe groupby output.
Here is my dataframe:
df = pd.DataFrame({
'Date': ['2020/08/18','2020/08/18', '2020/08/18', '2020/08/18', '2020/08/18', '2020/08/18', '2020/08/18'],
'Time':['Val3',60,30,'Val2',60,60,'Val2'],
'Val1': [0, 53.5, 33.35, 0,53.5, 53.5,0],
'Val2':[0, 0, 0, 45, 0, 0, 35],
'Val3':[48.5,0,0,0,0,0,0],
'Place':['LOC_A','LOC_A','LOC_A','LOC_B','LOC_B','LOC_B','LOC_A']
})
I want following result:
Place Total_sum Factor Val2_new
0 LOC_A 86.85 21.71 35
1 LOC_B 107.00 26.75 45
I have tried following:
df_by_place = df.groupby('Place')['Val1'].sum().reset_index(name='Total_sum')
df_by_place['Factor'] = round(df_by_place['Total_sum']*0.25, 2)
df_by_place['Val2_new'] = df.groupby('Place')['Val2'].agg('sum')
print(df_by_place)
But I get following result:
Place Total_sum Factor Val2_new
0 LOC_A 86.85 21.71 NaN
1 LOC_B 107.00 26.75 NaN
When I do following operation by it self:
print(df.groupby('Place')['Val2'].agg('sum'))
Output is desired:
Place
LOC_A 35
LOC_B 45
But when I assign to a column it gives "NaN" value.
Any help to this issue would be appreciated.
Thank You in advance.

Groupby in pandas >= 0.25 will allow you to assign names to columns inside of it and do what you want in one go.
df.groupby('Place').agg(Total_sum = ('Val1','sum'),
Factor = ('Val1', lambda x: round((x * 0.25).sum(),2)),
Val2_new = ('Val2', 'sum')).reset_index()
This provides your desired result.
Place Total_sum Factor Val2_new
0 LOC_A 86.85 21.71 35
1 LOC_B 107.00 26.75 45
Using lambda functions within groupby will make things a lot neater!

The answer given by sushanth seems to hold good.
df_by_place['Val2_new'] = df.groupby('Place')['Val2'].agg('sum').reset_index(drop=True)
By assigning drop = True in reset_index the previously created index are removed and the new index/column_name given by user is assigned.

slight variation on #maishm's answer, but basically same idea:
df.groupby('Place').agg(total_sum=pd.NamedAgg(column='Val1', aggfunc=sum),
factor=pd.NamedAgg(column='Val1', aggfunc=lambda x: round(sum(x)*0.25,2)),
val2_new=pd.NamedAgg(column='Val2', aggfunc=sum)).reset_index()
output:
Place total_sum factor val2_new
0 LOC_A 86.85 21.71 35
1 LOC_B 107.00 26.75 45

Related

Issue in executing a specific type of nested 'for' loop on columns of a panda dataframe

I have a panda dataframe that has values like below. Though in real I am working with lot more columns and historical data
AUD USD JPY EUR
0 0.67 1 140 1.05
I want to iterate over columns to create dataframe with columns AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR and JPYEUR
where for eg AUDUSD is calculated as product of AUD column and USD colum
I tried below
for col in df:
for cols in df:
cf[col+cols]=df[col]*df[cols]
But it generates table with unneccessary values like AUDAUD, USDUSD or duplicate value like AUDUSD and USDAUD. I think if i can somehow set "cols =col+1 till end of df" in second for loop I should be able to resolve the issue. But i don't know how to do that ??
Result i am looking for is a table with below columns and their values
AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR, JPYEUR

You can use itertools.combinations with pandas.Series.mul and pandas.concat.
Try this :
from itertools import combinations

combos = list(combinations(df.columns, 2))

out = pd.concat([df[col[1]].mul(df[col[0]]) for col in combos], axis=1, keys=combos)

out.columns = out.columns.map("".join)
# Output :
print(out)
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
# Used input :
df = pd.DataFrame({'AUD': [0.67], 'USD': [1], 'JPY': [140], 'EUR': [1.05]})

I thought it intuitive that your first approach was to use an inner / outer loop and think this solution works in the same spirit:
# Added a Second Row for testing
df = pd.DataFrame(
{'AUD': [0.67, 0.91], 'USD': [1, 1], 'JPY': [140, 130], 'EUR': [1.05, 1]},
)
# Instantiated the Second DataFrame
cf = pd.DataFrame()
# Call the index of the columns as an integer
for i in range(len(df.columns)):
# Increment the index + 1, so you aren't looking at the same column twice
# Also, limit the range to the length of your columns
for j in range(i+1, len(df.columns)):
print(f'{df.columns[i]}' + f'{df.columns[j]}') # VERIFY
# Create a variable of the column names mashed together
combine = f'{df.columns[i]}' + f'{df.columns[j]}
# Assign the rows to be a product of the mashed column series
cf[combine] = df[df.columns[i]] * df[df.columns[j]]
print(cf) # VERIFY
The console Log looks like this:
AUDUSD
AUDJPY
AUDEUR
USDJPY
USDEUR
JPYEUR
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
1 0.91 118.3 0.9100 130 1.00 130.0

applying different functions over a certain column in pandas [duplicate]

This question already has answers here:
How do I assign values based on multiple conditions for existing columns?
(7 answers)
Closed 5 months ago.
I would like to use the apply and lambda methods in python in order to change the pricing in a column. The column name is Price. So, if the price is less than 20 I would like to pass and keep it the same. If 30>price>20 I would like to add 1. If the price is 40>price>30 then I would like to add 1.50. And so on. I am trying to figure out a way to apply these functions over a column and then send it back to an excel format in order to updating the pricing. I am confused as to how to do so. I have tried putting this operation in a function using an if clause but it is not spitting out the results that I would need to (k is the name of the dataframe):
def addition():
if k[k['Price']] < 20]:
pass
if k[(k['Price']] > 20) & (k['Price] < 30)]:
return k + 1
if k[(k['Price']] > 30.01) & (k['Price] < 40)]:
return k + 1.50
and so on. However, at the end, when I attempt to send out (what I thought was the newly updated k[k['Price] format in xlsx it doesn't even show up. I have tried to make the xlsx variable global as well but still no luck. I think it is simpler to use the lambda function, but I am having trouble deciding on how to separate and update the prices in that column based off the conditions. Much help would be appreciated.
This is the dataframe that I am trying to perform the different functions on:
0 23.198824
1 21.080706
2 15.810118
3 21.787059
4 18.821882
...
33525 20.347059
33526 25.665882
33527 33.077647
33528 21.803529
33529 23.043529
Name: Price, Length: 33530, dtype: float64

If k is the dataframe,then k+1 won't work, it will cause an error. You can write a function to change the price and apply it to the column -
def update_price(price):
if 20<price<30:
price += 1
elif 30<price<40:
price += 1.5
return price
df['Updated_Price'] = df['Price'].apply(lambda x: update_price(x))
In [39]: df
Out[39]:
Name Price
0 a 15
1 b 23
2 c 37
In [43]: df
Out[43]:
Name Price Updated_Price
0 a 15 15.0
1 b 23 24.0
2 c 37 38.5

You can use apply method and lambda for this purpose alongside with nested if..elses.
import pandas as pd
df = pd.DataFrame({
'Price': [10.0, 23.0, 50.0, 32.0, 12.0, 50.0]
})
df = df['Price'].apply(lambda x: x if x < 20.0 else (x + 1.0 if 30.0 > x > 20.0 else x + 1.5))
print(df)
Output:
0 10.0
1 24.0
2 51.5
3 33.5
4 12.0
5 51.5
Name: Price, dtype: float64

Pandas Extract Number with decimals from String

I am trying to extract all numbers including decimals, dots and commas form a string using pandas.
This is my DataFrame
rate_number
0 92 rate
0 33 rate
0 9.25 rate
0 (4,396 total
0 (2,620 total
I tried using df['rate_number'].str.extract('(\d+)', expand=False) but the results were not correct.
The DataFrame I need to extract should be the following:
rate_number
0 92
0 33
0 9.25
0 4,396
0 2,620

You can try this:
df['rate_number'] = df['rate_number'].replace('\(|[a-zA-Z]+', '', regex=True)
Better answer:
df['rate_number_2'] = df['rate_number'].str.extract('([0-9][,.]*[0-9]*)')
Output:
rate_number rate_number_2
0 92 92
1 33 33
2 9.25 9.25
3 4,396 4,396
4 2,620 2,620

Dan's comment above is not very noticeable but worked for me:
for df in df_arr:
df = df.astype(str)
df_copy = df.copy()
for i in range(1, len(df.columns)):
df_copy[df.columns[i]]=df_copy[df.columns[i]].str.extract('(\d+[.]?\d*)', expand=False) #replace(r'[^0-9]+','')
new_df_arr.append(df_copy)

There is a small error with the asterisk's position:
df['rate_number_2'] = df['rate_number'].str.extract('([0-9]*[,.][0-9]*)')

How to set a minimum value when performing cumsum on a dataframe column (physical inventory cannot go below 0)

How to perform a cumulative sum with a minimum value in python/pandas?
In the table below:
the "change in inventory" column reflects the daily sales/new stock purchases.
data entry/human errors mean that applying cumsum shows a negative inventory level of -5 which is not physically possible.
as shown by the "inventory" column, the data entry errors continue to be a problem at the end (100 vs 95).
dataframe
change in inventory inventory cumsum
2015-01-01 100 100 100
2015-01-02 -20 80 80
2015-01-03 -30 50 50
2015-01-04 -40 10 10
2015-01-05 -15 0 -5
2015-01-06 100 100 95
One way to achieve this would be to use loops however it would be messy and there probably is a more efficient way to do this.
Here is the code to generate the dataframe:
import pandas as pd
df = pd.DataFrame.from_dict({'change in inventory': {'2015-01-01': 100,
'2015-01-02': -20,
'2015-01-03': -30,
'2015-01-04': -40,
'2015-01-05': -15,
'2015-01-06': 100},
'inventory': {'2015-01-01': 100,
'2015-01-02': 80,
'2015-01-03': 50,
'2015-01-04': 10,
'2015-01-05': 0,
'2015-01-06': 100}})
df['cumsum'] = df['change in inventory'].cumsum()
df
How to apply a cumulative sum with a minimum value in python/pandas to produce the values shown in the "inventory" column?

Depending on the data, it can be far more efficient to loop over blocks with the same sign, eg. with large running sub-blocks all positive or negative. You only have to be careful going back to positive values after a run of negative values.
With a minimum limiting value as minS summing over vector:
import numpy as np
i_sign = np.append(np.where(np.diff(np.sign(vector)) > 0)[0],[len(vector)])
i0 = 1
csum = np.maximum(minS, vector[:1])
for i1 in i_sign:
tmp_csum = np.maximum(minS, csum[-1] + np.cumsum(vector[i0:i1+1]))
csum = np.append(csum, tmp_csum)
i0 = i1
Final output in csum.

You can use looping, unfortunately:
lastvalue = 0
newcum = []
for row in df['change in inventory']:
thisvalue = row + lastvalue
if thisvalue < 0:
thisvalue = 0
newcum.append( thisvalue )
lastvalue = thisvalue
print pd.Series(newcum, index=df.index)
2015-01-01 100
2015-01-02 80
2015-01-03 50
2015-01-04 10
2015-01-05 0
2015-01-06 100
dtype: int64

very ugly solution
start = df.index[0]
df['cumsum'] = [max(df['change in inventory'].loc[start:end].sum(), 0)
for end in df.index]

Set value for particular cell in pandas DataFrame with iloc

I have a question similar to this and this. The difference is that I have to select row by position, as I do not know the index.
I want to do something like df.iloc[0, 'COL_NAME'] = x, but iloc does not allow this kind of access. If I do df.iloc[0]['COL_NAME'] = x the warning about chained indexing appears.

For mixed position and index, use .ix. BUT you need to make sure that your index is not of integer, otherwise it will cause confusions.
df.ix[0, 'COL_NAME'] = x
Update:
Alternatively, try
df.iloc[0, df.columns.get_loc('COL_NAME')] = x
Example:
import pandas as pd
import numpy as np
# your data
# ========================
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 2), columns=['col1', 'col2'], index=np.random.randint(1,100,10)).sort_index()
print(df)
col1 col2
10 1.7641 0.4002
24 0.1440 1.4543
29 0.3131 -0.8541
32 0.9501 -0.1514
33 1.8676 -0.9773
36 0.7610 0.1217
56 1.4941 -0.2052
58 0.9787 2.2409
75 -0.1032 0.4106
76 0.4439 0.3337
# .iloc with get_loc
# ===================================
df.iloc[0, df.columns.get_loc('col2')] = 100
df
col1 col2
10 1.7641 100.0000
24 0.1440 1.4543
29 0.3131 -0.8541
32 0.9501 -0.1514
33 1.8676 -0.9773
36 0.7610 0.1217
56 1.4941 -0.2052
58 0.9787 2.2409
75 -0.1032 0.4106
76 0.4439 0.3337

One thing I would add here is that the at function on a dataframe is much faster particularly if you are doing a lot of assignments of individual (not slice) values.
df.at[index, 'col_name'] = x
In my experience I have gotten a 20x speedup. Here is a write up that is Spanish but still gives an impression of what's going on.

If you know the position, why not just get the index from that?
Then use .loc:
df.loc[index, 'COL_NAME'] = x

You can use:
df.set_value('Row_index', 'Column_name', value)
set_value is ~100 times faster than .ix method. It also better then use df['Row_index']['Column_name'] = value.
But since set_value is deprecated now so .iat/.at are good replacements.
For example if we have this data_frame
A B C
0 1 8 4
1 3 9 6
2 22 33 52
if we want to modify the value of the cell [0,"A"] we can do
df.iat[0,0] = 2
or df.at[0,'A'] = 2

another way is, you assign a column value for a given row based on the index position of a row, the index position always starts with zero, and the last index position is the length of the dataframe:
df["COL_NAME"].iloc[0]=x

To modify the value in a cell at the intersection of row "r" (in column "A") and column "C"
retrieve the index of the row "r" in column "A"
i = df[ df['A']=='r' ].index.values[0]
modify the value in the desired column "C"
df.loc[i,"C"]="newValue"
Note: before, be sure to reset the index of rows ...to have a nice index list!
df=df.reset_index(drop=True)

Another way is to get the row index and then use df.loc or df.at.
# get row index 'label' from row number 'irow'
label = df.index.values[irow]
df.at[label, 'COL_NAME'] = x

Extending Jianxun's answer, using set_value mehtod in pandas. It sets value for a column at given index.
From pandas documentations:
DataFrame.set_value(index, col, value)
To set value at particular index for a column, do:
df.set_value(index, 'COL_NAME', x)
Hope it helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

groupby aggregate does not work as expected for Pandas - python

The answer given by sushanth seems to hold good. df_by_place['Val2_new'] = df.groupby('Place')['Val2'].agg('sum').reset_index(drop=True) By assigning drop = True in reset_index the previously created index are removed and the new index/column_name given by user is assigned.

Related

Issue in executing a specific type of nested 'for' loop on columns of a panda dataframe

applying different functions over a certain column in pandas [duplicate]

Pandas Extract Number with decimals from String

How to set a minimum value when performing cumsum on a dataframe column (physical inventory cannot go below 0)

Set value for particular cell in pandas DataFrame with iloc

Categories

Resources