I have this problem which I've been trying to solve:
I want the code to take this DataFrame and group multiple columns based on the most frequent number and sum the values on the last column. For example:
df = pd.DataFrame({'A':[1000, 1000, 1000, 1000, 1000, 200, 200, 500, 500],
'B':[380, 380, 270, 270, 270, 45, 45, 45, 55],
'C':[380, 380, 270, 270, 270, 88, 88, 88, 88],
'D':[45, 32, 67, 89, 51, 90, 90, 90, 90]})
df
A B C D
0 1000 380 380 45
1 1000 380 380 32
2 1000 270 270 67
3 1000 270 270 89
4 1000 270 270 51
5 200 45 88 90
6 200 45 88 90
7 500 45 88 90
8 500 55 88 90
I would like the code to show the result below:
A B C D
0 1000 380 380 284
1 1000 380 380 284
2 1000 270 270 284
3 1000 270 270 284
4 1000 270 270 284
5 200 45 88 360
6 200 45 88 360
7 500 45 88 360
8 500 55 88 360
Notice that the most frequent value on the first rows is 1000, and this way I group the column 'A' so I get the sum 284 on the column 'D'. However, on the last rows, the most frequent number, which is 88, is not on column 'A', but in column 'C'. I am trying to sum the values on column 'D' by grouping column 'C' and get 360. I am not sure if I made myself clear.
I tried to use the function df['D'] = df.groupby(['A', 'B', 'C'])['D'].transform('sum'), but it does not show the desired result aforementioned.
Is there any pandas-style way of resolving this? Thanks in advance!
Code
def get_count_sum(col, func):
return df.groupby(col).D.transform(func)
ga = get_count_sum('A', 'count')
gb = get_count_sum('B', 'count')
gc = get_count_sum('C', 'count')
conditions = [
((ga > gb) & (ga > gc)),
((gb > ga) & (gb > gc)),
((gc > ga) & (gc > gb)),
]
choices = [get_count_sum('A', 'sum'),
get_count_sum('B', 'sum'),
get_count_sum('C', 'sum')]
df['D'] = np.select(conditions, choices)
df
Output
A B C D
0 1000 380 380 284
1 1000 380 380 284
2 1000 270 270 284
3 1000 270 270 284
4 1000 270 270 284
5 200 45 88 360
6 200 45 88 360
7 500 45 88 360
8 500 55 88 360
Explanation
Since we need to group by each column 'A','B' or 'C' considering which one has max repeated number, so first we are checking the max repeated number and storing the groupby output in ga, gb, gc for A,B,C col respectively.
We are checking which col has max frequent number in conditions.
According to the conditions we are applying choices for if else conditions.
np.select is like if-elif-else where we placed the conditions and required output in choices.
Related
I have a data coming from the field and I want to categorize it with a gap of specific range.
I want to categorize in 100 range. That is, 0-100, 100-200, 200-300
My code:
df=pd.DataFrame([112,341,234,78,154],columns=['value'])
value
0 112
1 341
2 234
3 78
4 154
Expected answer:
value value_range
0 112 100-200
1 341 200-400
2 234 200-300
3 78 0-100
4 154 100-200
My code:
df['value_range'] = df['value'].apply(lambda x:[a,b] if x>a and x<b for a,b in zip([0,100,200,300,400],[100,200,300,400,500]))
Present solution:
SyntaxError: invalid syntax
You can use pd.cut:
df["value_range"] = pd.cut(df["value"], [0, 100, 200, 300, 400], labels=['0-100', '100-200', '200-300', '300-400'])
print(df)
Prints:
value value_range
0 112 100-200
1 341 300-400
2 234 200-300
3 78 0-100
4 154 100-200
you can use the odd IntervalIndex.from_tuples. Just set the tuple values to the values that are in your data and you should be good to go! -Listen to Lil Wayne
df = pd.DataFrame([112,341,234,78,154],columns=['value'])
bins = pd.IntervalIndex.from_tuples([(0, 100), (100, 200), (200, 300), (300, 400)])
df['value_range'] = pd.cut(df['value'], bins)
I have the following problem
import pandas as pd
data = {
"ID": [420, 380, 390, 540, 520, 50, 22],
"duration": [50, 40, 45,33,19,1,3],
"next":["390;50","880;222" ,"520;50" ,"380;111" ,"810;111" ,"22;888" ,"11" ]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
As you can see I have
ID duration next
0 420 50 390;50
1 380 40 880;222
2 390 45 520;50
3 540 33 380;111
4 520 19 810;111
5 50 1 22;888
6 22 3 11
Things to notice:
ID type is int
next type is a string with numbers separated by ; if more than two numbers
I would like to filter the rows with no next in the ID
For example in this case
420 has a follow up in both 390 and 50
380 has as next 880 and 222 both of which are not in ID so this one
540 has as next 380 and 111 and while 111 is not in ID, 380 is so not this one
same with 50
In the end I want to get
1 380 40 880;222
4 520 19 810;111
6 22 3 11
With only one value I used print(df[~df.next.astype(int).isin(df.ID)]) but in this case isin can not be simply applied.
How can I do this?
Let us try with split then explode with isin check
s = df.next.str.split(';').explode().astype(int)
out = df[~s.isin(df['ID']).groupby(level=0).any()]
Out[420]:
ID duration next
1 380 40 880;222
4 520 19 810;111
6 22 3 11
Use a regex with word boundaries for efficiency:
pattern = '|'.join(df['ID'].astype(str))
out = df[~df['next'].str.contains(fr'\b(?:{pattern})\b')]
Output:
ID duration next
1 380 40 880;222
4 520 19 810;111
6 22 3 11
This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 1 year ago.
i would like to ask how can i can iterate through the dataframe and check where the ID is the same value, then sum the prices for these rows.
i tried it with the following code:
d = {'ID': [126, 126, 148, 148, 137, 137], 'price': [100, 50, 120, 40, 160, 30]}
df = pd.DataFrame(data=d)
so the Dataframe looks like this
ID price
0 126 100
1 126 50
2 148 120
3 148 40
4 137 160
5 137 30
for index in df.index():
if df.iloc[index, "ID"] == df.iloc[index+1, "ID"]:
df.at[index, "price"] = df.iloc[index, "price"] + df.iloc[index+1, "price"]
df.at[index+1, "price"] = df.iloc[index, "price"] + df.iloc[index+1, "price"]
i would like to have a resulst like this:
ID price
0 126 150
1 126 150
2 148 160
3 148 160
4 137 190
5 137 190
Please help if you someone knows how to do it. :)
TRY Groupby-Transform:
df['price'] = df.groupby('ID')['price'].transform('sum')
OUTPUT:
ID price
0 126 150
1 126 150
2 148 160
3 148 160
4 137 190
5 137 190
import pandas as pd
import numpy as np
d = {'col1': [100, 198, 495, 600, 50], 'col2': [99, 200, 500, 594, 100], 'col3': [101, 202, 505, 606, 150]}
df = pd.DataFrame(data=d)
df
From this I get a simple table:
col1 col2 col3
0 100 99 101
1 198 200 202
2 495 500 505
3 600 594 606
4 50 100 150
From this I would like to take the %CV of all values in the first row, then second rows and so on...
I would like that it works regardless of how many columns the table has.
I could do this with a few lines of code:
df_shape = df.shape
CV_list = []
for i in range(df_shape[0]):
CV = np.std(df.iloc[i, :], ddof=1) / np.mean(df.iloc[i, :]) * 100
CV_list.append(str(round(CV, 3)) + ' %')
df["cv"] = CV_list
df
output:
col1 col2 col3 CV
0 100 99 101 1%
1 198 200 202 1%
2 495 500 505 1%
3 600 594 606 1%
4 50 100 150 50%
But I wonder if Pandas has a built in functions for this (that I could not find so far).
You can operate across an entire row by specifying axis=1. So get the Series of standard deviations and means (for each row) and divide.
df['CV'] = df.std(axis=1, ddof=1)/df.mean(axis=1)*100
col1 col2 col3 CV
0 100 99 101 1.0
1 198 200 202 1.0
2 495 500 505 1.0
3 600 594 606 1.0
4 50 100 150 50.0
I've got a pandas data frame defined like this:
last_4_weeks_range = pandas.date_range(
start=datetime.datetime(2001, 5, 4), periods=28)
last_4_weeks = pandas.DataFrame(
[{'REST_KEY': 1, 'DLY_TRN_QT': 80, 'DLY_SLS_AMT': 90,
'COOP_DLY_TRN_QT': 30, 'COOP_DLY_SLS_AMT': 20}] * 28 +
[{'REST_KEY': 2, 'DLY_TRN_QT': 70, 'DLY_SLS_AMT': 10,
'COOP_DLY_TRN_QT': 50, 'COOP_DLY_SLS_AMT': 20}] * 28,
index=last_4_weeks_range.append(last_4_weeks_range))
last_4_weeks.sort(inplace=True)
and when I go to resample it:
In [265]: last_4_weeks.resample('7D', how='sum')
Out[265]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
2001-05-04 280 560 700 1050 21
2001-05-11 280 560 700 1050 21
2001-05-18 280 560 700 1050 21
2001-05-25 280 560 700 1050 21
2001-06-01 0 0 0 0 0
I end up with an extra empty bin I wouldn't expect to see -- 2001-06-01. I wouldn't expect that bin to be there, as my 28 days are evenly divisible into the 7 day resample I'm performing. I've tried messing around with the closed kwarg, but I can't escape that extra bin. Why is that extra bin showing up when I've got nothing to put into it and how can I avoid generating it?
What I'm ultimately trying to do is get 7 day averages per REST_KEY, so doing a
In [266]: last_4_weeks.groupby('REST_KEY').resample('7D', how='sum').mean(level=0)
Out[266]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
REST_KEY
1 112 168 504 448 5.6
2 112 280 56 392 11.2
but that extra empty bin is throwing off my mean (e.g. for COOP_DLY_SLS_AMT I've got 112, which is (20 * 7 * 4) / 5 rather than the 140 I'd get from (20 * 7 * 4) / 4 if I didn't have that extra bin.) I also wouldn't expect REST_KEY to show up in the aggregation since it's part of the groupby, but that's really a smaller problem.
P.S. I'm using pandas 0.11.0
I think it's a bug:
The output with pandas 0.9.0dev on mac is:
In [3]: pandas.__version__
Out[3]: '0.9.0.dev-1e68fd9'
In [6]: last_4_weeks.resample('7D', how='sum')
Out[6]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
2001-05-04 40 80 100 150 3
2001-05-11 280 560 700 1050 21
2001-05-18 280 560 700 1050 21
2001-05-25 280 560 700 1050 21
2001-06-01 240 480 600 900 18
In [4]: last_4_weeks.groupby('REST_KEY').resample('7D', how='sum').mean(level=0)
Out[4]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
REST_KEY
1 112 168 504 448 5.6
2 112 280 56 392 11.2
I'm using this versions (via pip freeze):
numpy==1.8.0.dev-9597b1f-20120920
pandas==0.9.0.dev-1e68fd9-20120920