I have a test df like this:
df = pd.DataFrame({'A': ['Apple','Apple', 'Apple','Orange','Orange','Orange','Pears','Pears'],
'B': [1,2,9,6,4,3,2,1]
})
A B
0 Apple 1
1 Apple 2
2 Apple 9
3 Orange 6
4 Orange 4
5 Orange 3
6 Pears 2
7 Pears 1
Now I need to add a new column with the respective %differences in col 'B'. How is this possible. I cannot get this to work.
I have looked at
update column value of pandas groupby().last()
Not sure that it is pertinent to my problem.
And this which looks promising
Pandas Groupby and Sum Only One Column
I need to find and insert into the col maxpercchng (all rows in group) the maximum change in col (B) per group of col 'A'.
So I have come up with this code:
grouppercchng = ((df.groupby['A'].max() - df.groupby['A'].min())/df.groupby['A'].iloc[0])*100
and try to add it to the group col 'maxpercchng' like so
group['maxpercchng'] = grouppercchng
Or like so
df_kpi_hot.groupby(['A'], as_index=False)['maxpercchng'] = grouppercchng
Does anyone know how to add to all rows in group the maxpercchng col?
I believe you need transform for Series with same size like original DataFrame filled by aggregated values:
g = df.groupby('A')['B']
df['maxpercchng'] = (g.transform('max') - g.transform('min')) / g.transform('first') * 100
print (df)
A B maxpercchng
0 Apple 1 800.0
1 Apple 2 800.0
2 Apple 9 800.0
3 Orange 6 50.0
4 Orange 4 50.0
5 Orange 3 50.0
6 Pears 2 50.0
7 Pears 1 50.0
Or:
g = df.groupby('A')['B']
df1 = ((g.max() - g.min()) / g.first() * 100).reset_index()
print (df1)
A B
0 Apple 800.0
1 Orange 50.0
2 Pears 50.0
Related
I need to create a python code to search "N" as variable, consecutive rows in a column dataframe with the same value and different that NaN like this.
I can't figure out how to do it with a for loop because I don't know which row I'm looking at in each case. Any idea that how can do it?
Fruit
2 matches
5 matches
Apple
No
No
NaN
No
No
Pear
No
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
Yes
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
Banana
No
No
Banana
Yes
No
Update: testing solutions by #Corralien
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['Matches'] = df.where(counts >= N, other='No')
VSCode return me the 'Frame skipped from debugging during step-in.' message when execute the last line and generate an exception in the previous for loop.
Compute consecutive values and set NaN to 0. Once you have calculated the cumulative counter, you just have to check if the counter is greater than or equal to N:
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['2 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
N = 5
df['5 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
Output:
>>> df
Fruit 2 matches 5 matches
0 Apple No No
1 NaN No No
2 Pear No No
3 Pear Yes No
4 Pear Yes No
5 Pear Yes No
6 Pear Yes Yes
7 NaN No No
8 NaN No No
9 NaN No No
10 NaN No No
11 NaN No No
12 Banana No No
13 Banana Yes No
>>> counts
0 1
1 0
2 1
3 2
4 3
5 4
6 5
7 0
8 0
9 0
10 0
11 0
12 1
13 2
dtype: int64
Update
if I need to change "Yes" for the fruit name for example
N = 2
df['2 matches'] = df.where(counts >= N, other='No')
print(df)
# Output
Fruit 2 matches
0 Apple No
1 NaN No
2 Pear No
3 Pear Pear
4 Pear Pear
5 Pear Pear
6 Pear Pear
7 NaN No
8 NaN No
9 NaN No
10 NaN No
11 NaN No
12 Banana No
13 Banana Banana
This question already has answers here:
Pandas: filling missing values by mean in each group
(12 answers)
Closed 6 months ago.
data = {
"Food": ['apple', 'apple', 'apple','orange','apple','apple','orange','orange','orange'],
"Calorie": [50, 40, 50,30,'Nan','Nan',50,30,'Nan']
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
Having a data frame as above.Need to replace the missing value using median. Like if the food is apple and the Nan value need to be replace by median.orange is also just like that.The output needs to be like this:
Food Calorie
0 apple 50
1 apple 40
2 apple 50
3 orange 30
4 apple 50
5 apple 50
6 orange 50
7 orange 30
8 orange 30
You could do
df = df.replace('Nan',np.nan)
df.Calorie.fillna(df.groupby('Food')['Calorie'].transform('median') , inplace=True)
df
Out[170]:
Food Calorie
0 apple 50.0
1 apple 40.0
2 apple 50.0
3 orange 30.0
4 apple 50.0
5 apple 50.0
6 orange 50.0
7 orange 30.0
8 orange 30.0
I want to change the first index column to integer type.
ex)0.0->0, 1.0->1 2.0->2 ....
However, I can't search for the first column. As you can see, it's made up of multi-index. plz help me..
I succeeded in approaching a single value using the Pandas grammar. However, I don't know how to change the whole value of first column.
sum count
timestamp(hour) goods price price
0.0 1 1000 40
2 200 29
3 129 11
4 76 5
1.0 1 1000 40
2 200 29
3 129 11
4 76 5
.
.
.
In[61] pivot1.index[0][0]
Out[62] 0.0
You can use DataFrame.rename with level=0:
df = pd.DataFrame({
'col':[4,5,4,5,5,4],
'timestamp(hour)':[7,8.0,8,8.0,8,3],
'goods':list('aaabbb')
}).set_index(['timestamp(hour)','goods'])
print (df)
col
timestamp(hour) goods
7.0 a 4
8.0 a 5
a 4
b 5
b 5
3.0 b 4
df = df.rename(int, level=0)
print (df)
col
timestamp(hour) goods
7 a 4
8 a 5
a 4
b 5
b 5
3 b 4
You could:
df.index = df.index.set_levels([df.index.levels[0].astype(int), df.index.levels[1]])
But the answer of jezrael is better I guess.
I have the following dataframe:
df=pd.DataFrame({'id':['A','A','B','C','D'],'Name':['apple','apricot','banana','orange','citrus'], 'count':[2,3,6,5,12]})
id Name count
0 A apple 2
1 A apricot 3
2 B banana 6
3 C orange 5
4 D citrus 12
I am trying to group the dataframe by the 'id' column, but also preserve the duplicated names as separate columns. Below is the expected output:
id sum(count) id1 id2
0 A 5 apple apricot
1 B 6 banana na
2 C 5 orange na
3 D 12 citrus na
I tried grouping by the id column using the following statement but that removes the name column completely.
df.groupby(['id'], as_index=False).sum()
I would appreciate any suggestions/ help.
You can use DataFrame.pivot_table for this:
g = df.groupby('id')
# Generate the new columns of the pivoted dataframe
col = g.Name.cumcount()
# Sum of count grouped by id
sum_count = g['count'].sum()
(df.pivot_table(values='Name', index='id', columns = col, aggfunc='first')
.add_prefix('id')
.assign(sum_count = sum_count))
id0 id1 sum_count
id
A apple apricot 5
B banana NaN 6
C orange NaN 5
D citrus NaN 12
There are a plethora of questions on SO about how to select rows in a DataFrame and replace values in a column in those rows, but one use case is missing. To use the example DataFrame from this question,
In [1]: df
Out[1]:
apple banana cherry
0 0 3 good
1 1 4 bad
2 2 5 good
And this works if one wants to change a single column based on another:
df.loc[df.cherry == 'bad', 'apple'] = df.banana * 2
Or this sets the values in two columns:
df.loc[df.cherry == 'bad', ['apple', 'banana'] = np.nan
But this doesn't work:
df.loc[df.cherry == 'bad', ['apple', 'banana'] = [df.banana, df.apple]
, because apparently the right side is 3x2, while the left side is 1x2, hence the error message
ValueError: Must have equal len keys and value when setting with an ndarray
So I understand what the problem is, but what is the solution?
IIUC you can try:
df['a'] = df.apple * 3
df['b'] = df.banana * 2
print df
apple banana cherry a b
0 0 3 good 0 6
1 1 4 bad 3 8
2 2 5 good 6 10
df[['a', 'b']] = df.loc[df.cherry == 'bad', ['apple', 'banana']]
print df
apple banana cherry a b
0 0 3 good NaN NaN
1 1 4 bad 1.0 4.0
2 2 5 good NaN NaN
Or use conditions with values:
df['a'] = df.apple * 3
df['b'] = df.banana * 2
df.loc[df.cherry == 'bad', ['apple', 'banana']] =
df.loc[df.cherry == 'bad', ['a', 'b']].values
print df
apple banana cherry a b
0 0 3 good 0 6
1 3 8 bad 3 8
2 2 5 good 6 10
Another options with original columns:
print df[['apple','banana']].shift() * 2
apple banana
0 NaN NaN
1 12.0 6.0
2 2.0 8.0
df.loc[df.cherry == 'bad', ['apple', 'banana']] = df[['apple','banana']].shift() * 2
print df
apple banana cherry
0 6.0 3.0 good
1 12.0 6.0 bad
2 2.0 5.0 good