I'm trying to compute the cumulative sum in python based on a two different conditions.
As you can see in the attached image, Calculation column would take the same value as the Number column as long as the Cat1 and Cat2 column doesn't change.
Once Cat1 column changes, we should reset the Number column.
Calculation column stays the same as the Number column, Once Cat2 column changes with the same value of Cat1 column, the Calculation column will take the last value of the Number column and add it to the next one.
Example of data below:
Cat1 Cat2 Number CALCULATION
a orange 1 1
a orange 2 2
a orange 3 3
a orange 4 4
a orange 5 5
a orange 6 6
a orange 7 7
a orange 8 8
a orange 9 9
a orange 10 10
a orange 11 11
a orange 12 12
a orange 13 13
b purple 1 1
b purple 2 2
b purple 3 3
b purple 4 4
b purple 5 5
b purple 6 6
b purple 7 7
b purple 8 8
b silver 1 9
b silver 2 10
b silver 3 11
b silver 4 12
b silver 5 13
b silver 6 14
b silver 7 15
Are you looking for:
import pandas as pd
df = pd.DataFrame({'Cat1': ['a','a','a','a','a','a','a','a','a','a','a', 'a','a','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b'],
'Cat2': ['orange','orange','orange','orange','orange','orange','orange', 'orange','orange','orange','orange','orange','orange','purple','purple', 'purple','purple','purple','purple','purple','purple','silver','silver','silver', 'silver','silver','silver','silver']})
df['Number'] = df.groupby(['Cat1', 'Cat2']).cumcount()+1
df['CALCULATION'] = df.groupby('Cat1').cumcount()+1
Related
I have a table as following:
id
a
b
a
b
c
color
123
1
6
7
3
4
blue
456
2
8
9
7
5
yellow
As you can see, some of the columns have the same. What I want to do is to stack the columns with the same names on top of each other (make the table longer than wider). I have looked into documentations of stack, melt and pivot but I can't find a similar problem as I have here. Can anyone help me how this can be achieved?
FYI, here is how I need the table to be:
id
a
b
c
color
123
1
6
4
blue
123
7
3
4
blue
456
2
8
5
yellow
456
9
7
5
yellow
You can deduplicate with groupby.cumcount, then stack and groupby.ffill the missing values:
(df.set_axis(pd.MultiIndex.from_arrays([df.columns,
df.groupby(level=0, axis=1).cumcount()
]), axis=1)
.stack().groupby(level=0).ffill()
.reset_index(drop=True).convert_dtypes() # optional
[list(dict.fromkeys(df.columns))] # also optional, keep original order
)
output:
id a b c color
0 123 1 6 4 blue
1 123 7 3 4 blue
2 456 2 8 5 yellow
3 456 9 7 5 yellow
# melt to turn wide to long format
df2=df.melt(id_vars=['id'])
(df2.assign(seq=df2.groupby(['variable']).cumcount()) # assign a seq to create multiple rows for an id
.pivot(index=['id','seq'], columns='variable', values='value' ) # pivot
.reset_index()
.drop(columns='seq')
.rename_axis(columns=None)
).ffill() # fill nan with previous value
id a b c color
0 123 1 6 4 blue
1 123 7 3 4 blue
2 456 2 8 5 yellow
3 456 9 7 5 yellow
One option is with pivot_longer from pyjanitor; I added a temporary column c1, so there is a balance in the number for all columns:
(df
.assign(c1=df.c)
.pivot_longer(
index = ['id', 'color'],
names_to = '.value',
names_pattern = '(.)')
)
id color a b c
0 123 blue 1 6 4
1 456 yellow 2 8 5
2 123 blue 7 3 4
3 456 yellow 9 7 5
Sample dataset:
id fruit
0 7 NaN
1 7 apple
2 7 NaN
3 7 mango
4 7 apple
5 7 potato
6 3 berry
7 3 olive
8 3 olive
9 3 grape
10 3 NaN
11 3 mango
12 3 potato
In fruit column value of NaN and potato is 0. All other strings value is 1. I want to generate a new column sum_last_3 where each row calculates the sum of previous 3 rows (inclusive) of fruit column. When a new id appears, it should calculate from the beginning.
Output I want:
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1
My Code:
df['sum_last5'] = (df['fruit'].ne('potato') & df['fruit'].notna())
.groupby('id',sort=False, as_index=False)['fruit']
.rolling(min_periods=1, window=3).sum().astype(int).values
You can modify your codes slightly, as follows:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.droplevel(0)
)
or use .values as in your codes:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.values
)
Your codes are close, just need to change id to df['id'] in the .groupby() call (since the main subject for calling .groupby() is now a boolean series rather than df itself, so .groupby() cannot recognize the id column by the column label 'id' alone and need also the dataframe name to fully qualify/identify the column).
Also remove as_index=False since this parameter is for dataframe rather than (boolean) series here.
Result:
print(df)
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1
Iam using this dataframe
source fruit 2019 2020 2021
0 a apple 3 1 1
1 a banana 4 3 5
2 a orange 2 2 2
3 b apple 3 4 5
4 b banana 4 5 2
5 b orange 1 6 4
i want to refine it like this
source fruit 2019 2020 2021
0 a total 9 6 8
1 a seeds 5 3 3
2 a banana 4 3 5
3 b total 8 15 11
4 b seeds 4 10 9
5 b banana 4 5 2
total is sum of all fruits in that year for each source.
seeds is the sum of fruits containing seeds for each year for each source.
I tried
Appending new empty rows : Insert a new row after every nth row & Insert row at any position
But wasn't getting the expected result.
What would be the best way to get the desired output?
TRY:
df1 = df.groupby('source', as_index=False).sum().assign(fruit = 'total')
seeds = ['orange','apple']
df2 = df.loc[df['fruit'].isin(seeds)].groupby('source', as_index=False).sum().assign(fruit = 'seeds')
final_df = pd.concat([df.loc[~df['fruit'].isin(seeds)], df1,df2])
I have two dataframes - one large dataframe with multiple categorical columns and one column with missing values, and another that's sort of a dictionary with the same categorical columns and one column with a key value.
Essentially, I want to fill the missing values in the large dataframe with the key value in the second if all the categorical columns match.
Missing value df:
Color Number Letter Value
0 Red 2 B NaN
1 Green 2 A NaN
2 Red 2 B NaN
3 Red 1 B NaN
4 Green 1 A NaN
5 Red 2 B NaN
6 Green 1 B NaN
7 Green 2 A NaN
Dictionary df:
Color Number Letter Value
0 Red 1 A 10
1 Red 1 B 4
2 Red 2 A 3
3 Red 2 B 15
4 Green 1 A 21
5 Green 1 B 9
6 Green 2 A 22
7 Green 2 B 1
Desired df:
0 Red 2 B 15
1 Green 2 A 22
2 Red 2 B 15
3 Red 1 B 4
4 Green 1 A 21
5 Red 2 B 15
6 Green 1 B 9
7 Green 2 A 22
I'm not sure if I should have the 'dictionary df' as an actual dictionary, or keep it as a dataframe (it's pulled from a csv).
Is this possible to do cleanly without a myriad of if else statements?
Thanks!
Does this work?
>>> df_1[['Color', 'Number', 'Letter']].merge(df_2,
... on=('Color', 'Number', 'Letter'),
... how='left')
Color Number Letter Value
0 Red 2 B 15
1 Green 2 A 22
2 Red 2 B 15
3 Red 1 B 4
4 Green 1 A 21
5 Red 2 B 15
6 Green 1 B 9
7 Green 2 A 22
Thought it worth mentioning - a very simple way to convert examples from stackoverflow pandas questions into a dataframe, just cut and paste it into a string like this:
>>> df_1 = pd.read_csv(StringIO("""
... Color Number Letter Value
... 0 Red 2 B NaN
... 1 Green 2 A NaN
... 2 Red 2 B NaN
... 3 Red 1 B NaN
... 4 Green 1 A NaN
... 5 Red 2 B NaN
... 6 Green 1 B NaN
... 7 Green 2 A NaN
... """), sep=r'\s+')
Try:
missing_df.reset_index()[['index', 'Color', 'Number', 'Letter']]\
.merge(dict_df, on = ['Color', 'Number', 'Letter'])\
.set_index('index').reindex(missing_df.index)
Output:
Color Number Letter Value
0 Red 2 B 15
1 Green 2 A 22
2 Red 2 B 15
3 Red 1 B 4
4 Green 1 A 21
5 Red 2 B 15
6 Green 1 B 9
7 Green 2 A 22
I will be calling
Missing value df as: df
and Dictionary df as: ddf, considering both as dataframes
First drop the null values column from Missing value df:
df.drop(['Value'], axis=1)
Secondly run the below command, which should do the task for you.
df.assign(Value=ddf['Value'])
I'm having issues with pivoting the below data
index column data
0 1 A cat
1 1 B blue
2 1 C seven
3 2 A dog
4 2 B green
5 2 B red
6 2 C eight
7 2 C five
8 3 A fish
9 3 B pink
10 3 C one
I am attempting to pivot it by using
df.pivot(index='index', columns='column', values="data")
But I receive the error "Index contains duplicate entries, cannot reshape"
I have looked through a large number of similar posts to this but none of the solutions I tried worked
My desired output is
index A B C
1 cat blue seven
2 dog green eight
2 dog green five
2 dog red eight
2 dog red five
3 fish pink one
What would be the best solution for this?
in this question Pandas pivot warning about repeated entries on index they state that duplicate pairs (so a duplicate pair in the columns 'index' and 'column') are not possible to pivot.
in your dataset, the index 2 has two times the column values B and C.
Can you change the 'index' column?
See my new dataframe as an example:
df = pd.DataFrame({'index': [1,1,1,2,2,3,2,4,3,4,3],
'column': ['A','B','C','A','B','B','C','C','A','B','C'],
'data':['cat','blue','seven', 'dog', 'green', 'red',
'eight','five', 'fish', 'pink', 'one']})
df
out:
index column data
0 1 A cat
1 1 B blue
2 1 C seven
3 2 A dog
4 2 B green
5 3 B red
6 2 C eight
7 4 C five
8 3 A fish
9 4 B pink
10 3 C one
df.pivot('index', 'column', 'data')
out:
column A B C
index
1 cat blue seven
2 dog green eight
3 fish red one
4 NaN pink five
Option_2
If you use unstack with 'append':
testing = df.set_index(['index', 'column'],
append=True).unstack('column')
testing
data
column A B C
index
0 1 cat NaN NaN
1 1 NaN blue NaN
2 1 NaN NaN seven
3 2 dog NaN NaN
4 2 NaN green NaN
5 2 NaN red NaN
6 2 NaN NaN eight
7 3 NaN NaN five
8 3 fish NaN NaN
9 3 NaN pink NaN
10 3 NaN NaN one