I have a data frame which has the structure as follows
code value
1 red
2 blue
3 yellow
1
4
4 pink
2 blue
so basically i want to update the value column so that the blank rows are filled with values from other rows. So I know the code 4 refers to value pink, I want it to be updated in all the rows where that value is not present.
Using groupby and ffill and bfill
df.groupby('code').value.ffill().bfill()
0 red
1 blue
2 yellow
3 red
4 pink
5 pink
6 blue
Name: value, dtype: object
You could use first value of the given code group
In [379]: df.groupby('code')['value'].transform('first')
Out[379]:
0 red
1 blue
2 yellow
3 red
4 pink
5 pink
6 blue
Name: value, dtype: object
To assign back
In [380]: df.assign(value=df.groupby('code')['value'].transform('first'))
Out[380]:
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue
Or
df['value'] = df.groupby('code')['value'].transform('first')
You can create a series of your code-value pairs, and use that to map:
my_map = df[df['value'].notnull()].set_index('code')['value'].drop_duplicates()
df['value'] = df['code'].map(my_map)
>>> df
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue
Just to see what is happening, you are passing the following series to map:
>>> my_map
code
1 red
2 blue
3 yellow
4 pink
Name: value, dtype: object
So it says: "Where you find 1, give the value red, where you find 2, give blue..."
You can sort_values, ffill and then sort_index. The last step may not be necessary if order is not important. If it is, then the double sort may be unreasonably expensive.
df = df.sort_values(['code', 'value']).ffill().sort_index()
print(df)
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue
Using reindex
df.dropna().drop_duplicates('code').set_index('code').reindex(df.code).reset_index()
Out[410]:
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue
Related
I'm trying to compute the cumulative sum in python based on a two different conditions.
As you can see in the attached image, Calculation column would take the same value as the Number column as long as the Cat1 and Cat2 column doesn't change.
Once Cat1 column changes, we should reset the Number column.
Calculation column stays the same as the Number column, Once Cat2 column changes with the same value of Cat1 column, the Calculation column will take the last value of the Number column and add it to the next one.
Example of data below:
Cat1 Cat2 Number CALCULATION
a orange 1 1
a orange 2 2
a orange 3 3
a orange 4 4
a orange 5 5
a orange 6 6
a orange 7 7
a orange 8 8
a orange 9 9
a orange 10 10
a orange 11 11
a orange 12 12
a orange 13 13
b purple 1 1
b purple 2 2
b purple 3 3
b purple 4 4
b purple 5 5
b purple 6 6
b purple 7 7
b purple 8 8
b silver 1 9
b silver 2 10
b silver 3 11
b silver 4 12
b silver 5 13
b silver 6 14
b silver 7 15
Are you looking for:
import pandas as pd
df = pd.DataFrame({'Cat1': ['a','a','a','a','a','a','a','a','a','a','a', 'a','a','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b'],
'Cat2': ['orange','orange','orange','orange','orange','orange','orange', 'orange','orange','orange','orange','orange','orange','purple','purple', 'purple','purple','purple','purple','purple','purple','silver','silver','silver', 'silver','silver','silver','silver']})
df['Number'] = df.groupby(['Cat1', 'Cat2']).cumcount()+1
df['CALCULATION'] = df.groupby('Cat1').cumcount()+1
I have this dataframe:
name color number
0 john red 4
1 ana red 4
2 ana red 5
3 paul red 6
4 mark red 3
5 ana yellow 10
6 john yellow 11
7 john yellow 12
8 john red 13
If the value in color column change (according to the name column), I want to create another column with the subtraction between the last value associated to the color and the first value from the new color. If the value in color column don't change, return -999.
Ex:
Looking to ana, the last value for red is 5 and the first value for yellow is 10. So, the new column will be 10 - 5 = 5 to ana.
Looking to john, the last value for red is 4 and the first value for yellow is 11. So, the new column will be 11 - 4 = 7 to john. Do that just one time. If the color change again, it doesn't mather.
I want this output:
name color number difference
0 john red 4 7
1 ana red 4 5
2 ana red 5 5
3 paul red 6 -999
4 mark red 3 -999
5 ana yellow 10 5
6 john yellow 11 7
7 john yellow 12 7
8 john red 13 7
please, somebody help me?
try in this way
df = pd.DataFrame({'name':['john','ana','ana','paul','mark','ana','john','john','john'],
'color':['red','red','red','red','red','yellow','yellow','yellow','red'],
'number':[4,4,5,6,3,10,11,12,13]})
df['color_code'] = df['color'].factorize()[0]
partial_df = pd.DataFrame()
partial_df['difference'] = df.groupby('name')['number'].apply(lambda x: list(np.diff(x))).explode()
partial_df['change_status'] = df.groupby('name')['color_code'].apply(lambda x: list((np.diff(x)>0)+0)).explode()
map_difference = partial_df.loc[partial_df.change_status != 0].reset_index().drop_duplicates('name').set_index('name')['difference']
df['difference'] = df.name.copy().map(map_difference).fillna(-999)
df
I have two dataframes - one large dataframe with multiple categorical columns and one column with missing values, and another that's sort of a dictionary with the same categorical columns and one column with a key value.
Essentially, I want to fill the missing values in the large dataframe with the key value in the second if all the categorical columns match.
Missing value df:
Color Number Letter Value
0 Red 2 B NaN
1 Green 2 A NaN
2 Red 2 B NaN
3 Red 1 B NaN
4 Green 1 A NaN
5 Red 2 B NaN
6 Green 1 B NaN
7 Green 2 A NaN
Dictionary df:
Color Number Letter Value
0 Red 1 A 10
1 Red 1 B 4
2 Red 2 A 3
3 Red 2 B 15
4 Green 1 A 21
5 Green 1 B 9
6 Green 2 A 22
7 Green 2 B 1
Desired df:
0 Red 2 B 15
1 Green 2 A 22
2 Red 2 B 15
3 Red 1 B 4
4 Green 1 A 21
5 Red 2 B 15
6 Green 1 B 9
7 Green 2 A 22
I'm not sure if I should have the 'dictionary df' as an actual dictionary, or keep it as a dataframe (it's pulled from a csv).
Is this possible to do cleanly without a myriad of if else statements?
Thanks!
Does this work?
>>> df_1[['Color', 'Number', 'Letter']].merge(df_2,
... on=('Color', 'Number', 'Letter'),
... how='left')
Color Number Letter Value
0 Red 2 B 15
1 Green 2 A 22
2 Red 2 B 15
3 Red 1 B 4
4 Green 1 A 21
5 Red 2 B 15
6 Green 1 B 9
7 Green 2 A 22
Thought it worth mentioning - a very simple way to convert examples from stackoverflow pandas questions into a dataframe, just cut and paste it into a string like this:
>>> df_1 = pd.read_csv(StringIO("""
... Color Number Letter Value
... 0 Red 2 B NaN
... 1 Green 2 A NaN
... 2 Red 2 B NaN
... 3 Red 1 B NaN
... 4 Green 1 A NaN
... 5 Red 2 B NaN
... 6 Green 1 B NaN
... 7 Green 2 A NaN
... """), sep=r'\s+')
Try:
missing_df.reset_index()[['index', 'Color', 'Number', 'Letter']]\
.merge(dict_df, on = ['Color', 'Number', 'Letter'])\
.set_index('index').reindex(missing_df.index)
Output:
Color Number Letter Value
0 Red 2 B 15
1 Green 2 A 22
2 Red 2 B 15
3 Red 1 B 4
4 Green 1 A 21
5 Red 2 B 15
6 Green 1 B 9
7 Green 2 A 22
I will be calling
Missing value df as: df
and Dictionary df as: ddf, considering both as dataframes
First drop the null values column from Missing value df:
df.drop(['Value'], axis=1)
Secondly run the below command, which should do the task for you.
df.assign(Value=ddf['Value'])
I'm having issues with pivoting the below data
index column data
0 1 A cat
1 1 B blue
2 1 C seven
3 2 A dog
4 2 B green
5 2 B red
6 2 C eight
7 2 C five
8 3 A fish
9 3 B pink
10 3 C one
I am attempting to pivot it by using
df.pivot(index='index', columns='column', values="data")
But I receive the error "Index contains duplicate entries, cannot reshape"
I have looked through a large number of similar posts to this but none of the solutions I tried worked
My desired output is
index A B C
1 cat blue seven
2 dog green eight
2 dog green five
2 dog red eight
2 dog red five
3 fish pink one
What would be the best solution for this?
in this question Pandas pivot warning about repeated entries on index they state that duplicate pairs (so a duplicate pair in the columns 'index' and 'column') are not possible to pivot.
in your dataset, the index 2 has two times the column values B and C.
Can you change the 'index' column?
See my new dataframe as an example:
df = pd.DataFrame({'index': [1,1,1,2,2,3,2,4,3,4,3],
'column': ['A','B','C','A','B','B','C','C','A','B','C'],
'data':['cat','blue','seven', 'dog', 'green', 'red',
'eight','five', 'fish', 'pink', 'one']})
df
out:
index column data
0 1 A cat
1 1 B blue
2 1 C seven
3 2 A dog
4 2 B green
5 3 B red
6 2 C eight
7 4 C five
8 3 A fish
9 4 B pink
10 3 C one
df.pivot('index', 'column', 'data')
out:
column A B C
index
1 cat blue seven
2 dog green eight
3 fish red one
4 NaN pink five
Option_2
If you use unstack with 'append':
testing = df.set_index(['index', 'column'],
append=True).unstack('column')
testing
data
column A B C
index
0 1 cat NaN NaN
1 1 NaN blue NaN
2 1 NaN NaN seven
3 2 dog NaN NaN
4 2 NaN green NaN
5 2 NaN red NaN
6 2 NaN NaN eight
7 3 NaN NaN five
8 3 fish NaN NaN
9 3 NaN pink NaN
10 3 NaN NaN one
How do I group by two columns in a dataframe and specify other columns for which I want an overall average?
Data
name team a b c d
Bob blue 2 4 3 5
Bob blue 2 4 3 4
Bob blue 1 5 3 4
Bob green 1 3 2 5
Bob green 1 2 1 1
Bob green 1 2 1 4
Bob green 5 2 2 1
Jane red 1 2 2 3
Jane red 3 3 3 4
Jane red 2 5 1 2
Jane red 4 5 5 3
Desired Output
name team avg
Bob blue 3.333333333
Bob green 2.125
Jane red 3
You can mean two times :-)
df.groupby(['name','team']).mean().mean(1)
Out[1263]:
name team
Bob blue 3.333333
green 2.125000
Jane red 3.000000
dtype: float64
You need to set the index as the grouping columns and stack the remaining columns:
df.set_index(['name', 'team']).stack().groupby(level=[0, 1]).mean()
Out:
name team
Bob blue 3.333333
green 2.125000
Jane red 3.000000
dtype: float64