Python Average of Multiple Columns and Rows - python

How do I group by two columns in a dataframe and specify other columns for which I want an overall average?
Data
name team a b c d
Bob blue 2 4 3 5
Bob blue 2 4 3 4
Bob blue 1 5 3 4
Bob green 1 3 2 5
Bob green 1 2 1 1
Bob green 1 2 1 4
Bob green 5 2 2 1
Jane red 1 2 2 3
Jane red 3 3 3 4
Jane red 2 5 1 2
Jane red 4 5 5 3
Desired Output
name team avg
Bob blue 3.333333333
Bob green 2.125
Jane red 3

You can mean two times :-)
df.groupby(['name','team']).mean().mean(1)
Out[1263]:
name team
Bob blue 3.333333
green 2.125000
Jane red 3.000000
dtype: float64

You need to set the index as the grouping columns and stack the remaining columns:
df.set_index(['name', 'team']).stack().groupby(level=[0, 1]).mean()
Out:
name team
Bob blue 3.333333
green 2.125000
Jane red 3.000000
dtype: float64

Related

Is there a way to subtract a column values according to other column?

I have this dataframe:
name color number
0 john red 4
1 ana red 4
2 ana red 5
3 paul red 6
4 mark red 3
5 ana yellow 10
6 john yellow 11
7 john yellow 12
8 john red 13
If the value in color column change (according to the name column), I want to create another column with the subtraction between the last value associated to the color and the first value from the new color. If the value in color column don't change, return -999.
Ex:
Looking to ana, the last value for red is 5 and the first value for yellow is 10. So, the new column will be 10 - 5 = 5 to ana.
Looking to john, the last value for red is 4 and the first value for yellow is 11. So, the new column will be 11 - 4 = 7 to john. Do that just one time. If the color change again, it doesn't mather.
I want this output:
name color number difference
0 john red 4 7
1 ana red 4 5
2 ana red 5 5
3 paul red 6 -999
4 mark red 3 -999
5 ana yellow 10 5
6 john yellow 11 7
7 john yellow 12 7
8 john red 13 7
please, somebody help me?
try in this way
df = pd.DataFrame({'name':['john','ana','ana','paul','mark','ana','john','john','john'],
'color':['red','red','red','red','red','yellow','yellow','yellow','red'],
'number':[4,4,5,6,3,10,11,12,13]})
df['color_code'] = df['color'].factorize()[0]
partial_df = pd.DataFrame()
partial_df['difference'] = df.groupby('name')['number'].apply(lambda x: list(np.diff(x))).explode()
partial_df['change_status'] = df.groupby('name')['color_code'].apply(lambda x: list((np.diff(x)>0)+0)).explode()
map_difference = partial_df.loc[partial_df.change_status != 0].reset_index().drop_duplicates('name').set_index('name')['difference']
df['difference'] = df.name.copy().map(map_difference).fillna(-999)
df

Map multiple columns using Series from another DataFrame

I have two DataFrames. I need is to replace the text in columns B, C, and D in df1 with the values from df2['SC'], based on the value of df2['Title'].
df1
A B C D
Dave Green Blue Yellow
Pete Red
Phil Purple
df2
A ID N SC Title
Dave 1 5 2 Green
Dave 1 10 2 Blue
Dave 1 15 3 Yellow
Pete 2 100 3 Red
Phil 3 200 4 Purple
Desired output:
A B C D
Dave 2 2 3
Pete 3
Phil 4
Using stack + map + unstack
df1.set_index('A').stack().map(df2.set_index('Title')['SC']).unstack()
B C D
A
Dave 2.0 2.0 3.0
Pete 3.0 NaN NaN
Phil 4.0 NaN NaN
If a column contains all NaN it will be lost. To avoid this you could reindex
.reindex(df1.columns, axis=1) # append to previous command

Updating existing dataframe columns

I have a data frame which has the structure as follows
code value
1 red
2 blue
3 yellow
1
4
4 pink
2 blue
so basically i want to update the value column so that the blank rows are filled with values from other rows. So I know the code 4 refers to value pink, I want it to be updated in all the rows where that value is not present.
Using groupby and ffill and bfill
df.groupby('code').value.ffill().bfill()
0 red
1 blue
2 yellow
3 red
4 pink
5 pink
6 blue
Name: value, dtype: object
You could use first value of the given code group
In [379]: df.groupby('code')['value'].transform('first')
Out[379]:
0 red
1 blue
2 yellow
3 red
4 pink
5 pink
6 blue
Name: value, dtype: object
To assign back
In [380]: df.assign(value=df.groupby('code')['value'].transform('first'))
Out[380]:
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue
Or
df['value'] = df.groupby('code')['value'].transform('first')
You can create a series of your code-value pairs, and use that to map:
my_map = df[df['value'].notnull()].set_index('code')['value'].drop_duplicates()
df['value'] = df['code'].map(my_map)
>>> df
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue
Just to see what is happening, you are passing the following series to map:
>>> my_map
code
1 red
2 blue
3 yellow
4 pink
Name: value, dtype: object
So it says: "Where you find 1, give the value red, where you find 2, give blue..."
You can sort_values, ffill and then sort_index. The last step may not be necessary if order is not important. If it is, then the double sort may be unreasonably expensive.
df = df.sort_values(['code', 'value']).ffill().sort_index()
print(df)
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue
Using reindex
df.dropna().drop_duplicates('code').set_index('code').reindex(df.code).reset_index()
Out[410]:
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue

Insert rows in pandas where one column misses some value in groupby

Here's my dataframe:
user1 user2 cat quantity + other quantities
----------------------------------------------------
Alice Bob 0 ....
Alice Bob 1 ....
Alice Bob 2 ....
Alice Carol 0 ....
Alice Carol 2 ....
I want to make sure that every user1-user2 pair has a row corresponding to each category (there are three: 0,1,2). If not, I want to insert a row, and set the other columns to zero.
user1 user2 cat quantity + other quantities
----------------------------------------------------
Alice Bob 0 ....
Alice Bob 1 ....
Alice Bob 2 ....
Alice Carol 0 ....
Alice Carol 1 <SET ALL TO ZERO>
Alice Carol 2 ....
what I have so far is the list of all user1-user2 which has less than 3 values for cat:
df.groupby(['user1','user2']).agg({'cat':'count'}).reset_index()[['user1','user2']]
I could iterate over these users, but that will take a long time (there are >1M such pairs). I've checked at other solutions for inserting rows in pandas based on some condition (like Pandas/Python adding row based on condition and Insert row in Pandas Dataframe based on a condition) but they're not exactly the same.
Also, since this is a huge dataset, the solution has to be vectorized. How should I proceed?
Use set_index with reindex by MultiIndex.from_product:
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2 4
1 Alice Bob 1 3 4
2 Alice Bob 2 4 4
3 Alice Carol 0 6 4
4 Alice Carol 2 3 4
df = df.set_index(['user1','user2', 'cat'])
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2 4
1 Alice Bob 1 3 4
2 Alice Bob 2 4 4
3 Alice Carol 0 6 4
4 Alice Carol 1 0 0
5 Alice Carol 2 3 4
Another solution is create new Dataframe by all combinations of unique values of columns and merge with right join:
from itertools import product
df1 = pd.DataFrame(list(product(df['user1'].unique(),
df['user2'].unique(),
df['cat'].unique())), columns=['user1','user2', 'cat'])
df = df.merge(df1, how='right').fillna(0)
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2.0 4.0
1 Alice Bob 1 3.0 4.0
2 Alice Bob 2 4.0 4.0
3 Alice Carol 0 6.0 4.0
4 Alice Carol 2 3.0 4.0
5 Alice Carol 1 0.0 0.0
EDIT2:
df['user1'] = df['user1'] + '_' + df['user2']
df = df.set_index(['user1', 'cat']).drop('user2', 1)
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
df[['user1','user2']] = df['user1'].str.split('_', expand=True)
print (df)
user1 cat quantity a user2
0 Alice 0 2 4 Bob
1 Alice 1 3 4 Bob
2 Alice 2 4 4 Bob
3 Alice 0 6 4 Carol
4 Alice 1 0 0 Carol
5 Alice 2 3 4 Carol
EDIT3:
cols = df.columns.difference(['user1','user2'])
df = (df.groupby(['user1','user2'])[cols]
.apply(lambda x: x.set_index('cat').reindex(df['cat'].unique(), fill_value=0))
.reset_index())
print (df)
user1 user2 cat a quantity
0 Alice Bob 0 4 2
1 Alice Bob 1 4 3
2 Alice Bob 2 4 4
3 Alice Carol 0 4 6
4 Alice Carol 1 0 0
5 Alice Carol 2 4 3

python: pandas: Merge multiple tables according to a index table

For example, I have three tables A, B, C
Table A:
id1 value1
1 23
2 34
3 2342
4 333
Table B:
id2 value2
1 apple
2 banana
3 berry
Table C:
id3 value3 value4
1 red batman
2 green superman
3 white wonder woman
4 gray aquaman
5 yellow flash
I want to merge these three tables according to an index table D
Table D:
Table_A Table_B Table_C
1 3 2
3 4
2 2 3
4 1 1
5
And my resulting table should like:
id1 value1 id2 value2 id3 value3 value4
1 23 3 berry 2 green superman
3 2342 4 gray aquaman
2 34 2 banana 3 white wonder woman
4 333 1 apple 1 red batman
5 yellow flash
Can I do it via Python Pandas or I need do it in Spark?
Let's try:
table_d['value1'] = table_d['Table_A'].map(table_a.set_index('id1')['value1'])
table_d['value2'] = table_d['Table_B'].map(table_b.set_index('id2')['value2'])
table_d.merge(table_c, left_on='Table_C', right_on='id3')
Output:
Table_A Table_B Table_C value1 value2 id3 value3 value4
0 1.0 3.0 2 23.0 berry 2 green superman
1 3.0 NaN 4 2342.0 NaN 4 gray aquaman
2 2.0 2.0 3 34.0 banana 3 white wonder woman
3 4.0 1.0 1 333.0 apple 1 red batman
4 NaN NaN 5 NaN NaN 5 yellow flash

Categories