Pandas dataframe: merge rows into 1 row and sum a coulmn - python

I have a pandas dataframe that contains user id and ad click (if any) by this user across several days
df =pd.DataFrame([['A',0], ['A',1], ['A',0], ['B',0], ['B',0], ['B',0], ['B',1], ['B',1], ['B',1]],columns=['user_id', 'click_count'])
Out[8]:
user_id click_count
0 A 0
1 A 1
2 A 0
3 B 0
4 B 0
5 B 0
6 B 1
7 B 1
8 B 1
I would like to convert this dataframe into A dataframe WITH 1 row per user where 'click_cnt' now is sum of all click_count across all rows for each user in the original dataframe i.e.
Out[18]:
user_id click_cnt
0 A 1
1 B 3

What you're after is the function groupby:
df = df.groupby('user_id', as_index=False).sum()
Adding the flag as_index=False will add the keys as a separate column instead of using them for the new index.
groupby is super useful - have a read through the documentation for more info.

Related

pandas - creating new rows of combination with value of 0

I have a pandas dataframe like
user_id
music_id
has_rating
A
a
1
B
b
1
and I would like to automatically add new rows for each of user_id & music_id for those users haven't rated, like
user_id
music_id
has_rating
A
a
1
A
b
0
B
a
0
B
b
1
for each of user_id and music_id combination pairs those are not existing in my Pandas dataframe yet.
is there any way to append such rows automatically like this?
You can use a temporary reshape with pivot_table and fill_value=0 to fill the missing values with 0:
(df.pivot_table(index='user_id', columns='music_id',
values='has_rating', fill_value=0)
.stack().reset_index(name='has_rating')
)
Output:
user_id music_id has_rating
0 A a 1
1 A b 0
2 B a 0
3 B b 1
Try using pd.MultiIndex.from_product()
l = ['user_id','music_id']
(df.set_index(l)
.reindex(pd.MultiIndex.from_product([df[l[0]].unique(),df[l[1]].unique()],names = l),fill_value=0)
.reset_index())
Output:
user_id music_id has_rating
0 A a 1
1 A b 0
2 B a 0
3 B b 1

Selecting columns from two dataframes according to another column

I have 2 dataframes, one of them contains some general information about football players, and second of them contains other information like winning matches for each player. They both have the "id" column. However, they are not in same length.
What I want to do is creating a new dataframe which contains 2 columns: "x" from first dataframe and "y" from second dataframe, ONLY where the "id" column contains the same value in both dataframes. Thus, I can match the "x" and "y" columns which belong to same person.
I tried to do it using concat function:
pd.concat([firstdataframe['x'], seconddataframe['y']], axis=1, keys=['x', 'y'])
But I didn't manage to know how to apply the condition of the "id" being equal in both dataframes.
It seems you need merge with default inner join, also each values in id columns has to be unique:
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
Sample:
df1 = pd.DataFrame({'id':[1,2,3],'x':[4,3,8]})
print (df1)
id x
0 1 4
1 2 3
2 3 8
df2 = pd.DataFrame({'id':[1,2],'y':[7,0]})
print (df2)
id y
0 1 7
1 2 0
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
print (df)
id x y
0 1 4 7
1 2 3 0
Solution with concat is possible, but a bit complicated, becasue need join on indexes with inner join:
df = pd.concat([df1.set_index('id')['x'],
df2.set_index('id')['y']], axis=1, join='inner')
.reset_index()
print (df)
id x y
0 1 4 7
1 2 3 0
EDIT:
If ids are not unique, duplicates create all combinations and output dataframe is expanded:
df1 = pd.DataFrame({'id':[1,2,3],'x':[4,3,8]})
print (df1)
id x
0 1 4
1 2 3
2 3 8
df2 = pd.DataFrame({'id':[1,2,1,1],'y':[7,0,4,2]})
print (df2)
id y
0 1 7
1 2 0
2 1 4
3 1 2
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
print (df)
id x y
0 1 4 7
1 1 4 4
2 1 4 2
3 2 3 0

copy row values based on result from isin() in pandas

I have 2 dataframes:
df_a = pd.DataFrame({'A':[1,2,3,4],'B':[4,5,6,7],'ID':['a','b','c','d']})
df_a
A B ID
0 1 4 a
1 2 5 b
2 3 6 c
3 4 7 d
df_b = pd.DataFrame({'A':[1,2,3],'ID':['b','a','c']})
df_b['CopyB'] = ""
A ID CopyB
0 1 b
1 2 a
2 3 c
Now I want to match ID columns in both the dataframes and upon a successful match, I want to copy respective value of B from df_a to df_b['CopyB']. I tried df_b.loc[df_b['ID'].isin(df_a['ID']),'Copy']= df_a['B]
but that is not correct. Then I tried comparing the ID using '==' but got an error since the length of the ID series is not equal. Any help? Sorry if it is a very trivial query.
use join
df_b.join(df_a.set_index('ID').B, on='ID')
A ID B
0 1 b 5
1 2 a 4
2 3 c 6
join works on indices. So I set the index of df_a to be the ID column and access the B column so that df_a.set_index('ID').B is a series with the column I want to add as values and the merge column as the index. Then I use join. I wouldn't have to specify an on parameter if ID was the index of df_b but it isn't, so I do.
You could use merge
In [247]: df_b.merge(df_a, on=['ID'])
Out[247]:
A_x ID A_y B
0 1 b 2 5
1 2 a 1 4
2 3 c 3 6
In [248]: df_b.merge(df_a, on=['ID'])[['A_x', 'ID', 'B']].rename(columns={'A_x': 'A'})
Out[248]:
A ID B
0 1 b 5
1 2 a 4
2 3 c 6
If you set the ID column as a common index for both dataframes you can easily add the column:
df_a = df_a.set_index('ID')
df_b = db_b.set_index('ID')
df_b['Copy B'] = df_a['B']

Copy pandas DataFrame row to multiple other rows

Simple and practical question, yet I can't find a solution.
The questions I took a look were the following:
Modifying a subset of rows in a pandas dataframe
Changing certain values in multiple columns of a pandas DataFrame at once
Fastest way to copy columns from one DataFrame to another using pandas?
Selecting with complex criteria from pandas.DataFrame
The key difference between those and mine is that I need not to insert a single value, but a row.
My problem is, I pick up a row of a dataframe, say df1. Thus I have a series.
Now I have this other dataframe, df2, that I have selected multiple rows according to a criteria, and I want to replicate that series to all those row.
df1:
Index/Col A B C
1 0 0 0
2 0 0 0
3 1 2 3
4 0 0 0
df2:
Index/Col A B C
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
What I want to accomplish is inserting df1[3] into the lines df2[2] and df3[3] for example. So something like the non working code:
series = df1[3]
df2[df2.index>=2 and df2.index<=3] = series
returning
df2:
Index/Col A B C
1 0 0 0
2 1 2 3
3 1 2 3
4 0 0 0
Use loc and pass a list of the index labels of interest, after the following comma the : indicates we want to set all column values, we then assign the series but call attribute .values so that it's a numpy array. Otherwise you will get a ValueError as there will be a shape mismatch as you're intending to overwrite 2 rows with a single row and if it's a Series then it won't align as you desire:
In [76]:
df2.loc[[2,3],:] = df1.loc[3].values
df2
Out[76]:
A B C
1 0 0 0
2 1 2 3
3 1 2 3
4 0 0 0
Suppose you have to copy certain rows and columns from dataframe to some another data frame do this.
code
df2 = df.loc[x:y,a:b] // x and y are rows bound and a and b are column
bounds that you have to select

Drop Rows by Multiple Column Criteria in DataFrame

I have a pandas dataframe that I'm trying to drop rows based on a criteria across select columns. If the values in these select columns are zero, the rows should be dropped. Here is an example.
import pandas as pd
t = pd.DataFrame({'a':[1,0,0,2],'b':[1,2,0,0],'c':[1,2,3,4]})
a b c
0 1 1 1
1 0 2 2
2 0 0 3
3 2 0 4
I would like to try something like:
cols_of_interest = ['a','b'] #Drop rows if zero in all these columns
t = t[t[cols_of_interest]!=0]
This doesn't drop the rows, so I tried:
t = t.drop(t[t[cols_of_interest]==0].index)
And all rows are dropped.
What I would like to end up with is:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
Where the 3rd row (index 2) was dropped because it took on value 0 in BOTH the columns of interest, not just one.
Your problem here is that you first assigned the result of your boolean condition: t = t[t[cols_of_interest]!=0] which overwrites your original df and sets where the condition is not met with NaN values.
What you want to do is generate the boolean mask, then drop the NaN rows and pass thresh=1 so that there must be at least a single non-NaN value in that row, we can then use loc and use the index of this to get the desired df:
In [124]:
cols_of_interest = ['a','b']
t.loc[t[t[cols_of_interest]!=0].dropna(thresh=1).index]
Out[124]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
EDIT
As pointed out by #DSM you can achieve this simply by using any and passing axis=1 to test the condition and use this to index into your df:
In [125]:
t[(t[cols_of_interest] != 0).any(axis=1)]
Out[125]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4

Categories