In a jupyter notebook, I have a dataframe created from different merged datasets.
record_id | song_id | user_id | number_times_listened
0 |ABC | Shjkn4987 | 3
1 |ABC | Dsfds2347 | 15
2 |ABC | Fkjhh9849 | 7
3 |XYZ | Shjkn4987 | 20
4 |XXX | Shjkn4987 | 5
5 |XXX | Swjdh0980 | 1
I would like to create a pivot table dataframe by song_id listing the number of user_ids and the sum of number_times_listened.
I know that I need to create a for loop with the count and sum functions, but I cannot make it work. I also tried the pandas module's pd.pivot_table.
df = pd.pivot_table(data, index='song_ID', columns='userID', values='number_times_listened', aggfunc='sum')
OR something like this?
total_user=[]
total_times_listened =[]
for x in data:
total_user.append(sum('user_id'))
total_times_listened.append(count('number_times_listened'))
return df('song_id','total_user','total_times_listened')
You can pass a dictionary of column names as keys and a list of functions as values:
funcs = {'number_times_listened':['sum'], 'user_id':['count']}
Then simply use df.groupby on column song_id:
df.groupby('song_id').agg(funcs)
The output:
number_times_listened user_id
sum count
song_id
ABC 25 3
XXX 6 2
XYZ 20 1
Not sure if this is related but the column names and casing in your example don't match your Python code.
In any case, the following works for me on Python 2.7:
CSV File:
record_id song_id user_id number_times_listened
0 ABC Shjkn4987 3
1 ABC Dsfds2347 15
2 ABC Fkjhh9849 7
3 XYZ Shjkn4987 20
4 XXX Shjkn4987 5
5 XXX Swjdh0980 1
Python code:
csv_data = pd.read_csv('songs.csv')
df = pd.pivot_table(csv_data, index='song_id', columns='user_id', values='number_times_listened', aggfunc='sum').fillna(0)
The resulting pivot table looks like:
user_id Dsfds2347 Fkjhh9849 Shjkn4987 Swjdh0980
song_id
ABC 15 7 3 0
XXX 0 0 5 1
XYZ 0 0 20 0
Is this what you're looking for? Keep in mind that song_id, user_id pairs are unique in your dataset, so the aggregate function isn't actually doing anything in this specific example since there's nothing to group by on these two columns.
Related
Helo!
I have loaded a few datasets, the only thing in common is that they have the same Columns names BUT, the number of columns/rows and the data are different , so looks like i cannot use merge or concat because there is not thing in common by ID for example.. I want to to put each df above the other and leave the "extra" columns with Nan values.
df1:
| Column A | Column B |
| -------- | -------- |
| ID 1 | Cell 2 |
| ID 2 | Cell 4 |
df2:
Column A
Column B
ColumnC
ID 3
Cell 2
info
ID 4
Cell 4
info
I wanto something like this:
df:
Column A
Column B
ColumnC
ID 1
Cell 2
Nan
ID 2
Cell 4
Nan
ID 3
Cell 2
info
ID 4
Cell 4
info
Thanks a lot for your time!
I have try something like df = pd.concat(['df1','df2'], axis=1) and merge
I am matching two large data-sets and trying to perform update,remove and create operations on original data-set by comparing it with other data-set. How can I update 2 or 3 column out of 10 of original data-set and keep other column's value same as before?
I tried merge but no avail. Merge does not work for me.
Original data:
id | full_name | date
1 | John | 02-23-2006
2 | Paul Elbert | 09-29-2001
3 | Donag | 11-12-2013
4 | Tom Holland | 06-17-2016
other data:
id | full_name | date
1 | John | 02-25-2018
2 | Paul | 03-09-2001
3 | Donag | 07-09-2017
4 | Tom | 05-09-2016
After trying this I checked manually I didn't get expected results.
original[['id']].merge(other[['id','date']],on='id')
Can I solve this problem with map? When ID match then update all values in date column without changing any value in name column of original data set
Use pandas.Series.map:
df['date']=df['id'].map(other_df.set_index('id ')['date'])
print(df)
id full_name date
0 1 John 02-25-2018
1 2 Paul Elbert 03-09-2001
2 3 Donag 07-09-2017
3 4 Tom Holland 05-09-2016
to check other conditons:
cond=df.status.str.contains('new')
df.loc['date',cond]=df.loc['id',cond].map(other_df.set_index('id ')['date'])
Pandas' DataFrame.update does this, if you properly set id as your index on both the original and other:
original.update(other[["date"]])
I have data in python about several individuals (Person 1,2,3,4,5) and several groups (groups A,B,C). I have a table (currently as a pandas dataframe) of the initial state (time == 0) of the individuals and groups:
Person | Group
-------|-------
1 | A
2 | A
3 | C
4 | B
5 | B
And a table (also a pandas DF) of people changing groups. The table includes the person, their new group, and the time of the change.
Person | New Group | Time
-------|-----------|------
1 | B | 10
1 | A | 12
3 | A | 13
4 | C | 13
1 | C | 22
5 | A | 30
I need to write a function that can return a list of the people in a group at a given time
people = people_in_group(group, time) # type(people) == list
and a function that can return the group a person is in at a given time
group = group_member(person, time)
What is the most appropriate data structure to build from these two tables that would make it easiest to query in both directions like this?
You current df has unique times so each list of people in a group would have one person
Given a dataframe like this with duplicate time values:
Person New Time
1 1 B 10
2 1 A 12
3 3 A 12
4 4 C 13
5 1 C 22
6 5 A 30
df.groupby(['New', 'Time']).Person.apply(list)
Gives
New Time
A 12 [1, 3]
30 [5]
B 10 [1]
C 13 [4]
22 [1]
I got the following problem which I got stuck on and unfortunately cannot resolve by myself or by similar questions that I found on stackoverflow.
To keep it simple, I'll give a short example of my problem:
I got a Dataframe with several columns and one column that indicates the ID of a user. It might happen that the same user has several entries in this data frame:
| | userID | col2 | col3 |
+---+-----------+----------------+-------+
| 1 | 1 | a | b |
| 2 | 1 | c | d |
| 3 | 2 | a | a |
| 4 | 3 | d | e |
Something like this. Now I want to known the number of rows that belongs to a certain userID. For this operation I tried to use df.groupby('userID').size() which in return I want to use for another simple calculation, like division whatsover.
But as I try to save the results of the calculation in a seperate column, I keep getting NaN values.
Is there a way to solve this so that I get the result of the calculations in a seperate column?
Thanks for your help!
edit//
To make clear, how my output should look like. The upper dataframe is my main data frame so to say. Besides this frame I got a second frame looking like this:
| | userID | value | value/appearances |
+---+-----------+----------------+-------+
| 1 | 1 | 10 | 10 / 2 = 5 |
| 3 | 2 | 20 | 20 / 1 = 20 |
| 4 | 3 | 30 | 30 / 1 = 30 |
So I basically want in the column 'value/appearances' to have the result of the number in the value column divided by the number of appearances of this certain user in the main dataframe. For user with ID=1 this would be 10/2, as this user has a value of 10 and has 2 rows in the main dataframe.
I hope this makes it a bit clearer.
IIUC you want to do the following, groupby on 'userID' and call transform on the grouped column and pass 'size' to identify the method to call:
In [54]:
df['size'] = df.groupby('userID')['userID'].transform('size')
df
Out[54]:
userID col2 col3 size
1 1 a b 2
2 1 c d 2
3 2 a a 1
4 3 d e 1
What you tried:
In [55]:
df.groupby('userID').size()
Out[55]:
userID
1 2
2 1
3 1
dtype: int64
When assigned back to the df aligns with the df index so it introduced NaN for the last row:
In [57]:
df['size'] = df.groupby('userID').size()
df
Out[57]:
userID col2 col3 size
1 1 a b 2
2 1 c d 1
3 2 a a 1
4 3 d e NaN
I have a dataframe where I have transformed all NaN to 0 for a specific reason. In doing another calculation on the df, my group by is picking up a 0 and making it a value to perform the counts on. Any idea how to get python and pandas to exclude the 0 value? In this case the 0 represents a single row in the data. Is there a way to exclude all 0's from the groupby?
My groupby looks like this
+----------------+----------------+-------------+
| Team | Method | Count |
+----------------+----------------+-------------+
| Team 1 | Automated | 1 |
| Team 1 | Manual | 14 |
| Team 2 | Automated | 5 |
| Team 2 | Hybrid | 1 |
| Team 2 | Manual | 25 |
| Team 4 | 0 | 1 |
| Team 4 | Automated | 1 |
| Team 4 | Hybrid | 13 |
+----------------+----------------+-------------+
My code looks like this (after importing excel file)
df = df1.filnna(0)
a = df[['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'}
I'd filter the df prior to grouping:
In [8]:
a = df.loc[df['Method'] !=0, ['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'})
b
Out[8]:
Method
Team Method
1 Automated 1
Manual 1
2 Automated 1
Hybrid 1
Manual 1
4 Automated 1
Hybrid 1
Here we only select rows where method is not equal to 0
compare against without filtering:
In [9]:
a = df[['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'})
b
Out[9]:
Method
Team Method
1 Automated 1
Manual 1
2 Automated 1
Hybrid 1
Manual 1
4 0 1
Automated 1
Hybrid 1
You need the filter.
The filter method returns a subset of the original object. Suppose
we want to take only elements that belong to groups with a group sum
greater than 2.
Example:
In [94]: sf = pd.Series([1, 1, 2, 3, 3, 3])
In [95]: sf.groupby(sf).filter(lambda x: x.sum() > 2) Out[95]: 3 3
4 3 5 3 dtype: int64
Source.