Group data by two columns and count it using pandas - python

I am having the following data.
songs
play_event
In songs the data is as below:
song_id total_plays
1 2000
2 4532
3 9999
4 2343
And in play event the data is as below:
user_id song_id
102 1
103 4
102 1
102 3
104 2
102 1
For each time a song was played, there is a new entry, even is a song is played again.
With this data I want to:
Get total no. of time each user played each songs. For example, if user_id 102 played, the song_id 1 three times, as per above data. I want to have it grouped by the user_id with total count. Something like below:
user_id song_id count
102 1 3
102 3 1
103 4 1
104 2 1
I am thinking of using Pandas to do this. But I want to know if pandas is the right choice.
If its not pandas, then what should be my way forward.
If Pandas is the right choice, then:
The below code allows me to get the count either grouped by user or grouped by user_id how do we get the count grouped by user_id & song_id? See a sample code I tried below:
import pandas as pd
#Load data from csv file
data = pd.DataFrame.from_csv('play_events.csv')
# Gives how many entries per user
data['user_id'].value_counts()
# Gives how many entries per songs
data['song_id'].value_counts()

For your first problem, a simple groupby and value_counts does the trick. Note that everything after value_counts() in the code below is just to get it to an actual dataframe in the same format as your desired output.
counts = play_events.groupby('user_id')['song_id'].value_counts().to_frame('count').reset_index()
>>> counts
user_id song_id count
0 102 1 3
1 102 3 1
2 103 4 1
3 104 2 1
Then for your second problem (which you have deleted in your edited post, but I will leave just in case it is useful to you), you can loop through counts, grouping by user_id, and save each as csv:
for user, data in counts.groupby('user_id', as_index=False):
data.to_csv(str(user)+'_events.csv')
For your example dataframes, this gives you 3 csvs: 102_events.csv, 103_events.csv, and 103_events.csv. The first looks like:
user_id song_id count
0 102 1 3
1 102 3 1

Related

Pandas Dataframe : Using same category codes on different existing dataframes with same category

I have two pandas dataframes with some columns in common. These columns are of type category but unfortunately the category codes don't match for the two dataframes. For example I have:
>>> df1
artist song
0 The Killers Mr Brightside
1 David Guetta Memories
2 Estelle Come Over
3 The Killers Human
>>> df2
artist date
0 The Killers 2010
1 David Guetta 2012
2 Estelle 2005
3 The Killers 2006
But:
>>> df1['artist'].cat.codes
0 55
1 78
2 93
3 55
Whereas:
>>> df2['artist'].cat.codes
0 99
1 12
2 23
3 99
What I would like is for my second dataframe df2 to take the same category codes as the first one df1 without changing the category values. Is there any way to do this?
(Edit)
Here is a screenshot of my two dataframes. Essentially I want the song_tags to have the same cat codes for artist_name and track_name as the songs dataframe. Also song_tags is created from a merge between songs and another tag dataframe (which contains song data and their tags, without the user information) and then saved and loaded through pickle. Also it might be relevant to add that I had to cast artist_name and track_name in song_tags to type category from type object.
I think essentially my question is: how to modify category codes of an existing dataframe column?

Python: How to find the number of items in each point on scatterplot and produce list?

Right now I have a dataset of 1206 participants who have each endorsed a certain number of traumatic experiences and a number of symptoms associated with the trauma.
This is part of my dataframe (full dataframe is 1206 rows long):
SubjectID
PTSD_Symptom_Sum
PTSD_Trauma_Sum
1223
3
5
1224
4
2
1225
2
6
1226
0
3
I have two issues that I am trying to figure out:
I was able to create a scatter plot, but I can't tell from this plot how many participants are in each data point. Is there any easy way to see the number of subjects in each data point?
I used this code to create the scatterplot:
plt.scatter(PTSD['PTSD_Symptom_SUM'], PTSD['PTSD_Trauma_SUM'])
plt.title('Trauma Sum vs. Symptoms')
plt.xlabel('Symptoms')
plt.ylabel('Trauma Sum')
I haven't been able to successfully produce a list of the number of people endorsing each pair of items (symptoms and trauma number). I am able to run this code to create the counts for the number of people in each category:
:
count_sum= PTSD['PTSD_SUM'].value_counts()
count_symptom_sum= PTSD['PTSD_symptom_SUM'].value_counts()
print(count_sum)
print(count_symptom_sum)
Which produces this output:
0 379
1 371
2 248
3 130
4 47
5 17
6 11
8 2
7 1
Name: PTSD_SUM, dtype: int64
0 437
1 418
2 247
3 74
4 23
5 4
6 3
Name: PTSD_symptom_SUM, dtype: int64
Is it possible to alter the code to count the number of people endorsing each pair of items (symptom number and trauma number)? If not, are there any functions that would allow me to do this?
You could create a new dataset with the counts of each pair 'PTSD_SUM', 'PTSD_Symptom_SUM' with:
counts = PTSD.groupby(by=['PTSD_symptom_SUM', 'PTSD_SUM']).size().to_frame('size').reset_index()
and then use Seaborn like this:
import seaborn as sns
sns.scatterplot(data=counts, x="PTSD_symptom_SUM", y="PTSD_SUM", hue="size", size="size")
To obtain something like this:
If I understood properly, your dataframe is:
SubjectID TraumaSum Symptoms
1 1 5
2 3 4
...
So you just need:
dataset.groupby(by=['PTSD_SUM', 'PTSD_Symptom_SUM']).count()
This line will return you the count for each unique value

Transform dataframe/pandas

I have a dataframe with tag as column and movieId as index. movieId represents movies.
So there may be multiple tags for the same movieId.
I want to transform this dataframe and there are as many columns as there are tags and that moviesId become the lines but there must be one line per movieId.
And that for each movieId if there was a tag there is a 1 in the tag column at otherwise 0
I can't do it when I try, the moviesID appear several times.
Would someone help me
Thank you so much.
We can use ".crosstab()" function to get the required output:
I have created a sample dataframe as 'df':
movieId tag
260 Best movie ever
1240 scifi
2133 Best movie ever
1097 scifi
260 scifi
250 scifi
By using .crosstab() function:
pd.crosstab(df.movieId, df.tag, dropna = False)
The output will be like this:
tag Bestmovie ever scifi
movieId
250 0 1
260 1 1
1097 0 1
1240 0 1
2133 1 0
i hope this fixes the problem
import pandas as pd
import numpy as np
df = pd.DataFrame([[260, "best"],[520,"sci"],[260,"sci"]],columns=['movieId','tag'])
print("Dummy DataFrame: \n", df)
movieId, tags= list(df['movieId'].unique()), list(df['tag'].unique())
dfmatrix= pd.DataFrame(np.zeros((len(movieId),len(tags)+1),dtype=int), columns=['movieID']+tags)
# dfmatrix['movieID'][1]= 54
for i, movie in enumerate(movieId):
listoftag = df.tag[df['movieId']==movie]
dfmatrix.movieID[i]= movie
for tag in listoftag:
dfmatrix[tag][i]=1
print("\n \n dfmatrix \n",dfmatrix)
the output is :
Dummy DataFrame:
movieId tag
0 260 best
1 520 sci
2 260 sci
dfmatrix
movieID best sci
0 260 1 1
1 520 0 1

Pair of employees that worked together for the longest period of time - Python/Pandas

I recently had to do a code, which returns the pair of employees that have worked the most together on a common project. This is the code I came up with:
Note 1: Null is read by the program as "Today"
Note 2: The data comes from a .txt file in this form:
EmpID,ProjectID,DateFrom,DateTo
1,101,2014-11-01,2015-05-01
1,103,2013-11-01,2016-05-01
2,101,2013-12-06,2014-10-06
2,103,2014-06-05,2015-05-14
3,100,2016-03-01,2018-07-03
3,102,2015-06-04,2017-09-04
3,103,2015-06-04,2017-09-04
4,102,2013-11-13,2014-03-13
4,103,2016-02-14,2017-03-15
4,104,2014-10-01,2015-12-01
5,100,2013-03-07,2015-11-07
5,101,2015-07-09,2019-01-19
5,102,2014-03-15,NULL
6,101,2014-03-15,2014-03-16
The problem that I currently have is that I have to adapt/change the code to return the pair of employees that have worked together with each other the longest (not on a single project, but all projects combined). I am having troubles of adapting my current code, which runs perfectly fine for what it is, and I am wondering if I should just scratch all of this and start from the beginning (but it would cost me a lot of time, which I don't have currently). I am having difficulties with obtaining the combinations of employees that have worked together on projects.
I would very much appreciate it if anyone can give me any tips! Thanks!
Edit 1: A person in the comments reminded me to mention that overlapping days should be counted as for example:
Person A and B work on two projects for the entirety of June. This means that it should be counted as 30 days total common work (for the two projects), not adding both project times together, which would result in 60 days.
Here's one of the more straight-forward ways I can think of doing this.
Expand the timespans to a single row per date.
Merge all Days on the same project (to get all combinations of people who worked together)
Remove duplicated rows of people who work together on the same day, but different projects.
Just find how many rows are within each worker pairing.
Code:
import pandas as pd
import numpy as np
def expand_period_daily(df, start, stop):
# Allows it to work for one day spans.
df.loc[df[stop].notnull(), stop] = (df.loc[df[stop].notnull(), stop]
+ pd.Timedelta(hours=1))
real_span = df[[start, stop]].notnull().all(1)
# Resample timespans to daily fields.
df['temp_id'] = range(len(df))
dailydf = (df.loc[real_span, ['temp_id', start, stop]].set_index('temp_id').stack()
.reset_index(level=-1, drop=True).rename('period').to_frame())
dailydf = (dailydf.groupby('temp_id').apply(lambda x: x.set_index('period')
.resample('d').asfreq()).reset_index())
# Merge back other information
dailydf = (dailydf.merge(df, on=['temp_id'])
.drop(columns=['temp_id', start, stop]))
return dailydf
# Make dates, fill missings.
df[['DateFrom', 'DateTo']] = df[['DateFrom', 'DateTo']].apply(pd.to_datetime, errors='coerce')
df[['DateFrom', 'DateTo']] = df[['DateFrom', 'DateTo']].fillna(pd.to_datetime('today').normalize())
dailydf = expand_period_daily(df.copy(), start='DateFrom', stop='DateTo')
# Merge, remove rows of employee with him/herself.
m = (dailydf.merge(dailydf, on=['period', 'ProjectID'])
.loc[lambda x: x.EmpID_x != x.EmpID_y])
# Ensure A-B and B-A are grouped the same
m[['EmpID_x', 'EmpID_y']] = np.sort(m[['EmpID_x', 'EmpID_y']].to_numpy(), axis=1)
# Remove duplicated projects on same date between employee pairs
m = m.drop_duplicates(['period', 'EmpID_x', 'EmpID_y'])
m.groupby(['EmpID_x', 'EmpID_y']).size().to_frame('Days_Together')
Output:
Days_Together
EmpID_x EmpID_y
1 2 344
3 333
4 78
2 6 2
3 4 396
5 824
Test Case
To give a bit more clarity on how it handles overlaps, and combines different projects, Here's the following test case:
EmpID ProjectID DateFrom DateTo
0 1 101 2014-11-01 2014-11-15
1 1 103 2014-11-01 2014-11-15
2 1 105 2015-11-02 2015-11-03
3 2 101 2014-11-01 2014-11-15
4 2 103 2014-11-01 2014-11-15
5 2 105 2015-10-02 2015-11-05
6 3 101 2014-11-01 2014-11-15
Employees 1 and 2 perfectly overlap for 15 days on 2 projects in Nov 2014. They then work together for 2 additional days on another project in 2015. 1, 2 and 3 all work together for 15 days on a single Project.
Running with this test case we obtain:
Days_Together
EmpID_x EmpID_y
1 2 17
3 15
2 3 15

Counting the number of customers by values in a second series

I have imported a list of customers into python to run some RFM analysis, this adds a new field to the data for the RFM Class, so now my data looks like this:
customer RFMClass
0 0001914f-4655-4148-a1dc-1f25ca6d1f15 343
1 0002e50a-5551-4d9a-8734-76307dfe2131 341
2 00039977-512e-47ad-b929-170f18a1b14a 442
3 000693ff-2c61-425c-97c1-0286c874dd2f 443
4 00095dc2-7f37-48b0-894f-910d90cbbee2 142
5 000b748b-7ea0-48f2-a875-5f6cb95561d9 141
...
I'd like to plot a histogram showing the number of customers in each RFM Class, how can I get a count of the number of distinct customers ID's per class?
I tried adding a 1 to every row with summary['number'] = 1 thinking that it might be easier to count these rather than the customer ID's, as these have already been de-duped in my code, but I can't figure out how to sum these per RFM Class either.
Any thoughts on how I could do this?
I worked this out by using .groupby on my RFM class and summing the 'number' I assigned to each row:
byhour = df.groupby(['Hour']).agg({'Orders': 'sum'})
print(byhour)
This then produces the desired output:
Orders
Hour
0 902
1 438
2 307
3 162
4 149
5 233
6 721

Categories