Pandas data frame sum of column and combing with other data - python

This is a continuation on a question I asked before and got a suitable answer for at the time. Now however my problem is different and the given answers no longer (completely) applies.
I have a big collection of Twitter messages and I want to do some statistical analysis on it. Part of the data frame looks as follows:
user.id user.screen_name user.followers_count text
Jim JimTHEbest 14 blahbla
Jim JIMisCOOL 15 blebla
Sarah Sarah123 33 blaat
Sarah Sarah123 33 bla
Peter PeterOnline 9 blabla
user.id never changes and is a identifier of a Twitter account.
user.screen_name Name given to twitter account, can change over time.
user.followers_count How many followers account has, can change over time.
text twitter message, each row represents 1 twitter message and its meta data.
What i would like to do is count the frequency of tweets by each twitter user in my data frame and combine it with the data that I already have. So that I get something like this:
user.id user.screen_name user.followers_count count
Jim JIMisCOOL 15 2
Sarah Sarah123 33 2
Peter PeterOnline 9 1
A data frame with 1 row for each twitter user in my data set that shows their tweet count and the last screen_name and followers_count.
What I think I should do is first do the 'count' operation and then pd.merge that outcome with a part of my original data frame. Trying merge with help of the pandas documentation didn't get me very far, mostly endlessly repeating rows of duplicate data.. Any help would be greatly appreciated!
The count part I do as follows:
df[['name', 'text']].groupby(['name']).size().reset_index(name='count')

# df being the original dataframe, taking the last row of each unique user.id and ignoring the 'text' column
output_df = df.drop_duplicates(subset='user.id', take_last=True)[['user.id', 'user.screen_name', 'user.followers_count']]
# adding the 'count' column
output_df['count'] = df['user.id'].apply(lambda x: len(df[df['user.id'] == x]))
output_df.reset_index(inplace=True, drop=True)
print output_df
>> user.id user.screen_name user.followers_count count
0 Jim JIMisCOOL 15 2
1 Sarah Sarah123 33 2
2 Peter PeterOnline 9 1

You group on user.id, and then use agg to apply a custom aggregation function to each column. In this case, we use a lambda expression and then use iloc to take the last member of each group. We then use count on the text column.
result = df.groupby('user.id').agg({'user.screen_name': lambda group: group.iloc[-1],
'user.followers_count': lambda group: group.iloc[-1],
'text': 'count'})
result.rename(columns={'text': 'count'}, inplace=True)
>>> result[['user.screen_name', 'user.followers_count', 'count']]
user.screen_name user.followers_count count
user.id
Jim JIMisCOOL 15 2
Peter PeterOnline 9 1
Sarah Sarah123 33 2

This is how I did it myself, but I'm also going to look into the other answers, they are probably different for a reason :).
df2 = df[['user.id', 'text']].groupby(['user.id']).size().reset_index(name='count')
df = df.set_index('user.id')
df2 = df2.set_index('user.id')
frames = [df2, df]
result = pd.concat(frames, axis=1, join_axes=[df.index])
result = result.reset_index()
result = result.drop_duplicates(['user.id'], keep='last')
result = result[['user.id', 'user.screen_name', 'user.followers_count', 'count']]
result
user.id user.screen_name user.followers_count count
1 Jim JIMisCOOL 15 2
3 Sarah Sarah123 33 2
4 Peter PeterOnline 9 1

Related

Pandas Dataframe : Using same category codes on different existing dataframes with same category

I have two pandas dataframes with some columns in common. These columns are of type category but unfortunately the category codes don't match for the two dataframes. For example I have:
>>> df1
artist song
0 The Killers Mr Brightside
1 David Guetta Memories
2 Estelle Come Over
3 The Killers Human
>>> df2
artist date
0 The Killers 2010
1 David Guetta 2012
2 Estelle 2005
3 The Killers 2006
But:
>>> df1['artist'].cat.codes
0 55
1 78
2 93
3 55
Whereas:
>>> df2['artist'].cat.codes
0 99
1 12
2 23
3 99
What I would like is for my second dataframe df2 to take the same category codes as the first one df1 without changing the category values. Is there any way to do this?
(Edit)
Here is a screenshot of my two dataframes. Essentially I want the song_tags to have the same cat codes for artist_name and track_name as the songs dataframe. Also song_tags is created from a merge between songs and another tag dataframe (which contains song data and their tags, without the user information) and then saved and loaded through pickle. Also it might be relevant to add that I had to cast artist_name and track_name in song_tags to type category from type object.
I think essentially my question is: how to modify category codes of an existing dataframe column?

How to count Pandas df elements with dynamic condition per row (=countif)

I am tyring to do some equivalent of COUNTIF in Pandas. I am trying to get my head around doing it with groupby, but I am struggling because my logical grouping condition is dynamic.
Say I have a list of customers, and the day on which they visited. I want to identify new customers based on 2 logical conditions
They must be the same customer (same Guest ID)
They must have been there on the previous day
If both conditions are met, they are a returning customer. If not, they are new (Hence newby = 1-... to identify new customers.
I managed to do this with a for loop, but obviously performance is terrible and this goes pretty much against the logic of Pandas.
How can I wrap the following code into something smarter than a loop?
for i in range (0, len(df)):
newby = 1-np.sum((df["Day"] == df.iloc[i]["Day"]-1) & (df["Guest ID"] == df.iloc[i]["Guest ID"]))
This post does not help, as the condition is static. I would like to avoid introducting "dummy columns", such as transposing the df, because I will have many categories (many customer names) and would like to build more complex logical statements. I do not want to run the risk of ending up with many auxiliary columns
I have the following input
df
Day Guest ID
0 3230 Tom
1 3230 Peter
2 3231 Tom
3 3232 Peter
4 3232 Peter
and expect this output
df
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
Note that elements 3 and 4 are not necessarily duplicates - given there might be additional, varying columns (such as their order).
Do:
# ensure the df is sorted by date
df = df.sort_values('Day')
# group by customer and find the diff within each group
df['newby'] = (df.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
UPDATE
If multiple visits are allowed per day, you could do:
# only keep unique visits per day
uniques = df.drop_duplicates()
# ensure the df is sorted by date
uniques = uniques.sort_values('Day')
# group by customer and find the diff within each group
uniques['newby'] = (uniques.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
# merge the uniques visits back into the original df
res = df.merge(uniques, on=['Day', 'Guest ID'])
print(res)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
As an alternative, without sorting or merging, you could do:
lookup = {(day + 1, guest) for day, guest in df[['Day', 'Guest ID']].value_counts().to_dict()}
df['newby'] = (~pd.MultiIndex.from_arrays([df['Day'], df['Guest ID']]).isin(lookup)).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1

How to insert missing data into a pandas dataframe, using another dataframe that has the missing info?

Let's say I have a dataframe of leads as such:
import pandas as pd
leads = {'Unique Identifier':['1','2','3','4','5','6','7','8'],
'Name': ['brad','stacy','holly','mike','phil', 'chris','jane','glenn'],
'Channel': [None,None,None,None,'facebook', 'facebook','google', 'facebook'],
'Campaign': [None,None,None,None,'A', 'B','B', 'C'],
'Gender': ['M','F','F','M','M', 'M','F','M'],
'Signup Month':['Mar','Mar','Apr','May','May','May','Jun','Jun']
}
leads_df = pd.DataFrame(leads)
leads_df
which looks like the following. It has missing data for Channel and Campaign for the first 4 leads.
leads table
I have a separate dataframe with the missing data:
missing = {'Unique Identifier':['1','2','3','4'],
'Channel': ['google', 'email','facebook', 'google'],
'Campaign': ['B', 'A','C', 'B']
}
missing_df = pd.DataFrame(missing)
missing_df
table with missing data
Using the Unique Identifiers in both tables, how would I go about plugging in the missing data into the main leads table? For context there are about 6,000 leads with missing data.
You can merge the two dataframes together, update the columns using the results from the merge and then proceed to drop the merged columns.
data = leads_df.merge(missing_df, how='outer', on='Unique Identifier')
data['Channel'] = data['Channel_y'].fillna(data['Channel_x'])
data['Campaign'] = data['Campaign_y'].fillna(data['Campaign_x'])
data.drop(['Channel_x', 'Channel_y', 'Campaign_x', 'Campaign_y'], 1, inplace=True)
The result:
data
Unique Identifier Name Gender Signup Month Channel Campaign
0 1 brad M Mar google B
1 2 stacy F Mar email A
2 3 holly F Apr facebook C
3 4 mike M May google B
4 5 phil M May facebook A
5 6 chris M May facebook B
6 7 jane F Jun google B
7 8 glenn M Jun facebook C
You can set the index of both dataframes to unique identifier and use combine_first to fill the null values in leads_df
(leads_df
.set_index("Unique Identifier")
.combine_first(missing_df.set_index("Unique Identifier"))
.reset_index()
)
The way I use in this kind of case is similar to vlookup function of Excel.
leads_df.loc[leads_df.Channel.isna(),'Channel']=pd.merge(leads_df.loc[leads_df.Channel.isna(),'Unique Identifier'],
missing_df,
how='left')['Channel']
This code will result in :
Unique Identifier Name Channel Campaign Gender Signup Month
0 1 brad google None M Mar
1 2 stacy email None F Mar
2 3 holly facebook None F Apr
3 4 mike google None M May
4 5 phil facebook A M May
5 6 chris facebook B M May
6 7 jane google B F Jun
7 8 glenn facebook C M Jun
You can do same to 'Campaign'.
You just need to fill it using fillna() ...
leads_df.fillna(missing_df, inplace=True)
There is a pandas DataFrame method for this called combine_first:
voltron = leads_df.combine_first(missing_df)

Update a value based on another dataframe pairing

I have problem where I need to update a value if people were at the same table.
import pandas as pd
data = {"p1":['Jen','Mark','Carrie'],
"p2":['John','Jason','Rob'],
"value":[10,20,40]}
df = pd.DataFrame(data,columns=["p1",'p2','value'])
meeting = {'person':['Jen','Mark','Carrie','John','Jason','Rob'],
'table':[1,2,3,1,2,3]}
meeting = pd.DataFrame(meeting,columns=['person','table'])
df is a relationship table and value is the field i need to update. So if two people were at the same table in the meeting dataframe then update the df row accordingly.
for example: Jen and John were both at table 1, so I need to update the row in df that has Jen and John and set their value to value + 100 so 110.
I thought about maybe doing a self join on meeting to get the format to match that of df but not sure if this is the easiest or fastest approach
IIUC you could set the person as index in the meeting dataframe, and use its table values to replace the names in df. Then if both mappings have the same value (table), replace with df.value+100:
m = df[['p1','p2']].replace(meeting.set_index('person').table).eval('p1==p2')
df['value'] = df.value.mask(m, df.value+100)
print(df)
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140
This could be an approach, using df.to_records():
groups=meeting.groupby('table').agg(set)['person'].to_list()
df['value']=[row[-1]+100 if set(list(row)[1:3]) in groups else row[-1] for row in df.to_records()]
Output:
df
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140

Top 5 movies with most number of ratings

I'm currently facing a little problem. I'm working with the movie-lens 1M data, and trying to get the top 5 movies with the most ratings.
movies = pandas.read_table('movies.dat', sep='::', header=None, names= ['movie_id', 'title', 'genre'])
users = pandas.read_table('users.dat', sep='::', header=None, names=['user_id', 'gender','age','occupation_code','zip'])
ratings = pandas.read_table('ratings.dat', sep='::', header=None, names=['user_id','movie_id','rating','timestamp'])
movie_data = pandas.merge(movies,pandas.merge(ratings,users))
The above code is what I have written to merge the .dat files into one Dataframe.
Then I need the top 5 from that movie_data dataframe, based on the ratings.
Here is what I have done:
print(movie_data.sort('rating', ascending = False).head(5))
This seem to find the top 5 based on the rating. However, the output is:
movie_id title genre user_id \
0 1 Toy Story (1995) Animation|Children's|Comedy 1
657724 2409 Rocky II (1979) Action|Drama 101
244214 1012 Old Yeller (1957) Children's|Drama 447
657745 2409 Rocky II (1979) Action|Drama 549
657752 2409 Rocky II (1979) Action|Drama 684
rating timestamp gender age occupation_code zip
0 5 978824268 F 1 10 48067
657724 5 977578472 F 18 3 33314
244214 5 976236279 F 45 11 55105
657745 5 976119207 M 25 6 53217
657752 5 975603281 M 25 4 27510
As you can see Rocky II appears 3 times. I would like to know if I can somehow remove duplicates fast, other than going through the list again, and remove duplicates that way.
I have looked at a pivot_table, but i'm not quite sure how they work, so if it can be done with such a table, i need some explaination of how they work
EDIT.
First comment did indeed remove the duplicates.
movie_data.drop_duplicates(subset='movie_id').sort('rating', ascending = False).head(5)
Thank you :)
You can drop the duplicate entries by calling drop_duplicates and pass param subset='movie_id':
movie_data.drop_duplicates(subset='movie_id').sort('rating', ascending = False).head(5)

Categories