I have a dataframe with 2 columns i.e. UserId in integer format and Actors in string format as shown below:
Userid Actors
u1 Tony Ward,Bruce LaBruce,Kevin P. Scott,Ivar Johnson, Naomi Watts, Tony Ward,.......
u2 Tony Ward,Bruce LaBruce,Kevin P. Scott, Luke Wilson, Owen Wilson, Lumi Cavazos,......
It represents actors from all movies watched by a particular user of the platform
I want an output where we have the count of each actor for each user as shown below:
UserId Tony Ward Bruce LaBruce Kevin P. Scott Ivar Johnson Luke Wilson Owen Wilson Lumi Cavazos
u1 2 1 1 1 0 0 0
u2 1 1 1 0 1 1 1
It is something similar to countvectoriser I reckon, but i just have nouns here.
Kindly help
Assuming its a pandas.Dataframe try this, DataFrame.explode Transform each element of a list-like (result of split) to a row DataFrame.groupby aggregates the data & DataFrame.unstack transforms to required format.
df['Actors'] = df['Actors'].str.replace(",\s", ",").str.split(",")
(
df.explode('Actors').
groupby(['Userid', 'Actors'], as_index=False).size().
unstack().fillna(0)
)
Related
I have the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': ['Steve Smith', 'Joe Nadal',
'Roger Federer'],
'birthdat/company': ['1995-01-26Sharp, Reed and Crane',
'1955-08-14Price and Sons',
'2000-06-28Pruitt, Bush and Mcguir']})
df[['data_time','full_company_name']] = df['birthdat/company'].str.split('[0-9]{4}-[0-9]{2}-[0-9]{2}', expand=True)
df
with my code I get the following:
____|____Name______|__birthdat/company_______________|_birthdate_|____company___________
0 |Steve Smith |1995-01-26Sharp, Reed and Crane | |Sharp, Reed and Crane
1 |Joe Nadal |1955-08-14Price and Sons | |Price and Sons
2 |Roger Federer |2000-06-28Pruitt, Bush and Mcguir| |Pruitt, Bush and Mcguir
what I want is - get this regex ('[0-9]{4}-[0-9]{2}-[0-9]{2}') and the rest should go to the column "full_company_name" and :
____|____Name______|_birthdate_|____company_name_______
0 |Steve Smith |1995-01-26 |Sharp, Reed and Crane
1 |Joe Nadal |1955-08-14 |Price and Sons
2 |Roger Federer |2000-06-28 |Pruitt, Bush and Mcguir
Updated Question:
How could I handle missing values for birthdate or company name,
example: birthdate/company = "NaApple" or birthdate/company = "2003-01-15Na" the missing values are not only limited to Na
You may use
df[['data_time','full_company_name']] = df['birthdat/company'].str.extract(r'^([0-9]{4}-[0-9]{2}-[0-9]{2})(.*)', expand=False)
>>> df
Name Age ... data_time full_company_name
0 Steve Smith 32 ... 1995-01-26 Sharp, Reed and Crane
1 Joe Nadal 34 ... 1955-08-14 Price and Sons
2 Roger Federer 36 ... 2000-06-28 Pruitt, Bush and Mcguir
[3 rows x 5 columns]
The Series.str.extract is used here because you need to get two parts without losing the date.
The regex is
^ - start of string
([0-9]{4}-[0-9]{2}-[0-9]{2}) - your date pattern captured into Group 1
(.*) - the rest of the string captured into Group 2.
See the regex demo.
split splits the string by the separator while ignoring them. I think you want extract with two capture groups:
df[['data_time','full_company_name']] = \
df['birthdat/company'].str.extract('^([0-9]{4}-[0-9]{2}-[0-9]{2})(.*)')
Output:
Name birthdat/company data_time full_company_name
-- ------------- --------------------------------- ----------- -----------------------
0 Steve Smith 1995-01-26Sharp, Reed and Crane 1995-01-26 Sharp, Reed and Crane
1 Joe Nadal 1955-08-14Price and Sons 1955-08-14 Price and Sons
2 Roger Federer 2000-06-28Pruitt, Bush and Mcguir 2000-06-28 Pruitt, Bush and Mcguir
I was trying to return only those rows when any value in column director has a separator '|'. But it is not filtering on the separator basis and instead shows all the rows. Please let me know the possible issue with this.
I tried the following:
hb_dctr = df_updated[df_updated['director'].str.contains('|')]
hb_dctr
But it shows following
id popularity budget Cast director
135397 32.985763 150000000 Chris Pratt|Irrfan Khan Colin Trevorrow
76341 28.419936 150000000 Tom Hardy|Charlize Theron George Miller
76757 6.189369 176000003 Mila Kunis|Channing Lana Wachowski|Lilly Wachowski
It should only show rows that with id 135397 and 766341
You need to set the regex=False
df[df.director.str.contains("|",regex=False)]
id popularity budget Cast \
2 76757 6.189369 176000003 Mila Kunis|Channing
director
2 Lana Wachowski|Lilly Wachowski
If you want to exclude such rows, use an invert ~
df[~df.director.str.contains("|",regex=False)]
id popularity budget Cast director
0 135397 32.985763 150000000 Chris Pratt|Irrfan Khan Colin Trevorrow
1 76341 28.419936 150000000 Tom Hardy|Charlize Theron George Miller
Escape | because special regex character (or):
df1 = df[df.director.str.contains("\|")]
print (df1)
id popularity budget Cast \
2 76757 6.189369 176000003 Mila Kunis|Channing Lana
director
2 Wachowski|Lilly Wachowski
For not contains use ~:
df2 = df[~df.director.str.contains("\|")]
print (df2)
id popularity budget Cast director
0 135397 32.985763 150000000 Chris Pratt|Irrfan Khan Colin Trevorrow
1 76341 28.419936 150000000 Tom Hardy|Charlize Theron George Miller
Details:
print (df.director.str.contains("\|"))
0 False
1 False
2 True
Name: director, dtype: bool
print (~df.director.str.contains("\|"))
0 True
1 True
2 False
Name: director, dtype: bool
I have a large df called data which looks like:
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
I have another dataframe called updates. In this example the dataframe has updated information for data for a couple of records and looks like:
Identifier Surname First names(s) Date change
0 12233.0 Smith Bob 05/09/14
1 10610.0 Cooper Amy 16/08/12
I'm trying to find a way to update data with the updates df so the resulting dataframe looks like:
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob 15/09/14 FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
As you can see the Date change field for Bob in the data df has been updated with the Date change from the updates df.
What can I try next?
a while back, I was dealing with that too. the straight up .update was giving me issues (sorry can't remember the exact issue I had. I think it was that when you do .update, it's reliant on indexes matching, and they didn't match in my 2 separate dataframes. so I wanted to use certain columns as my index to update on),
But I made a function to deal with it. So this might be way overkill than what's needed but try this and see if it'll work.
I'm also assuming the date you want update from the updates dataframe should be 15/09/14 not 05/09/14. So I had that different in my sample data below
Also, I'm assuming the Identifier is unique key. If not, you'll need to include multiple columns as your unique key
import sys
import pandas as pd
data = pd.DataFrame([[12233.0,'Smith','Bob','','FT','NW'],
[54213.0,'Jones','Sally','15/04/15','FT','NW'],
[12237.0,'Evans','Steve','26/08/14','FT','SE'],
[10610.0,'Cooper','Amy','16/08/12','FT','SE']],
columns = ['Identifier','Surname','First names(s)','Date change','Work Pattern','Region'])
updates = pd.DataFrame([[12233.0,'Smith','Bob','15/09/14'],
[10610.0,'Cooper','Amy','16/08/12']],
columns = ['Identifier','Surname','First names(s)','Date change'])
def update(df1, df2, keys_list):
df1 = df1.set_index(keys_list)
df2 = df2.set_index(keys_list)
dup_idx1 = df1.index.get_duplicates()
dup_idx2 = df2.index.get_duplicates()
if len(dup_idx1) > 0 or len(dup_idx2) > 0:
print('\n'+'#'*50+'\nError! Duplicate Indicies:')
for element in dup_idx1:
print('df1: %s' %(element,))
for element in dup_idx2:
print('df2: %s' %(element,))
print('#'*50+'\n\n')
df1.update(df2, overwrite=True)
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
return df1
# the 3rd input is a list, in case you need multiple columns as your unique key
df = update(data, updates, ['Identifier'])
Output:
print (data)
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
print (updates)
Identifier Surname First names(s) Date change
0 12233.0 Smith Bob 15/09/14
1 10610.0 Cooper Amy 16/08/12
df = update(data, updates, ['Identifier'])
In [19]: print (df)
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob 15/09/14 FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
Using DataFrame.update.
First set index:
data.set_index('Identifier', inplace=True)
updates.set_index('Identifier', inplace=True)
Then update:
data.update(updates)
print(data)
Surname First names(s) Date change Work Pattern Region
Identifier
12233.0 Smith Bob 15/09/14 FT NW
54213.0 Jones Sally 15/04/15 FT NW
12237.0 Evans Steve 26/08/14 FT SE
10610.0 Cooper Amy 16/08/12 FT SE
If you need multiple columns to create a unique index you can just set them with a list. For example:
data.set_index(['Identifier', 'Surname'], inplace=True)
updates.set_index(['Identifier', 'Surname'], inplace=True)
data.update(updates)
I have two columns in a DataFrame, crewname is a list of crew members worked on a film. Director_loc is the location within the list of the director.
I want to create a new column which has the name of the director.
crewname Director_loc
[John Lasseter, Joss Whedon, Andrew Stanton, J... 0
[Larry J. Franco, Jonathan Hensleigh, James Ho... 3
[Howard Deutch, Mark Steven Johnson, Mark Stev... 0
[Forest Whitaker, Ronald Bass, Ronald Bass, Ez... 0
[Alan Silvestri, Elliot Davis, Nancy Meyers, N... 5
[Michael Mann, Michael Mann, Art Linson, Micha... 0
[Sydney Pollack, Barbara Benedek, Sydney Polla... 0
[David Loughery, Stephen Sommers, Peter Hewitt... 2
[Peter Hyams, Karen Elise Baldwin, Gene Quinta... 0
[Martin Campbell, Ian Fleming, Jeffrey Caine, ... 0
I've tried a number of codes using list comprehension, enumerate etc. I'm a bit embarrassed to put them here.
Any help will be appreciated.
Use indexing with list comprehension:
df['name'] = [a[b] for a , b in zip(df['crewname'], df['Director_loc'])]
print (df)
crewname Director_loc \
0 [John Lasseter, Joss Whedon, Andrew Stanton] 2
1 [Larry J. Franco, Jonathan Hensleigh] 1
name
0 Andrew Stanton
1 Jonathan Hensleigh
I have two data frames
df1 =
actorID actorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
df2 =
directorID directorName
0 john_lasseter John Lasseter
1 joe_johnston Joe Johnston
2 donald_petrie Donald Petrie
3 forest_whitaker Forest Whitaker
4 charles_shyer Charles Shyer
What I ideally want is a concatenation of these two dataframes, like pd.concat((df1, df2)):
actorID-directorID actorName-directorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
5 john_lasseter John Lasseter
6 joe_johnston Joe Johnston
7 donald_petrie Donald Petrie
8 forest_whitaker Forest Whitaker
9 charles_shyer Charles Shyer
however I want there to be an easy way to specify that I want to join df1.actorName and df2.directorName together, and actorID / directorID. How can I do this?