Extract certain elements based on element location from another column - python

I have two columns in a DataFrame, crewname is a list of crew members worked on a film. Director_loc is the location within the list of the director.
I want to create a new column which has the name of the director.
crewname Director_loc
[John Lasseter, Joss Whedon, Andrew Stanton, J... 0
[Larry J. Franco, Jonathan Hensleigh, James Ho... 3
[Howard Deutch, Mark Steven Johnson, Mark Stev... 0
[Forest Whitaker, Ronald Bass, Ronald Bass, Ez... 0
[Alan Silvestri, Elliot Davis, Nancy Meyers, N... 5
[Michael Mann, Michael Mann, Art Linson, Micha... 0
[Sydney Pollack, Barbara Benedek, Sydney Polla... 0
[David Loughery, Stephen Sommers, Peter Hewitt... 2
[Peter Hyams, Karen Elise Baldwin, Gene Quinta... 0
[Martin Campbell, Ian Fleming, Jeffrey Caine, ... 0
I've tried a number of codes using list comprehension, enumerate etc. I'm a bit embarrassed to put them here.
Any help will be appreciated.

Use indexing with list comprehension:
df['name'] = [a[b] for a , b in zip(df['crewname'], df['Director_loc'])]
print (df)
crewname Director_loc \
0 [John Lasseter, Joss Whedon, Andrew Stanton] 2
1 [Larry J. Franco, Jonathan Hensleigh] 1
name
0 Andrew Stanton
1 Jonathan Hensleigh

Related

Matching two strings columns in "A vs B", then assign label into new column

I have a dataframe that looks like this:
Name F_Name L_Name Title
John Down John Down sth vs Down John
Dave Brown Dave Brown sth v Brown Dave
Mary Sith Mary Sith Sith Mary vs sth
Sam Walker Sam Walker sth vs Sam Walker
Chris Humpy Chris Humpy Humpy
John Hunter John Hunter John Hunter
Nola Smith Nola Smith Nola
Chuck Bass Chuck Bass Bass v sth
Rob Bank Rob Bank Rob v sth
Chris Ham Chris Ham Chris Ham
Angie Poppy Angie Poppy Poppy Angie
Joe Exhaust Joe Exhaust sth vs Joe
: : :
Tony Start Tony Start sth v Start
Tony Start Tony Start sth v james bb
Tony Start Tony Start Dave Sins
I would like to match the Name column with the Title column. If the Name appear before v or vs, then the new column Label will be first. Otherwise, it will be second. If the Title column only has the name without v or vs. It will be null.
Here is what the output dataframe would look like:
Name F_Name L_Name Title Label
John Down John Down sth vs Down John second
Dave Brown Dave Brown sth v Brown Dave second
Mary Sith Mary Sith Sith Mary vs sth first
Sam Walker Sam Walker sth vs Sam Walker second
Chris Humpy Chris Humpy Humpy null
John Hunter John Hunter John Hunter null
Nola Smith Nola Smith Nola null
Chuck Bass Chuck Bass Bass v sth first
Rob Bank Rob Bank Rob vs sth first
Chris Ham Chris Ham Chris Ham null
Angie Poppy Angie Poppy Poppy Angie null
Joe Exhaust Joe Exhaust sth vs Joe second
: : : :
Tony Start Tony Start sth v Start second
Tony Start Tony Start sth v james b null
Tony Start Tony Start Dave Sins null
I am thinking to split the v or vs from the Title column into two new columns then matching with the Name column. But I do not know how to add the conditions that to check whether the names appear before the v or vs. So I am wondering are there any better ways to do this without splitting the title column?
Idea for matching is values before v or vs splitted by spaces and converted to sets and for second condition test this strings in Series.str.contains, last passed to numpy.select:
#convert slitted by spaces Name column to sets
names = df['Name'].str.split().apply(set)
#convert both splitted columns by vs or v to sets, if emty value add empty set
df1 = (df['Title'].str.split('\s+vs|v\s+', expand=True)
.apply(lambda x: x.str.split())
.applymap(lambda x: set(x) if isinstance(x, list) else set()))
#tests subsets for both columns in df1
m11 = [label.issubset(name) for label, name in zip(df1[0], names)]
m12 = [label.issubset(name) for label, name in zip(df1[1], names)]
#test if no vs v
m2 = ~df['Title'].str.contains(r'\s+vs|v\s+')
#set values
df['Label'] = np.select([m2, m11, m12], [None, 'first','second'], None)
print (df)
Name F_Name L_Name Title Label
0 John Down John Down sth vs Down John second
1 Dave Brown Dave Brown sth v Brown Dave second
2 Mary Sith Mary Sith Sith Mary vs sth first
3 Sam Walker Sam Walker sth vs Sam Walker second
4 Chris Humpy Chris Humpy Humpy None
5 John Hunter John Hunter John Hunter None
6 Nola Smith Nola Smith Nola None
7 Chuck Bass Chuck Bass Bass v sth first
8 Rob Bank Rob Bank Rob v sth first
9 Chris Ham Chris Ham Chris Ham None
10 Angie Poppy Angie Poppy Poppy Angie None
11 Joe Exhaust Joe Exhaust sth vs Joe second
12 Tony Start Tony Start sth v Start second
13 Tony Start Tony Start sth v james bb None
14 Tony Start Tony Start Dave Sins None

Relationship based on time

I am trying to create a relationship between two data frames that are related, but there is no key that creates a relationship. Here is the layout of my problem:
The first data frame that I am using is information about when people entered an amusement park. In this amusement park, people can stay at the park for multiple days. So the structure of this data frame is
id
name
date
0
John Smith
07-01-2020 10:13:24
1
John Smith
07-22-2020 09:47:04
4
Jane Doe
07-22-2020 09:47:04
2
Jane Doe
06-13-2020 13:27:53
3
Thomas Wallace
07-08-2020 11:15:28
So people may visit the park once, or multiple times (assume that name is a unique identifier for people). For the other data frame, the data is what rides they went on during their time at the park. So the structure of this data frame is
name
ride
date
John Smith
Insanity
07-01-2020 13:53:07
John Smith
Bumper Cars
07-01-2020 16:37:29
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
John Smith
Insanity
07-22-2020 11:44:32
Jane Doe
Bumper Cars
06-13-2020 14:14:41
Jane Doe
Teacups
06-13-2020 17:31:56
Thomas Wallace
Insanity
07-08-2020 13:20:23
With these two data frames, I want to get the id of the visit associated with the rides that they went on during that visit. So the desired output in this example would be
id
name
ride
date
0
John Smith
Insanity
07-01-2020 13:53:07
0
John Smith
Bumper Cars
07-01-2020 16:37:29
0
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
1
John Smith
Insanity
07-22-2020 11:44:32
2
Jane Doe
Bumper Cars
06-13-2020 14:14:41
2
Jane Doe
Teacups
06-13-2020 17:31:56
3
Thomas Wallace
Insanity
07-08-2020 13:20:23
The way how I had thought about approaching this problem is by iterating over the visits and then adding the id to the ride if the name matched, the ride occurred during/after the visit, and the time delta is the smallest difference (using a large initial time delta and then updating the smallest different to that difference). If those conditions are not met, then just keep the same value. With this process in mind, here is my thought process in code:
rides['min_diff'] = pd.to_timedelta(365, unit='day')
rides['id'] = -1
for index, row in visits.iterrows():
rides['id'], rides['min_diff'] = np.where((rides['name'] == row['name']) & (
rides['date'] >= visits['date']) & (
(rides['date'] - row['date']) < rides['min_diff']),
(row['id'], rides['date'] - row['date']),
(rides['id'], rides['min_diff'))
This unfortunately does not execute because of the shapes not matching (as well as trying to assign values across multiple columns, which I am not sure how to do), but this is the general idea. I am not sure how this could be accomplished exactly, so if anyone has a solution, I would appreciate it.
Try with apply() and asof():
df1 = df1.set_index("date").sort_index() #asof requires a sorted index
df2["id"] = df2.apply(lambda x: df1[df1["Name"]==x["Name"]]["id"].asof(x["date"]), axis=1)
>>> df2
Name ride date id
0 John Smith Insanity 2020-07-01 13:53:07 0
1 John Smith Bumper Cars 2020-07-01 16:37:29 0
2 John Smith Tilt-A-Whirl 2020-07-02 08:21:18 0
3 John Smith Insanity 2020-07-22 11:44:32 1
4 Jane Doe Bumper Cars 2020-06-13 14:14:41 2
5 Jane Doe Teacups 2020-06-13 17:31:56 2
6 Thomas Wallace Insanity 2020-07-08 13:20:23 3
I think this does what you need. The ids aren't in the order you specified but they do represent visit ids with the logic you requested.
merged = pd.merge(df1, df2, how="right", left_on=['date', 'name'], right_on=['name', 'ride'])[['name_y', 'ride', 'date_y']]
merged['ymd'] = pd.to_datetime(merged.date_y).apply(lambda x: x.strftime('%Y-%m-%d'))
merged['id'] = merged.groupby(['name_y', 'ymd']).ngroup()
merged.drop('ymd', axis=1, inplace=True)
merged.columns = ['name', 'ride', 'date', 'id']
merged.sort_values(by='id', inplace=True)
print(merged)
OUT:
name ride date id
4 Jane Doe Bumper Cars 06-13-2020 14:14:41 0
5 Jane Doe Teacups 06-13-2020 17:31:56 0
0 John Smith Insanity 07-01-2020 13:53:07 1
1 John Smith Bumper Cars 07-01-2020 16:37:29 1
2 John Smith Tilt-A-Whirl 07-02-2020 08:21:18 2
3 John Smith Insanity 07-22-2020 11:44:32 3
6 Thomas Wallace Insanity 07-08-2020 13:20:23 4

How to slice pandas column with index list?

I'm try extract the first two words from a string in dataframe
df["Name"]
Name
Anthony Frank Hawk
John Rodney Mullen
Robert Dean Silva Burnquis
Geoffrey Joseph Rowley
To get index of the second " "(Space) I try this but find return NaN instead return number of characters until second Space.
df["temp"] = df["Name"].str.find(" ")+1
df["temp"] = df["Status"].str.find(" ", start=df["Status"], end=None)
df["temp"]
0 NaN
1 NaN
2 NaN
3 NaN
and the last step is slice those names, I try this code but don't work to.
df["Status"] = df["Status"].str.slice(0,df["temp"])
df["Status"]
0 NaN
1 NaN
2 NaN
3 NaN
expected return
0 Anthony Frank
1 John Rodney
2 Robert Dean
3 Geoffrey Joseph
if you have a more efficient way to do this please let me know!?
df['temp'] = df.Name.str.rpartition().get(0)
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean Silva
3 Geoffrey Joseph Rowley Geoffrey Joseph
EDIT
If only first two elements are required in output.
df['temp'] = df.Name.str.split().str[:2].str.join(' ')
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join(x[:2]))
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join([x[0], x[1]]))
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean
3 Geoffrey Joseph Rowley Geoffrey Joseph
You can use str.index(substring) instead of str.find, it returns the smallest index of the substring(such as " ", empty space) found in the string. Then you can split the string by that index and reapply the above to the second string in the resulting list.

Create a count matrix of actor names from movies

I have a dataframe with 2 columns i.e. UserId in integer format and Actors in string format as shown below:
Userid Actors
u1 Tony Ward,Bruce LaBruce,Kevin P. Scott,Ivar Johnson, Naomi Watts, Tony Ward,.......
u2 Tony Ward,Bruce LaBruce,Kevin P. Scott, Luke Wilson, Owen Wilson, Lumi Cavazos,......
It represents actors from all movies watched by a particular user of the platform
I want an output where we have the count of each actor for each user as shown below:
UserId Tony Ward Bruce LaBruce Kevin P. Scott Ivar Johnson Luke Wilson Owen Wilson Lumi Cavazos
u1 2 1 1 1 0 0 0
u2 1 1 1 0 1 1 1
It is something similar to countvectoriser I reckon, but i just have nouns here.
Kindly help
Assuming its a pandas.Dataframe try this, DataFrame.explode Transform each element of a list-like (result of split) to a row DataFrame.groupby aggregates the data & DataFrame.unstack transforms to required format.
df['Actors'] = df['Actors'].str.replace(",\s", ",").str.split(",")
(
df.explode('Actors').
groupby(['Userid', 'Actors'], as_index=False).size().
unstack().fillna(0)
)

Is there a more computationally efficient way to find the first occurrence matching a regular expression using Pandas?

Is there a more computationally efficient way in Pandas to get to the final output below? I only want the first occurrence, and it seems computationally inefficient to findall and then get the 0th element of the list, as below:
Input:
s= pd.Series(["David Matt Juan Peter David James",
"Scott David Peter Sam David Ron",
"Dan Phil David Sam Pedro David Mani"])
s_find= s.str.findall(r'David [A-za-z]*')
print(s_find)
Output:
0 [David Matt, David James]
1 [David Peter, David Ron]
2 [David Sam, David Mani]
Input:
s_find= s_find.str[0]
print(s_find)
Output:
0 David Matt
1 David Peter
2 David Sam
You can use str.extract to only take the first match:
s.str.extract('(David [A-za-z]*)')
This returns:
0 David Matt
1 David Peter
2 David Sam
dtype: object
Or, avoiding pandas str methods, you can use a list comprehension:
import re
pd.Series([re.search('(David [A-za-z]*)', i).group() for i in s.values])
0 David Matt
1 David Peter
2 David Sam
dtype: object

Categories