Parse movie text data into a dataframe - python

I have some .txt data from a movie script that looks like this.
JOHN
Hi man. How are you?
TOM
A little hungry but okay.
JOHN
Let's get breakfast then
I'd like to parse out the text and create a dataframe with 2 columns. I for person e.g JOHN and TOM and a second column for the lines (which are the block of text below each name). The result would be like..
index | person | lines
0 | JOHN | "Hi man. How are you?"
1 | TOM | "A little hungry but okay."
2 | JOHN | "Let's get breakfast then"

I know I'm late to this party but this will parse an entire script into a dictionary of character name and their dialogue as values all you need to do then is df = pd.DataFrame(final_dict.values(), columns = final_dict.keys())
*# Grouped regex pattern to capture char and dialouge in a tuple
char_dialogue = re.compile(r"(?m)^\s*\b([A-Z]+)\b\s*\n(.*(?:\n.+)*)")
extract_dialogue = char_dialogue.findall(script)
final_dict = {}
for element in extract_dialogue:
# Seperating the character and dialogue from the tuple
char = element[0]
line = element[1]
# If the char is already a key in the dictionary
# and line is not empty append the dialogue to the value list
if char in final_dict:
if line != '':
final_dict[char].append(line)
else:
# Else add the character name to the dictionary keys with their first line
# Drop any lower case matches from group 0
# Can adjust the len here if you have characters with fewer letters
if char.isupper() and len(char) >2:
final_dict[char] = [line]
# Some final cleaning to drop empty dalouge
final_dict = {k: v for k, v in final_dict.items() if v != ['']}
# More filtering to reutrn only main characters with more than 50
# lines of dialogue
final_dict = {k: v for k, v in final_dict.items() if len(v) > 50}*

If every text is in one line then you can split to lines
lines = text.split('\n')
Remove spaces
lines = [x.strip() for x in lines]
And use slice [start:end:step] to create dataframe
df = pd.DataFrame({
'person': lines[0::3],
'lines': lines[1::3]
})
Example:
text = ''' JOHN
Hi man. How are you?
TOM
A little hungry but okay.
JOHN
Let's get breakfast then'''
lines = text.split('\n')
lines = [x.strip() for x in lines]
import pandas as pd
df = pd.DataFrame({
'person': lines[0::3],
'lines': lines[1::3]
})
print(df)
Result:
person lines
0 JOHN Hi man. How are you?
1 TOM A little hungry but okay.
2 JOHN Let's get breakfast then
If person may have text in many lines - ie.
JOHN
Hi man.
How are you?
then it needs more spliting and striping.
You can do it before creating DataFrame.
text = ''' JOHN
Hi man.
How are you?
TOM
A little hungry but okay.
JOHN
Let's get breakfast then'''
data = []
parts = text.split('\n\n')
for part in parts:
person, lines = part.split('\n', 1)
person = person.strip()
lines = "\n".join(x.strip() for x in lines.split('\n'))
data.append([person, lines])
import pandas as pd
df = pd.DataFrame(data)
df.columns = ['person', 'lines']
print(df)
Or you can try to do it after creating DataFrame
text = ''' JOHN
Hi man.
How are you?
TOM
A little hungry but okay.
JOHN
Let's get breakfast then'''
lines = text.split('\n\n')
lines = [x.split('\n', 1) for x in lines]
import pandas as pd
df = pd.DataFrame(lines)
df.columns = ['person', 'lines']
df['person'] = df['person'].str.strip()
df['lines'] = df['lines'].apply(lambda txt: "\n".join(x.strip() for x in txt.split('\n')))
print(df)
Resulst:
person lines
0 JOHN Hi man.\nHow are you?
1 TOM A little hungry but okay.
2 JOHN Let's get breakfast then

Related

Search DataFrame column for words in list

I am trying to create a new DataFrame column that contains words that match between a list of keywords and strings in a df column...
data = {
'Sandwich Opinions':['Roast beef is overrated','Toasted bread is always best','Hot sandwiches are better than cold']
}
df = pd.DataFrame(data)
keywords = ['bread', 'bologna', 'toast', 'sandwich']
df['Matches'] = [df.apply(lambda x: ' '.join([i for i in df['Sandwich iOpinions'].str.split() if i in keywords]), axis=1)
This seems like it should do the job but it's getting stuck in endless processing.
for kw in keywords:
df[kw] = np.where(df['Sandwich Opinions'].str.contains(kw), 1, 0)
def add_contain_row(row):
contains = []
for kw in keywords:
if row[kw] == 1:
contains.append(kw)
return contains
df['contains'] = df.apply(add_contain_row, axis=1)
# if you want to drop the temp columns
df.drop(columns=keywords, inplace=True)
Create a regex pattern from your list of words:
import re
pattern = fr"\b({'|'.join(re.escape(k) for k in keywords)})\b"
df['contains'] = df['Sandwich Opinions'].str.extract(pattern, re.IGNORECASE)
Output:
>>> df
Sandwich Opinions contains
0 Roast beef is overrated NaN
1 Toasted bread is always best bread
2 Hot sandwiches are better than cold NaN

Duplicates in a cell str in data frame

I have a sample df
id
Another header
1
JohnWalter walter
2
AdamSmith Smith
3
Steve Rogers rogers
How can I find whether it is duplicated in every row and pop it out?
id
Name
poped_out_string
corrected_name
1
JohnWalter walter
walter
John walter
2
AdamSmith Smith
Smith
Adam Smith
3
Steve Rogers rogers
rogers
Steve Rogers
You could try something like below:
import re # Import to help efficiently find duplicates
# Get unique items from list, and duplicates
def uniqueItems(input):
seen = set()
uniq = []
dups = []
result_dict = {}
for x in input:
xCapitalize = x.capitalize()
if x in uniq and x not in dups:
dups.append(x)
if x not in seen:
uniq.append(xCapitalize)
seen.add(x)
result_dict = {"unique": uniq, "duplicates": dups}
return result_dict
# Split our strings
def splitString(inputString):
stringProcess = re.sub( r"([A-Z])", r" \1", inputString).split()
if (len(stringProcess) > 1): #If there are multiple items in a cell, after splitting
convertToLower = [x.lower() for x in stringProcess] #convert all strings to lower case for easy comparison
uniqueValues = uniqueItems(convertToLower)
else:
result = inputString
return result
# Iterate over rows in data frame
for i, row in df.iterrows():
split_result = splitString(row['Name'])
if (len(split_result["duplicates"]) > 0): # If there are duplicates
df.at[i, "poped_out_string"] = split_result["duplicates"] # Typo here - consider fixing
if (len(split_result["unique"]) > 1):
df.at[i, "corrected_name"] = " ".join(split_result["unique"])
The general idea is to iterate over each row, split the string in the "Name" column, check for duplicates, and then write those values into the data frame
import re
df = pd.DataFrame(['JohnWalter walter brown', 'winter AdamSmith Smith', 'Steve Rogers rogers'], columns=['Name'])
df
Name
0 JohnWalter walter brown
1 winter AdamSmith Smith
2 Steve Rogers rogers
def remove_dups(string):
# first find names that starts with simple/capital leter having one or more characters excluding space and upper cases
names = re.findall('[a-zA-Z][^A-Z ]*', string)
# then take new array to get non-duplicates (set can't use as it doesn't preserve order of the names)
new_names = []
# capitalize and append names if they are not already added
[new_names.append(name.capitalize()) for name in names if name.capitalize() not in new_names]
# finallyconstruct full name and return
return(' '.join(new_names))
df.Name.apply(remove_dups)
0 John Walter Brown
1 Winter Adam Smith
2 Steve Rogers
Name: Name, dtype: object

If text is contained in another dataframe then flag row with a binary designation

I'm working on mining survey data. I was able to flag the rows for certain keywords:
survey['Rude'] = survey['Comment Text'].str.contains('rude', na=False, regex=True).astype(int)
Now, I want to flag any rows containing names. I have another dataframe that contains common US names.
Here's what I thought would work, but it is not flagging any rows, and I have validated that names do exist in the 'Comment Text'
for row in survey:
for word in survey['Comment Text']:
survey['Name'] = 0
if word in names['Name']:
survey['Name'] = 1
You are not looping through the series correctly. for row in survey: loops through the column names in survey. for word in survey['Comment Text']: loops though the comment strings. survey['Name'] = 0 creates a column of all 0s.
You could use set intersections and apply(), to avoid all the looping through rows:
survey = pd.DataFrame({'Comment_Text':['Hi rcriii',
'Hi yourself stranger',
'say hi to Justin for me']})
names = pd.DataFrame({'Name':['rcriii', 'Justin', 'Susan', 'murgatroyd']})
s2 = set(names['Name'])
def is_there_a_name(s):
s1 = set(s.split())
if len(s1.intersection(s2))>0:
return 1
else:
return 0
survey['Name'] = survey['Comment_Text'].apply(is_there_a_name)
print(names)
print(survey)
Name
0 rcriii
1 Justin
2 Susan
3 murgatroyd
Comment_Text Name
0 Hi rcriii 1
1 Hi yourself stranger 0
2 say hi to Justin for me 1
As a bonus, return len(s1.intersection(s2)) to get the number of matches per line.

str.contains only and exact value

I have the following list :
personnages = ['Stanley','Kevin', 'Franck']
I want to use str.contains function to create a new pandas dataframe df3 :
df3 = df2[df2['speaker'].str.contains('|'.join(personnages))]
However, if the row of the column speaker contains : 'Stanley & Kevin', i don't want it in df3.
How can I improve my code to do this ?
Here what I would do:
# toy data
df = pd.DataFrame({'speaker':['Stanley & Kevin', 'Everybody',
'Kevin speaks', 'The speaker is Franck', 'Nobody']})
personnages = ['Stanley','Kevin', 'Franck']
pattern = '|'.join(personnages)
s = (df['speaker'].str
.extractall(f'({pattern})') # extract all personnages
.groupby(level=0)[0] # group by df's row
.nunique().eq(1) # count the unique number
)
df.loc[s.index[s]]
Output:
speaker
2 Kevin speaks
3 The speaker is Franck
You'll want to denote line start and end in your regex, that way it only contains the single name:
import pandas as pd
speakers = ['Stanley', 'Kevin', 'Frank', 'Kevin & Frank']
df = pd.DataFrame([{'speaker': speaker} for speaker in speakers])
speaker
0 Stanley
1 Kevin
2 Frank
3 Kevin & Frank
r = '|'.join(speakers[:-1]) # gets all but the last one for the sake of example
# the ^ marks start of string, and $ is the end
df[df['speaker'].str.contains(f'^({r})$')]
speaker
0 Stanley
1 Kevin
2 Frank

i am trying to split a full name to first middle and last name in pandas but i am stuck at replace

i am trying to break the name into two parts and keeping first name last name and finally replacing the common part in all of them such that first name is must then last name then if middle name remain it is added to column
df['owner1_first_name'] = df['owner1_name'].str.split().str[0].astype(str,
errors='ignore')
df['owner1_last_name'] =
df['owner1_name'].str.split().str[-1].str.replace(df['owner1_first_name'],
"").astype(str, errors='ignore')
['owner1_middle_name'] =
df['owner1_name'].str.replace(df['owner1_first_name'],
"").str.replace(df['owner1_last_name'], "").astype(str, errors='ignore')
the problem is i am not able to use
.str.replace(df['owner1_name'], "")
as i am getting an error
"TypeError: 'Series' objects are mutable, thus they cannot be hashed"
is there any replacement sytax in pandas for what i am tryin to achieve
my desired output is
full name = THOMAS MARY D which is in column owner1_name
I want
owner1_first_name = THOMAS
owner1_middle_name = MARY
owner1_last_name = D
I think you need mask which replace if same values in both columns to empty strings:
df = pd.DataFrame({'owner1_name':['THOMAS MARY D', 'JOE Long', 'MARY Small']})
splitted = df['owner1_name'].str.split()
df['owner1_first_name'] = splitted.str[0]
df['owner1_last_name'] = splitted.str[-1]
df['owner1_middle_name'] = splitted.str[1]
df['owner1_middle_name'] = df['owner1_middle_name']
.mask(df['owner1_middle_name'] == df['owner1_last_name'], '')
print (df)
owner1_name owner1_first_name owner1_last_name owner1_middle_name
0 THOMAS MARY D THOMAS D MARY
1 JOE Long JOE Long
2 MARY Small MARY Small
What is same as:
splitted = df['owner1_name'].str.split()
df['owner1_first_name'] = splitted.str[0]
df['owner1_last_name'] = splitted.str[-1]
middle = splitted.str[1]
df['owner1_middle_name'] = middle.mask(middle == df['owner1_last_name'], '')
print (df)
owner1_name owner1_first_name owner1_last_name owner1_middle_name
0 THOMAS MARY D THOMAS D MARY
1 JOE Long JOE Long
2 MARY Small MARY Small
EDIT:
For replace by rows is possible use apply with axis=1:
df = pd.DataFrame({'owner1_name':['THOMAS MARY-THOMAS', 'JOE LongJOE', 'MARY Small']})
splitted = df['owner1_name'].str.split()
df['a'] = splitted.str[0]
df['b'] = splitted.str[-1]
df['c'] = df.apply(lambda x: x['b'].replace(x['a'], ''), axis=1)
print (df)
owner1_name a b c
0 THOMAS MARY-THOMAS THOMAS MARY-THOMAS MARY-
1 JOE LongJOE JOE LongJOE Long
2 MARY Small MARY Small Small
the exact code to in three line to achieve what i wanted in my question is
df['owner1_first_name'] = df['owner1_name'].str.split().str[0]
df['owner1_last_name'] = df.apply(lambda x: x['owner1_name'].split()
[-1].replace(x['owner1_first_name'], ''), axis=1)
df['owner1_middle_name'] = df.apply(lambda x:
x['owner1_name'].replace(x['owner1_first_name'],
'').replace(x['owner1_last_name'], ''), axis=1)
Just change your assignment and use another variable:
split = df['owner1_name'].split()
df['owner1_first_name'] = split[0]
df['owner1_middle_name'] = split[-1]
df['owner1_last_name'] = split[1]
splitted = df['Contact_Name'].str.split()
df['First_Name'] = splitted.str[0]
df['Last_Name'] = splitted.str[-1]
df['Middle_Name'] = df['Contact_Name'].loc[df['Contact_Name'].str.split().str.len() == 3].str.split(expand=True)[1]
This might help! the part here is to rightly insert the middle name which you can do by this code..
I like to use the extract parameter. It will return a new dataframe with columns named 0, 1, 2. You can rename them in one line:
col_names = ['owner1_first_name', 'owner1_middle_name', 'owner1_last_name']
df.owner1_name.str.split(extract=True).rename(dict(range(len(col_names), col_names)))
Beware that this code breaks if someone has four names. Better to it in 2 steps: split(n=1, extract=True) and then rsplit(n=1, extract=True

Categories