How do I extract characters from a string in Python? - python

I need to make some name formats match for merging later on in my script. My column 'Name' is imported from a csv and contains names like the following:
Antonio Brown
LeSean McCoy
Le'Veon Bell
For my script, I would like to get the first letter of the first name and combine it with the last name as such....
A.Brown
L.McCoy
L.Bell
Here's what I have right now that returns a NaaN every time:
ff['AbbrName'] = ff['Name'].str.extract('([A-Z]\s[a-zA-Z]+)', expand=True)
Thanks!

Another option using str.replace method with ^([A-Z]).*?([a-zA-Z]+)$; ^([A-Z]) captures the first letter at the beginning of the string; ([a-zA-Z]+)$ matches the last word, then reconstruct the name by adding . between the first captured group and second captured group:
df['Name'].str.replace(r'^([A-Z]).*?([a-zA-Z]+)$', r'\1.\2')
#0 A.Brown
#1 L.McCoy
#2 L.Bell
#Name: Name, dtype: object

What if you would just apply() a function that would split by the first space and get the first character of the first word adding the rest:
import pandas as pd
def abbreviate(row):
first_word, rest = row['Name'].split(" ", 1)
return first_word[0] + ". " + rest
df = pd.DataFrame({'Name': ['Antonio Brown', 'LeSean McCoy', "Le'Veon Bell"]})
df['AbbrName'] = df.apply(abbreviate, axis=1)
print(df)
Prints:
Name AbbrName
0 Antonio Brown A. Brown
1 LeSean McCoy L. McCoy
2 Le'Veon Bell L. Bell

This should be simple enough to do, even without regex. Use a combination of string splitting and concatenation.
df.Name.str[0] + '.' + df.Name.str.split().str[-1]
0 A.Brown
1 L.McCoy
2 L.Bell
Name: Name, dtype: object
If there is a possibility of the Name column having leading spaces, replace df.Name.str[0] with df.Name.str.strip().str[0].
Caveat: Columns must have two names at the very least.

You get NaaN because your regular expression cannot match to the names.
Instead I'll try the following:
parts = ff[name].split(' ')
ff['AbbrName'] = parts[0][0] + '.' + parts[1]

Related

Function to remove a part of a string before a capital letter in Pandas Series

I have a dataframe that includes a column ['locality_name'] with names of villages, towns, cities. Some names are written like "town of Hamilton", some like "Hamilton", some like "city of Hamilton" etc. As such, it's hard to count unique values etc. My goal is to leave the names only.
I want to write a function that removes the part of a string till the capital letter and then apply it to my dataframe.
That's what I tried:
import re
def my_slicer(row):
"""
Returns a string with the name of locality
"""
return re.sub('ABCDEFGHIKLMNOPQRSTVXYZ','', row['locality_name'])
raw_data['locality_name_only'] = raw_data.apply(my_slicer, axis=1)
I excpected it to return a new column with the names of places. Instead, nothing changed - ['locality_name_only'] has the same values as in ['locality_name'].
You can use pandas.Series.str.extract. For the example :
ser = pd.Series(["town of Hamilton", "Hamilton", "city of Hamilton"])
ser_2= ser.str.extract("([A-Z][a-z]+-?\w+)")
In your case, use :
raw_data['locality_name_only'] = raw_data['locality_name'].str.extract("([A-Z][a-z]+-?\w+)")
# Output :
print(ser_2)
0
0 Hamilton
1 Hamilton
2 Hamilton
I would use str.replace and phrase the problem as removing all non uppercase words:
raw_data["locality_name_only"] = df["locality_name"].str.replace(r'\s*\b[a-z]\w*\s*', ' ', regex=True).str.strip()

Looping through values in a specific column and changing values Python and Pandas

I have a dataframe as follows (example is simplified):
id prediction1 prediction2
1234 Cocker_spaniel german_Shepard
5678 rhodesian_ridgeback australian_shepard
I need to remove the underscores and make sure the string is in lower case so I can search it easier later.
I am not quite sure how to loop through this. My initial student thought is something like what follows:
for row in image_predictions['p1']:
image_predictions['p1'] = image_predictions['p1'].replace('_', ' ')
The above code is for replacing the underscore with a space and I believe the code would be similar for lowercase using the .lower() method.
Any advice to point me in the right direction?
For in place modification you can use:
df.update(df[['prediction1', 'prediction2']]
.apply(lambda c: c.str.lower()
.str.replace('_', ' ', regex=False))
)
Output:
id prediction1 prediction2
0 1234 cocker spaniel german shepard
1 5678 rhodesian ridgeback australian shepard
You can use image_predictions['p1'].apply() to apply a function to each cell of the p1 column:
def myFunction(x):
return x.replace('_', ' ')
image_predictions['p1'] = image_predictions['p1'].apply(myFunction)
Wanted to see if it was possible to not have to specify the columns for replacement. This approach creates a dict to replace A -> a, B -> b, etc, and _ -> space. Then uses replace with regex=True
import string
replace_dict = dict(zip(string.ascii_uppercase,string.ascii_lowercase))
replace_dict['_'] = ' '
df.replace(replace_dict, regex=True, inplace=True)
print(df)

Python: Replace string in one column from list in other column

I need some help please.
I have a dataframe with multiple columns where 2 are:
Content_Clean = Column filled with Content - String
Removals: list of strings to be removed from Content_Clean Column
Problem: I am trying to replace words in Content_Clean with spaces if in Removals Column:
Example Image
Example:
Content Clean: 'Johnny and Mary went to the store'
Removals: ['Johnny','Mary']
Output: 'and went to the store'
Example Code:
for i in data_eng['Removals']:
for u in i:
data_eng['Content_Clean_II'] = data_eng['Content_Clean'].str.replace(u,' ')
This does not work as Removals columns contain lists per row.
Another Example:
data_eng['Content_Clean_II'] = data_eng['Content_Clean'].apply(lambda x: re.sub(data_eng.loc[data_eng['Content_Clean'] == x, 'Removals'].values[0], '', x))
Does not work as this code is only looking for one string.
The problem is that Removals column is a list that I want use to remove/ replace with spaces in the Content_Clean column on a per row basis.
The example image link might help
Here you go. This worked on my test data. Let me know if it works for you
def repl(row):
for word in row['Removals']:
row['Content_Clean'] = row['Content_Clean'].replace(word, '')
return row
data_eng = data_eng.apply(repl, axis=1)
You can call the str.replace(old, new) method to remove unwanted words from a string.
Here is one small example I have done.
a_string = "I do not like to eat apples and watermelons"
stripped_string = a_string.replace(" do not", "")
print(stripped_string)
This will remove "do not" from the sentence

Python find text in string

I have the following string for which I want to extract data:
text_example = '\nExample text \nTECHNICAL PARTICULARS\nLength oa: ...............189.9m\nLength bp: ........176m\nBreadth moulded: .......26.4m\nDepth moulded to main deck: ....9.2m\n
Every variable I want to extract starts with \n
The value I want to get starts with a colon ':' followed by more than 1 dot
When it doesnt start with a colon followed by dots, I dont want to extract that value.
For example my preferred output looks like:
LOA = 189.9
LBP = 176.0
BM = 26.4
DM = 9.2
import re
text_example = '\nExample text \nTECHNICAL PARTICULARS\nLength oa: ...............189.9m\nLength bp: ........176m\nBreadth moulded: .......26.4m\nDepth moulded to main deck: ....9.2m\n'
# capture all the characters BEFORE the ':' character
variables = re.findall(r'(.*?):', text_example)
# matches all floats and integers (does not account for minus signs)
values = re.findall(r'(\d+(?:\.\d+)?)', text_example)
# zip into dictionary (this is assuming you will have the same number of results for both regex expression.
result = dict(zip(variables, values))
print(result)
--> {'Length oa': '189.9', 'Breadth moulded': '26.4', 'Length bp': '176', 'Depth moulded to main deck': '9.2'}
You can create a regex and workaround the solution-
re.findall(r'(\\n|\n)([A-Za-z\s]*)(?:(\:\s*\.+))(\d*\.*\d*)',text_example)[2]
('\n', 'Breadth moulded', ': .......', '26.4')

Regex: combining two groups

Test string:
First
Name:
MICKEY
One to
four lines
of cruft go here
Last
Name:
MOUSE
More cruft
goes here
I want to return a single group "MICKEY MOUSE"
I have:
(?:First\WName:)\W((.+)\W(?:((.+\W){1,4})(?:Last\WName:\W))(.+))
Group 2 returns MICKEY and group 5 returns MOUSE.
I thought that enclosing them in a single group and making the middle cruft and Last name segments non-capturing groups with ?: would prevent them from appearing. But Group 1 returns
MICKEY
One to
four lines
of cruft go here
Last
Name:
MOUSE
How can I get it to remove the middle stuff from what's returned (or alternately combine groups 2 and group 5 into a single named or numbered group)?
To solve this you could make use of non capturing groups in regex. These are declared with: (?:)
After modifying the regex to:
(?:First\WName:)\W((.+)\W(?:(?:(?:.+\W){1,4})(?:Last\WName:\W))(.+))
you can do the following in python:
import re
inp = """
First
Name:
MICKEY
One to
four lines
of cruft go here
Last
Name:
MOUSE
More cruft
goes here
"""
query = r'(?:First\WName:)\W((.+)\W(?:(?:(?:.+\W){1,4})(?:Last\WName:\W))(.+))'
output = ' '.join(re.match(query, inp).groups())
With re.search() function and specific regex pattern:
import re
s = '''
First
Name:
MICKEY
One to
four lines
of cruft go here
Last
Name:
MOUSE
More cruft
goes here'''
result = re.search(r'Name:\n(?P<firstname>\S+)[\s\S]*Name:\n(?P<lastname>\S+)', s).groupdict()
print(result)
The output:
{'firstname': 'MICKEY', 'lastname': 'MOUSE'}
----------
Or even simpler with re.findall() function:
result = re.findall(r'(?<=Name:\n)(\S+)', s)
print(result)
The output:
['MICKEY', 'MOUSE']
You can split the string and check if all characters are uppercase:
import re
s = """
First
Name:
MICKEY
One to
four lines
of cruft go here
Last
Name:
MOUSE
More cruft
goes here
"""
final_data = ' '.join(i for i in s.split('\n') if re.findall('^[A-Z]+$', i))
Output:
'MICKEY MOUSE'
Or, a pure regex solution:
new_data = ' '.join(re.findall('(?<=)[A-Z]+(?=\n)', s))
Output:
'MICKEY MOUSE'

Categories