Regex: combining two groups - python

Test string:
First
Name:
MICKEY
One to
four lines
of cruft go here
Last
Name:
MOUSE
More cruft
goes here
I want to return a single group "MICKEY MOUSE"
I have:
(?:First\WName:)\W((.+)\W(?:((.+\W){1,4})(?:Last\WName:\W))(.+))
Group 2 returns MICKEY and group 5 returns MOUSE.
I thought that enclosing them in a single group and making the middle cruft and Last name segments non-capturing groups with ?: would prevent them from appearing. But Group 1 returns
MICKEY
One to
four lines
of cruft go here
Last
Name:
MOUSE
How can I get it to remove the middle stuff from what's returned (or alternately combine groups 2 and group 5 into a single named or numbered group)?

To solve this you could make use of non capturing groups in regex. These are declared with: (?:)
After modifying the regex to:
(?:First\WName:)\W((.+)\W(?:(?:(?:.+\W){1,4})(?:Last\WName:\W))(.+))
you can do the following in python:
import re
inp = """
First
Name:
MICKEY
One to
four lines
of cruft go here
Last
Name:
MOUSE
More cruft
goes here
"""
query = r'(?:First\WName:)\W((.+)\W(?:(?:(?:.+\W){1,4})(?:Last\WName:\W))(.+))'
output = ' '.join(re.match(query, inp).groups())

With re.search() function and specific regex pattern:
import re
s = '''
First
Name:
MICKEY
One to
four lines
of cruft go here
Last
Name:
MOUSE
More cruft
goes here'''
result = re.search(r'Name:\n(?P<firstname>\S+)[\s\S]*Name:\n(?P<lastname>\S+)', s).groupdict()
print(result)
The output:
{'firstname': 'MICKEY', 'lastname': 'MOUSE'}
----------
Or even simpler with re.findall() function:
result = re.findall(r'(?<=Name:\n)(\S+)', s)
print(result)
The output:
['MICKEY', 'MOUSE']

You can split the string and check if all characters are uppercase:
import re
s = """
First
Name:
MICKEY
One to
four lines
of cruft go here
Last
Name:
MOUSE
More cruft
goes here
"""
final_data = ' '.join(i for i in s.split('\n') if re.findall('^[A-Z]+$', i))
Output:
'MICKEY MOUSE'
Or, a pure regex solution:
new_data = ' '.join(re.findall('(?<=)[A-Z]+(?=\n)', s))
Output:
'MICKEY MOUSE'

Related

Delete words with regex patterns in Python from a dataframe

I'm playing around with regular expression in Python for the below data.
Random
0 helloooo
1 hahaha
2 kebab
3 shsh
4 title
5 miss
6 were
7 laptop
8 welcome
9 pencil
I would like to delete the words which have patterns of repeated letters (e.g. blaaaa), repeated pair of letters (e.g. hahaha) and any words which have the same adjacent letters around one letter (e.g.title, kebab, were).
Here is the code:
import pandas as pd
data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}
df = pd.DataFrame(data)
df = df.loc[~df.agg(lambda x: x.str.contains(r"([a-z])+\1{1,}\b"), axis=1).any(1)].reset_index(drop=True)
print(df)
Below is the output for the above with a Warning message:
UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
Random
0 hahaha
1 kebab
2 shsh
3 title
4 were
5 laptop
6 welcome
7 pencil
However, I expect to see this:
Random
0 laptop
1 welcome
2 pencil
You can use Series.str.contains directly to create a mask and disable the user warning before and enable it after:
import pandas as pd
import warnings
data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}
df = pd.DataFrame(data)
warnings.filterwarnings("ignore", 'This pattern has match groups') # Disable the warning
df['Random'] = df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
warnings.filterwarnings("always", 'This pattern has match groups') # Enable the warning
Output:
>>> df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
# =>
7 laptop
8 welcome
9 pencil
Name: Random, dtype: object
The regex you have contains an issue: the quantifier is put outside of the group, and \1 was looking for the wrong repeated string. Also, the \b word boundary is superflous. The ([a-z]+)[a-z]?\1 pattern matches for one or more letters, then any one optional letter, and the same substring right after it.
See the regex demo.
We can safely disable the user warning because we deliberately use the capturing group here, as we need to use a backreference in this regex pattern. The warning needs re-enabling to avoid using capturing groups in other parts of our code where it is not necessary.
IIUC, you can use sth like the pattern r'(\w+)(\w)?\1', i.e., one or more letters, an optional letter, and the letters from the first match. This gives the right result:
df[~df.Random.str.contains(r'(\w+)(\w)?\1')]

I am confused on how to replace a sentence with 're.sub' for this particular problem

I have trouble with changing this particular string with re.sub:
string = "
Name: Carolyn\r\n
Age : 20\r\n
Hobby: skiing, diving\r\n"
Is there a way to easily replace for example from Hobby: skiing, diving\r\n to Hobby: swimming, reading\r\n?
Assuming you're trying to match anything after Hobby not just skiing and diving specifically. One option is to match the whole line, capture Hobby: in a capture group, and replace the line with the capture plus replacement text. You can use re.M to change to multiline mode allowing you to match the line ending rather than the string ending.
import re
string = '''
Name: Carolyn
Age : 20
Hobby: skiing, diving
'''
print(re.sub(r'(Hobby: ).*$', r'\1swimming, reading', string, flags=re.M))
result
Name: Carolyn
Age : 20
Hobby: swimming, reading

How do I extract characters from a string in Python?

I need to make some name formats match for merging later on in my script. My column 'Name' is imported from a csv and contains names like the following:
Antonio Brown
LeSean McCoy
Le'Veon Bell
For my script, I would like to get the first letter of the first name and combine it with the last name as such....
A.Brown
L.McCoy
L.Bell
Here's what I have right now that returns a NaaN every time:
ff['AbbrName'] = ff['Name'].str.extract('([A-Z]\s[a-zA-Z]+)', expand=True)
Thanks!
Another option using str.replace method with ^([A-Z]).*?([a-zA-Z]+)$; ^([A-Z]) captures the first letter at the beginning of the string; ([a-zA-Z]+)$ matches the last word, then reconstruct the name by adding . between the first captured group and second captured group:
df['Name'].str.replace(r'^([A-Z]).*?([a-zA-Z]+)$', r'\1.\2')
#0 A.Brown
#1 L.McCoy
#2 L.Bell
#Name: Name, dtype: object
What if you would just apply() a function that would split by the first space and get the first character of the first word adding the rest:
import pandas as pd
def abbreviate(row):
first_word, rest = row['Name'].split(" ", 1)
return first_word[0] + ". " + rest
df = pd.DataFrame({'Name': ['Antonio Brown', 'LeSean McCoy', "Le'Veon Bell"]})
df['AbbrName'] = df.apply(abbreviate, axis=1)
print(df)
Prints:
Name AbbrName
0 Antonio Brown A. Brown
1 LeSean McCoy L. McCoy
2 Le'Veon Bell L. Bell
This should be simple enough to do, even without regex. Use a combination of string splitting and concatenation.
df.Name.str[0] + '.' + df.Name.str.split().str[-1]
0 A.Brown
1 L.McCoy
2 L.Bell
Name: Name, dtype: object
If there is a possibility of the Name column having leading spaces, replace df.Name.str[0] with df.Name.str.strip().str[0].
Caveat: Columns must have two names at the very least.
You get NaaN because your regular expression cannot match to the names.
Instead I'll try the following:
parts = ff[name].split(' ')
ff['AbbrName'] = parts[0][0] + '.' + parts[1]

How to capture all regex groups in one regex?

Given a file like this:
# For more information about CC-CEDICT see:
# http://cc-cedict.org/wiki/
A A [A] /(slang) (Tw) to steal/
AA制 AA制 [A A zhi4] /to split the bill/to go Dutch/
AB制 AB制 [A B zhi4] /to split the bill (where the male counterpart foots the larger portion of the sum)/(theater) a system where two actors take turns in acting the main role, with one actor replacing the other if either is unavailable/
A咖 A咖 [A ka1] /class "A"/top grade/
A圈兒 A圈儿 [A quan1 r5] /at symbol, #/
A片 A片 [A pian4] /adult movie/pornography/
I want to build a json object that:
skip lines that starts with #
breaks lines into 4 parts
tradition character (spans from start ^ until the next space)
simplified character (spans from the first space to the second)
pinyin (spans between the square brackets [...])
the gloss space between the first / till the last / (note there are cases where there can be slashes within the gloss, e.g. /adult movie/pornography/
I am currently doing it as such:
>>> for line in text.split('\n'):
... if line.startswith('#'): continue;
... line = line.strip()
... simple, _, line = line.partition(' ')
... trad, _, line = line.partition(' ')
... print simple, trad
...
A A
AA制 AA制
AB制 AB制
A咖 A咖
A圈兒 A圈儿
A片 A片
To get the [...], I had to do:
>>> import re
>>> line = "A片 A片 [A pian4] /adult movie/pornography/"
>>> simple, _, line = line.partition(' ')
>>> trad, _, line = line.partition(' ')
>>> re.findall(r'\[.*\]', line)[0].strip('[]')
'A pian4'
And to find the /.../, I had to do:
>>> line = "A片 A片 [A pian4] /adult movie/pornography/"
>>> re.findall(r'\/.*\/$', line)[0].strip('/')
'adult movie/pornography'
How do I use regex groups to catch all of them at once which doing multiple partitions/splits/findall?
I could extract the info using regular expressions instead. This way, you can catch blocks in groups and then handle them as desired:
import re
with open("myfile") as f:
data = f.read().split('\n')
for line in data:
if line.startswith('#'): continue
m = re.search(r"^([^ ]*) ([^ ]*) \[([^]]*)\] \/(.*)\/$", line)
if m:
print(m.groups())
That is regular expression splits the string in the following groups:
^([^ ]*) ([^ ]*) \[([^]]*)\] \/(.*)\/$
^^^^^ ^^^^^ ^^^^^ ^^
1) 2) 3) 4)
That is:
the first word.
the second word.
the text within [ and ].
the text from / up to the / before the end of the line.
It returns:
('A', 'A', 'A', '(slang) (Tw) to steal')
('AA制', 'AA制', 'A A zhi4', 'to split the bill/to go Dutch')
('AB制', 'AB制', 'A B zhi4', 'to split the bill (where the male counterpart foots the larger portion of the sum)/(theater) a system where two actors take turns in acting the main role, with one actor replacing the other if either is unavailable')
('A咖', 'A咖', 'A ka1', 'class "A"/top grade')
('A圈兒', 'A圈儿', 'A quan1 r5', 'at symbol, #')
('A片', 'A片', 'A pian4', 'adult movie/pornography')
p = re.compile(ru"(\S+)\s+(\S+)\s+\[([^\]]*)\]\s+/(.*)/$")
m = p.match(line)
if m:
simple, trad, pinyin, gloss = m.groups()
See https://docs.python.org/2/howto/regex.html#grouping for more details.
This might help:
preg = re.compile(r'^(?<!#)(\w+)\s(\w+)\s(\[.*?\])\s/(.+)/$',
re.MULTILINE | re.UNICODE)
with open('your_file') as f:
for line in f:
match = preg.match(line)
if match:
print(match.groups())
Take a look here for a detailed explanation of the used regular expression.
I created following regex to match all the four groups:
REGEX DEMO
^(.*)\s(.*)\s(\[.*\])\s(\/.*\/)
This does assume that there is only one space in between the groups however if you have more you can just add a modifier.
Here is a demo of how this works with python with the lines provided in the question:
IDEONE DEMO

Search in a string and obtain the 2 words before and after the match in Python

I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]

Categories