Regex with grouping, how to terminate the group? - python

I need to match the below with a regexp and want to accces the resulting group.
String to be searched:
Products in these categories Nr 24432 in Kitchen ( Bestsellers ) Nr 11 in Home Improvement > Garden Nr 25 in Hobby > Gärtnerei
Expected Results:
"Kitchen","Home Improvement > Garden", "Hobby > Gärtnerei"
This is the regexp that I came up with so far, but it only matches the first occurrance.
Any ideas?
Nr [0-9]{1,} in ([0-9A-z >&äÄüÜöÖ]{1,})

Not sure how you're currently trying to match them, but this should work:
text = "Products in these categories Nr 24432 in Kitchen ( Bestsellers ) Nr 11 in Home Improvement > Garden Nr 25 in Hobby > Gärtnerei "
for m in re.finditer(r"Nr [0-9]{1,} in ([0-9A-z >&äÄüÜöÖ]{1,})", text):
print m.group(1)
Reference.
Also, your second match will match the whole rest of the string.
I suggest changing it to something like:
Nr [0-9]+ in (.+?)(?=[^0-9A-z >&äÄüÜöÖ]|$| Nr )
+ means the same as {1,}
.+? means one or more wild-cards (non-greedily)
?= means look-ahead, so it checks if the next character is an invalid character, end-of-line or " Nr " - the start of the next match.

Related

Need some help on extracting particular string using string manipulations with/without regex

I have an OCR program (not so accurate though) that outputs a string. I append it to a list. So, my ss list looks like this:
ss = [
'성 벼 | 5 번YAO LIAO거 CHINA P R체류자격 결혼이민F-1)말급일자', # 'YAO LIAO'
'성 별 F 등록번호명 JAO HALJUNGCHINA P R격 결혼이민(F-6)밥급인자', # 'JAO HALJUNG'
'성 별 F명 CHENG HAIJING국 가 CHINA P R 역체 가차격 결혼이민(C-4) 박급인자', # 'CHENG HAIJING'
'KOa MDOVUD TAREEQ SAID HAFIZULLAH TURKIYE움첫;자격 거주(F-2)발급일자', # 'DOVUD TAREEQ SAID HAFIZULLAH'
'KOn 별 MDOVUD TAREEQ SAID- IIAFIZULLAH 감 TURKIYE동체나자격 거주F-2) 발급일자', # 'DOVUD TAREEQ SAID- IIAFIZULLAH'
'등록번호IN" 성 별 M명 TAREEQ SAD IIAFIZULLAH 값 TURKIYE8체주자격 거주-2)발급일자' # 'TAREEQ SAD IIAFIZULLAH'
]
I need to find some way to at least remove country names, or even better solution would be to extract clean full names as shown as comments above.
Here, the ss list stores the worst outputs, so if I can handle all 6 strings here with one universal solution, I hope the rest will be easier.
So far, I could think of looping through each element to extract upper English-only letters and filter out empty strings and any string whose len is less than 2, because I am assuming name consists of at least 2 letters:
for s in ss:
eng_parts = ''.join([i if 64 < ord(i) < 91 else ' ' for i in s])
#print("English-only strings: {}".format(eng_parts))
new_string = ''
spaced_string_list = eng_parts.split(" ")
for spaced_string in spaced_string_list:
if len(spaced_string) >= 2:
new_string += spaced_string + " "
new_string_list.append(new_string)
where new_string_list is ['YAO LIAO CHINA ', 'JAO HALJUNGCHINA ', 'CHENG HAIJING CHINA ', 'KO MDOVUD TAREEQ SAID HAFIZULLAH TURKIYE ', 'KO MDOVUD TAREEQ SAID IIAFIZULLAH TURKIYE ', 'IN TAREEQ SAD IIAFIZULLAH TURKIYE ']
Could this result be improved further?
EDIT:
The desired name string could be of up to 5 space-separated substrings. Also, a part of the name string is at least two English-only upper letters. In some cases, a name substring could be separated by a - (refer to SAID- case) if it reaches the end of the ID card, where initially the whole string got extracted from.
It is a great idea to postulate that a name always is build of two upper-case words of Latin characters separated by a space (or more).
So you can loop through the elements and look for that pattern. regex is the library to use =):
import re
for el in ss:
m = re.search(r'[A-Z]{2,}(\s+[A-Z\-]{2,})+', el)
if m:
print(m.group())
YAO LIAO
JAO HALJUNGCHINA
CHENG HAIJING
MDOVUD TAREEQ SAID HAFIZULLAH TURKIYE
MDOVUD TAREEQ SAID- IIAFIZULLAH
TAREEQ SAD IIAFIZULLAH
Let's examine the pattern in detail:
[A-Z]{2,} this searches for upper-case Latin characters of length 2 or more. The brackets indicate a symbol range and the curly brackets a numeric range.
\s+ looks for one ore more (+) widespaces (\s)
add special characters to the list of allowed character if necessary. Note that e.g. a dash needs to be escaped \- because it signifies a range otherwise -
group fractions of the pattern to make it repeatable: ( )+

What Python RegEx can I use to indicate a pattern only in the end of an Excel cell

I am working with a dataset where I am separating the contents of one Excel column into 3 separate columns. A mock version of the data is as follows:
Movie Titles/Category/Rating
Wolf of Wall Street A-13 x 9
Django Unchained IMDB x 8
The EXPL Haunted House FEAR x 7
Silver Lining DC-23 x 8
This is what I want the results to look like:
Title
Category
Rating
Wolf of Wall Street
A-13
9
Django Unchained
IMDB
8
The EXPL Haunted House
FEAR
7
Silver Lining
DC-23
8
Here is the RegEx I used to successfully separate the cells:
For Rating, this RegEx worked:
data = [[Movie Titles/Category/Rating, Rating]] = data['Movie Titles/Category/Rating'].str.split(' x ', expand = True)
However, to separate Category from movie titles, this RegEx doesn't work:
data['Category']=data['Movie Titles/Category/Rating'].str.extract('((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4}$))', expand = True)
Since the uppercase letter pattern is present in the middle of the third cell as well (EXPL and I only want to separate FEAR into a separate column), the regex pattern '\s[A-Z]{4}$' is not working. Is there a way to indicate in the RegEx pattern that I only want the uppercase text in the end of the table cell to separate (FEAR) and not the middle (EXPL)?
You can use
import pandas as pd
df = pd.DataFrame({'Movie Titles/Category/Rating':['Wolf of Wall Street A-13 x 9','Django Unchained IMDB x 8','The EXPL Haunted House FEAR x 7','Silver Lining DC-23 x 8']})
df2 = df['Movie Titles/Category/Rating'].str.extract(r'^(?P<Movie>.*?)\s+(?P<Category>\S+)\s+x\s+(?P<Rating>\d+)$', expand=True)
See the regex demo.
Details:
^ - start of string
(?P<Movie>.*?) - Group (Column) "Movie": any zero or more chars other than line break chars, as few as possible
\s+ - one or more whitespaces
(?P<Category>\S+) - Group "Category": one or more non-whitespace chars
\s+x\s+ - x enclosed with one or more whitespaces
(?P<Rating>\d+) - Group "Rating": one or more digits
$ - end of string.
Assuming there is always x between Category and Rating, and the Category has no spaces in it, then the following should get what you want:
(.*) (.*) x (\d+)
I think
'((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4})) x'
would work for you - to indicate that you want the part of the string that comes right before the x. (Assuming that pattern is always true for your data.)

Getting quantity and unit

I want to get bold parts in sentences below.
Examples:
SmellNice Coffee 450 gr
Clean 2 k Rice
LukaLuka 1,5lt cold drink
Jumbo 7 gutgut eggs 12'li
Espresso 5 Klasik 10 Ad
Expression below works well until to the last two.
\d+[.,]?\d*\s*[’']?\s*(gr|g|kg|k|adet|ad|lı|li|lu|lü|cc|cl|ml|lt|l|mm|cm|mt|m)
I have added \s|$ end of the expression. Thinking that If the unit is not the last word then there should be a space after it. But it didn't work. Briefly, how can I capture all bold expressions?
It works with brackets:
\d+[.,]?\d*\s*[’']?\s*(gr|g|kg|k|adet|ad|lı|li|lu|lü|cc|cl|ml|lt|l|mm|cm|mt|m)(\s+|$)
x2 = (
"\d+" #digit
"[,'\s]" #space comma apostrophe
"[\d*\s*]?" #opt digit or space
"((gr)|g|(kg)|k|(adet)|([Aa]d)|(lı)|(li)|(lu)|(lü)|(cc)|(cl)|(ml)|(lt)|l|(mm)|(cm)|(mt)|m)" #all the weights to look for
"(\s+|$)" #it's gotta be followed with a space, or with end of line.
)

Regex to find name in sentence

I have some sentence like
1:
"RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held
ball is correctly called."
2:
"Nurkic (POR) maintains legal
guarding position and makes incidental contact with Wall (WAS) that
does not affect his driving shot attempt."
I need to use Python regex to find the name "Oubre Jr." ,"Nurkic" and "Nurkic", "Wall".
p = r'\s*(\w+?)\s[(]'
use this pattern,
I can find "['Nurkic', 'Wall']", but in sentence 1, I just can find ['Nurkic'], missed "Oubre Jr."
Who can help me?
You can use the following regex:
(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()
|-----Main Pattern-----|
Details:
(?:) - Creates a non-capturing group
[A-Z] - Captures 1 uppercase letter
[a-z] - Captures 1 lowercase letter
[\s\.a-z]* - Captures spaces (' '), periods ('.') or lowercase letters 0+ times
(?=\s\() - Captures the main pattern if it is only followed by ' (' string
str = '''RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called.
Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt.'''
res = re.findall( r'(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()', str )
print(res)
Demo: https://repl.it/#RahulVerma8/OvalRequiredAdvance?language=python3
Match: https://regex101.com/r/OsLTrY/1
Here is one approach:
line = "RLB shows Oubre Jr (WAS) legally ties up Nurkic (POR), and a held ball is correctly called."
results = re.findall( r'([A-Z][\w+'](?: [JS][r][.]?)?)(?= \([A-Z]+\))', line, re.M|re.I)
print(results)
['Oubre Jr', 'Nurkic']
The above logic will attempt to match one name, beginning with a capital letter, which is possibly followed by either the suffix Jr. or Sr., which in turn is followed by a ([A-Z]+) term.
You need a pattern that you can match - for your sentence you cou try to match things before (XXX) and include a list of possible "suffixes" to include as well - you would need to extract them from your sources
import re
suffs = ["Jr."] # append more to list
rsu = r"(?:"+"|".join(suffs)+")? ?"
# combine with suffixes
regex = r"(\w+ "+rsu+")\(\w{3}\)"
test_str = "RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called. Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt."
matches = re.finditer(regex, test_str, re.MULTILINE)
names = []
for matchNum, match in enumerate(matches,1):
for groupNum in range(0, len(match.groups())):
names.extend(match.groups(groupNum))
print(names)
Output:
['Oubre Jr.', 'Nurkic ', 'Nurkic ', 'Wall ']
This should work as long as you do not have Names with non-\w in them. If you need to adapt the regex, use https://regex101.com/r/pRr9ZU/1 as starting point.
Explanation:
r"(?:"+"|".join(suffs)+")? ?" --> all items in the list suffs are strung together via | (OR) as non grouping (?:...) and made optional followed by optional space.
r"(\w+ "+rsu+")\(\w{3}\)" --> the regex looks for any word characters followed by optional suffs group we just build, followed by literal ( then three word characters followed by another literal )

Regex, how to remove all non-alphanumeric except colon in a 12/24 hour timestamp?

I have a string like:
Today, 3:30pm - Group Meeting to discuss "big idea"
How do you construct a regex such that after parsing it would return:
Today 3:30pm Group Meeting to discuss big idea
I would like it to remove all non-alphanumeric characters except for those that appear in a 12 or 24 hour time stamp.
# this: D:DD, DD:DDam/pm 12/24 hr
re = r':(?=..(?<!\d:\d\d))|[^a-zA-Z0-9 ](?<!:)'
A colon must be preceded by at least one digit and followed by at least two digits: then it's a time. All other colons will be considered textual colons.
How it works
: // match a colon
(?=.. // match but not capture two chars
(?<! // start a negative look-behind group (if it matches, the whole fails)
\d:\d\d // time stamp
) // end neg. look behind
) // end non-capture two chars
| // or
[^a-zA-Z0-9 ] // match anything not digits or letters
(?<!:) // that isn't a colon
Then when applied to this silly text:
Today, 3:30pm - Group 1,2,3 Meeting to di4sc::uss3: 2:3:4 "big idea" on 03:33pm or 16:47 is also good
...changes it into:
Today, 3:30pm Group 123 Meeting to di4scuss3 234 big idea on 03:33pm or 16:47 is also good
Python.
import string
punct=string.punctuation
s='Today, 3:30pm - Group Meeting:am to discuss "big idea" by our madam'
for item in s.split():
try:
t=time.strptime(item,"%H:%M%p")
except:
item=''.join([ i for i in item if i not in punct])
else:
item=item
print item,
output
$ ./python.py
Today 3:30pm Group Meetingam to discuss big idea by our madam
# change to s='Today, 15:30pm - Group 1,2,3 Meeting to di4sc::uss3: 2:3:4 "big idea" on 03:33pm or 16:47 is also good'
$ ./python.py
Today 15:30pm Group 123 Meeting to di4scuss3 234 big idea on 03:33pm or 1647 is also good
NB: Method should be improved to check for valid time only when necessary(by imposing conditions) , but i will leave it as that for now.
I assume you'd like to keep spaces as well, and this implementation is in python, but it's PCRE so it should be portable.
import re
x = u'Today, 3:30pm - Group Meeting to discuss "big idea"'
re.sub(r'[^a-zA-Z0-9: ]', '', x)
Output: 'Today 3:30pm Group Meeting to discuss big idea'
for a slightly cleaner answer (no double spaces)
import re
x = u'Today, 3:30pm - Group Meeting to discuss "big idea"'
tmp = re.sub(r'[^a-zA-Z0-9: ]', '', x)
re.sub(r'[ ]+', ' ', tmp)
Output: 'Today 3:30pm Group Meeting to discuss big idea'
You can try, in Javascript:
var re = /(\W+(?!\d{2}[ap]m))/gi;
var input = 'Today, 3:30pm - Group Meeting to discuss "big idea"';
alert(input.replace(re, " "))
Correct regexp to do that would be:
'(?<!\d):|:(?!\d\d)|[^a-zA-Z0-9 :]'
s="Call me, my dear, at 3:30"
re.sub(r'[^\w :]','',s)
'Call me my dear at 3:30'

Categories