How to extract specific strings using Python Regex

How to extract specific strings using Python Regex - python

I have very challenging strings that I have been struggling.
For example,
str1 = '95% for Pikachu, 92% for Sandshrew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'
String starts with % integer and may have for or not, then following with name of pokemon. There might be comma(,) or & sign then new % integer. Finally there is another name of pokemon.(All start with capital case alphabet) I want to extract two pokemons, for example result,
['Pikachu', 'Sandshrew']
['Paras', 'Arcanine']
['Diglett', 'Dugtrio']
['Squirtle', 'Alakazam']
['Metopod', 'Dewgong']
I can create a list of all pokemen then using in syntax, but it is not the best way(In case they add more pokemon). Is it possible to extract using Regex?
Thanks in advance!
EDIT
As requested, I am adding my code,
str_list = [str1, str2, str3, str4, str5]
for x in str_list:
temp_list = []
if 'for' in x:
temp = x.split('% for', 1)[1].strip()
temp_list.append(temp)
else:
temp = x.split(" ", 1)[1]
temp_list.append(temp)
print(temp_list)
I know it is not regex express. The expression I tried is, \d+ to
extract integer to start... but have no idea how to start.
EDIT2
#b_c has good edge case so, I am adding it here
edge_str = '100% for Pikachu, 29% Pika Pika Pikachu'
result
['Pikachu', 'Pika Pika Pikachu']

Hopefully I didn't over engineer this, but I wanted to cover the edge cases of the slightly-more-complicated named pokemon, such as "Mr. Mime", "Farfetch'd", and/or "Nidoran♂" (only looking at the first 151).
The pattern I used is (?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*, which looks to be working in my testing (here's the regex101 link for a breakdown).
For a general summary, I'm looking for:
1+ digits followed by a %
A space or the word "for" at least once
(To start the capture) A starting capital letter
At least one of (ending the capture group):
a word character, a period, the male/female symbols, or an apostrophe
Note: If you want to catch additional "weird" pokemon characters, like numbers, colon, etc., add them to this portion (the [\w\.♀♂'] bit).
OR a space, but only if followed by a capital letter
A comma, space, or ampersand, any number of times
Unless it's changed, Python's builtin re module doesn't support repeated capture groups (which I believe I did correctly), so I just used re.findall and organized them into pairs (I replaced a couple names from your input with the complicated ones):
import re
str1 = '95% for Pikachu, 92% for Mr. Mime'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = "10% Squirtle, 100% for Farfetch'd"
str5 = '30% Metopod & 99% Nidoran♂'
pattern = r"(?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*"
# Find matches in each string, then unpack each list of
# matches into a flat list
all_matches = [match
for s in [str1, str2, str3, str4, str5]
for match in re.findall(pattern, s)]
# Pair up the matches
pairs = zip(all_matches[::2], all_matches[1::2])
for pair in pairs:
print(pair)
This then prints out:
('Pikachu', 'Mr. Mime')
('Paras', 'Arcanine')
('Diglett', 'Dugtrio')
('Squirtle', "Farfetch'd")
('Metopod', 'Nidoran♂')
Also, as was already mentioned, you do have a few typos in the pokemon names, but a regex isn't the right fix for that unfortunately :)

Since there seems to be no other upper-case letter in your strings you can simply use [A-Z]\w+ as regex.
See regex101
Code:
import re
str1 = '95% for Pikachu, 92% for Sandsherew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'
str_list = [str1, str2, str3, str4, str5]
regex = re.compile('[A-Z]\w+')
pokemon_list = []
for x in str_list:
pokemon_list.append(re.findall(regex, x))
print(pokemon_list)
Output:
[['Pikachu', 'Sandsherew'], ['Paras', 'Arcanine'], ['Diglett', 'Dugtrio'], ['Squirtle', 'Alakazam'], ['Metopod', 'Dewgong']]

An alternate method if you dont want to use regex and you don't want to rely on capitalization
def pokeFinder(strng):
wordList = strng.split()
pokeList = []
for word in wordList:
if not set('[~!##$%^&*()_+{}":;\']+$').intersection(word) and 'for' not in word:
pokeList.append(word.replace(',', ''))
return pokeList
This won't add words with special chars. It also won't add words that are for. Then it removes commas from the found words.
A print of str2 returns ['Diglett', 'Dugtrio']
EDIT
In light of the fact that there are apparently Pokemon with two words and special chars, I made this slightly more convoluted version of the above code
def pokeFinder(strng):
wordList = strng.split()
pokeList = []
prevWasWord = False
for word in wordList:
if not set('%&').intersection(word) and 'for' not in word:
clnWord = word.replace(',', '')
if prevWasWord is True: # 2 poke in a row means same poke
pokeList[-1] = pokeList[-1] + ' ' + clnWord
else:
pokeList.append(clnWord)
prevWasWord = True
else:
prevWasWord = False
return pokeList
If there's no "three word" pokemon, and the rules OP set remain constant, this should always work. 2 poke matches in a row adds to the previous pokemon.
So printing a string of '30% for Mr. Mime & 20% for Type: Null' gets
['Mr. Mime', 'Type: Null']

Use a positive lookbehind, this will work regardless of capitalization.
(?<=\d\d% for )[A-Za-z]+|(?<=\d% for )[A-Za-z]+
EDIT: Changed it to work in Python.

Related

Easy way of converting a string to lowercase in python

I have a text as follows.
mytext = "This is AVGs_ABB and NMN_ABB and most importantly GFD_ABB This is so important that you have to CLEAN the lab everyday"
I want to convert it to lowercase, except the words that has _ABB in it.
So, my output should look as follows.
mytext = "this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday"
My current code is as follows.
splits = mytext.split()
newtext = []
for item in splits:
if not '_ABB' in item:
item = item.lower()
newtext.append(item)
else:
newtext.append(item)
However, I want to know if there is any easy way of doing this, possibly in one line?

You can use a one liner splitting the string into words, check the words with str.endswith() and then join the words back together:
' '.join(w if w.endswith('_ABB') else w.lower() for w in mytext.split())
# 'this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday'
Of course use the in operator rather than str.endswith() if '_ABB' can actually occur anywhere in the word and not just at the end.

Extended regex approach:
import re
mytext = "This is AVGs_ABB and NMN_ABB and most importantly GFD_ABB This is so important that you have to CLEAN the lab everyday"
result = re.sub(r'\b((?!_ABB)\S)+\b', lambda m: m.group().lower(), mytext)
print(result)
The output:
this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday
Details:
\b - word boundary
(?!_ABB) - lookahead negative assertion, ensures that the given pattern will not match
\S - non-whitespace character
\b((?!_ABB)\S)+\b - the whole pattern matches a word NOT containing substring _ABB

Here is another possible(not elegant) one-liner:
mytext = "This is AVGs_ABB and NMN_ABB and most importantly GFD_ABB This is so important that you have to CLEAN the lab everyday"
print(' '.join(map(lambda x : x if '_ABB' in x else x.lower(), mytext.split())))
Which Outputs:
this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday
Note: This assumes that your text will only seperate the words by spaces, so split() suffices here. If your text includes punctuation such as",!.", you will need to use regex instead to split up the words.

Splitting a sentence by ending characters

A recent project has me needing to split incoming phrases (as strings) into their component sentences. For instance, this string:
"Your mother was a hamster, and your father smelt of elderberries! Now go away, or I shall taunt you a second time. You know what, never mind. This entire sentence is far too silly. Wouldn't you agree? I think it is."
Would need to be turned into a list composed of the following elements:
["Your mother was a hamster, and your father smelt of elderberries",
"Now go away, or I shall taunt you a second time",
"You know what, never mind",
"This entire sentence is far too silly",
"Wouldn't you agree",
"I think it is"]
For the purposes of this function, a "sentence" is a string terminated by !, ?, or . Note that punctuation should be removed from the output as shown above.
I've got a working version, but it's quite ugly, leaves leading and trailing spaces, and I can't help but think there's a better way:
from functools import reduce
def split_sentences(st):
if type(st) is not str:
raise TypeError("Cannot split non-strings")
sl = st.split('.')
sl = [s.split('?') for s in sl]
sl = reduce(lambda x, y: x+y, sl) #Flatten the list
sl = [s.split('!') for s in sl]
return reduce(lambda x, y: x+y, sl)

Use re.split instead to specify a regular expression matching any sentence-ending character (and any following whitespace).
def split_sentences(st):
sentences = re.split(r'[.?!]\s*', st)
if sentences[-1]:
return sentences
else:
return sentences[:-1]

You can also do this without regexes:
result = [s.strip() for s in String.replace('!', '.').replace('?', '.').split('.')]
Or, you could've written a bleeding-edge algorithm that doesn't copy data around so much:
String = list(String)
for i in range(len(String)):
if (String[i] == '?') or (String[i] == '!'):
String[i] = '.'
String = [s.strip() for s in String.split('.')]

import re
st1 = " Another example!! Let me contribute 0.50 cents here?? \
How about pointer '.' character inside the sentence? \
Uni Mechanical Pencil Kurutoga, Blue, 0.3mm (M310121P.33). \
Maybe there could be a multipoint delimeter?.. Just maybe... "
st2 = "One word"
def split_sentences(st):
st = st.strip() + '. '
sentences = re.split(r'[.?!][.?!\s]+', st)
return sentences[:-1]
print(split_sentences(st1))
print(split_sentences(st2))

You can use regex split to split them at specific special characters.
import re
str = "Your mother was a hamster, and your father smelt of elderberries! Now go away, or I shall taunt you a second time. You know what, never mind. This entire sentence is far too silly. Wouldn't you agree? I think it is."
re.compile(r'[?.!]\s+').split(str)

How to replace a word which occurs before another word in python

I want to replace(re-spell) a word A in a text string with another word B if the word A occurs before an operator. Word A can be any word.
E.G:
Hi I am Not == you
Since "Not" occurs before operator "==", I want to replace it with alist["Not"]
So, above sentence should changed to
Hi I am alist["Not"] == you
Another example
My height > your height
should become
My alist["height"] > your height
Edit:
On #Paul's suggestion, I am putting the code which I wrote myself.
It works but its too bulky and I am not happy with it.
operators = ["==", ">", "<", "!="]
text_list = text.split(" ")
for index in range(len(text_list)):
if text_list[index] in operators:
prev = text_list[index - 1]
if "." in prev:
tokens = prev.split(".")
prev = "alist"
for token in tokens:
prev = "%s[\"%s\"]" % (prev, token)
else:
prev = "alist[\"%s\"]" % prev
text_list[index - 1] = prev
text = " ".join(text_list)

This can be done using regular expressions
import re
...
def replacement(match):
return "alist[\"{}\"]".format(match.group(0))
...
re.sub(r"[^ ]+(?= +==)", replacement, s)
If the space between the word and the "==" in your case is not needed, the last line becomes:
re.sub(r"[^ ]+(?= *==)", replacement, s)
I'd highly recommend you to look into regular expressions, and the python implementation of them, as they are really useful.
Explanation for my solution:
re.sub(pattern, replacement, s) replaces occurences of patterns, that are given as regular expressions, with a given string or the output of a function.
I use the output of a function, that puts the whole matched object into the 'alist["..."]' construct. (match.group(0) returns the whole match)
[^ ] match anything but space.
+ match the last subpattern as often as possible, but at least once.
* match the last subpattern as often as possible, but it is optional.
(?=...) is a lookahead. It checks if the stuff after the current cursor position matches the pattern inside the parentheses, but doesn't include them in the final match (at least not in .group(0), if you have groups inside a lookahead, those are retrievable by .group(index)).

str = "Hi I am Not == you"
s = str.split()
y = ''
str2 = ''
for x in s:
if x in "==":
str2 = str.replace(y, 'alist["'+y+'"]')
break
y = x
print(str2)

You could try using the regular expression library I was able to create a simple solution to your problem as shown here.
import re
data = "Hi I am Not == You"
x = re.search(r'(\w+) ==', data)
print(x.groups())
In this code, re.search looks for the pattern of (1 or more) alphanumeric characters followed by operator (" ==") and stores the result ("Hi I am Not ==") in variable x.
Then for swaping you could use the re.sub() method which CodenameLambda suggested.
I'd also recommend learning how to use regular expressions, as they are useful for solving many different problems and are similar between different programming languages

How can I use regex to search inside sentence -not a case sensitive

I'm a newbie to Regular expression in Python :
I have a list that i want to search if it's contain a employee name.
The employee name can be :
it can be at the beginning followed by space.
followed by Â®
OR followed by space
OR Can be at the end and space before it
not a case sensitive
ListSentence = ["SteveÂ®", "steveHotel", "Rob spring", "Car Daniel", "CarDaniel","Done daniel"]
ListEmployee = ["Steve", "Rob", "daniel"]
The output from the ListSentence is:
["SteveÂ®", "Rob spring", "Car Daniel", "Done daniel"]

First take all your employee names and join them with a | character and wrap the string so it looks like:
(?:^|\s)((?:Steve|Rob|Daniel)(?:Â®)?)(?=\s|$)
By first joining all the names together you avoid the performance overhead of using a nested set of for next loops.
I don't know python well enough to offer a python example, however in powershell I'd write it like this
[array]$names = #("Steve", "Rob", "daniel")
[array]$ListSentence = #("SteveÂ®", "steveHotel", "Rob spring", "Car Daniel", "CarDaniel","Done daniel")
# build the regex, and insert the names as a "|" delimited string
$Regex = "(?:^|\s)((?:" + $($names -join "|") + ")(?:Â®)?)(?=\s|$)"
# use case insensitive match to find any matching array values
$ListSentence -imatch $Regex
Yields
SteveÂ®
Rob spring
Car Daniel
Done daniel

Why do you want to use regular expressions? I'd generally recommend avoiding them in Python - you can use string methods instead.
For example:
def string_has_employee_name_in_it(test_string):
test_string = test_string.lower() # case insensitive
for name in ListEmployee:
name = name.lower()
if name == test_string:
return True
elif name + 'Â®' == test_string:
return True
elif test_string.endswith(' ' + name):
return True
elif test_string.startswith(name + ' '):
return True
elif (' ' + name + ' ') in test_string:
return True
return False
final_list = []
for string in ListSentence:
if string_has_employee_name_in_it(string):
final_list.append(string)
final_list is the list you want. This is longer than a regex, but it's also a lot easier to parse and maintain. You can make it a lot shorter in various ways (e.g. combining the tests in the function, and using a list comprehension instead of a loop), but as you're starting out with Python it's a good idea to be clear with what's going on.

I don't think you need to check for all of those scenarios. I think all you need to do is check for word breaks.
You can join the ListEmployee list with | to make an either or regex (also, lowercase it for case-insensitivity), surrounded by \b for word breaks, and that should work:
regex = '|'.join(ListEmployee).lower()
import re
[l for l in ListSentence if re.search(r'\b(%s)\b' % regex, l.lower())]
Should output:
['Steve\xb6\xa9', 'Rob spring', 'Car Daniel', 'Done daniel']

If you're just looking for strings containing a space, as your example indicates, it should be something like this:
[i for i in ListSentence if i.endswith('Â®') or (' ' in i)]

A possible solution:
import re
ListSentence = ["SteveÂ®", "steveHotel", "Rob spring", "Car Daniel", "CarDaniel","Done daniel"]
ListEmployee = ["Steve", "Rob", "daniel"]
def findEmployees(employees, sentence):
retval = []
for employee in employees:
expr = re.compile(r'(^%(employee)s(Â®)?(\s|$))|((^|\s)%(employee)s(Â®)?(\s|$))|((^|\s)%(employee)s(Â®)?$)'
% {'employee': employee},
re.IGNORECASE)
for part in sentence:
if expr.search(part):
retval.append(part)
return retval
findEmployees(ListEmployee, ListSentence)
>> Returns ['Steve\xc3\x82\xc2\xae', 'Rob spring', 'Car Daniel', 'Done daniel']

string to list conversion in python?

I have a string like :
searchString = "u:sads asdas asdsad n:sadasda as:adds sdasd dasd a:sed eee"
what I want is list :
["u:sads asdas asdsad","n:sadasda","as:adds sdasd dasd","a:sed eee"]
What I have done is :
values = re.split('\s', searchString)
mylist = []
word = ''
for elem in values:
if ':' in elem:
if word:
mylist.append(word)
word = elem
else:
word = word + ' ' + elem
list.append(word)
return mylist
But I want an optimized code in python 2.6 .
Thanks

Use regular expressions:
import re
mylist= re.split('\s+(?=\w+:)', searchString)
This splits the string everywhere there's a space followed by one or more letters and a colon. The look-ahead ((?= part) makes it split on the whitespace while keeping the \w+: parts

You can use "look ahead" feature offered by a lot of regular expression engines. Basically, the regex engines checks for a pattern without consuming it when it comes to look ahead.
import re
s = "u:sads asdas asdsad n:sadasda as:adds sdasd dasd a:sed eee"
re.split(r'\s(?=[a-z]:)', s)
This means, split only when we have a \s followed by any letter and a colon but don't consume those tokens.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract specific strings using Python Regex - python

Use a positive lookbehind, this will work regardless of capitalization. (?<=\d\d% for )[A-Za-z]+|(?<=\d% for )[A-Za-z]+ EDIT: Changed it to work in Python.

Related

Easy way of converting a string to lowercase in python

Splitting a sentence by ending characters

How to replace a word which occurs before another word in python

How can I use regex to search inside sentence -not a case sensitive

string to list conversion in python?

Categories

Resources