string to list conversion in python? - python

I have a string like :
searchString = "u:sads asdas asdsad n:sadasda as:adds sdasd dasd a:sed eee"
what I want is list :
["u:sads asdas asdsad","n:sadasda","as:adds sdasd dasd","a:sed eee"]
What I have done is :
values = re.split('\s', searchString)
mylist = []
word = ''
for elem in values:
if ':' in elem:
if word:
mylist.append(word)
word = elem
else:
word = word + ' ' + elem
list.append(word)
return mylist
But I want an optimized code in python 2.6 .
Thanks

Use regular expressions:
import re
mylist= re.split('\s+(?=\w+:)', searchString)
This splits the string everywhere there's a space followed by one or more letters and a colon. The look-ahead ((?= part) makes it split on the whitespace while keeping the \w+: parts

You can use "look ahead" feature offered by a lot of regular expression engines. Basically, the regex engines checks for a pattern without consuming it when it comes to look ahead.
import re
s = "u:sads asdas asdsad n:sadasda as:adds sdasd dasd a:sed eee"
re.split(r'\s(?=[a-z]:)', s)
This means, split only when we have a \s followed by any letter and a colon but don't consume those tokens.

Related

Python Regex, how to substitute multiple occurrences with a single pattern?

I'm trying to make a fuzzy autocomplete suggestion box that highlights searched characters with HTML tags <b></b>
For example, if the user types 'ldi' and one of the suggestions is "Leonardo DiCaprio" then the desired outcome is "Leonardo DiCaprio". The first occurrence of each character is highlighted in order of appearance.
What I'm doing right now is:
def prototype_finding_chars_in_string():
test_string_list = ["Leonardo DiCaprio", "Brad Pitt","Claire Danes","Tobey Maguire"]
comp_string = "ldi" #chars to highlight
regex = ".*?" + ".*?".join([f"({x})" for x in comp_string]) + ".*?" #results in .*?(l).*?(d).*?(i).*
regex_compiled = re.compile(regex, re.IGNORECASE)
for x in test_string_list:
re_search_result = re.search(regex_compiled, x) # correctly filters the test list to include only entries that features the search chars in order
if re_search_result:
print(f"char combination {comp_string} are in {x} result group: {re_search_result.groups()}")
results in
char combination ldi are in Leonardo DiCaprio result group: ('L', 'D', 'i')
Now I want to replace each occurrence in the result groups with <b>[whatever in the result]</b> and I'm not sure how to do it.
What I'm currently doing is looping over the result and using the built-in str.replace method to replace the occurrences:
def replace_with_bold(result_groups, original_string):
output_string: str = original_string
for result in result_groups:
output_string = output_string.replace(result,f"<b>{result}</b>",1)
return output_string
This results in:
Highlighted string: <b>L</b>eonar<b>d</b>o D<b>i</b>Caprio
But I think looping like this over the results when I already have the match groups is wasteful. Furthermore, it's not even correct because it checked the string from the beginning each loop. So for the input 'ooo' this is the result:
char combination ooo are in Leonardo DiCaprio result group: ('o', 'o', 'o')
Highlighted string: Le<b><b><b>o</b></b></b>nardo DiCaprio
When it should be Le<b>o</b>nard<b>o</b> DiCapri<b>o</b>
Is there a way to simplify this? Maybe regex here is overkill?
A way using re.split:
test_string_list = ["Leonardo DiCaprio", "Brad Pitt", "Claire Danes", "Tobey Maguire"]
def filter_and_highlight(strings, letters):
pat = re.compile( '(' + (')(.*?)('.join(letters)) + ')', re.I)
results = []
for s in strings:
parts = pat.split(s, 1)
if len(parts) == 1: continue
res = ''
for i, p in enumerate(parts):
if i & 1:
p = '<b>' + p + '</b>'
res += p
results.append(res)
return results
filter_and_highlight(test_string_list, 'lir')
A particularity of re.split is that captures are included by default as parts in the result. Also, even if the first capture matches at the start of the string, an empty part is returned before it, that means that searched letters are always at odd indexes in the list of substrings.
This should work:
for result in result_groups:
output_string = re.sub(fr'(.*?(?!<b>))({result})((?!</b>).*)',
r'\1<b>\2</b>\3',
output_string,
flags=re.IGNORECASE)
on each iteration first occurrence of result (? makes .* lazy this together does the magic of first occurrence) will be replaced by <b>result</b> if it is not enclosed by tag before ((?!<b>) and (?!</b>) does that part) and \1 \2 \3 are first, second and third group additionally we will use IGNORECASE flag to make it case insensitive.

How to extract specific strings using Python Regex

I have very challenging strings that I have been struggling.
For example,
str1 = '95% for Pikachu, 92% for Sandshrew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'
String starts with % integer and may have for or not, then following with name of pokemon. There might be comma(,) or & sign then new % integer. Finally there is another name of pokemon.(All start with capital case alphabet) I want to extract two pokemons, for example result,
['Pikachu', 'Sandshrew']
['Paras', 'Arcanine']
['Diglett', 'Dugtrio']
['Squirtle', 'Alakazam']
['Metopod', 'Dewgong']
I can create a list of all pokemen then using in syntax, but it is not the best way(In case they add more pokemon). Is it possible to extract using Regex?
Thanks in advance!
EDIT
As requested, I am adding my code,
str_list = [str1, str2, str3, str4, str5]
for x in str_list:
temp_list = []
if 'for' in x:
temp = x.split('% for', 1)[1].strip()
temp_list.append(temp)
else:
temp = x.split(" ", 1)[1]
temp_list.append(temp)
print(temp_list)
I know it is not regex express. The expression I tried is, \d+ to
extract integer to start... but have no idea how to start.
EDIT2
#b_c has good edge case so, I am adding it here
edge_str = '100% for Pikachu, 29% Pika Pika Pikachu'
result
['Pikachu', 'Pika Pika Pikachu']
Hopefully I didn't over engineer this, but I wanted to cover the edge cases of the slightly-more-complicated named pokemon, such as "Mr. Mime", "Farfetch'd", and/or "Nidoran♂" (only looking at the first 151).
The pattern I used is (?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*, which looks to be working in my testing (here's the regex101 link for a breakdown).
For a general summary, I'm looking for:
1+ digits followed by a %
A space or the word "for" at least once
(To start the capture) A starting capital letter
At least one of (ending the capture group):
a word character, a period, the male/female symbols, or an apostrophe
Note: If you want to catch additional "weird" pokemon characters, like numbers, colon, etc., add them to this portion (the [\w\.♀♂'] bit).
OR a space, but only if followed by a capital letter
A comma, space, or ampersand, any number of times
Unless it's changed, Python's builtin re module doesn't support repeated capture groups (which I believe I did correctly), so I just used re.findall and organized them into pairs (I replaced a couple names from your input with the complicated ones):
import re
str1 = '95% for Pikachu, 92% for Mr. Mime'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = "10% Squirtle, 100% for Farfetch'd"
str5 = '30% Metopod & 99% Nidoran♂'
pattern = r"(?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*"
# Find matches in each string, then unpack each list of
# matches into a flat list
all_matches = [match
for s in [str1, str2, str3, str4, str5]
for match in re.findall(pattern, s)]
# Pair up the matches
pairs = zip(all_matches[::2], all_matches[1::2])
for pair in pairs:
print(pair)
This then prints out:
('Pikachu', 'Mr. Mime')
('Paras', 'Arcanine')
('Diglett', 'Dugtrio')
('Squirtle', "Farfetch'd")
('Metopod', 'Nidoran♂')
Also, as was already mentioned, you do have a few typos in the pokemon names, but a regex isn't the right fix for that unfortunately :)
Since there seems to be no other upper-case letter in your strings you can simply use [A-Z]\w+ as regex.
See regex101
Code:
import re
str1 = '95% for Pikachu, 92% for Sandsherew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'
str_list = [str1, str2, str3, str4, str5]
regex = re.compile('[A-Z]\w+')
pokemon_list = []
for x in str_list:
pokemon_list.append(re.findall(regex, x))
print(pokemon_list)
Output:
[['Pikachu', 'Sandsherew'], ['Paras', 'Arcanine'], ['Diglett', 'Dugtrio'], ['Squirtle', 'Alakazam'], ['Metopod', 'Dewgong']]
An alternate method if you dont want to use regex and you don't want to rely on capitalization
def pokeFinder(strng):
wordList = strng.split()
pokeList = []
for word in wordList:
if not set('[~!##$%^&*()_+{}":;\']+$').intersection(word) and 'for' not in word:
pokeList.append(word.replace(',', ''))
return pokeList
This won't add words with special chars. It also won't add words that are for. Then it removes commas from the found words.
A print of str2 returns ['Diglett', 'Dugtrio']
EDIT
In light of the fact that there are apparently Pokemon with two words and special chars, I made this slightly more convoluted version of the above code
def pokeFinder(strng):
wordList = strng.split()
pokeList = []
prevWasWord = False
for word in wordList:
if not set('%&').intersection(word) and 'for' not in word:
clnWord = word.replace(',', '')
if prevWasWord is True: # 2 poke in a row means same poke
pokeList[-1] = pokeList[-1] + ' ' + clnWord
else:
pokeList.append(clnWord)
prevWasWord = True
else:
prevWasWord = False
return pokeList
If there's no "three word" pokemon, and the rules OP set remain constant, this should always work. 2 poke matches in a row adds to the previous pokemon.
So printing a string of '30% for Mr. Mime & 20% for Type: Null' gets
['Mr. Mime', 'Type: Null']
Use a positive lookbehind, this will work regardless of capitalization.
(?<=\d\d% for )[A-Za-z]+|(?<=\d% for )[A-Za-z]+
EDIT: Changed it to work in Python.

How to replace a word which occurs before another word in python

I want to replace(re-spell) a word A in a text string with another word B if the word A occurs before an operator. Word A can be any word.
E.G:
Hi I am Not == you
Since "Not" occurs before operator "==", I want to replace it with alist["Not"]
So, above sentence should changed to
Hi I am alist["Not"] == you
Another example
My height > your height
should become
My alist["height"] > your height
Edit:
On #Paul's suggestion, I am putting the code which I wrote myself.
It works but its too bulky and I am not happy with it.
operators = ["==", ">", "<", "!="]
text_list = text.split(" ")
for index in range(len(text_list)):
if text_list[index] in operators:
prev = text_list[index - 1]
if "." in prev:
tokens = prev.split(".")
prev = "alist"
for token in tokens:
prev = "%s[\"%s\"]" % (prev, token)
else:
prev = "alist[\"%s\"]" % prev
text_list[index - 1] = prev
text = " ".join(text_list)
This can be done using regular expressions
import re
...
def replacement(match):
return "alist[\"{}\"]".format(match.group(0))
...
re.sub(r"[^ ]+(?= +==)", replacement, s)
If the space between the word and the "==" in your case is not needed, the last line becomes:
re.sub(r"[^ ]+(?= *==)", replacement, s)
I'd highly recommend you to look into regular expressions, and the python implementation of them, as they are really useful.
Explanation for my solution:
re.sub(pattern, replacement, s) replaces occurences of patterns, that are given as regular expressions, with a given string or the output of a function.
I use the output of a function, that puts the whole matched object into the 'alist["..."]' construct. (match.group(0) returns the whole match)
[^ ] match anything but space.
+ match the last subpattern as often as possible, but at least once.
* match the last subpattern as often as possible, but it is optional.
(?=...) is a lookahead. It checks if the stuff after the current cursor position matches the pattern inside the parentheses, but doesn't include them in the final match (at least not in .group(0), if you have groups inside a lookahead, those are retrievable by .group(index)).
str = "Hi I am Not == you"
s = str.split()
y = ''
str2 = ''
for x in s:
if x in "==":
str2 = str.replace(y, 'alist["'+y+'"]')
break
y = x
print(str2)
You could try using the regular expression library I was able to create a simple solution to your problem as shown here.
import re
data = "Hi I am Not == You"
x = re.search(r'(\w+) ==', data)
print(x.groups())
In this code, re.search looks for the pattern of (1 or more) alphanumeric characters followed by operator (" ==") and stores the result ("Hi I am Not ==") in variable x.
Then for swaping you could use the re.sub() method which CodenameLambda suggested.
I'd also recommend learning how to use regular expressions, as they are useful for solving many different problems and are similar between different programming languages

Splitting a string using re module of python

I have a string
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
#I have to capture only the field 'count_EVENT_GENRE'
field = re.split(r'[(==)(>=)(<=)(in)(like)]', s)[0].strip()
#o/p is 'cou'
# for s = 'sum_EVENT_GENRE in [1,2,3,4,5]' o/p = 'sum_EVENT_GENRE'
which is fine
My doubt is for any character in (in)(like) it is splitting the string s at that character and giving me first slice.(as after "cou" it finds one matching char i:e n). It's happening for any string that contains any character from (in)(like).
Ex : 'percentage_AMOUNT' o/p = 'p'
as it finds a matching char as 'e' after p.
So i want some advice how to treat (in)(like) as words not as characters , when splitting occurs/matters.
please suggest a syntax.
Answering your question, the [(==)(>=)(<=)(in)(like)] is a character class matching single characters you defined inside the class. To match sequences of characters, you need to remove [ and ] and use alternation:
r'==?|>=?|<=?|\b(?:in|like)\b'
or better:
r'[=><]=?|\b(?:in|like)\b'
You code would look like:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
field = re.split(r'[=><]=?|\b(?:in|like)\b', s)[0].strip()
print(field)
However, there might be other (easier, or safer - depending on the actual specifications) ways to get what you want (splitting with space and getting the first item, use re.match with r'\w+' or r'[a-z]+(?:_[A-Z]+)+', etc.)
If your value is at the start of the string and starts with lowercase ASCII letters, and then can have any amount of sequences of _ followed with uppercase ASCII letters, use:
re.match(r'[a-z]+(?:_[A-Z]+)*', s)
Full demo code:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
fieldObj = re.match(r'[a-z]+(?:_[A-Z]+)*', s)
if fieldObj:
print(fieldObj.group())
If you want only the first word of your string, then this should do the job:
import re
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
field = re.split(r'\W', s)[0]
# count_EVENT_GENRE
Is there anything wrong with using split?
>>> s = 'count_EVENT_GENRE in [1,2,3,4,5]'
>>> s.split(' ')[0]
'count_EVENT_GENRE'
>>> s = 'coint_EVENT_GENRE = "ROMANCE"'
>>> s.split(' ')[0]
'coint_EVENT_GENRE'
>>>

Breaking up substrings in Python based on characters

I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?
You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']
fields = inputString.split('"')
print fields[1], fields[3], fields[5]
You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.

Categories