Split string with multiple possible delimiters to get substring - python

I am trying to make a simple Discord bot to respond to some user input and having difficulty trying to parse the response for the info I need. I am trying to get their "gamertag"/username but the format is a little different sometimes.
So, my idea was to make a list of delimiter words I am looking for (different versions of the word gamertag such as Gamertag:, Gamertag -, username, etc.)
Then, look line by line for one that contains any of those delimiters.
Split the string on first matching delim, strip non alphanumeric characters
I had it kinda working for a single line, then realized some people don't put it on the first line so added line by line check and messed it up (on line 19 I just realized).. Also thought there must be a better way than this? please advise, some kinda working code at this link and copied below:
testString = """Application
Gamertag : testGamertag
Discord - testDiscord
Age - 25"""
applicationString = testString
gamertagSplitList = [ "gamertag", "Gamertag","Gamertag:", "gamertag:"]
#splWord = 'Gamertag'
lineNum = 0
for line in applicationString.partition('\n'):
print(line)
if line in gamertagSplitList:
applicationString = line
break
#get first line
#applicationString = applicationString.partition('\n')[0]
res = ""
#split on word, want to split on first occurrence of list of words
for splitWord in gamertagSplitList:
if splitWord in applicationString:
res = applicationString.split(splitWord)
break
splitString = res[1]
#res = test_string.split(spl_word, 1)
#splitString = res[1]
#get rid of non alphaNum characters
finalString = "" #define string for ouput
for character in splitString:
if(character.isalnum()):
# if character is alphanumeric concat to finalString
finalString = finalString + character
print(finalString)

Don't know if this will work with all your different inputs, but you can tweak it to get what you want :
import re
gamertagSplitList = ["gamertag", "Gamertag", "Gamertag:", "gamertag:"]
applicationString = """Application
Gamertag : testGamertag
Discord - testDiscord
Age - 25"""
for line in applicationString.split('\n'):
line = line.replace(' ', '')
for tag in gamertagSplitList:
if tag in line:
gamer_tag = line.replace(tag, '', 1)
break
print(re.sub(r'\W+', '', gamer_tag))
Output :
testGamertag

You can do it without any loops with a single regex:
import re
gamertagSplitList = ["gamertag", "Gamertag"]
applicationString = """Application
Gamertag : testGamertag
Discord - testDiscord
Age - 25"""
print(re.search(r'(' + '|'.join(gamertagSplitList) + ')\s*[:-]?\s*(\w+)\s*', applicationString)[2])
If all values in gamertagSplitList differ just by casing, you can simplify that even further:
print(re.search(r'gamertag\s*[:-]?\s*(\w+)\s*', applicationString, re.IGNORECASE)[1])
Let's take a closer look at this regex:
gamertag will match a string 'gamertag'
\s* will match any (including none) whitespace characters (space, newline, tab, etc.)
[:-]? will match either none or a single character which is either : or -
(\w+) will match 1 or more alphanumeric characters. Parenthesis here denote a group -- specific substring that we can extract later from the match.
By using re.IGNORECASE we make matching case insensitive, so that separator GaMeRtAg will also be recognised by this pattern.
The indexing part [1] means that we're interested in a first group in our pattern (remember the parenthesis). A group with index 0 is always a full match, and groups from index 1 upwards represent substrings that match subexpressions in parenthesis (ordered by their ( appearance in the regex).

Related

Implement regular expression in Python to replace every occurence of "meshname = x" in a text file

I want to replace every line in a textfile with " " which starts with "meshname = " and ends with any letter/number and underscore combination. I used regex's in CS but I never really understood the different notations in Python. Can you help me with that?
Is this the right regex for my problem and how would i transform that into a Python regex?
m.e.s.h.n.a.m.e.' '.=.' '.{{_}*,{0,...,9}*,{a,...,z}*,{A,...,Z}*}*
x.y = Concatenation of x and y
' ' = whitespace
{x} = set containing x
x* = x.x.x. ... .x or empty word
What would the script look like in order to replace every string/line in a file containing meshname = ... with the Python regex? Something like this?
fin = open("test.txt", 'r')
data = fin.read()
data = data.replace("^meshname = [[a-z]*[A-Z]*[0-9]*[_]*]+", "")
fin.close()
fin = open("test.txt", 'w')
fin.write(data)
fin.close()
or is this completely wrong? I've tried to get it working with this approach, but somehow it never matched the right string: How to input a regex in string.replace?
Following the current code logic, you can use
data = re.sub(r'^meshname = .*\w$', ' ', data, flags=re.M)
The re.sub will replace with a space any line that matches
^ - line start (note the flags=re.M argument that makes sure the multiline mode is on)
meshname - a meshname word
= - a = string
.* - any zero or more chars other than line break chars as many as possible
\w - a letter/digit/_
$ - line end.

how to get a pattern repeating multiple times in a string using regular expression

I am still new to regular expressions, as in the Python library re.
I want to extract all the proper nouns as a whole word if they are separated by space.
I tried
result = re.findall(r'(\w+)\w*/NNP (\w+)\w*/NNP', tagged_sent_str)
Input: I have a string like
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
Output expected:
[('European Community'), ('European')]
Current output:
[('European','Community')]
But this will only give the pairs not the single ones. I want all the kinds
IIUC, itertools.groupby is more suited for this kind of job:
from itertools import groupby
def join_token(string_, type_ = 'NNP'):
res = []
for k, g in groupby([i.split('/') for i in string_.split()], key=lambda x:x[1]):
if k == type_:
res.append(' '.join(i[0] for i in g))
return res
join_token(tagged_sent_str)
Output:
['European Community', 'European']
and it doesn't require a modification if you expect three or more consecutive types:
str2 = "European/NNP Community/NNP Union/NNP French/JJ European/NNP export/VB"
join_token(str2)
Output:
['European Community Union', 'European']
Interesting requirement. Code is explained in the comments, a very fast solution using only REGEX:
import re
# make it more complex
text = "export1/VB European0/NNP export/VB European1/NNP Community1/NNP Community2/NNP French/JJ European2/NNP export/VB European2/NNP"
# 1: First clean app target words word/NNP to word,
# you can use str.replace but just to show you a technique
# how to to use back reference of the group use \index_of_group
# re.sub(r'/NNP', '', text)
# text.replace('/NNP', '')
_text = re.sub(r'(\w+)/NNP', r'\1', text)
# this pattern strips the leading and trailing spaces
RE_FIND_ALL = r'(?:\s+|^)((?:(?:\s|^)?\w+(?=\s+|$)?)+)(?:\s+|$)'
print('RESULT : ', re.findall(RE_FIND_ALL, _text))
OUTPUT:
RESULT : ['European0', 'European1 Community1 Community2', 'European2', 'European2']
Explaining REGEX:
(?:\s+|^) : skip leading spaces
((?:(?:\s)?\w+(?=\s+|$))+): capture a group of non copture subgroup (?:(?:\s)?\w+(?=\s+|$)) subgroup will match all sequence words folowed by spaces or end of line. and that match will be captured by the global group. if we don't do this the match will return only the first word.
(?:\s+|$) : remove trailing space of the sequence
I needed to remove /NNP from the target words because you want to keep the sequence of word/NNP in a single group, doing something like this (word)/NNP (word)/NPP this will return two elements in one group but not as a single text, so by removing it the text will be word word so REGEX ((?:\w+\s)+) will capture the sequence of word but it's not a simple as this because we need to capture the word that doesn't contain /sequence_of_letter at the end, no need to loop over the matched groups to concatenate element to build a valid text.
NOTE: both solutions work fine if all words are in this format word/sequence_of_letters; if you have words that are not in this format
you need to fix those. If you want to keep them add /NPP at the end of each word, else add /DUMMY to remove them.
Using re.split but slow because I'm using list comprehensive to fix result:
import re
# make it more complex
text = "export1/VB Europian0/NNP export/VB Europian1/NNP Community1/NNP Community2/NNP French/JJ Europian2/NNP export/VB Europian2/NNP export/VB export/VB"
RE_SPLIT = r'\w+/[^N]\w+'
result = [x.replace('/NNP', '').strip() for x in re.split(RE_SPLIT, text) if x.strip()]
print('RESULT: ', result)
You'd like to get a pattern but with some parts deleted from it.
You can get it with two successive regexes:
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
[ re.sub(r"/NNP","",s) for s in re.findall(r"\w+/NNP(?:\s+\w+/NNP)*",tagged_sent_str) ]
['European Community', 'European']

Search file for exact match of word list

There are many many questions surrounding this, some using regex, some using with open, and others but I have found none suitably fit my requirements.
I am opening a xml file which contains strings, 1 per line. e.g
<string name="AutoConf_5">setup is in progress…</string>
I want to iterate over each line in the file and search each line for exact matches of words in a list. The current code seems to work and prints out matches but it doesn't do exact matches, e.g 'pass' finds 'passed', 'pro' finds 'provide', 'process', 'proceed' etc
def stringRun(self,file):
str_file = ['admin','premium','pro','paid','pass','password','api']
with open(file, 'r') as sf:
for s in sf:
if any(x in str(s) for x in str_file):
self.progressBox.AppendText(s)
Instead of using the function "in" which matches any substring in the line, you should use regex "re.search"
I haven't checked it with python so minor syntax errors might have slipped in but this is the general idea, replace the if in your code with this:
if any(re.search(x, str(s)) for x in str_file):
Then you can use the power of regex to search for the words in the list with word boundaries. You need to add '\b' to the beginning and end of each search string, or add to all in the condition:
if any(re.search(r'\b' + x + r'\b', str(s)) for x in str_file):
If you want an exact match, IMO, the best way is to prepare the strings to match and then search each string in each line.
For instances, you can prepare a mapping between tagged string and strings you want to match:
tagged = {'<string name="AutoConf_5">{0}</string>'.format(s): s
for s in str_file}
This dict is an association between the tagged string you want to match and the actual string.
You can use it like that:
for line in sf:
line = line.strip()
if line in tagged:
self.progressBox.AppendText(tagged[line])
Note: if any of your string contains "&", "<" or ">", you need to escape those characters, like this:
from xml.sax.saxutils import escape
tagged = {'<string name="AutoConf_5">{0}</string>'.format(escape(s)): s
for s in str_file}
Another solution is to use lxml to parse your XML tree and find nodes which match a given xpath expression.
EDIT: match at least a word (form a words list)
You have a list of strings containing words. To match the XML content which contains at least of word of this list, you can use regular expression.
You may encounter 2 difficulties:
a XML content, parsed like a text file, can contains "&", "<" or ">". So you need to unescape the XML content.
some word from your words list may contains RegEx special characters (like "[" or "(") which must be escaped.
First, you can prepare a RegEx (and a function) to find all occurence of a word in a string. To do that, you can use "\b" to match the empty string, but only at the beginning or end of a word:
str_file = ['admin', 'premium', 'pro', 'paid', 'pass', 'password', 'api']
re_any_word = r"\b(?:" + r"|".join(re.escape(e) for e in str_file) + r")\b"
find_any_word = re.compile(re_any_word, flags=re.DOTALL).findall
For instance:
>>> find_any_word("Time has passed")
[]
>>> find_any_word("I pass my exam, I'm a pro")
['pass', 'pro']
To extract the content of a XML fragment, you can also use a RegEx (even if it is not recommended in the general case, it worth it here):
The following RegEx (and function) matches a "<string>...</string>" fragment and select the content in the first group:
re_string = r'<string[^>]*>(.*?)</string>'
match_string = re.compile(re_string, flags=re.DOTALL).match
For instance:
>>> match_string('<string name="AutoConf_5">setup is in progress…</string>').group(1)
setup is in progress…
Now, all you have to do is to parse your file, line by line.
For the demo, I used a list of strings:
lines = [
'<string name="AutoConf_5">setup is in progress…</string>\n',
'<string name="AutoConf_5">it has passed</string>\n',
'<string name="AutoConf_5">I pass my exam, I am a pro</string>\n',
]
for line in lines:
line = line.strip()
mo = match_string(line)
if mo:
content = saxutils.unescape(mo.group(1))
words = find_any_word(content)
if words:
print(line + " => " + ", ".join(words))
You get:
<string name="AutoConf_5">I pass my exam, I am a pro</string> => pass, pro

Splitting a string using re module of python

I have a string
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
#I have to capture only the field 'count_EVENT_GENRE'
field = re.split(r'[(==)(>=)(<=)(in)(like)]', s)[0].strip()
#o/p is 'cou'
# for s = 'sum_EVENT_GENRE in [1,2,3,4,5]' o/p = 'sum_EVENT_GENRE'
which is fine
My doubt is for any character in (in)(like) it is splitting the string s at that character and giving me first slice.(as after "cou" it finds one matching char i:e n). It's happening for any string that contains any character from (in)(like).
Ex : 'percentage_AMOUNT' o/p = 'p'
as it finds a matching char as 'e' after p.
So i want some advice how to treat (in)(like) as words not as characters , when splitting occurs/matters.
please suggest a syntax.
Answering your question, the [(==)(>=)(<=)(in)(like)] is a character class matching single characters you defined inside the class. To match sequences of characters, you need to remove [ and ] and use alternation:
r'==?|>=?|<=?|\b(?:in|like)\b'
or better:
r'[=><]=?|\b(?:in|like)\b'
You code would look like:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
field = re.split(r'[=><]=?|\b(?:in|like)\b', s)[0].strip()
print(field)
However, there might be other (easier, or safer - depending on the actual specifications) ways to get what you want (splitting with space and getting the first item, use re.match with r'\w+' or r'[a-z]+(?:_[A-Z]+)+', etc.)
If your value is at the start of the string and starts with lowercase ASCII letters, and then can have any amount of sequences of _ followed with uppercase ASCII letters, use:
re.match(r'[a-z]+(?:_[A-Z]+)*', s)
Full demo code:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
fieldObj = re.match(r'[a-z]+(?:_[A-Z]+)*', s)
if fieldObj:
print(fieldObj.group())
If you want only the first word of your string, then this should do the job:
import re
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
field = re.split(r'\W', s)[0]
# count_EVENT_GENRE
Is there anything wrong with using split?
>>> s = 'count_EVENT_GENRE in [1,2,3,4,5]'
>>> s.split(' ')[0]
'count_EVENT_GENRE'
>>> s = 'coint_EVENT_GENRE = "ROMANCE"'
>>> s.split(' ')[0]
'coint_EVENT_GENRE'
>>>

Split group of special characters from string

In test.txt:
quiet confidence^_^
want:P
(:let's start
Codes:
import re
file = open('test.txt').read()
for line in file.split('\n'):
line = re.findall(r"[^\w\s$]+|[a-zA-z]+|[^\w\s$]+", line)
print " ".join(line)
Results showed:
quiet confidence^_^
want : P
(: let ' s start
I tried to separate group of special characters from string but still incorrect.
Any suggestion?
Expected results:
quiet confidence ^_^
want :P
(: let's start
as #interjay said, you must define what you consider a word and what is "special characters". Still I would use 2 separate regexes to find what a word is and what is not.
word = re.compile("[a-zA-Z\']+")
not_word = re.compile("[^a-zA-Z\']+")
for line in file.split('\n'):
matched_words = re.findall(word, line)
non_matching_words = re.findall(not_word, line)
print " ".join(matched_words)
print " ".join(non_matching_words)
Have in mind that spaces \s+ will be grouped as non words.

Categories