Re.search for comma/delimiter-removed substring - python

I have a text and used a function to extract a part of the text. However, in the returned value, delimiters (e.g ',', '-') are removed. I need to find the extracted part in the original text including substring and position.
e.g:
original_text = "xyz, 19900 Praha 9, Letnany"
(or original_text = "xyz, 19900 Praha 9 - Letnany")
extracted_text = "praha 9 letnany" (lower case, delimiters are removed)
I expect the output is the same as the ouput of re.search('praha 9, letnany', original_text) meaning getting the substring 'Praha 9, Letnany' and start of the match: 11.
Is there any regular expression to locate extracted text in the original text?
The output of the function can't be changed (up until now)
I have tried to find problems related to ignoring some character while using regex but their problems are different.

This will locate a span in the original text that matches the extracted text ignoring case & inserting delimiters at will (in this case, comma or dash):
import re
pat = ("[,-]*".join(list(extracted_text))).replace(" ","\\s")
mat = re.search( pat, original_text, re.I )
if mat:
print(mat.span())
else:
print("No match")

Same idea as #ScottHunter but process at word level instead of character level:
import re
ori_txt = '19900, Praha 7, Letnany'
extr_txt = 'praha 7 letnany'
delimiters = [',', '\s', '-']
deli = '|'.join([i for i in delimiters])
extr_arr = re.split(deli, extr_txt)
ins_c = ''.join([i for i in delimiters])
ins_c = ''.join(['[', ins_c, ']', '*'])
pat = ins_c.join(extr_arr)
mat = re.search(pat, ori_txt, re.I)
if mat:
print mat.group()
else:
print('not found')
I first want to find a regular expression to directly search for the extracted text in the original text but there seem to be no such an expression. Here is another way to solve my problem. Thank you.

Related

how to get a pattern repeating multiple times in a string using regular expression

I am still new to regular expressions, as in the Python library re.
I want to extract all the proper nouns as a whole word if they are separated by space.
I tried
result = re.findall(r'(\w+)\w*/NNP (\w+)\w*/NNP', tagged_sent_str)
Input: I have a string like
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
Output expected:
[('European Community'), ('European')]
Current output:
[('European','Community')]
But this will only give the pairs not the single ones. I want all the kinds
IIUC, itertools.groupby is more suited for this kind of job:
from itertools import groupby
def join_token(string_, type_ = 'NNP'):
res = []
for k, g in groupby([i.split('/') for i in string_.split()], key=lambda x:x[1]):
if k == type_:
res.append(' '.join(i[0] for i in g))
return res
join_token(tagged_sent_str)
Output:
['European Community', 'European']
and it doesn't require a modification if you expect three or more consecutive types:
str2 = "European/NNP Community/NNP Union/NNP French/JJ European/NNP export/VB"
join_token(str2)
Output:
['European Community Union', 'European']
Interesting requirement. Code is explained in the comments, a very fast solution using only REGEX:
import re
# make it more complex
text = "export1/VB European0/NNP export/VB European1/NNP Community1/NNP Community2/NNP French/JJ European2/NNP export/VB European2/NNP"
# 1: First clean app target words word/NNP to word,
# you can use str.replace but just to show you a technique
# how to to use back reference of the group use \index_of_group
# re.sub(r'/NNP', '', text)
# text.replace('/NNP', '')
_text = re.sub(r'(\w+)/NNP', r'\1', text)
# this pattern strips the leading and trailing spaces
RE_FIND_ALL = r'(?:\s+|^)((?:(?:\s|^)?\w+(?=\s+|$)?)+)(?:\s+|$)'
print('RESULT : ', re.findall(RE_FIND_ALL, _text))
OUTPUT:
RESULT : ['European0', 'European1 Community1 Community2', 'European2', 'European2']
Explaining REGEX:
(?:\s+|^) : skip leading spaces
((?:(?:\s)?\w+(?=\s+|$))+): capture a group of non copture subgroup (?:(?:\s)?\w+(?=\s+|$)) subgroup will match all sequence words folowed by spaces or end of line. and that match will be captured by the global group. if we don't do this the match will return only the first word.
(?:\s+|$) : remove trailing space of the sequence
I needed to remove /NNP from the target words because you want to keep the sequence of word/NNP in a single group, doing something like this (word)/NNP (word)/NPP this will return two elements in one group but not as a single text, so by removing it the text will be word word so REGEX ((?:\w+\s)+) will capture the sequence of word but it's not a simple as this because we need to capture the word that doesn't contain /sequence_of_letter at the end, no need to loop over the matched groups to concatenate element to build a valid text.
NOTE: both solutions work fine if all words are in this format word/sequence_of_letters; if you have words that are not in this format
you need to fix those. If you want to keep them add /NPP at the end of each word, else add /DUMMY to remove them.
Using re.split but slow because I'm using list comprehensive to fix result:
import re
# make it more complex
text = "export1/VB Europian0/NNP export/VB Europian1/NNP Community1/NNP Community2/NNP French/JJ Europian2/NNP export/VB Europian2/NNP export/VB export/VB"
RE_SPLIT = r'\w+/[^N]\w+'
result = [x.replace('/NNP', '').strip() for x in re.split(RE_SPLIT, text) if x.strip()]
print('RESULT: ', result)
You'd like to get a pattern but with some parts deleted from it.
You can get it with two successive regexes:
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
[ re.sub(r"/NNP","",s) for s in re.findall(r"\w+/NNP(?:\s+\w+/NNP)*",tagged_sent_str) ]
['European Community', 'European']

How to fix 'replace' keyword when it is not working in python

I am writing a code that needs to get four individual values, and one of the values has the newline character in addition to an extra apostrophe and bracket like so: 11\n']. I only need the 11 and have been able to strip the '], but I am unable to remove the newline character.
I have tried various different set ups of strip and replace, and both strip and replace are not removing the part.
with open('gil200110raw.txt', 'r') as qcfile:
txt = qcfile.readlines()
line1 = txt[1:2]
line2 = txt[2:3]
line1 = str(line1)
line2 = str(line2)
sptline1 = line1.split(' ')
sptline2 = line2.split(' ')
totalobs = sptline1[39]
qccalc1 = sptline2[2]
qccalc2 = sptline2[9]
qccalc3 = sptline2[16]
qccalc4 = sptline2[22]
qccalc4 = qccalc4.strip("\n']")
qccalc4 = qccalc4.replace("\n", "")
I did not get an error, but the output of print(qccalc4) is 11\n. I expect the output to be 11.
Use rstrip instead!
>>> 'test string\n'.rstrip()
'test string'
You can use regex to match the outputs you're looking for.
From your description, I assume it is all integers, consider the following snippet
import re
p = re.compile('[0-9]+')
sample = '11\n\'] dwqed 12 444'
results = p.findall(sample)
results would now contain the array ['11', '12', '444'].
re is the regex package for python and p is the pattern we would like to find in our text, this pattern [0-9]+ simply means match one or more characters 0 to 9
you can find the documentation here

Search file for exact match of word list

There are many many questions surrounding this, some using regex, some using with open, and others but I have found none suitably fit my requirements.
I am opening a xml file which contains strings, 1 per line. e.g
<string name="AutoConf_5">setup is in progress…</string>
I want to iterate over each line in the file and search each line for exact matches of words in a list. The current code seems to work and prints out matches but it doesn't do exact matches, e.g 'pass' finds 'passed', 'pro' finds 'provide', 'process', 'proceed' etc
def stringRun(self,file):
str_file = ['admin','premium','pro','paid','pass','password','api']
with open(file, 'r') as sf:
for s in sf:
if any(x in str(s) for x in str_file):
self.progressBox.AppendText(s)
Instead of using the function "in" which matches any substring in the line, you should use regex "re.search"
I haven't checked it with python so minor syntax errors might have slipped in but this is the general idea, replace the if in your code with this:
if any(re.search(x, str(s)) for x in str_file):
Then you can use the power of regex to search for the words in the list with word boundaries. You need to add '\b' to the beginning and end of each search string, or add to all in the condition:
if any(re.search(r'\b' + x + r'\b', str(s)) for x in str_file):
If you want an exact match, IMO, the best way is to prepare the strings to match and then search each string in each line.
For instances, you can prepare a mapping between tagged string and strings you want to match:
tagged = {'<string name="AutoConf_5">{0}</string>'.format(s): s
for s in str_file}
This dict is an association between the tagged string you want to match and the actual string.
You can use it like that:
for line in sf:
line = line.strip()
if line in tagged:
self.progressBox.AppendText(tagged[line])
Note: if any of your string contains "&", "<" or ">", you need to escape those characters, like this:
from xml.sax.saxutils import escape
tagged = {'<string name="AutoConf_5">{0}</string>'.format(escape(s)): s
for s in str_file}
Another solution is to use lxml to parse your XML tree and find nodes which match a given xpath expression.
EDIT: match at least a word (form a words list)
You have a list of strings containing words. To match the XML content which contains at least of word of this list, you can use regular expression.
You may encounter 2 difficulties:
a XML content, parsed like a text file, can contains "&", "<" or ">". So you need to unescape the XML content.
some word from your words list may contains RegEx special characters (like "[" or "(") which must be escaped.
First, you can prepare a RegEx (and a function) to find all occurence of a word in a string. To do that, you can use "\b" to match the empty string, but only at the beginning or end of a word:
str_file = ['admin', 'premium', 'pro', 'paid', 'pass', 'password', 'api']
re_any_word = r"\b(?:" + r"|".join(re.escape(e) for e in str_file) + r")\b"
find_any_word = re.compile(re_any_word, flags=re.DOTALL).findall
For instance:
>>> find_any_word("Time has passed")
[]
>>> find_any_word("I pass my exam, I'm a pro")
['pass', 'pro']
To extract the content of a XML fragment, you can also use a RegEx (even if it is not recommended in the general case, it worth it here):
The following RegEx (and function) matches a "<string>...</string>" fragment and select the content in the first group:
re_string = r'<string[^>]*>(.*?)</string>'
match_string = re.compile(re_string, flags=re.DOTALL).match
For instance:
>>> match_string('<string name="AutoConf_5">setup is in progress…</string>').group(1)
setup is in progress…
Now, all you have to do is to parse your file, line by line.
For the demo, I used a list of strings:
lines = [
'<string name="AutoConf_5">setup is in progress…</string>\n',
'<string name="AutoConf_5">it has passed</string>\n',
'<string name="AutoConf_5">I pass my exam, I am a pro</string>\n',
]
for line in lines:
line = line.strip()
mo = match_string(line)
if mo:
content = saxutils.unescape(mo.group(1))
words = find_any_word(content)
if words:
print(line + " => " + ", ".join(words))
You get:
<string name="AutoConf_5">I pass my exam, I am a pro</string> => pass, pro

Splitting a string using re module of python

I have a string
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
#I have to capture only the field 'count_EVENT_GENRE'
field = re.split(r'[(==)(>=)(<=)(in)(like)]', s)[0].strip()
#o/p is 'cou'
# for s = 'sum_EVENT_GENRE in [1,2,3,4,5]' o/p = 'sum_EVENT_GENRE'
which is fine
My doubt is for any character in (in)(like) it is splitting the string s at that character and giving me first slice.(as after "cou" it finds one matching char i:e n). It's happening for any string that contains any character from (in)(like).
Ex : 'percentage_AMOUNT' o/p = 'p'
as it finds a matching char as 'e' after p.
So i want some advice how to treat (in)(like) as words not as characters , when splitting occurs/matters.
please suggest a syntax.
Answering your question, the [(==)(>=)(<=)(in)(like)] is a character class matching single characters you defined inside the class. To match sequences of characters, you need to remove [ and ] and use alternation:
r'==?|>=?|<=?|\b(?:in|like)\b'
or better:
r'[=><]=?|\b(?:in|like)\b'
You code would look like:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
field = re.split(r'[=><]=?|\b(?:in|like)\b', s)[0].strip()
print(field)
However, there might be other (easier, or safer - depending on the actual specifications) ways to get what you want (splitting with space and getting the first item, use re.match with r'\w+' or r'[a-z]+(?:_[A-Z]+)+', etc.)
If your value is at the start of the string and starts with lowercase ASCII letters, and then can have any amount of sequences of _ followed with uppercase ASCII letters, use:
re.match(r'[a-z]+(?:_[A-Z]+)*', s)
Full demo code:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
fieldObj = re.match(r'[a-z]+(?:_[A-Z]+)*', s)
if fieldObj:
print(fieldObj.group())
If you want only the first word of your string, then this should do the job:
import re
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
field = re.split(r'\W', s)[0]
# count_EVENT_GENRE
Is there anything wrong with using split?
>>> s = 'count_EVENT_GENRE in [1,2,3,4,5]'
>>> s.split(' ')[0]
'count_EVENT_GENRE'
>>> s = 'coint_EVENT_GENRE = "ROMANCE"'
>>> s.split(' ')[0]
'coint_EVENT_GENRE'
>>>

Breaking up substrings in Python based on characters

I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?
You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']
fields = inputString.split('"')
print fields[1], fields[3], fields[5]
You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.

Categories