How to replace a word which occurs before another word in python - python

I want to replace(re-spell) a word A in a text string with another word B if the word A occurs before an operator. Word A can be any word.
E.G:
Hi I am Not == you
Since "Not" occurs before operator "==", I want to replace it with alist["Not"]
So, above sentence should changed to
Hi I am alist["Not"] == you
Another example
My height > your height
should become
My alist["height"] > your height
Edit:
On #Paul's suggestion, I am putting the code which I wrote myself.
It works but its too bulky and I am not happy with it.
operators = ["==", ">", "<", "!="]
text_list = text.split(" ")
for index in range(len(text_list)):
if text_list[index] in operators:
prev = text_list[index - 1]
if "." in prev:
tokens = prev.split(".")
prev = "alist"
for token in tokens:
prev = "%s[\"%s\"]" % (prev, token)
else:
prev = "alist[\"%s\"]" % prev
text_list[index - 1] = prev
text = " ".join(text_list)

This can be done using regular expressions
import re
...
def replacement(match):
return "alist[\"{}\"]".format(match.group(0))
...
re.sub(r"[^ ]+(?= +==)", replacement, s)
If the space between the word and the "==" in your case is not needed, the last line becomes:
re.sub(r"[^ ]+(?= *==)", replacement, s)
I'd highly recommend you to look into regular expressions, and the python implementation of them, as they are really useful.
Explanation for my solution:
re.sub(pattern, replacement, s) replaces occurences of patterns, that are given as regular expressions, with a given string or the output of a function.
I use the output of a function, that puts the whole matched object into the 'alist["..."]' construct. (match.group(0) returns the whole match)
[^ ] match anything but space.
+ match the last subpattern as often as possible, but at least once.
* match the last subpattern as often as possible, but it is optional.
(?=...) is a lookahead. It checks if the stuff after the current cursor position matches the pattern inside the parentheses, but doesn't include them in the final match (at least not in .group(0), if you have groups inside a lookahead, those are retrievable by .group(index)).

str = "Hi I am Not == you"
s = str.split()
y = ''
str2 = ''
for x in s:
if x in "==":
str2 = str.replace(y, 'alist["'+y+'"]')
break
y = x
print(str2)

You could try using the regular expression library I was able to create a simple solution to your problem as shown here.
import re
data = "Hi I am Not == You"
x = re.search(r'(\w+) ==', data)
print(x.groups())
In this code, re.search looks for the pattern of (1 or more) alphanumeric characters followed by operator (" ==") and stores the result ("Hi I am Not ==") in variable x.
Then for swaping you could use the re.sub() method which CodenameLambda suggested.
I'd also recommend learning how to use regular expressions, as they are useful for solving many different problems and are similar between different programming languages

Related

Replace commas enclosed in curly braces

I try to replace commas with semicolons enclosed in curly braces.
Sample string:
text = "a,b,{'c','d','e','f'},g,h"
I am aware that it comes down to lookbehinds and lookaheads, but somehow it won't work like I want it to:
substr = re.sub(r"(?<=\{)(.+?)(,)(?=.+\})",r"\1;", text)
It returns:
a,b,{'c';'d','e','f'},g,h
However, I am aiming for the following:
a,b,{'c';'d';'e';'f'},g,h
Any idea how I can achieve this?
Any help much appreciated :)
You can match the whole block {...} (with {[^{}]+}) and replace commas inside it only with a lambda:
import re
text = "a,b,{'c','d','e','f'},g,h"
print(re.sub(r"{[^{}]+}", lambda x: x.group(0).replace(",", ";"), text))
See IDEONE demo
Output: a,b,{'c';'d';'e';'f'},g,h
By declaring lambda x we can get access to each match object, and get the whole match value using x.group(0). Then, all we need is replace a comma with a semi-colon.
This regex does not support recursive patterns. To use a recursive pattern, you need PyPi regex module. Something like m = regex.sub(r"\{(?:[^{}]|(?R))*}", lambda x: x.group(0).replace(",", ";"), text) should work.
Below I have posted a solution that does not rely on an regular expression. It uses a stack (list) to determine if a character is inside a curly bracket {. Regular expression are more elegant, however, they can be harder to modify when requirements change. Please note that the example below also works for nested brackets.
text = "a,b,{'c','d','e','f'},g,h"
output=''
stack = []
for char in text:
if char == '{':
stack.append(char)
elif char == '}':
stack.pop()
#Check if we are inside a curly bracket
if len(stack)>0 and char==',':
output += ';'
else:
output += char
print output
This gives:
'a,b,{'c';'d';'e';'f'},g,h
You can also rewrite this as a map function if you use a the global variable for stack:
stack = []
def replace_comma_in_curly_brackets(char):
if char == '{':
stack.append(char)
elif char == '}':
stack.pop()
#Check if we are inside a curly bracket
if len(stack)>0 and char==',':
return ';'
return char
text = "a,b,{'c','d','e','f'},g,h"
print ''.join(map(str, map(replace_comma_in_curly_brackets,text)))
Regarding performance, when running the above two methods and the regular expression solution proposed by #stribizhev on the test string at the end of this post, I get the following timings:
Regular expression (#stribizshev): 0.38 seconds
Map function: 26.3 seconds
For loop: 251 seconds
This is the test string that is 55,300,00 characters long:
text = "a,able,about,across,after,all,almost,{also,am,among,an,and,any,are,as,at,be,because},been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your" * 100000
If you don't have nested braces, it might be enough to just look ahead at each , if there is a closing } ahead without any opening { in between. Search for
,(?=[^{]*})
and replace with ;
, match a comma literally
(?=...) the lookahead to check
if there's ahead [^{]* any amount of characters, that are not {
followed by a closing curly brace }
See demo at regex101

pythonic string syntax corrector

I wrote a script to catch and correct commands before they are read by a parser. The parser requires equal, not equal, greater, etc, entries to be separated by commas, such as:
'test(a>=b)' is wrong
'test(a,>=,b)' is correct
The script i wrote works fine, but i would love to know if there's a more efficient way to do this.
Here's my script:
# Correction routine
def corrector(exp):
def rep(exp,a,b):
foo = ''
while(True):
foo = exp.replace(a,b)
if foo == exp:
return exp
exp = foo
# Replace all instances with a unique identifier. Do it in a specific order
# so for example we catch an instance of '>=' before we get to '='
items = ['>=','<=','!=','==','>','<','=']
for i in range(len(items)):
exp = rep(exp,items[i],'###%s###'%i)
# Re-add items with commas
for i in range(len(items)):
exp = exp.replace('###%s###'%i,',%s,'%items[i])
# Remove accidental double commas we may have added
return exp.replace(',,',',')
print corrector('wrong_syntax(b>=c) correct_syntax(b,>=,c)')
// RESULT: wrong_syntax(b,>=,c) correct_syntax(b,>=,c)
thanks!
As mentioned in the comments, one approach would be to use a regular expression. The following regex matches any of your operators when they are not surrounded by commas, and replaces them with the same string with the commas inserted:
inputstring = 'wrong_syntax(b>=c) correct_syntax(b,>=,c)'
regex = r"([^,])(>=|<=|!=|==|>|<|=)([^,])"
replace = r"\1,\2,\3"
result = re.sub(regex, replace, inputstring)
print(result)
Simple regexes are relatively easy, but they can get complicated quickly. Check out the docs for more info:
http://docs.python.org/2/library/re.html
Here is a regex that will do what you asked:
import re
regex = re.compile(r'''
(?<!,) # Negative lookbehind
(!=|[><=]=?)
(?!,) # Negative lookahead
''', re.VERBOSE)
print regex.sub(r',\1,', 'wrong_expression(b>=c) or right_expression(b,>=,c)')
outputs
wrong_expression(b,>=,c) or right_expression(b,>=,c)

Python regex: re.search() is extremely slow on large text files

My code does the following:
Take a large text file (i.e. a legal document that is 300 pages as a PDF).
Find a certain keyword (e.g. "small").
Return n words to the left and n words to the right of the keyword.
NOTE: In this context, a "word" is any string of non-space characters. "$cow123" would be a word, but "health care" would be two words.
Here is my problem:
The code takes an extremely long time to run on the 300 pages, and that time tends to increase very quickly as n increases.
Here is my code:
fileHandle = open('test_pdf.txt', mode='r')
document = fileHandle.read()
def search(searchText, doc, n):
#Searches for text, and retrieves n words either side of the text, which are returned separately
surround = r"\s*(\S*)\s*"
groups = re.search(r'{}{}{}'.format(surround*n, searchText, surround*n), doc).groups()
return groups[:n],groups[n:]
Here is the nasty culprit:
print search("\$27.5 million", document, 10)
Here's how you can test this code:
Copy the function definition from the code block above and run the following:
t = "The world is a small place, we $.205% try to take care of it."
print search("\$.205", t, 3)
I suspect that I have a nasty case of catastrophic backtracking, but I'm too new to regex to point my finger on the problem.
How do I speed up my code?
How about using re.search (or even string.find if you're only searching for fixed strings) to find the string, without any surrounding capturing groups. Then you use the position and length of the match (.start and .end on a re matchobject, or the return value of find plus the length of the search string). Get the substring before the match and do /\s*(\S*)\s*\z/ etc. on it, and get the substring after the match and do /\A\s*(\S*)\s*/ etc. on it.
Also, for help with your backtracking: you can use a pattern like \s+\S+\s+ instead of \s*\S*\s* (two chunks of whitespace have to be separated by a non-zero amount of non-whitespace, or else they wouldn't be two chunks), and you shouldn't butt up two consecutive \s*s like you do. I think r'\S+'.join([[r'\s+']*(n)) would give the right pattern for capturing n previous words (but my Python is rusty, so check that).
I see several problems here. The First, and probably worst, is that everything in your "surround" regex is, not just optional but independently optional. Given this string:
"Lorem ipsum tritani impedit civibus ei pri"
...when searchText = "tritani" and n = 1, this is what it has to go through before it finds the first match:
regex: \s* \S* \s* tritani
offset 0: '' 'Lorem' ' ' FAIL
'' 'Lorem' '' FAIL
'' 'Lore' '' FAIL
'' 'Lor' '' FAIL
'' 'Lo' '' FAIL
'' 'L' '' FAIL
'' '' '' FAIL
...then it bumps ahead one position and starts over:
offset 1: '' 'orem' ' ' FAIL
'' 'orem' '' FAIL
'' 'ore' '' FAIL
'' 'or' '' FAIL
'' 'o' '' FAIL
'' '' '' FAIL
... and so on. According to RegexBuddy's debugger, it takes almost 150 steps to reach the offset where it can make the first match:
position 5: ' ' 'ipsum' ' ' 'tritani'
And that's with just one word to skip over, and with n=1. If you set n=2 you end up with this:
\s*(\S*)\s*\s*(\S*)\s*tritani\s*(\S*)\s*\s*(\S*)\s*
I sure you can see where this is is going. Note especially that when I change it to this:
(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)tritani(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)
...it finds the first match in a little over 20 steps. This is one of the most common regex anti-patterns: using * when you should be using +. In other words, if it's not optional, don't treat it as optional.
Finally, you may have noticed the \s*\s* the auto-generated regex
You could try using mmap and appropriate regex flags, eg (untested):
import re
import mmap
with open('your file') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
for match in re.finditer(your_re, mf, flags=re.DOTALL):
print match.group() # do something with your match
This'll only keep memory usage lower though...
The alternative is to have a sliding window of words (simple example of just single word before and after)...:
import re
import mmap
from itertools import islice, tee, izip_longest
with open('testingdata.txt') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
words = (m.group() for m in re.finditer('\w+', mf, flags=re.DOTALL))
grouped = [islice(el, idx, None) for idx, el in enumerate(tee(words, 3))]
for group in izip_longest(*grouped, fillvalue=''):
if group[1] == 'something': # check criteria for group
print group
I think you are going about this completely backwards (I'm a little confused as to what you are doing in the first place!)
I would recommend checking out the re_search function I developed in the textools module of my cloud toolbox
with re_search you could solve this problem with something like:
from cloudtb import textools
data_list = textools.re_search('my match', pdf_text_str) # search for character objects
# you now have a list of strings and RegPart objects. Parse through them:
for i, regpart in enumerate(data_list):
if isinstance(regpart, basestring):
words = textools.re_search('\w+', regpart)
# do stuff with words
else:
# I Think you are ignoring these? Not totally sure
Here is a link on how to use and how it works:
http://cloudformdesign.com/?p=183
In addition to this, your regular expressions would also be printed out in more readable format.
You might also want to check out my tool Search The Sky or the similar tool Kiki to help you build and understand your regular expressions.

replacing all regex matches in single line

I have dynamic regexp in which I don't know in advance how many groups it has
I would like to replace all matches with xml tags
example
re.sub("(this).*(string)","this is my string",'<markup>\anygroup</markup>')
>> "<markup>this</markup> is my <markup>string</markup>"
is that even possible in single line?
For a constant regexp like in your example, do
re.sub("(this)(.*)(string)",
r'<markup>\1</markup>\2<markup>\3</markup>',
text)
Note that you need to enclose .* in parentheses as well if you don't want do lose it.
Now if you don't know what the regexp looks like, it's more difficult, but should be doable.
pattern = "(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 0
else s for n, s in enumerate(m.groups())),
text)
If the first thing matched by your pattern doesn't necessarily have to be marked up, use this instead, with the first group optionally matching some prefix text that should be left alone:
pattern = "()(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 1
else s for n, s in enumerate(m.groups())),
text)
You get the idea.
If your regexps are complicated and you're not sure you can make everything part of a group, where only every second group needs to be marked up, you might do something smarter with a more complicated function:
pattern = "(this).*(string)"
def replacement(m):
s = m.group()
n_groups = len(m.groups())
# assume groups do not overlap and are listed left-to-right
for i in range(n_groups, 0, -1):
lo, hi = m.span(i)
s = s[:lo] + '<markup>' + s[lo:hi] + '</markup>' + s[hi:]
return s
re.sub(pattern, replacement, text)
If you need to handle overlapping groups, you're on your own, but it should be doable.
re.sub() will replace everything it can. If you pass it a function for repl then you can do even more.
Yes, this can be done in a single line.
>>> re.sub(r"\b(this|string)\b", r"<markup>\1</markup>", "this is my string")
'<markup>this</markup> is my <markup>string</markup>'
\b ensures that only complete words are matched.
So if you have a list of words that you need to mark up, you could do the following:
>>> mywords = ["this", "string", "words"]
>>> myre = r"\b(" + "|".join(mywords) + r")\b"
>>> re.sub(myre, r"<markup>\1</markup>", "this is my string with many words!")
'<markup>this</markup> is my <markup>string</markup> with many <markup>words</markup>!'

Python Regular expression must strip whitespace except between quotes

I need a way to remove all whitespace from a string, except when that whitespace is between quotes.
result = re.sub('".*?"', "", content)
This will match anything between quotes, but now it needs to ignore that match and add matches for whitespace..
I don't think you're going to be able to do that with a single regex. One way to do it is to split the string on quotes, apply the whitespace-stripping regex to every other item of the resulting list, and then re-join the list.
import re
def stripwhite(text):
lst = text.split('"')
for i, item in enumerate(lst):
if not i % 2:
lst[i] = re.sub("\s+", "", item)
return '"'.join(lst)
print stripwhite('This is a string with some "text in quotes."')
Here is a one-liner version, based on #kindall's idea - yet it does not use regex at all! First split on ", then split() every other item and re-join them, that takes care of whitespaces:
stripWS = lambda txt:'"'.join( it if i%2 else ''.join(it.split())
for i,it in enumerate(txt.split('"')) )
Usage example:
>>> stripWS('This is a string with some "text in quotes."')
'Thisisastringwithsome"text in quotes."'
You can use shlex.split for a quotation-aware split, and join the result using " ".join. E.g.
print " ".join(shlex.split('Hello "world this is" a test'))
Oli, resurrecting this question because it had a simple regex solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Here's the small regex:
"[^"]*"|(\s+)
The left side of the alternation matches complete "quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.
Here is working code (and an online demo):
import re
subject = 'Remove Spaces Here "But Not Here" Thank You'
regex = re.compile(r'"[^"]*"|(\s+)')
def myreplacement(m):
if m.group(1):
return ""
else:
return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
Here little longish version with check for quote without pair. Only deals with one style of start and end string (adaptable for example for example start,end='()')
start, end = '"', '"'
for test in ('Hello "world this is" atest',
'This is a string with some " text inside in quotes."',
'This is without quote.',
'This is sentence with bad "quote'):
result = ''
while start in test :
clean, _, test = test.partition(start)
clean = clean.replace(' ','') + start
inside, tag, test = test.partition(end)
if not tag:
raise SyntaxError, 'Missing end quote %s' % end
else:
clean += inside + tag # inside not removing of white space
result += clean
result += test.replace(' ','')
print result

Categories