replace a string with regular expression in python - python

I have been learning regular expression for a while but still find it confusing sometimes
I am trying to replace all the
self.assertRaisesRegexp(SomeError,'somestring'):
to
self.assertRaiseRegexp(SomeError,somemethod('somestring'))
How can I do it? I am assuming the first step is fetch 'somestring' and modify it to somemethod('somestring') then replace the original 'somestring'

here is your regular expression
#f is going to be your file in string form
re.sub(r'(?m)self\.assertRaisesRegexp\((.+?),((?P<quote>[\'"]).*?(?P=quote))\)',r'self.assertRaisesRegexp(\1,somemethod(\2))',f)
this will grab something that matches and replace it accordingly. It will also make sure that the quotation marks line up correctly by setting a reference in quote
there is no need to iterate over the file here either, the first statement "(?m)" puts it in multiline mode so it maps the regular expression over each line in the file. I have tested this expression and it works as expected!
test
>>> print f
this is some
multi line example that self.assertRaisesRegexp(SomeError,'somestring'):
and so on. there self.assertRaisesRegexp(SomeError,'somestring'): will be many
of these in the file and I am just ranting for example
here is the last one self.assertRaisesRegexp(SomeError,'somestring'): okay
im done now
>>> print re.sub(r'(?m)self\.assertRaisesRegexp\((.+?),((?P<quote>[\'"]).*?(?P=quote))\)',r'self.assertRaisesRegexp(\1,somemethod(\2))',f)
this is some
multi line example that self.assertRaisesRegexp(SomeError,somemethod('somestring')):
and so on. there self.assertRaisesRegexp(SomeError,somemethod('somestring')): will be many
of these in the file and I am just ranting for example
here is the last one self.assertRaisesRegexp(SomeError,somemethod('somestring')): okay
im done now

A better tool for this particular task is sed:
$ sed -i 's/\(self.assertRaisesRegexp\)(\(.*\),\(.*\))/\1(\2,somemethod(\3))/' *.py
sed will take care of the file I/O, renaming files, etc.
If you already know how to do the file manipulation, and iterating over lines in each file, then the python re.sub line will look like:
new_line = re.sub(r"(self.assertRaisesRegexp)\((.*),(.*)\)",
r"\1(\2,somemethod(\3)",
old_line)

Related

re.sub doesn't replace the string when I execute the file

I am trying to write a python script to practice the re.sub method. But when I use python3 to run the script, I figure out that the string in the file doesn't change.
Here is my location.txt file,
34.3416,108.9398
this is what regex.py contains,
import re
with open ('location.txt','r+') as second:
content = second.read()
content = re.sub('([-+]?\d{2}\.\d{4},[-+]?\d{2}\.\d{4})','44.9740,-93.2277',content)
print (content)
I set up a print statement to test the output, and it gives me
34.3416,108.9398
which is not what I want.
Then I change the "r+" to "w+", it completely removes the location.txt content. Can anyone tell me the reason?
Your regexp has a problem as pointed by Andrej Kesely in the other answer. \d{2} should be \d{2,3}:
content = re.sub(r'([-+]?\d{2,3}\.\d{4},[-+]?\d{2,3}\.\d{4})', ,'44.9740,-93.2277',content)
After fixing that, you changed the string, but you didn't write it back to the file, you're only changing the variable in memory.
second.seek(0) # return to beginning of file
second.write(content) # write the data back to the file
second.truncate() # remove extraneous bytes (in case the content shrinked)
The second number in your location.txt is 108.9398, which has 3 digits before dot and it doesn't match to your regexp. Change your regexp to:
([-+]?\d{2,3}\.\d{4},[-+]?\d{2,3}\.\d{4})
Online regexp here.

sed not working on large file [Looking for other options]

I have a gigantic json file that was accidentally output without a newline character in between all the json entries. It is being treated as one giant single line. So what I did was try and take a find an replace with sed and insert a newline.
sed 's/{"seq_id"/\n{"seq_id"/g' my_giant_json.json
It doesn't output anything
However, I know my sed expression is working if I operate on just a small part of the file and it works fine.
head -c 1000000 my_giant_json.json | sed 's/{"seq_id"/\n{"seq_id"/g'
I have also tried using python with this gnarly one liner
'\n{"seq_id'.join(open(json_file,'r').readlines()[0].split('{"seq_id')).lstrip()
But this loads into memory thanks to readlines() method. But I don't know how to iterate through a giant single line of characters (iterate in chunks) and do a find and replace.
Any thoughts?
Perl will let you change the input separator ($/) from newline to another character. You could take advantage of this to get some convenient chunking.
perl -pe'BEGIN{$/="}"}s/^({"seq_id")/\n$1/' my_giant_json.json
That sets the input separator to be "}". Then it looks for chunks that start with {"seq_id" and prefixes them with a newline.
Note that it puts an unnecessary empty line at the beginning. You could complicate the program to eliminate that or just delete it manually after.

Reading regexes from file, in Python

I am trying to read a bunch of regexes from a file, using python.
The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.
<\? xml([^>]*?)>,<\? XML$1>
peter,Peter
I am doing
detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
line = line.strip()
[search_term, replace_term] = line.split(',', 1)
detergent += [[search_term,replace_term]]
This is not producing the right input. If I print the detergent I get
['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]
It seems to be that it is escaping the backslashes.
Moreover, in a file containing, say
<? xml ........>
a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be
<\? XML$1>
So, the $1 is not recovering the first capture group in the first regex of the pair.
What is the proper way to input regexes from a file to be later used in re.sub?
When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.
There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...
>>> r"\?"
>>> '\\?'
The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.

Python Script using Fileinput Module and with Regex substitution containing \n (multiple lines)

All,
I am relatively new to Python but have used other scripting languages with REGEX extensively. I need a script that will open a file, look for a REGEX pattern, replace the pattern and close the file. I have found that the below script works great, however, I dont know if the "for line in fileinput.input" command can accomodate for a regex pattern that exceeds a single line (i.e. the regex includes a carriage return). In my instance, it covers 2 lines. My test file read_it.txt looks like this
read_it.txt (contains just 3 lines)
ABA
CDC
EFE
The script is designed to open the file, recognize the pattern ABA\nCDC that is seen over 2 lines, then replace it with the word TEST.
If the pattern replace is successful, then the file should read as follows and contain now only 2 lines:
TEST
EFE
Knowing the answer to this will help greatly in using Python scripts to parse text files and modify them on the fly. I believe, but am not sure, that there may be a better Python construct that still allows for REGEX searches. So the question is:
1) Do I need to change something in the existing script that would change the behavior of the "for line" command to match a multi-line REGEX pattern?
2) Or do I need a different Python script that is better suited to a multi-line search?
Some things that may help but I currently dont know how to write them are:
1) fileinput "readline" option.
2) adding (?m) in the expression for multline
Please help!
Brent
SCRIPT
import sys
import fileinput
import re
for line in fileinput.input('C:\\Python34\\read_it.txt', inplace=1):
line = re.sub(r'A(B)A$\nCDC', r'TEST', line.rstrip())
print(line)
2) adding (?m) in the expression for multline
You can do this by adding re.M or flags=re.MULTILINE as an argument in re.sub
Example:-
re.sub(r'A(B)A$\nCDC', r'TEST', line.rstrip(), re.M)
or
re.sub(r'A(B)A$\nCDC', r'TEST', line.rstrip(), flags=re.MULTILINE)

How to search for string in Python by removing line breaks but return the exact line where the string was found?

I have a bunch of PDF files that I have to search for a set of keywords against. I have to extract the exact line where the keyword was found. I first used xpdf's pdf2text to convert the file to PDF. (Tried solr but had a tough time tailoring the output/schema to suit my requirement).
import sys
file_name = sys.argv[1]
searched_string = sys.argv[2]
result = [(line_number+1, line) for line_number, line in enumerate(open(file_name)) if searched_string.lower() in line.lower()]
#print result
for each in result:
print each[0], each[1]
ThinkCode:~$ python find_string.py sample.txt "String Extraction"
The problem I have with this is that for cases where search string is broken towards the end of the line :
If you are going to index large binary files, remember to change the
size limits. String
Extraction is a common problem
If I am searching for 'String Extraction', I will miss this keyword if I use the code presented above. What is the most efficient way of achieving this without making 2 copies of text file (one for searching the keyword to extract the line (number) and the other for removing line breaks and finding the keyword to eliminate the case where the keyword spans across 2 lines).
Much appreciated guys!
Note: Some considerations without any code, but I think they belong to an answer rather than to a comment.
My idea would be to search only for the first keyword; if a match is found, search for the second. This allows you to, if the match is found at the end of the line, take into consideration the next line and do line concatenation only if a match is found in first place*.
Edit:
Coded a simple example and ended up using a different algorithm; the basic idea behind it is this code snippet:
def iterwords(fh):
for number, line in enumerate(fh):
for word in re.split(r'\s+', line.strip()):
yield number, word
It iterates over the file handler and produces a (line_number, word) tuple for each word in the file.
The matching afterwards becomes pretty easy; you can find my implementation as a gist on github. It can be run as follows:
python search.py 'multi word search string' file.txt
There is one main concern with the linked code, I didn't code a workaround both for performance and complexity reasons. Can you figure it out? (Spoiler: try to search for a sentence whose first word appears two times in a row in the file)
* I didn't perform any testing on my own, but this article and the python wiki suggest that string concatenation is not that efficient in python (don't know how actual the information is).
There may be a better way of doing it, but my suggestion would be to start by taking in two lines (let's call them line1 and line2), concatenating them into line3 or something similar, and then search that resultant line.
Then you'd assign line2 to line1, get a new line2, and repeat the process.
Use the flag re.MULTILINE when compiling your expressions: http://docs.python.org/library/re.html#re.MULTILINE
Then use \s to represent all white space (including new lines).

Categories