Reading regexes from file, in Python - python

I am trying to read a bunch of regexes from a file, using python.
The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.
<\? xml([^>]*?)>,<\? XML$1>
peter,Peter
I am doing
detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
line = line.strip()
[search_term, replace_term] = line.split(',', 1)
detergent += [[search_term,replace_term]]
This is not producing the right input. If I print the detergent I get
['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]
It seems to be that it is escaping the backslashes.
Moreover, in a file containing, say
<? xml ........>
a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be
<\? XML$1>
So, the $1 is not recovering the first capture group in the first regex of the pair.
What is the proper way to input regexes from a file to be later used in re.sub?
When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.

There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...
>>> r"\?"
>>> '\\?'
The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.

Related

Python script to duplicate .tex files with small changes

I have a letter in LaTeX format. I'd like to write a short script in python that takes one argument (the addressee) and creates a .tex file with the general letter format and the addressee.
from sys import argv
script, addressee = argv
file = open('newletter.tex', 'w')
file.write("\begin{Document} Dear " + addressee + ", \n Greetings, how are you? Sincerely, Me \end{Document}")
file.close()
Is there a better function to write out large blocks of text? Also, you can see that the .tex file will contain programming syntax - will python disregard this as long as it is coerced to a string? Do I need to coerce a large block to string? Thanks in advance!
If you directly enter print "\begin..." into your interpreter, you will notice the result will omit the \b on the front of the string. This is because \b is a character that the print statement (or function if you're in 3.x) recognizes (it happens to be a backspace).
To avoid this confusion, you can use a "raw string", which in python is denoted by pre-pending an 'r':
>>> a = "\begin"
>>> b = r"\begin"
>>> print a
egin
>>> print b
\begin
>>>
Typically, when working with strings to represent file paths, or anything else which may contain a \ character, you should use a raw string.
As far as inserting information into a template, I would recommend using the format() function rather than string concatenation. To do this, your string would look like this:
r"\begin{{Document}} Dear {} \n Greetings, how are you? Sincerely, Me \end{{Document}}".format(addressee)
The argument of the function (in this case addressee) will be inserted into each {} within the string. For this reason, curly brackets which should be interpreted literally must be escaped by included them in duplicate.
I'd take the approach of creating the tex files first as letter.tex with the addressee set to something like QXQ_ADDRESSEE_QXQ.
The in the python script I'd read the entire file into memory. When you read from a file, it gets treated as a raw string with proper escaping.
with open('letter.tex', 'r') as f:
raw_letter = f.readlines()
Then just do a substitution and write the string to a file.
raw_letter.replace("QXQ_ADDRESSEE_QXQ", newname)
with open('newletter.tex', 'w') as f:
f.write(raw_letter)

Retain only specified content in a string

I have data in the following form in a file:
<string1> abc:string2 <http://yago-knowledge.org/resource/wikicategory_Sports_clubs_established</text\u003e\n______<sha1\u003eqwjfowt5my8t6yuszdb88k2ehskjuh0</sha1\u003e\n____</revision\u003e\n__</page\u003e\n__<page\u003e\n____<title\u003ePortal:Tropical_cyclones/Anniversaries/August_22</title\u003e\n____<ns\u003e100</ns\u003e\n____<id\u003e7957689</id\u003e\n____<revision\u003e\n______<id\u003e446349886</id\u003e\n______<timestamp\u003e2011-08-23T17:38:19Z</timestamp\u003e\n______<contributor\u003e\n________<username\u003eLightbot</username\u003e\n________<id\u003e7178666</id\u003e\n______</contributor\u003e\n______<comment\u003eDelink_non-obscure_units._Conversions._Report_bugs_to_[[User_talk:Lightmouse>.
The delimiter in the above file is a tab (\t) i.e. string1 is separated from abc:string2by \t. Similarly for the rest of the strings.
Now I want to retain just alphabets, numbers, /, :,'.' and _ within the strings which are enclosed within <>. I want to delete all the characters apart from the specified ones from the strings which are enlosed in <>.
Is there some way by which I may achieve this using linux commands or python? I want to replace all the unwanted characters by an underscore.
<string1> abc:string2 <http://yago-knowledge.org/resource/wikicategory_Sports_clubs_established_text_u003e_n_______sha1_u003eqwjfowt5my8t6yuszdb88k2ehskjuh0_sha1_u003e_n_____revision_u003e_n___/page_u003e_n___page_u003e_n_____title_u003ePortal:Tropical_cyclones/Anniversaries/August_22_/title_u003e_n_____ns_u003e100_/ns_u003e_n_____id_u003e7957689_/id_u003e_n_____revision_u003e_n_______id_u003e446349886_/id_u003e_n_______timestamp_u003e2011-08-23T17:38:19Z_/timestamp_u003e_n_______contributor_u003e_n_________username_u003eLightbot_/username_u003e_n_________id_u003e7178666_/id_u003e_n_______/contributor_u003e_n_______comment_u003eDelink_non-obscure_units._Conversions._Report_bugs_to___User_talk:Lightmouse>.
Is there some way by which I may achieve this?
You can probably achieve this just with UNIX tools and some crazy regular expression, but I would write a small Python script for this:
Open two files (input and output) with open()
Iterate over the input file line by line: for line in input_file:
Split the line at tab: for part in line.split('\t'):
Check if a part is enclosed in <>: if part.startswith('<') and line.endswith('>'):
Filter characters with a regular expression: filtered_part = re.sub(r'[^a-zA-Z0-9/:._]', '', part)
Join the filtered parts back together: filtered_line = '\t'.join(filtered_parts)
Write the filtered line to the output file: output_file.write(filtered_line + '\n')
Following this outline, it should be easy for you to write a working script.

replace a string with regular expression in python

I have been learning regular expression for a while but still find it confusing sometimes
I am trying to replace all the
self.assertRaisesRegexp(SomeError,'somestring'):
to
self.assertRaiseRegexp(SomeError,somemethod('somestring'))
How can I do it? I am assuming the first step is fetch 'somestring' and modify it to somemethod('somestring') then replace the original 'somestring'
here is your regular expression
#f is going to be your file in string form
re.sub(r'(?m)self\.assertRaisesRegexp\((.+?),((?P<quote>[\'"]).*?(?P=quote))\)',r'self.assertRaisesRegexp(\1,somemethod(\2))',f)
this will grab something that matches and replace it accordingly. It will also make sure that the quotation marks line up correctly by setting a reference in quote
there is no need to iterate over the file here either, the first statement "(?m)" puts it in multiline mode so it maps the regular expression over each line in the file. I have tested this expression and it works as expected!
test
>>> print f
this is some
multi line example that self.assertRaisesRegexp(SomeError,'somestring'):
and so on. there self.assertRaisesRegexp(SomeError,'somestring'): will be many
of these in the file and I am just ranting for example
here is the last one self.assertRaisesRegexp(SomeError,'somestring'): okay
im done now
>>> print re.sub(r'(?m)self\.assertRaisesRegexp\((.+?),((?P<quote>[\'"]).*?(?P=quote))\)',r'self.assertRaisesRegexp(\1,somemethod(\2))',f)
this is some
multi line example that self.assertRaisesRegexp(SomeError,somemethod('somestring')):
and so on. there self.assertRaisesRegexp(SomeError,somemethod('somestring')): will be many
of these in the file and I am just ranting for example
here is the last one self.assertRaisesRegexp(SomeError,somemethod('somestring')): okay
im done now
A better tool for this particular task is sed:
$ sed -i 's/\(self.assertRaisesRegexp\)(\(.*\),\(.*\))/\1(\2,somemethod(\3))/' *.py
sed will take care of the file I/O, renaming files, etc.
If you already know how to do the file manipulation, and iterating over lines in each file, then the python re.sub line will look like:
new_line = re.sub(r"(self.assertRaisesRegexp)\((.*),(.*)\)",
r"\1(\2,somemethod(\3)",
old_line)

How to prevent Python from escaping special characters when reading a regex from a text file?

I am reading a text file in Python that, among other things, contains pre-written regexes that will be used for matching later on. The text file is of the following format:
...
--> Task 2
Concatenate and print the strings "Hello, " and "world!" to the screen.
--> Answer
Hello, world!
print(\"Hello,\s\"\s*+\s*\"world!\")
--> Hint 1
You can concatenate two strings with the + operator
...
User input is being accepted based on tasks and either executed in a subprocess to see a return value or matched against a regex. The issue, though, is that python's file.readline() will escape all special characters in the regex string (i.e. backslashes), giving me something that isn't useful.
I tried to read in the file as bytes and decode the lines using the 'raw_unicode_escape' argument (described as producing "a string that is suitable as raw Unicode literal in Python source code"), but no dice:
file.open(filename, 'rb')
for line in file:
line = line.decode('raw_unicode_escape')
...
Am I going about this the completely wrong way?
Thanks for any and all help.
p.s. I found this question as well: Issue while reading special characters from file. However, I still have the same trouble when I use file.open(filename, 'r', encoding='utf-8').
Python regex patterns are just plain old strings. There should be no problem with storing them in a file. Perhaps when you use file.readline() you are seeing escaped characters because you are looking at the repr of the line? That should not be an issue when you actually use the pattern as a regex however:
import re
filename='/tmp/test.txt'
with open(filename,'w') as f:
f.write(r'\"Hello,\s\"\s*\+\s*\"world!\"')
with open(filename,'r') as f:
pat = f.readline()
print(pat)
# \"Hello,\s\"\s*\+\s*\"world!\"
print(repr(pat))
# '\\"Hello,\\s\\"\\s*\\+\\s*\\"world!\\"'
assert re.search(pat,' "Hello, " + "world!"') # Shows match was found

Why does my regex not work on input from file.read()?

I have a section of code that I need to remove from multiple files that starts like this:
<?php
//{{56541616
and ends like this:
//}}18420732
?>
where both strings of numbers can be any sequence of letters and numbers (not the same).
I wrote a Python program that will return the entire input string except for this problem string:
def removeInsert(text):
m = re.search(r"<\?php\n\/\/\{\{[a-zA-Z0-9]{8}.*\/\/\}\}[a-zA-Z0-9]{8}\n\?>", text, re.DOTALL)
return text[:m.start()] + text[m.end():]
This program works great when I call it with removeInsert("""[file text]""") -- the triple quotes allow it to be read in as multiline.
I attempted to extend this to open a file and pass the string contents of the file to removeInsert() with this:
def fileRW(filename):
input_file = open(filename, 'r')
text = input_file.read()
newText = removeInsert(text)
...
However, when I run fileRW([input-file]), I get this error:
return text[:m.start()] + text[m.end():]
AttributeError: 'NoneType' object has no attribute 'start'
I can confirm that "text" in that last code is actually a string, and does contain the problem code, but it seems that the removeInsert() code doesn't work on this string. My best guess is that it's related to the triple quoting I do when inputting the string manually into removeInsert(). Perhaps the text that fileRW() passes to removeInsert() is not triple-quoted (I've tried different ways of forcing it to have triple quotes ("\"\"\"" added), but that doesn't work). I have no idea how to fix this, though, and can't find any information about it in my google searching. Any suggestions?
Your regex only uses \n for lines. Your text editor may insert a carriage return and newline combination: \r\n. Try changing \n in your regex to (\r\n|\r|\n).
Keep the \n in your regular expressions and open the file as:
input_file= open(filename, 'rU')
Note the extra U in the mode. This will allow your code to work even if used on other operating systems, or given files having “foreign” end-of-line.

Categories