Python script to duplicate .tex files with small changes - python

I have a letter in LaTeX format. I'd like to write a short script in python that takes one argument (the addressee) and creates a .tex file with the general letter format and the addressee.
from sys import argv
script, addressee = argv
file = open('newletter.tex', 'w')
file.write("\begin{Document} Dear " + addressee + ", \n Greetings, how are you? Sincerely, Me \end{Document}")
file.close()
Is there a better function to write out large blocks of text? Also, you can see that the .tex file will contain programming syntax - will python disregard this as long as it is coerced to a string? Do I need to coerce a large block to string? Thanks in advance!

If you directly enter print "\begin..." into your interpreter, you will notice the result will omit the \b on the front of the string. This is because \b is a character that the print statement (or function if you're in 3.x) recognizes (it happens to be a backspace).
To avoid this confusion, you can use a "raw string", which in python is denoted by pre-pending an 'r':
>>> a = "\begin"
>>> b = r"\begin"
>>> print a
egin
>>> print b
\begin
>>>
Typically, when working with strings to represent file paths, or anything else which may contain a \ character, you should use a raw string.
As far as inserting information into a template, I would recommend using the format() function rather than string concatenation. To do this, your string would look like this:
r"\begin{{Document}} Dear {} \n Greetings, how are you? Sincerely, Me \end{{Document}}".format(addressee)
The argument of the function (in this case addressee) will be inserted into each {} within the string. For this reason, curly brackets which should be interpreted literally must be escaped by included them in duplicate.

I'd take the approach of creating the tex files first as letter.tex with the addressee set to something like QXQ_ADDRESSEE_QXQ.
The in the python script I'd read the entire file into memory. When you read from a file, it gets treated as a raw string with proper escaping.
with open('letter.tex', 'r') as f:
raw_letter = f.readlines()
Then just do a substitution and write the string to a file.
raw_letter.replace("QXQ_ADDRESSEE_QXQ", newname)
with open('newletter.tex', 'w') as f:
f.write(raw_letter)

Related

Reading regexes from file, in Python

I am trying to read a bunch of regexes from a file, using python.
The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.
<\? xml([^>]*?)>,<\? XML$1>
peter,Peter
I am doing
detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
line = line.strip()
[search_term, replace_term] = line.split(',', 1)
detergent += [[search_term,replace_term]]
This is not producing the right input. If I print the detergent I get
['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]
It seems to be that it is escaping the backslashes.
Moreover, in a file containing, say
<? xml ........>
a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be
<\? XML$1>
So, the $1 is not recovering the first capture group in the first regex of the pair.
What is the proper way to input regexes from a file to be later used in re.sub?
When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.
There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...
>>> r"\?"
>>> '\\?'
The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.

replace some special character with newline \n

I have a text file with numbers and symbols, i want to delete some character of them and to put new line.
for example the text file is like that:
00004430474314-3","100004430474314-3","1779803519-3","100003004929477-3","100006224433874-3","1512754498-3","100003323786067
i want the output to be like that:
00004430474314
100004430474314
100003004929477
1779803519
100006224433874
1512754498
100003323786067
i tred to replace -3"," with \n by this code but it does not work. any help?
import re
import collections
s = re.findall('\w+', open('text.txt').read().lower())
print(s.replace("-3","",">\n"))
The re.findall is useless here.
with open('path/to/file') as infile:
contents = infile.read()
contents = contents.replace('-3","', '\n')
print(contents)
Another problem with your code is that you seem to think that "-3","" is a string containing -3",". This is not the case. Python sees a second " and interprets that as the end of the string. You have a comma right afterward, which makes python consider the second bit as the second parameter to s.replace().
What you really want to do is to tell python that those double quotes are part of the string. You can do this by manually escaping them as follows:
some_string_with_double_quotes = "this is a \"double quote\" within a string"
You can also accomplish the same thing by defining the string with single quotes:
some_string_with_double_quotes = 'this is a "double quote" within a string'
Both types of quotes are equivalent in python and can be used to define strings. This may be weird to you if you come from a language like C++, where single quotes are used for characters, and double quotes are used for strings.
First I think that the s object is not a string but a list and if you try to make is a string (s=''.join(s) for example) you are going to end with something like this:
0000443047431431000044304743143177980351931000030049294773100006224433874315127544983100003323786067
Where replace() is useless.
I would change your code to the following (tested in python 3.2)
lines = [line.strip() for line in open('text.txt')]
line=''.join(lines)
cl=line.replace("-3\",\"","\n")
print(cl)

Python 3 HTML parser

I'm sure everyone will groan, and tell me to look at the documentation (which I have) but I just don't understand how to achieve the same as the following:
curl -s http://www.maxmind.com/app/locate_my_ip | awk '/align="center">/{getline;print}'
All I have in python3 so far is:
import urllib.request
f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for lines in f.readlines():
print(lines)
f.close()
Seriously, any suggestions (please don't tell me to read http://docs.python.org/release/3.0.1/library/html.parser.html as I have been learning python for 1 day, and get easily confused) a simple example would be amazing!!!
This is based off of larsmans's answer, above.
f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for line in f:
if b'align="center">' in line:
print(next(f).decode().rstrip())
f.close()
Explanation:
for line in f iterates over the lines in the file-like object, f. Python let's you iterate over lines in a file like you would items in a list.
if b'align="center">' in line looks for the string 'align="center">' in the current line. The b indicates that this is a buffer of bytes, rather than a string. It appears that urllib.reqquest.urlopen interpets the results as binary data, rather than unicode strings, and an unadorned 'align="center">' would be interpreted as a unicode string. (That was the source of the TypeError above.)
next(f) takes the next line of the file, because your original awk script printed the line after 'align="center">' rather than the current line. The decode method (strings have methods in Python) takes the binary data and converts it to a printable unicode object. The rstrip() method strips any trailing whitespace (namely, the newline at the end of each line.
# no need for .readlines here
for ln in f:
if 'align="center">' in ln:
print(ln)
But be sure to read the Python tutorial.
I would probably use regular expressions to get the ip itself:
import re
import urllib
f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
html_text=f.read()
re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',html_text)[0]
which will print the first string of the format: 1-3digits, period, 1-3digits,...
I take it you were looking for the line, you could simply extend the string in the findall() expression to take care of that. (see the python docs for re for more details).
By the way, the r in front of the match string makes it a raw string so you wouldn't need to escape python escape characters inside of it (but you still need to escape RE escape characters).
Hope that helps

How to prevent Python from escaping special characters when reading a regex from a text file?

I am reading a text file in Python that, among other things, contains pre-written regexes that will be used for matching later on. The text file is of the following format:
...
--> Task 2
Concatenate and print the strings "Hello, " and "world!" to the screen.
--> Answer
Hello, world!
print(\"Hello,\s\"\s*+\s*\"world!\")
--> Hint 1
You can concatenate two strings with the + operator
...
User input is being accepted based on tasks and either executed in a subprocess to see a return value or matched against a regex. The issue, though, is that python's file.readline() will escape all special characters in the regex string (i.e. backslashes), giving me something that isn't useful.
I tried to read in the file as bytes and decode the lines using the 'raw_unicode_escape' argument (described as producing "a string that is suitable as raw Unicode literal in Python source code"), but no dice:
file.open(filename, 'rb')
for line in file:
line = line.decode('raw_unicode_escape')
...
Am I going about this the completely wrong way?
Thanks for any and all help.
p.s. I found this question as well: Issue while reading special characters from file. However, I still have the same trouble when I use file.open(filename, 'r', encoding='utf-8').
Python regex patterns are just plain old strings. There should be no problem with storing them in a file. Perhaps when you use file.readline() you are seeing escaped characters because you are looking at the repr of the line? That should not be an issue when you actually use the pattern as a regex however:
import re
filename='/tmp/test.txt'
with open(filename,'w') as f:
f.write(r'\"Hello,\s\"\s*\+\s*\"world!\"')
with open(filename,'r') as f:
pat = f.readline()
print(pat)
# \"Hello,\s\"\s*\+\s*\"world!\"
print(repr(pat))
# '\\"Hello,\\s\\"\\s*\\+\\s*\\"world!\\"'
assert re.search(pat,' "Hello, " + "world!"') # Shows match was found

Why does my regex not work on input from file.read()?

I have a section of code that I need to remove from multiple files that starts like this:
<?php
//{{56541616
and ends like this:
//}}18420732
?>
where both strings of numbers can be any sequence of letters and numbers (not the same).
I wrote a Python program that will return the entire input string except for this problem string:
def removeInsert(text):
m = re.search(r"<\?php\n\/\/\{\{[a-zA-Z0-9]{8}.*\/\/\}\}[a-zA-Z0-9]{8}\n\?>", text, re.DOTALL)
return text[:m.start()] + text[m.end():]
This program works great when I call it with removeInsert("""[file text]""") -- the triple quotes allow it to be read in as multiline.
I attempted to extend this to open a file and pass the string contents of the file to removeInsert() with this:
def fileRW(filename):
input_file = open(filename, 'r')
text = input_file.read()
newText = removeInsert(text)
...
However, when I run fileRW([input-file]), I get this error:
return text[:m.start()] + text[m.end():]
AttributeError: 'NoneType' object has no attribute 'start'
I can confirm that "text" in that last code is actually a string, and does contain the problem code, but it seems that the removeInsert() code doesn't work on this string. My best guess is that it's related to the triple quoting I do when inputting the string manually into removeInsert(). Perhaps the text that fileRW() passes to removeInsert() is not triple-quoted (I've tried different ways of forcing it to have triple quotes ("\"\"\"" added), but that doesn't work). I have no idea how to fix this, though, and can't find any information about it in my google searching. Any suggestions?
Your regex only uses \n for lines. Your text editor may insert a carriage return and newline combination: \r\n. Try changing \n in your regex to (\r\n|\r|\n).
Keep the \n in your regular expressions and open the file as:
input_file= open(filename, 'rU')
Note the extra U in the mode. This will allow your code to work even if used on other operating systems, or given files having “foreign” end-of-line.

Categories