re.sub doesn't replace the string when I execute the file - python

I am trying to write a python script to practice the re.sub method. But when I use python3 to run the script, I figure out that the string in the file doesn't change.
Here is my location.txt file,
34.3416,108.9398
this is what regex.py contains,
import re
with open ('location.txt','r+') as second:
content = second.read()
content = re.sub('([-+]?\d{2}\.\d{4},[-+]?\d{2}\.\d{4})','44.9740,-93.2277',content)
print (content)
I set up a print statement to test the output, and it gives me
34.3416,108.9398
which is not what I want.
Then I change the "r+" to "w+", it completely removes the location.txt content. Can anyone tell me the reason?

Your regexp has a problem as pointed by Andrej Kesely in the other answer. \d{2} should be \d{2,3}:
content = re.sub(r'([-+]?\d{2,3}\.\d{4},[-+]?\d{2,3}\.\d{4})', ,'44.9740,-93.2277',content)
After fixing that, you changed the string, but you didn't write it back to the file, you're only changing the variable in memory.
second.seek(0) # return to beginning of file
second.write(content) # write the data back to the file
second.truncate() # remove extraneous bytes (in case the content shrinked)

The second number in your location.txt is 108.9398, which has 3 digits before dot and it doesn't match to your regexp. Change your regexp to:
([-+]?\d{2,3}\.\d{4},[-+]?\d{2,3}\.\d{4})
Online regexp here.

Related

Python: how to get rid of non-ascii characters being read from a file

I am processing, with python, a long list of data that looks like this
The digraphs are probably due to encoding problems. (I am not sure whether these characters will be preserved in this site)
29/07/2016 04:00:12 0.125143
Now, when I read such file into a script using something like open and readlines, there is an error, reading
SyntaxError: EOL while scanning string literal
I know (or may look up usage of) replace and regex functions, but I cannot do them in my script. The biggest problem is that anywhere I include or read such strange character, error occurs, pointing on the very line it is read. So I cannot do anything to them.
Are you reading a file? If so, try to extract values using regexps, not to remove extra characters:
re.search(r'^([\d/: ]{19})', line).group(1)
re.search(r'([\d.]{7})', line).group(1)
I find that the re.findall works. (I am sorry I do not have time to test all other methods, since the significance of this job has vanished, and I even forget this question itself.)
def extract_numbers(str_i):
pat="(\d+)/(\d+)/(\d+)\D*(\d+):(\d+):(\d+)\D*(\d+)\.(\d+)"
match_h = re.findall(pat, str_i)
return match_h[0]
# ....
# `f` is the handle of the file in question
lines =f.readlines()
for l in lines:
ls_f =extract_numbers(l)
# process them....

Python script to duplicate .tex files with small changes

I have a letter in LaTeX format. I'd like to write a short script in python that takes one argument (the addressee) and creates a .tex file with the general letter format and the addressee.
from sys import argv
script, addressee = argv
file = open('newletter.tex', 'w')
file.write("\begin{Document} Dear " + addressee + ", \n Greetings, how are you? Sincerely, Me \end{Document}")
file.close()
Is there a better function to write out large blocks of text? Also, you can see that the .tex file will contain programming syntax - will python disregard this as long as it is coerced to a string? Do I need to coerce a large block to string? Thanks in advance!
If you directly enter print "\begin..." into your interpreter, you will notice the result will omit the \b on the front of the string. This is because \b is a character that the print statement (or function if you're in 3.x) recognizes (it happens to be a backspace).
To avoid this confusion, you can use a "raw string", which in python is denoted by pre-pending an 'r':
>>> a = "\begin"
>>> b = r"\begin"
>>> print a
egin
>>> print b
\begin
>>>
Typically, when working with strings to represent file paths, or anything else which may contain a \ character, you should use a raw string.
As far as inserting information into a template, I would recommend using the format() function rather than string concatenation. To do this, your string would look like this:
r"\begin{{Document}} Dear {} \n Greetings, how are you? Sincerely, Me \end{{Document}}".format(addressee)
The argument of the function (in this case addressee) will be inserted into each {} within the string. For this reason, curly brackets which should be interpreted literally must be escaped by included them in duplicate.
I'd take the approach of creating the tex files first as letter.tex with the addressee set to something like QXQ_ADDRESSEE_QXQ.
The in the python script I'd read the entire file into memory. When you read from a file, it gets treated as a raw string with proper escaping.
with open('letter.tex', 'r') as f:
raw_letter = f.readlines()
Then just do a substitution and write the string to a file.
raw_letter.replace("QXQ_ADDRESSEE_QXQ", newname)
with open('newletter.tex', 'w') as f:
f.write(raw_letter)

Reading regexes from file, in Python

I am trying to read a bunch of regexes from a file, using python.
The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.
<\? xml([^>]*?)>,<\? XML$1>
peter,Peter
I am doing
detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
line = line.strip()
[search_term, replace_term] = line.split(',', 1)
detergent += [[search_term,replace_term]]
This is not producing the right input. If I print the detergent I get
['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]
It seems to be that it is escaping the backslashes.
Moreover, in a file containing, say
<? xml ........>
a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be
<\? XML$1>
So, the $1 is not recovering the first capture group in the first regex of the pair.
What is the proper way to input regexes from a file to be later used in re.sub?
When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.
There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...
>>> r"\?"
>>> '\\?'
The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.

Python 3 HTML parser

I'm sure everyone will groan, and tell me to look at the documentation (which I have) but I just don't understand how to achieve the same as the following:
curl -s http://www.maxmind.com/app/locate_my_ip | awk '/align="center">/{getline;print}'
All I have in python3 so far is:
import urllib.request
f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for lines in f.readlines():
print(lines)
f.close()
Seriously, any suggestions (please don't tell me to read http://docs.python.org/release/3.0.1/library/html.parser.html as I have been learning python for 1 day, and get easily confused) a simple example would be amazing!!!
This is based off of larsmans's answer, above.
f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for line in f:
if b'align="center">' in line:
print(next(f).decode().rstrip())
f.close()
Explanation:
for line in f iterates over the lines in the file-like object, f. Python let's you iterate over lines in a file like you would items in a list.
if b'align="center">' in line looks for the string 'align="center">' in the current line. The b indicates that this is a buffer of bytes, rather than a string. It appears that urllib.reqquest.urlopen interpets the results as binary data, rather than unicode strings, and an unadorned 'align="center">' would be interpreted as a unicode string. (That was the source of the TypeError above.)
next(f) takes the next line of the file, because your original awk script printed the line after 'align="center">' rather than the current line. The decode method (strings have methods in Python) takes the binary data and converts it to a printable unicode object. The rstrip() method strips any trailing whitespace (namely, the newline at the end of each line.
# no need for .readlines here
for ln in f:
if 'align="center">' in ln:
print(ln)
But be sure to read the Python tutorial.
I would probably use regular expressions to get the ip itself:
import re
import urllib
f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
html_text=f.read()
re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',html_text)[0]
which will print the first string of the format: 1-3digits, period, 1-3digits,...
I take it you were looking for the line, you could simply extend the string in the findall() expression to take care of that. (see the python docs for re for more details).
By the way, the r in front of the match string makes it a raw string so you wouldn't need to escape python escape characters inside of it (but you still need to escape RE escape characters).
Hope that helps

Why does my regex not work on input from file.read()?

I have a section of code that I need to remove from multiple files that starts like this:
<?php
//{{56541616
and ends like this:
//}}18420732
?>
where both strings of numbers can be any sequence of letters and numbers (not the same).
I wrote a Python program that will return the entire input string except for this problem string:
def removeInsert(text):
m = re.search(r"<\?php\n\/\/\{\{[a-zA-Z0-9]{8}.*\/\/\}\}[a-zA-Z0-9]{8}\n\?>", text, re.DOTALL)
return text[:m.start()] + text[m.end():]
This program works great when I call it with removeInsert("""[file text]""") -- the triple quotes allow it to be read in as multiline.
I attempted to extend this to open a file and pass the string contents of the file to removeInsert() with this:
def fileRW(filename):
input_file = open(filename, 'r')
text = input_file.read()
newText = removeInsert(text)
...
However, when I run fileRW([input-file]), I get this error:
return text[:m.start()] + text[m.end():]
AttributeError: 'NoneType' object has no attribute 'start'
I can confirm that "text" in that last code is actually a string, and does contain the problem code, but it seems that the removeInsert() code doesn't work on this string. My best guess is that it's related to the triple quoting I do when inputting the string manually into removeInsert(). Perhaps the text that fileRW() passes to removeInsert() is not triple-quoted (I've tried different ways of forcing it to have triple quotes ("\"\"\"" added), but that doesn't work). I have no idea how to fix this, though, and can't find any information about it in my google searching. Any suggestions?
Your regex only uses \n for lines. Your text editor may insert a carriage return and newline combination: \r\n. Try changing \n in your regex to (\r\n|\r|\n).
Keep the \n in your regular expressions and open the file as:
input_file= open(filename, 'rU')
Note the extra U in the mode. This will allow your code to work even if used on other operating systems, or given files having “foreign” end-of-line.

Categories