Python 3 HTML parser - python

I'm sure everyone will groan, and tell me to look at the documentation (which I have) but I just don't understand how to achieve the same as the following:
curl -s http://www.maxmind.com/app/locate_my_ip | awk '/align="center">/{getline;print}'
All I have in python3 so far is:
import urllib.request
f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for lines in f.readlines():
print(lines)
f.close()
Seriously, any suggestions (please don't tell me to read http://docs.python.org/release/3.0.1/library/html.parser.html as I have been learning python for 1 day, and get easily confused) a simple example would be amazing!!!

This is based off of larsmans's answer, above.
f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for line in f:
if b'align="center">' in line:
print(next(f).decode().rstrip())
f.close()
Explanation:
for line in f iterates over the lines in the file-like object, f. Python let's you iterate over lines in a file like you would items in a list.
if b'align="center">' in line looks for the string 'align="center">' in the current line. The b indicates that this is a buffer of bytes, rather than a string. It appears that urllib.reqquest.urlopen interpets the results as binary data, rather than unicode strings, and an unadorned 'align="center">' would be interpreted as a unicode string. (That was the source of the TypeError above.)
next(f) takes the next line of the file, because your original awk script printed the line after 'align="center">' rather than the current line. The decode method (strings have methods in Python) takes the binary data and converts it to a printable unicode object. The rstrip() method strips any trailing whitespace (namely, the newline at the end of each line.

# no need for .readlines here
for ln in f:
if 'align="center">' in ln:
print(ln)
But be sure to read the Python tutorial.

I would probably use regular expressions to get the ip itself:
import re
import urllib
f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
html_text=f.read()
re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',html_text)[0]
which will print the first string of the format: 1-3digits, period, 1-3digits,...
I take it you were looking for the line, you could simply extend the string in the findall() expression to take care of that. (see the python docs for re for more details).
By the way, the r in front of the match string makes it a raw string so you wouldn't need to escape python escape characters inside of it (but you still need to escape RE escape characters).
Hope that helps

Related

re.sub doesn't replace the string when I execute the file

I am trying to write a python script to practice the re.sub method. But when I use python3 to run the script, I figure out that the string in the file doesn't change.
Here is my location.txt file,
34.3416,108.9398
this is what regex.py contains,
import re
with open ('location.txt','r+') as second:
content = second.read()
content = re.sub('([-+]?\d{2}\.\d{4},[-+]?\d{2}\.\d{4})','44.9740,-93.2277',content)
print (content)
I set up a print statement to test the output, and it gives me
34.3416,108.9398
which is not what I want.
Then I change the "r+" to "w+", it completely removes the location.txt content. Can anyone tell me the reason?
Your regexp has a problem as pointed by Andrej Kesely in the other answer. \d{2} should be \d{2,3}:
content = re.sub(r'([-+]?\d{2,3}\.\d{4},[-+]?\d{2,3}\.\d{4})', ,'44.9740,-93.2277',content)
After fixing that, you changed the string, but you didn't write it back to the file, you're only changing the variable in memory.
second.seek(0) # return to beginning of file
second.write(content) # write the data back to the file
second.truncate() # remove extraneous bytes (in case the content shrinked)
The second number in your location.txt is 108.9398, which has 3 digits before dot and it doesn't match to your regexp. Change your regexp to:
([-+]?\d{2,3}\.\d{4},[-+]?\d{2,3}\.\d{4})
Online regexp here.

Python script to duplicate .tex files with small changes

I have a letter in LaTeX format. I'd like to write a short script in python that takes one argument (the addressee) and creates a .tex file with the general letter format and the addressee.
from sys import argv
script, addressee = argv
file = open('newletter.tex', 'w')
file.write("\begin{Document} Dear " + addressee + ", \n Greetings, how are you? Sincerely, Me \end{Document}")
file.close()
Is there a better function to write out large blocks of text? Also, you can see that the .tex file will contain programming syntax - will python disregard this as long as it is coerced to a string? Do I need to coerce a large block to string? Thanks in advance!
If you directly enter print "\begin..." into your interpreter, you will notice the result will omit the \b on the front of the string. This is because \b is a character that the print statement (or function if you're in 3.x) recognizes (it happens to be a backspace).
To avoid this confusion, you can use a "raw string", which in python is denoted by pre-pending an 'r':
>>> a = "\begin"
>>> b = r"\begin"
>>> print a
egin
>>> print b
\begin
>>>
Typically, when working with strings to represent file paths, or anything else which may contain a \ character, you should use a raw string.
As far as inserting information into a template, I would recommend using the format() function rather than string concatenation. To do this, your string would look like this:
r"\begin{{Document}} Dear {} \n Greetings, how are you? Sincerely, Me \end{{Document}}".format(addressee)
The argument of the function (in this case addressee) will be inserted into each {} within the string. For this reason, curly brackets which should be interpreted literally must be escaped by included them in duplicate.
I'd take the approach of creating the tex files first as letter.tex with the addressee set to something like QXQ_ADDRESSEE_QXQ.
The in the python script I'd read the entire file into memory. When you read from a file, it gets treated as a raw string with proper escaping.
with open('letter.tex', 'r') as f:
raw_letter = f.readlines()
Then just do a substitution and write the string to a file.
raw_letter.replace("QXQ_ADDRESSEE_QXQ", newname)
with open('newletter.tex', 'w') as f:
f.write(raw_letter)

Reading regexes from file, in Python

I am trying to read a bunch of regexes from a file, using python.
The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.
<\? xml([^>]*?)>,<\? XML$1>
peter,Peter
I am doing
detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
line = line.strip()
[search_term, replace_term] = line.split(',', 1)
detergent += [[search_term,replace_term]]
This is not producing the right input. If I print the detergent I get
['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]
It seems to be that it is escaping the backslashes.
Moreover, in a file containing, say
<? xml ........>
a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be
<\? XML$1>
So, the $1 is not recovering the first capture group in the first regex of the pair.
What is the proper way to input regexes from a file to be later used in re.sub?
When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.
There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...
>>> r"\?"
>>> '\\?'
The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.

replace some special character with newline \n

I have a text file with numbers and symbols, i want to delete some character of them and to put new line.
for example the text file is like that:
00004430474314-3","100004430474314-3","1779803519-3","100003004929477-3","100006224433874-3","1512754498-3","100003323786067
i want the output to be like that:
00004430474314
100004430474314
100003004929477
1779803519
100006224433874
1512754498
100003323786067
i tred to replace -3"," with \n by this code but it does not work. any help?
import re
import collections
s = re.findall('\w+', open('text.txt').read().lower())
print(s.replace("-3","",">\n"))
The re.findall is useless here.
with open('path/to/file') as infile:
contents = infile.read()
contents = contents.replace('-3","', '\n')
print(contents)
Another problem with your code is that you seem to think that "-3","" is a string containing -3",". This is not the case. Python sees a second " and interprets that as the end of the string. You have a comma right afterward, which makes python consider the second bit as the second parameter to s.replace().
What you really want to do is to tell python that those double quotes are part of the string. You can do this by manually escaping them as follows:
some_string_with_double_quotes = "this is a \"double quote\" within a string"
You can also accomplish the same thing by defining the string with single quotes:
some_string_with_double_quotes = 'this is a "double quote" within a string'
Both types of quotes are equivalent in python and can be used to define strings. This may be weird to you if you come from a language like C++, where single quotes are used for characters, and double quotes are used for strings.
First I think that the s object is not a string but a list and if you try to make is a string (s=''.join(s) for example) you are going to end with something like this:
0000443047431431000044304743143177980351931000030049294773100006224433874315127544983100003323786067
Where replace() is useless.
I would change your code to the following (tested in python 3.2)
lines = [line.strip() for line in open('text.txt')]
line=''.join(lines)
cl=line.replace("-3\",\"","\n")
print(cl)

How to prevent Python from escaping special characters when reading a regex from a text file?

I am reading a text file in Python that, among other things, contains pre-written regexes that will be used for matching later on. The text file is of the following format:
...
--> Task 2
Concatenate and print the strings "Hello, " and "world!" to the screen.
--> Answer
Hello, world!
print(\"Hello,\s\"\s*+\s*\"world!\")
--> Hint 1
You can concatenate two strings with the + operator
...
User input is being accepted based on tasks and either executed in a subprocess to see a return value or matched against a regex. The issue, though, is that python's file.readline() will escape all special characters in the regex string (i.e. backslashes), giving me something that isn't useful.
I tried to read in the file as bytes and decode the lines using the 'raw_unicode_escape' argument (described as producing "a string that is suitable as raw Unicode literal in Python source code"), but no dice:
file.open(filename, 'rb')
for line in file:
line = line.decode('raw_unicode_escape')
...
Am I going about this the completely wrong way?
Thanks for any and all help.
p.s. I found this question as well: Issue while reading special characters from file. However, I still have the same trouble when I use file.open(filename, 'r', encoding='utf-8').
Python regex patterns are just plain old strings. There should be no problem with storing them in a file. Perhaps when you use file.readline() you are seeing escaped characters because you are looking at the repr of the line? That should not be an issue when you actually use the pattern as a regex however:
import re
filename='/tmp/test.txt'
with open(filename,'w') as f:
f.write(r'\"Hello,\s\"\s*\+\s*\"world!\"')
with open(filename,'r') as f:
pat = f.readline()
print(pat)
# \"Hello,\s\"\s*\+\s*\"world!\"
print(repr(pat))
# '\\"Hello,\\s\\"\\s*\\+\\s*\\"world!\\"'
assert re.search(pat,' "Hello, " + "world!"') # Shows match was found

Categories