How to parse HTML and then write it to a .py file - python

I am trying to parse some HTML and then have that HTML written to a .py file. Here is the code I am using:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
print(data)
f = open('/Users/austinhitt/Desktop/Test.py', 'w')
f = open('/Users/austinhitt/Desktop/Test.py', 'r')
t = f.read()
f = open('/Users/austinhitt/Desktop/Test.py', 'w')
f.write(t + '\n' + data)
f.close()
parser = MyHTMLParser()
parser.feed('<html>'
'<body>'
'<p>import time as t</p>'
'<p>from os import path</p>'
'<p>import os</p>'
'</body>'
'</html>')
I am not getting any error, however only the contents of the last p tag are being put into the file. I only want what is inside of the p tags to be added to the file, not the p tag itself. I need the content of every p tag added to the file, and I don't want to use BeautifulSoup or other non-built in modules. I am using Python 3.5.1

It seems that you read file "Test.py" after use "write" mode, that may cause data lost.

Related

how to replace HTML codes in HTML file using python?

I'm trying to replace all HTML codes in my HTML file in a for Loop (not sure if this is the easiest approach) without changing the formatting of the original file. When I run the code below I don't get the codes replaced. Does anyone know what could be wrong?
import re
tex=open('ALICE.per-txt.txt', 'r')
tex=tex.read()
for i in tex:
if i =='õ':
i=='õ'
elif i == 'ç':
i=='ç'
with open('Alice1.replaced.txt', "w") as f:
f.write(tex)
f.close()
You can use html.unescape.
>>> import html
>>> html.unescape('õ')
'õ'
With your code:
import html
with open('ALICE.per-txt.txt', 'r') as f:
html_text = f.read()
html_text = html.unescape(html_text)
with open('ALICE.per-txt.txt', 'w') as f:
f.write(html_text)
Please note that I opened the files with a with statement. This takes care of closing the file after the with block - something you forgot to do when reading the file.

Parsing an xml file and creating another from the parsed object

I am trying to parse an xml file(containing bad characters) using lxml module in recover = True mode.
Below is the code snippet
from lxml import etree
f=open('test.xml')
data=f.read()
f.close()
parser = etree.XMLParser(recover=True)
x = etree.fromstring(data, parser=parser)
Now I want to create another xml file (test1.xml) from the above object (x)
Could anyone please help in this matter.
Thanks
I think this is what you are searching for
from lxml import etree
# opening the source file
with open('test.xml','r') as f:
# reading the number
data=f.read()
parser = etree.XMLParser(recover=True)
# fromstring() parses XML from a string directly into an Element
x = etree.fromstring(data, parser=parser)
# taking the content retrieved
y = etree.tostring(x, pretty_print=True).decode("utf-8")
# writing the content on the output file
with open('test1.xml','w') as f:
f.write(y)

Setting HTML source for QtWebKit to a string value vs. file.read(), encoding issue?

I have a script that reads a bunch of JavaScript files into a variable, and then places the contents of those files into placeholders in a Python template. This results in the value of the variable src (described below) being a valid HTML document including scripts.
# Open the source HTML file to get the paths to the JavaScript files
f = open(srcfile.html, 'rU')
src = f.read()
f.close()
js_scripts = re.findall('script\ssrc="(.*)"', src)
# Put all of the scripts in a variable
js = ''
for script in js_scripts:
f = open(script, 'rU')
js = js + f.read() + '\n'
f.close()
# Open/read the template
template = open('template.html)
templateSrc = Template(template.read())
# Substitute the scripts for the placeholder variable
src = str(templateSrc.safe_substitute(javascript_content=js))
# Write a Python file containing the string
with open('htmlSource.py', 'w') as f:
f.write('#-*- coding: utf-8 -*-\n\nhtmlSrc = """' + src + '"""')
If I try to open it up via PyQt5/QtWebKit in Python...
from htmlSource import htmlSrc
webWidget.setHtml(htmlSrc)
...it doesn't load the JS files in the web widget. I just end up with a blank page.
But if I get rid of everything else, and just write to file '"""src"""', when I open the file up in Chrome, it loads everything as expected. Likewise, it'll also load correctly in the web widget if I read from the file itself:
f = open('htmlSource.py', 'r')
htmlSrc = f.read()
webWidget.setHtml(htmlSrc)
In other words, when I run this script, it produces the Python output file with the variable; then I try to import that variable and pass it to webWidget.setHtml(); but the page doesn't render. But if I use open() and read it as a file, it does.
I suspect there's an encoding issue going on here. But I've tried several variations of encode and decode without any luck. The scripts are all UTF-8.
Any suggestions? Many thanks!

Saving an image from text file providing image url's in python

import urllib2
import urllib
import json
import urlparse
def main():
f = open("C:\Users\Stern Marketing\Desktop\dumpaday.txt","r")
if f.mode == 'r':
item = f.read()
for x in item:
urlParts = urlparse.urlsplit(x)
filename = urlParts.path.split('/')[-1]
urllib.urlretrieve(item.strip(), filename)
if __name__ == "__main__":
main()`
Looks like script still not working properly, I'm really not sure why... :S
Getting lots of errors...
urllib.urlretrieve("x", "0001.jpg")
This will try to download from the (static) URL "x".
The URL you actually want to download from is within the variable x, so you should write your line to reference that variable:
urllib.urlretrieve(x, "0001.jpg")
Also, you probably want to change the target filename for each download, so you don’t keep on overwriting it.
Regarding your filename update:
urlparse.urlsplit is a function that takes an URL and splits it into multiple parts. Those parts are returned from the function, so you need to save it in some variable.
One part is the path, which is what contains the file name. The path itself is a string on which you can call the split method to separate it by the / character. As you are interested in only the last part—the filename—you can discard everything else:
url = 'http://www.dumpaday.com/wp-content/uploads/2013/12/funny-160.jpg'
urlParts = urlparse.urlsplit(url)
print(urlParts.path) # /wp-content/uploads/2013/12/funny-160.jpg
filename = urlParts.path.split('/')[-1]
print(filename) # funny-160.jpg
It should work like this:
import urllib2
import urllib
import json
import urlparse
def main():
with open("C:\Users\Stern Marketing\Desktop\dumpaday.txt","r") as f:
for x in f:
urlParts = urlparse.urlsplit(x.strip())
filename = urlParts.path.split('/')[-1]
urllib.urlretrieve(x.strip(), filename)
if __name__ == "__main__":
main()`
The readlines method of file objects returns lines with a trailing newline character (\n).
Change your loop to the following:
# By the way, you don't need readlines at all. Iterating over a file yields its lines.
for x in fl:
urllib.urlretrieve(x.strip(), "0001.jpg")
Here is a solution that loops over images indexed 160 to 171. You can adjust as needed. This creates a url from the base, opens it via urllib2 and saves it as a binary file.
import urllib2
base_url = "http://www.dumpaday.com/wp-content/uploads/2013/12/funny-{}.jpg"
for n in xrange(160, 170):
url = base_url.format(n)
f_save = "{}.jpg".format(n)
req = urllib2.urlopen(url)
with open(f_save,'wb') as FOUT:
FOUT.write(req.read())

How to pass xml as a parameter in python script?

I have the parameters to be passed to my python code saved in an xml file.
How to pass this xml as a parameter to my python code?
Can someone please help on this?
Thanks in Adavce!
You can pass it as a command line parameter when executing the script. Use sys.argv, the array that stores all the arguments passed or argparse module, that handles customisable command line parameters
Assuming you have "file.xml" as:
<?xml version="1.0"?>
<address>
<name>John Doe</name>
<position>CEO</position>
</address>
You can either:
Pass the XML file as a command line parameter to your python script.
Usage: script.py path/to/file.xml
import sys
from xml.dom.minidom import parseString
def read_xml(xml_file):
with open(xml_file, 'r') as f:
data = f.read()
return parseString(data)
if (len(sys.argv) < 2):
print "Error: Missing parameter."
else:
dom = read_xml(sys.argv[1])
tag = dom.getElementsByTagName('name')[0].toxml()
print tag
It will be better if you use argparse module instead of sys.argv.
Or just open and read the XML file and parse it.
from xml.dom.minidom import parseString
def read_xml(xml_file):
with open(xml_file, 'r') as f:
data = f.read()
return parseString(data)
dom = read_xml("file.xml")
tag = dom.getElementsByTagName('name')[0].toxml()
print tag

Categories