Python urllib-html parse - python

Question about parsing web-site:
My code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import os
import urllib2
import re
# Parse Web
from lxml import html
import requests
def parse():
try:
output = open('proba.xml','w')
page = requests.get('http://www.rts.rs/page/tv/sr/broadcast/22/RTS+1.html')
tree = html.fromstring(page.text)
parse = tree.xpath('//div[#class="ProgramTime"]/text()|//div[#class="ProgramName"]/text()|//a[#class="recnik"]/text()')
for line in parse:
clean = line.strip()
if clean:
print clean
except:
pass
parse()
My question is how can I write this result to file, when I try with this:
print >> output, line
I got only 6 first lines into file.
With this code:
output.write(line)
Same thing, so can you help me with this issue.
What I wanan is to output parsed content.

I am having trouble replicating the problem. Here is what I did...
import sys
import os
import urllib2
import re
from lxml import html
import requests
def parse():
output = open('proba.xml','w')
page = requests.get('http://www.rts.rs/page/tv/sr/broadcast/22/RTS+1.html')
tree = html.fromstring(page.text)
p = tree.xpath('//div[#class="ProgramTime"]/text()|//div[#class="ProgramName"]/text()|//a[#class="recnik"]/text()')
for line in p:
clean = line.strip()
if clean:
output.write(line.encode('utf-8')+'\n') # the \n adds a line break
output.close()
parse()
I think you are getting a unicode related error when writing to the file, but because you put everything in a try block and let the error pass silently you aren't getting feedback!
Try typing import this in a terminal. You will get the Zen of Python. One aphorism is "Errors should never pass silently."

Try this instead:
with file('proba.xml', 'w') as f:
f.writelines([line.strip() for line in parse]
Put this in place of for line in parse: clean = * and remove the declaration output = * above and no need for output.write again. Sorry if am not clearer am typing this on a mobile phone.

Related

Reading url from file Python

can not read the url in txt file
I want to read and open the url addresses in txt one by one, and I want to get the title of the title with regex from the source of url addresses
Error messages:
Traceback (most recent call last): File "Mypy.py", line 14, in
UrlsOpen = urllib2.urlopen(listSplit) File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 420, in open
req.timeout = timeout AttributeError: 'list' object has no attribute 'timeout'
Mypy.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import requests
import urllib2
import threading
UrlListFile = open("Url.txt","r")
UrlListRead = UrlListFile.read()
UrlListFile.close()
listSplit = UrlListRead.split('\r\n')
UrlsOpen = urllib2.urlopen(listSplit)
ReadSource = UrlsOpen.read().decode('utf-8')
regex = '<title.*?>(.+?)</title>'
comp = re.compile(regex)
links = re.findall(comp,ReadSource)
for i in links:
SaveDataFiles = open("SaveDataMyFile.txt","w")
SaveDataFiles.write(i)
SaveDataFiles.close()
When you are calling urllib2.urlopen(listSplit) listSplit is a list when it needs to be a string or request object. It's a simple fix to iterate over the listSplit instead of passing the entire list to urlopen.
Also re.findall() will return a list for each ReadSource searched. You can handle this a couple of ways:
I chose to handle it by just making a list of lists
websites = [ [link, link], [link], [link, link, link]
and iterating over both lists. This makes it so you can do something specific for each list of urls from each website (put in different file ect...).
You could also flatten the website list to just contain the links instead of another list that then contains the links:
links = [link, link, link, link]
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import urllib2
from pprint import pprint
UrlListFile = open("Url.txt", "r")
UrlListRead = UrlListFile.read()
UrlListFile.close()
listSplit = UrlListRead.splitlines()
pprint(listSplit)
regex = '<title.*?>(.+?)</title>'
comp = re.compile(regex)
websites = []
for url in listSplit:
UrlsOpen = urllib2.urlopen(url)
ReadSource = UrlsOpen.read().decode('utf-8')
websites.append(re.findall(comp, ReadSource))
with open("SaveDataMyFile.txt", "w") as SaveDataFiles:
for website in websites:
for link in website:
pprint(link)
SaveDataFiles.write(link.encode('utf-8'))
SaveDataFiles.close()

What is wrong with my code urlopen function?

Below is my python code to check curse word in the file.
But i am unable to find why the compiler is showing error: module 'urllib' has no attribute 'urlopen'.
import urllib
def read_txt():
quote = open("c:\\read.txt") #for opening the file
content = quote.read() #for reading content in a file
print(content)
quote.close() #for closing the file
check_profanity(content)
def check_profanity(text):
connection = urllib.urlopen("https://www.wdylike.appspot.com/?q=" + text)
ouput = connection.read()
print(output)
connection.close()
read_txt()
In Python 3, urllib is now a package that collects multiple modules. urlopen() is now a part of urllib.request module:
from urllib.request import urlopen
Then using it:
connection = urlopen("https://www.wdylike.appspot.com/?q=" + text)
Well, because urllib does not have a urlopen method.
In Python2 you should use urllib2, while in Python 3 you should use urllib.request

Python Special Characters Encoding

I have a python script that reads a CSV file and writes in a XML file. I have been hitting a wall trying to find out how to read special characters such as: ç, á, é, í, etc. The script runs perfectly fine without special characters. That is the script header:
# coding=utf-8
'''
#modified by: Julierme Pinheiro
'''
import os
import sys
import unittest
from unittest import skip
import csv
import uuid
import xml
import xml.dom.minidom as minidom
import owslib
from owslib.iso import *
import pyproj
from decimal import *
import logging
The way I retrieve information from the csv file is shown bellow:
# add the title
title = data[1]
titleElement = identificationInfo[0].getElementsByTagName('gmd:title')[0]
titleNode = record.createTextNode(title)
titleElement.childNodes[1].appendChild(titleNode)
print "Title:" + title
Note: If data[1], second column in the csv file, contains a special character as found in "Navegação" the script fails (It does not write anything in the xml file).
The way a new XML file is created based on a blank Template XML is shown bellow:
# write out the gemini record
filename = '../output/%s.xml' % fileId
with open(filename,'w') as test_xml:
test_xml.write(record.toprettyxml(newl="", encoding="utf-8"))
except:
e = sys.exc_info()[1]
logging.debug("Import failed for entry %s" % data[0])
logging.debug("Specific error: %s" % e)
#skip('')
def testOWSMetadataImport(self):
raw_data = []
with open('../input/metadata_cartapapel.csv') as csvfile:
reader = csv.reader(csvfile, dialect='excel')
for columns in reader:
raw_data.append(columns)
md = MD_Metadata(etree.parse('gemini-template.xml'))
md.identification.topiccategory = ['farming','environment']
print md.identification.topiccategory
outfile = open('mdtest.xml','w')
# crap, can't update the model and write back out - this is badly needed!!
outfile.write(md.xml)
if __name__ == "__main__":
unittest.main()
Could someone help to solve this issue, please?
Thank you in advance for your time.
That's unicode. csv can't read unicode if you are in python 2.7. In python 3.x you can pass the utf-8 option while opening the file.
In python you can decode the data[1] to utf-8 like below.
title = data[1].decode('utf-8')
Some of the windows legacy windows components in english might require the 'cp1252'. If the above decoding fails, try this.
title = data[1].decode('cp1252')

Stanford pos tagger not displaying the output elements in Python (MAC)

#-*- coding:Utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf8')
import os
java_path = "/usr/libexec/java_home" # replace this
os.environ['JAVAHOME'] = java_path
from nltk.tag.stanford import POSTagger
french_postagger = POSTagger("stanford-postagger-full-2014-10-26/models/french.tagger", "stanford-postagger-full-2014-10-26/stanford-postagger.jar", encoding="utf-8")
english_postagger = POSTagger("stanford-postagger-full-2014-10-26/models/english-bidirectional-distsim.tagger", "stanford-postagger-full-2014-10-26/stanford-postagger.jar", encoding="utf-8")
print french_postagger.tag("siddhartha is a good boy".split())
the result is as follows:
[('', u'/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home')]
instead I need to see the words and their tags.
The problem is this part of your code:
java_path = "/usr/libexec/java_home" # replace this
os.environ['JAVAHOME'] = java_path
Where did that code from? It looks like you should replace it. If your setup is like mine, changing that first line to java_path = "/usr/bin/java" fixes the problem. Actually, if your setup is like mine, just deleting those two lines completely fixes the problem (while including them reproduces it):
from nltk.tag.stanford import POSTagger
french_postagger = POSTagger("models/french.tagger", "stanford-postagger.jar", encoding="utf-8")
english_postagger = POSTagger("models/english-bidirectional-distsim.tagger", "stanford-postagger.jar", encoding="utf-8")
print french_postagger.tag("siddhartha is a good boy".split())
> [[(u'siddhartha', u'ADV'), (u'is', u'VPP'), (u'a', u'V'), (u'good', u'ET'), (u'boy', u'ET')]]

How to make each tweet on its own line?

I am wanting to have each tweet be on its own line.
Currently, this breaks at each response (I listed Response_1...I am using through Response_10)
Any ideas?
#!/usr/bin/env python
import urllib
import json
response_1 = urllib.urlopen("http://search.twitter.com/search.json?q=microsoft&page=1")
for i in response_1:
print (i, "\n")
You have to parse the json as a python object first, only then you can iterate over it.
#!/usr/bin/env python
import urllib
import json
response_1 = json.loads(urllib.urlopen("http://search.twitter.com/search.json?q=microsoft&page=1").read())
for i in response_1['results']:
print (i, "\n")

Categories