Get value from online xml

Get value from online xml - python

I want to get the value of the 'latest' version tag from here: https://papermc.io/repo/repository/maven-public/com/destroystokyo/paper/paper-api/maven-metadata.xml
I tried using this python:
import urllib.request
from xml.etree import ElementTree
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
data = opener.open('https://papermc.io/repo/repository/maven-public/com/destroystokyo/paper/paper-api/maven-metadata.xml').
root = ElementTree.fromstring(data)
versioning = root.find("versioning")
latest = versioning.find("latest")
snip.rv = latest.text
The problem is, using this inside of vim (I'm trying to make UltiSnips snippets with it) makes the whole of vim extremely slow after the code has finished running.
What's causing my program to slow down just when I add that ^^ code?

I don't know if this will solve the performance issue in vim, but the code was not running for me due to errors in it.
opener.open returns a file-like object, so you should read it using
ElementTree.parse instead of ElementTree.fromstring (actually there is a trailing dot after opener.open(...), so I don't know if you missed a read() thereafter. In that case the return value is indeed a string).
Apart from that, you could try to close the opener to see if that frees up some resources (or use the with).
I attach an example of the improved code:
import urllib.request
from xml.etree import ElementTree
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
with opener.open('https://papermc.io/repo/repository/maven-public/com/destroystokyo/paper/paper-api/maven-metadata.xml') as data:
root = ElementTree.parse(data)
latest = root.find("./versioning/latest")
snip.rv = latest.text

Related

Website scraping script works in Linux but not in Windows 7?

I have written a script that scrapes a URL. It works fine on Linux OS. But i am getting http 503 error when running on Windows 7. The URL has some issue.
I am using python 2.7.11 .
Please help.
Below is the script:
import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document
if __name__ == "__main__":
### Import Beautiful Soup
### Here, I have the BeautifulSoup folder in the level of this Python script
### So I need to tell Python where to look.
sys.path.append("./BeautifulSoup")
from bs4 import BeautifulSoup
### Create opener with Google-friendly user agent
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
### Open page & generate soup
### the "start" variable will be used to iterate through 10 pages.
for start in range(0,1000):
url = "http://www.google.com/search?q=site:theknot.com/us/&start=" + str(start*10)
page = opener.open(url)
soup = BeautifulSoup(page)
### Parse and find
### Looks like google contains URLs in <cite> tags.
### So for each cite tag on each page (10), print its contents (url)
file = open("parseddata.txt", "wb")
for cite in soup.findAll('cite'):
print cite.text
file.write(cite.text+"\n")
# file.flush()
# file.close()
In case you run it in windows 7, the cmd throws http503 error stating the issue is with url.
The URL works fine in Linux OS. In case URL is actually wrong please suggest the alternatives.

Apparently with Python 2.7.2 on Windows, any time you send a custom User-agent header, urllib2 doesn't send that header. (source: https://stackoverflow.com/a/8994498/6479294).
So you might want to consider using requests instead of urllib2 in Windows:
import requests
# ...
page = requests.get(url)
soup = BeautifulSoup(page.text)
# etc...
EDIT: Also a very good point to be made is that Google may be blocking your IP - they don't really like bots making 100 odd requests sequentially.

Error crawling wikipedia

According to the answer by #Jens Timmerman on this post: Extract the first paragraph from a Wikipedia article (Python)
i did this:
import urllib2
def getPage(url):
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] #wikipedia needs this
resource = opener.open("http://en.wikipedia.org/wiki/" + url)
data = resource.read()
resource.close()
return data
print getPage('Steve_Jobs')
technically it should run properly and give me the source of the page. but here's what i get:
any help would be appreciated..

After checking with wget and curl, I saw that it wasn't a problem specific to Python - they too got "strange" characters; a quick check with file tells me that the response is simply gzip-compressed, so it seems that Wikipedia just sends gzipped data by default, without checking if the client actually says to support it in the request.
Fortunately, Python is capable of decompressing gzipped data: integrating your code with this answer you get:
import urllib2
from StringIO import StringIO
import gzip
def getPage(url):
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'MyTestScript/1.0 (contact at myscript#mysite.com)'), ('Accept-encoding', 'gzip')]
resource = opener.open("http://en.wikipedia.org/wiki/" + url)
if resource.info().get('Content-Encoding') == 'gzip':
buf = StringIO( resource.read())
f = gzip.GzipFile(fileobj=buf)
return f.read()
else:
return resource.read()
print getPage('Steve_Jobs')
which works just fine on my machine.
Still, as already pointed out in the comments, you should probably avoid "brutal crawling", if you want to access Wikipedia content programmatically use their APIs.

Python 2 to Python 3 Conversion of http Request

I am new to Python. I am trying to convert Python 2 code to Python 3. In my old code I have the following lines:
# Create a cookiejar to store cookie
cj = cookielib.CookieJar()
# Create opener
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
I have converted these lines to:
# Create a cookiejar to store cookie
cj = cookielib.CookieJar()
# Create opener
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
The issue I have is that I keep getting the following error:
NameError: global name 'cookielib' is not defined
I am not sure what I am doing wrong and how to fix this. Can someone please help me? Thank you very much.

Did you use the 2to3 tool? Also, using Python Docs,
Note The cookielib module has been renamed to http.cookiejar in Python
3. The 2to3 tool will automatically adapt imports when converting your sources to Python 3.
After seeing your comment, the problem is that it'shttp.cookiejar and not http.cookieJar.
Notice the un-captitalised J.

I think I have the solution. The following seems to work:
cj = http.cookiejar.CookieJar()

The library cookielib module has been renamed to http.cookiejar in Python 3, see https://docs.python.org/2/library/cookielib.html.
You may use the the 2to3 tool to convert your source code to Python 3, see https://docs.python.org/2/library/2to3.html

Retrieving pages from what.cd

I'm working on a screen scraper using BeautifulSoup for what.cd using Python. I came across this script while working and decided to look at it, since it seems to be similar to what I'm working on. However, every time I run the script I get a message that my credentials are wrong, even though they are not.
As far as I can tell, I'm getting this message because when the script tries to log into what.cd, what.cd is supposed to return a cookie containing the information that lets me request pages later in the script. So where the script is failing is:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username,
'password' : password})
check = opener.open('http://what.cd/login.php', login_data)
soup = BeautifulSoup(check.read())
warning = soup.find('span', 'warning')
if warning:
exit(str(warning)+'\n\nprobably means username or pw is wrong')
I've tried multiple methods of authenticating with the site including using CookieFileJar, the script located here, and the Requests module. I've gotten the same HTML message with each one. It says, in short, that "Javascript is disabled", and "Cookies are disabled", and also provides a login box in HTML.
I don't really want to mess around with Mechanize, but I don't see any other way to do it at the moment. If anyone can provide any help, it would be greatly appreciated.

After a few more hours of searching, I found the solution to my problem. I'm still not sure why this code works as apposed to the version above, but it does. Here is the code I'm using now:
import urllib
import urllib2
import cookielib
cj = cookielib.LWPCookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
request = urllib2.Request("http://what.cd/index.php", None)
f = urllib2.urlopen(request)
f.close()
data = urllib.urlencode({"username": "your-login", "password" : "your-password"})
request = urllib2.Request("http://what.cd/login.php", data)
f = urllib2.urlopen(request)
html = f.read()
f.close()
Credit goes to carl.waldbieser from linuxquestions.org. Thanks for everyone who gave input.

Using lxml, what causes a "lxml.etree.XMLSyntaxError: Document is empty" error?

I'm using mechanize/cookiejar/lxml to read a page and it works for some but not others. The error I'm getting in them is the one in the title. I can't post the pages here because they aren't SFW, but is there a way to fix it? Basically, this is what I do:
import mechanize, cookielib
from lxml import etree
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(False)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 maverick Firefox/3.6.13')]
response = br.open('...')
tree = etree.parse(response) #error
After that I get the root and search the document for the values I want. Apparently iterparse doesn't crash it, but at the moment I'm assuming it doesn't just because I didn't process anything with it. Plus, I haven't figured out yet how to search for the stuff with it.
I've tried disabling gzip and enabling sending the referer as well but neither solves the problem. I also tried saving the sourcecode to the disk and creating the tree from there just for the sake of it and I get the same error.
edit
The response I get seems to be fine, using print repr(response) as suggested I get a <response_seek_wrapper at 0xa4a160c whose wrapped object = <stupid_gzip_wrapper at 0xa49acec whose fp = <socket._fileobject object at 0xa49c32c>>>. I can also save the response using the read() method and check that the saved .xml works on the browser and everything.
Also, in one of the pages, there is a ’ that gives me the following error: "lxml.etree.XMLSyntaxError: Entity 'rsquo' not defined, line 17, column 7054". So far I've replaced it with a regex, but is there a parser that can handle this? I've gotten this error even with the lxml.html.parse suggested below.
Regarding the file being highlighted, I meant that when I open it with gEdit it does this kinda: http://img34.imageshack.us/img34/9574/gedit.jpg

use lxml.html.parse for html it can handle even very broken html, you still get an error then?

What is the nature of response? According to the help, etree.parse is expecting one of:
- a file name/path
- a file object
- a file-like object
- a URL using the HTTP or FTP protocol

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get value from online xml - python

Related

Website scraping script works in Linux but not in Windows 7?

Error crawling wikipedia

Python 2 to Python 3 Conversion of http Request

Retrieving pages from what.cd

Using lxml, what causes a "lxml.etree.XMLSyntaxError: Document is empty" error?

Categories

Resources