Parsing XML File with Python, while extracting Attributes and Children - python

I'm trying to read an XML file in Python whose general format is as follows:
<item id="1149" num="1" type="topic">
<title>Afghanistan</title>
<additionalInfo>Afghanistan</additionalInfo>
</item>
(This snippet repeats many times.)
I'm trying to get the id value and the title value to be printed into a file.
Currently, I'm having trouble with getting the XML file into Python. Currently, I'm doing this to get the XML file:
import xml.etree.ElementTree as ET
from urllib2 import urlopen
url = 'http://api.npr.org/list?id=3002' #1007 is science
response = urlopen(url)
f = open('out.xml', 'w')
f.write(response)
However, whenever I run this code, I get the error Traceback (most recent call last): File "python", line 9, in <module> TypeError: expected a character buffer object, which makes me think that I'm not using something that can handle XML.
Is there any way that I can save the XML file to a file, then extract the title of each section, as well as the id attribute associated with that title?
Thanks for the help.

You can read the content of response by this code :
import urllib2
opener = urllib2.build_opener(urllib2.HTTPRedirectHandler(),urllib2.HTTPCookieProcessor())
response= opener.open("http://api.npr.org/list?id=3002").read()
opener.close()
and then write it to file :
f = open('out.xml', 'w')
f.write(response)
f.close()

What you want is response.read() not response. The response variable is an instance not the xml string. By doing response.read() it will read the xml from the response instance.
You can then write it directly to a file like so:
url = 'http://api.npr.org/list?id=3002' #1007 is science
response = urlopen(url)
f = open('out.xml', 'w')
f.write(response.read())
Alternatively you could also parse it directly into the ElementTree like so:
url = 'http://api.npr.org/list?id=3002' #1007 is science
response = urlopen(url)
tree = ET.fromstring(response.read())
To extract all of the id/title pairs you could do the following as well:
url = 'http://api.npr.org/list?id=3002' #1007 is science
response = urlopen(url)
tree = ET.fromstring(response.read())
for item in tree.findall("item"):
print item.get("id")
print item.find("title").text
From there you can decide where to store/output the values

Related

Write request.get response to json file in python

I am using requests.get in python for a URL to get the data as below:
import requests
username = 'ROAND'
password = dbutils.secrets.get("lab-secrets","ROSecret")
response = requests.get('https://pit.service.com/api/table', auth=(username,password))
The count for this is
print(response.headers)
X-Total-Count': '799434'
Im trying to load this into a json file as below:
data = response.content
with open('/path/file.json', 'wb') as f:
f.write(data)
But the file contains only 1439 records.
The json file content looks like the below image:
Ive tried multiple ways, but not successful.
I just want to exactly bring all my contents from requests.get into a json file.
Kindly help.

Using a text file with JSON data and use it in Tkinter

I have used an API using python, to call out to "NewsAPI" to get all the latest news that I need and I have actually save it into a text file called "NewsAPI.txt".
My code is:
import json
import requests
def newsAPI():
url = ('https://newsapi.org/v2/everything?' #API URL
'q=procurement AND tender&' #keywords on procurement AND tender
'sortBy=popularity&' #Sort them by popularity
'apiKey=***') #Personal API key
# GET
response = requests.get(url)
#storing the output into variable "results"
results = response.json()
# save the JSON output into a txt file for future usage
with open("NewsAPI.txt", "w") as text_file:
json.dump(results, text_file)
After calling json.dump, it gets saved into the "NewsAPI.txt" file as I have mentioned. But I'm having trouble putting it into a treeview in Tkinter, or am I using the wrong widget to display them?
Output data:

Can anyone tell me what error msg "line 1182 in parse" means when I'm trying to parse and xml in python

This is the code that results in an error message:
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.parse(data)
The error:
I'm new to python. I did read documentation and a couple of tutorials, but clearly I still have done something wrong. I don't believe it is the xml file itself because it does this to two different xml files.
Consider using ElementTree's fromstring():
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
# http://feeds.bbci.co.uk/news/rss.xml?edition=int
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.fromstring(data)
print ET.tostring(tree, encoding='utf8', method='xml')
data is a reference to the XML content as a string, but the parse() function expects a filename or file object as argument. That's why there is an an error.
urlhandle is a file object, so tree = ET.parse(urlhandle) should work for you.
The error message indicates that your code is trying to open a file, who's name is stored in the variable source.
It's failing to open that file (IOError) because the variable source contains a bunch of XML, not a file name.

XML in Python and lxml

I am using the pinnacle (betting) api which returns an XML file. At the moment, I save it to a .xml file as below:
req = urllib2.Request(url, headers=headers)
responseData = urllib2.urlopen(req).read()
ofn = 'pinnacle_feed_basketball.xml'
with open(ofn, 'w') as ofile:
ofile.write(responseData)
parse_xml()
and then open it in the parse_xml function
tree = etree.parse("pinnacle_feed_basketball.xml")
fdtime = tree.xpath('//rsp/fd/fdTime/text()')
I am presuming saving it as an XML file and then reading in the file is not necessary but I cannot get it to work without doing this.
I tried passing in responseData to the parsexml() function
parse_xml(responseData)
and then in the function
tree = etree.parse(responseData)
fdtime = tree.xpath('//rsp/fd/fdTime/text()')
But it doesn't work.
If you want to parse an in-memory object (in your case, a string), use etree.fromstring(<obj>) -- etree.parse expects a file-like object or filename -- Docs
For example:
import urllib2, lxml.etree as etree
url = 'http://www.xmlfiles.com/examples/note.xml'
headers = {}
req = urllib2.Request(url, headers=headers)
responseData = urllib2.urlopen(req).read()
element = etree.fromstring(responseData)
print(element)
print(etree.tostring(element, pretty_print=True))
Output:
<Element note at 0x2c29dc8>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
parse() is designed to read from file-like objects.
But you are passing a string in both cases - pinnacle_feed_basketball.xml string and responseData, which is also a string.
In the first case it should be:
with open("pinnacle_feed_basketball.xml") as f:
tree = etree.parse(f)
In the second case:
root = etree.fromstring(responseData) # note that you are not getting an "ElementTree" object here
FYI, urllib2.urlopen(req) is also a file-like object:
tree = etree.parse(urllib2.urlopen(req))

Extracting blog data in python

We have to extract a specified number of blogs(n) by reading them from a a text file containing a list of blogs.
Then I extract the blog data and append it to a file.
This is just a part of the main assignment of applying nlp to the data.
So far I've done this:
import urllib2
from bs4 import BeautifulSoup
def create_data(n):
blogs=open("blog.txt","r") #opening the file containing list of blogs
f=file("data.txt","wt") #Create a file data.txt
with open("blog.txt")as blogs:
head = [blogs.next() for x in xrange(n)]
page = urllib2.urlopen(head['href'])
soup = BeautifulSoup(page)
link = soup.find('link', type='application/rss+xml')
print link['href']
rss = urllib2.urlopen(link['href']).read()
souprss = BeautifulSoup(rss)
description_tag = souprss.find('description')
f = open("data.txt","a") #data file created for applying nlp
f.write(description_tag)
This code doesn't work. It worked on giving the link directly.like:
page = urllib2.urlopen("http://www.frugalrules.com")
I call this function from a different script where user gives the input n.
What am I doing wrong?
Traceback:
Traceback (most recent call last):
File "C:/beautifulsoup4-4.3.2/main.py", line 4, in <module>
create_data(2)#calls create_data(n) function from create_data
File "C:/beautifulsoup4-4.3.2\create_data.py", line 14, in create_data
page=urllib2.urlopen(head)
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 395, in open
req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
head is a list:
head = [blogs.next() for x in xrange(n)]
A list is indexed by integer indices (or slices). You can not use head['href'] when head is a list:
page = urllib2.urlopen(head['href'])
It's hard to say how to fix this without knowing what the contents of blog.txt looks like. If each line of blog.txt contains a URL, then
you could use:
with open("blog.txt") as blogs:
for url in list(blogs)[:n]:
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
...
with open('data.txt', 'a') as f:
f.write(...)
Note that file is a deprecated form of open (which was removed in Python3). Instead of using f=file("data.txt","wt"), use the more modern with-statement syntax (as shown above).
For example,
import urllib2
import bs4 as bs
def create_data(n):
with open("data.txt", "wt") as f:
pass
with open("blog.txt") as blogs:
for url in list(blogs)[:n]:
page = urllib2.urlopen(url)
soup = bs.BeautifulSoup(page.read())
link = soup.find('link', type='application/rss+xml')
print(link['href'])
rss = urllib2.urlopen(link['href']).read()
souprss = bs.BeautifulSoup(rss)
description_tag = souprss.find('description')
with open('data.txt', 'a') as f:
f.write('{}\n'.format(description_tag))
create_data(2)
I'm assuming that you are opening, writing to and closing data.txt with each pass through the loop because you want to save partial results -- maybe in case the program is forced to terminate prematurely.
Otherwise, it would be easier to just open the file once at the very beginning:
with open("blog.txt") as blogs, open("data.txt", "wt") as f:

Categories