I am trying to parse data from a page using python which can be pretty straightforward but all the data is hidden under jquery elements and such which makes it harder to grab the data. Please forgive me as i am a newbie to Python and programming as a whole so still getting familiar with it.The website i am getting it from is http://www.asusparts.eu/partfinder/Asus/All In One/E Series so i just need all the data from the E This is the code i have so far:
import string, urllib2, csv, urlparse, sys
from bs4 import BeautifulSoup
changable_url = 'http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series'
page = urllib2.urlopen(changable_url)
base_url = 'http://www.asusparts.eu'
soup = BeautifulSoup(page)
redirects = []
model_info = []
select = soup.find(id='myselectListModel')
print select.get_text()
options = select.findAll('option')
for option in options:
if(option.has_attr('redirectvalue')):
redirects.append(option['redirectvalue'])
for r in redirects:
rpage = urllib2.urlopen(base_url + r.replace(' ', '%20'))
s = BeautifulSoup(rpage)
print s
sys.exit()
However the only problem is, it just prints out the data for the first model which is
Asus->All In One->E Series->ET10B->AC Adapter. The actual HTML page prints out like the following... (output was too long - just pasted the main output needed)
I am unsure on how i would grab the data for all the E Series parts as i assumed this would grab everything? Also i would appreciate if any answers you show relate to the current method i am using as this is the way the person in charge would like it done, Thanks.
[EDIT]
This is how i am trying to parse the HTML:
for r in redirects:
rpage = urllib2.urlopen(urljoin(base_url, quote(r)))
s = BeautifulSoup(rpage)
print s
data = soup.find(id='accordion')
selection = data.findAll('td')
for s in selections:
if(selection.has_attr('class', 'ProduktLista')):
redirects.append(td['class', 'ProduktLista'])
This is the error i come up with:
Traceback (most recent call last):
File "C:\asus.py", line 31, in <module>
selection = data.findAll('td')
AttributeError: 'NoneType' object has no attribute 'findAll'
You need to remove the sys.exit() call you have in your loop:
for r in redirects:
rpage = urllib2.urlopen(base_url + r.replace(' ', '%20'))
s = BeautifulSoup(rpage)
print s
# sys.exit() # remove this line, no need to exit your program
You also may want to use urllib.quote to properly quote the URLs you get from the option dropdown; this removes the need to manually replace spaces with '%20'. Use urlparse.urljoin() to construct the final URL:
from urllib import quote
from urlparse import
for r in redirects:
rpage = urllib2.urlopen(urljoin(base_url, quote(r)))
s = BeautifulSoup(rpage)
print s
Related
I built a simple RSS reader on Python and it is not working.
In addition, I want to get the featured image source link of every post and I didn't find a way to do so.
it shows me the Error: Traceback (most recent call last): File
"RSS_reader.py", line 7, in
feed_title = feed['feed']['title']
If there are some other RSS feeds that work fine. So I don't understand why there are some RSS feeds that are working and others that aren't
So I would like to understand why the code doesn't work and also how to get the featured image source link of a post
I attached the code, is written on Python 3.7
import feedparser
import webbrowser
feed = feedparser.parse("https://finance.yahoo.com/rss/")
feed_title = feed['feed']['title']
feed_entries = feed.entries
for entry in feed.entries:
article_title = entry.title
article_link = entry.link
article_published_at = entry.published # Unicode string
article_published_at_parsed = entry.published_parsed # Time object
article_author = entry.author
content = entry.summary
article_tags = entry.tags
print ("{}[{}]".format(article_title, article_link))
print ("Published at {}".format(article_published_at))
print ("Published by {}".format(article_author))
print("Content {}".format(content))
print("catagory{}".format(article_tags))
A few things.
1) First feed['feed']['title'] does not exist.
2) At least for this site entry.author, entry.tags do not exist
3) It seems feedparser is not compatible with python3.7 (it gives me KeyError, "object doesn't have key 'category')
So as a starting point try to run the following code in python 3.6 and go from there.
import feedparser
import webbrowser
feed = feedparser.parse("https://finance.yahoo.com/rss/")
# feed_title = feed['feed']['title'] # NOT VALID
feed_entries = feed.entries
for entry in feed.entries:
article_title = entry.title
article_link = entry.link
article_published_at = entry.published # Unicode string
article_published_at_parsed = entry.published_parsed # Time object
# article_author = entry.author DOES NOT EXIST
content = entry.summary
# article_tags = entry.tags DOES NOT EXIST
print ("{}[{}]".format(article_title, article_link))
print ("Published at {}".format(article_published_at))
# print ("Published by {}".format(article_author))
print("Content {}".format(content))
# print("catagory{}".format(article_tags))
Good luck.
You can also use xml parser libraries like beatifulsoup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and create custom parsers. A sample customer parser code can be found here (https://github.com/vintageplayer/RSS-Parser). A walk through the same can read here (https://towardsdatascience.com/rss-feed-parser-in-python-553b1857055c)
Though libraries can be useful, beautifulsoup is an extremely handy library to try out.
I have used BeautifulSoup for a beginner RSS feed reader project (You need to install lxml for it to work since we are dealing with xml):
from bs4 import BeautifulSoup
import requests
url = requests.get('https://realpython.com/atom.xml')
soup = BeautifulSoup(url.content, 'xml')
entries = soup.find_all('entry')
for i in entries:
title = i.title.text
link = i.link['href']
summary = i.summary.text
print(f'Title: {title}\n\nSummary: {summary}\n\nLink: {link}\n\n------------------------\n')
You can find the Youtube video here:
https://www.youtube.com/watch?v=8HbqO-TfjlI
I am new in Python and try to scrape data from the web to (eventually) feed a small database.
My code is generating a NoneType error. Could you assist?
import urllib2
from bs4 import BeautifulSoup
#1- create files to Leagues, stock data and error
FLeague= open("C:\Python\+exercice\SoccerLeague.txt","w")
FData=open("C:\Python\+exercice\FootballDump.txt","w")
ErrorFile=open("C:\Python\+exercice\ErrorFootballScrap.txt","w")
#Open the website
# 1- grab the data and get the error too
soup = BeautifulSoup(urllib2.urlopen("http://www.soccerstats.com/leagues.asp").read(),"html")
TableLeague = soup.find("table", {"class" : "trow8"})
print TableLeague
#\here I just want to grab country name
for row in TableLeague("tr")[2:]:
col = row.findAll("td")
# I try to identify errors
try:
country = col[1].a.string.stip()
FLeague.write(country+"\n")
except Exception as e:
ErrorFile.write (country + ";" + str(e)+";"+str(col)+"\n")
pass
#close the files
FLeague.close
FData.close
ErrorFile.close
The first problem is coming from:
TableLeague("tr")[2:]
TableLeague is None here since there is no table element with trow8 class. Instead use the id attribute to find the desired table element:
TableLeague = soup.find("table", {"id": "btable"})
Also, you probably meant strip() and not stip() here: col[1].a.string.stip().
And, in order to close the files, call the close() method. Replace:
FLeague.close
FData.close
ErrorFile.close
with:
FLeague.close()
FData.close()
ErrorFile.close()
Or, even better, use with context manager to work with files - you would not need to close a file explicitly.
Okay guys, I'm new to parsing XML and Python, and am trying to get this to work. If someone could help me with this it would be greatly appreciated. If you can help me (educate me) on how to figure it out for myself, that would be even better!
I am having trouble trying to figure out the range to reference for an XML document as I can't find any documentation on it. Here is my code and I'll include the entire Traceback after.
#import library to do http requests:
import urllib.request
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
#download the file:
file = urllib.request.urlopen('http://www.wizards.com/dndinsider/compendium/CompendiumSearch.asmx/KeywordSearch?Keywords=healing%20%word&nameOnly=True&tab=')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName:
xmlTag = dom.getElementsByTagName('Data.Results.Power.ID')[0].toxml()
#strip off the tag (<tag>data</tag> ---> data):
xmlData=xmlTag.replace('<id>','').replace('</id>','')
#print out the xml tag and data in this format: <tag>data</tag>
print(xmlTag)
#just print the data
print(xmlData)
Traceback
/usr/bin/python3.4 /home/mint/PycharmProjects/DnD_Project/Power_Name.py
Traceback (most recent call last):
File "/home/mint/PycharmProjects/DnD_Project/Power_Name.py", line 14, in <module>
xmlTag = dom.getElementsByTagName('id')[0].toxml()
IndexError: list index out of range
Process finished with exit code 1
print len( dom.getElementsByTagName('id') )
EDIT:
ids = dom.getElementsByTagName('id')
if len( ids ) > 0 :
xmlTag = ids[0].toxml()
# rest of code
EDIT: I add example because I saw in other comment tha you don't know how to use it
BTW: I add some comment in code about file/connection
import urllib.request
from xml.dom.minidom import parseString
# create connection to data/file on server
connection = urllib.request.urlopen('http://www.wizards.com/dndinsider/compendium/CompendiumSearch.asmx/KeywordSearch?Keywords=healing%20%word&nameOnly=True&tab=')
# read from server as string (not "convert" to string):
data = connection.read()
#close connection because we dont need it anymore:
connection.close()
dom = parseString(data)
# get tags from dom
ids = dom.getElementsByTagName('Data.Results.Power.ID')
# check if there are any data
if len( ids ) > 0 :
xmlTag = ids[0].toxml()
xmlData=xmlTag.replace('<id>','').replace('</id>','')
print(xmlTag)
print(xmlData)
else:
print("Sorry, there was no data")
or you can use for loop if there is more tags
dom = parseString(data)
# get tags from dom
ids = dom.getElementsByTagName('Data.Results.Power.ID')
# get all tags - one by one
for one_tag in ids:
xmlTag = one_tag.toxml()
xmlData = xmlTag.replace('<id>','').replace('</id>','')
print(xmlTag)
print(xmlData)
BTW:
getElementsByTagName() expects tagname ID - not path Data.Results.Power.ID
tagname is ID so you have to replace <ID> not <id>
for this tag you can event use one_tag.firstChild.nodeValue in place of xmlTag.replace
.
dom = parseString(data)
# get tags from dom
ids = dom.getElementsByTagName('ID') # tagname
# get all tags - one by one
for one_tag in ids:
xmlTag = one_tag.toxml()
#xmlData = xmlTag.replace('<ID>','').replace('</ID>','')
xmlData = one_tag.firstChild.nodeValue
print(xmlTag)
print(xmlData)
I haven't used the built in xml library in a while, but it's covered in Mark Pilgrim's great Dive into Python book.
-- I see as I'm typing this that your question has already been answered but since you mention being new to Python I think you will find the text useful for xml parsing and as an excellent introduction to the language.
If you would like to try another approach to parsing xml and html, I highly recommend lxml.
I am trying to collect data from a webpage which has a bunch of select lists i need to fetch
data from. Here is the page:- http://www.asusparts.eu/partfinder/Asus/All In One/E Series/
And this is what i have so far:
import glob, string
from bs4 import BeautifulSoup
import urllib2, csv
for file in glob.glob("http://www.asusparts.eu/partfinder/*"):
##-page to show all selections for the E-series-##
selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/'
##-
page = urllib2.urlopen(selected_list)
soup = BeautifulSoup(page)
##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
##-identify the id of select list which contains the E-series-##
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')
##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]
for option in option_tags:
open(url + option['value'])
html = urllib2.urlopen("http://www.asusparts.eu/partfinder/")
soup = BeautifulSoup(html)
all = soup.find('div', id="accordion")
I am not sure if i am going about the right way? As all the select menus make it confusing. Basically i need to grab
all the data from the selected results such as images,price,description,etc. They are all contained within
one div tag which contains all the results, which is named 'accordion' so would this still gather all the data?
or would i need to dig deeper to search through the tags inside this div? Also i would have prefered to search by id rather than
class as i could fetch all the data in one go. How would i do this from what i have above? Thanks. Also i am unsure about the glob function too if i am using that correctly or not?
EDIT
Here is my edited code, no errors return however i am not sure if it returns all the models for the e-series?
import string, urllib2, urllib, csv, urlparse from bs4 import
BeautifulSoup
##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
base_url = 'http://www.asusparts.eu/' + url
print base_url
##-page to show all selections for the E-series-##
selected_list = urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
print urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
#selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
##-
page = urllib2.urlopen('http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series')
soup = BeautifulSoup(page)
print soup
##-identify the id of select list which contains the E-series-##
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')
print option_tags
##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]
print option_tags
for option in option_tags:
url + option['redirectvalue']
print " " + url + option['redirectvalue']
First of all, I'd like to point out a couple of problems you have in the code you posted. First, of all the glob module is not typically used for making HTTP requests. It is useful for iterating through a subset of files on a specified path, you can read more about it in its docs.
The second issue is that in the line:
for file in glob.glob("http://www.asusparts.eu/partfinder/*"):
you have an indentation error, because there is no indented code that follows. This will raise an error and prevent the rest of the code from being executed.
Another problem is that you are using some of python's "reserved" names for your variables. You should never use words such as all or file for variable names.
Finally when you are looping through option_tags:
for option in option_tags:
open(url + option['value'])
The open statement will try and open a local file whose path is url + option['value']. This will likely raise an error, as I doubt you'll have a file at that location. In addition, you should be aware that you aren't doing anything with this open file.
Okay, so enough with the critique. I've taken a look at the asus page and I think I have an idea of what you want to accomplish. From what I understand, you want to scrape a list of parts (images, text, price, etc..) for each computer model on the asus page. Each model has its list of parts located at a unique URL (for example: http://www.asusparts.eu/partfinder/Asus/Desktop/B%20Series/BM2220). This means that you need to be able to create this unique URL for each model. To make matters more complicated, each parts category is loaded dynamically, so for example the parts for the "Cooling" section are not loaded until you click on the link for "Cooling". This means we have a two part problem: 1) Get all of the valid (brand, type, family, model) combinations and 2) Figure out how to load all the parts for a given model.
I was kind of bored and decided to write up a simple program that will take care of most of the heavy lifting. It isn't the most elegant thing out there, but it'll get the job done. Step 1) is accomplished in get_model_information(). Step 2) is taken care of in parse_models() but is a little less obvious. Taking a look at the asus website, whenever you click on a parts subsection the JavaScript function getProductsBasedOnCategoryID() is run, which makes an ajax call to a formatted PRODUCT_URL (see below). The response is some JSON information that is used to populate the section you clicked on.
import urllib2
import json
import urlparse
from bs4 import BeautifulSoup
BASE_URL = 'http://www.asusparts.eu/partfinder/'
PRODUCTS_URL = 'http://json.zandparts.com/api/category/GetCategories/'\
'44/EUR/{model}/{family}/{accessory}/{brand}/null/'
ACCESSORIES = ['Cable', 'Cooling', 'Cover', 'HDD', 'Keyboard', 'Memory',
'Miscellaneous', 'Mouse', 'ODD', 'PS', 'Screw']
def get_options(url, select_id):
"""
Gets all the options from a select element.
"""
r = urllib2.urlopen(url)
soup = BeautifulSoup(r)
select = soup.find('select', id=select_id)
try:
options = [option for option in select.strings]
except AttributeError:
print url, select_id, select
raise
return options[1:] # The first option is the menu text
def get_model_information():
"""
Finds all the models for each family, all the families and models for each
type, and all the types, families, and models for each brand.
These are all added as tuples (brand, type, family, model) to the list
models.
"""
model_info = []
print "Getting brands"
brand_options = get_options(BASE_URL, 'mySelectList')
for brand in brand_options:
print "Getting types for {0}".format(brand)
# brand = brand.replace(' ', '%20') # URL encode spaces
brand_url = urlparse.urljoin(BASE_URL, brand.replace(' ', '%20'))
types = get_options(brand_url, 'mySelectListType')
for _type in types:
print "Getting families for {0}->{1}".format(brand, _type)
bt = '{0}/{1}'.format(brand, _type)
type_url = urlparse.urljoin(BASE_URL, bt.replace(' ', '%20'))
families = get_options(type_url, 'myselectListFamily')
for family in families:
print "Getting models for {0}->{1}->{2}".format(brand,
_type, family)
btf = '{0}/{1}'.format(bt, family)
fam_url = urlparse.urljoin(BASE_URL, btf.replace(' ', '%20'))
models = get_options(fam_url, 'myselectListModel')
model_info.extend((brand, _type, family, m) for m in models)
return model_info
def parse_models(model_information):
"""
Get all the information for each accessory type for every
(brand, type, family, model). accessory_info will be the python formatted
json results. You can parse, filter, and save this information or use
it however suits your needs.
"""
for brand, _type, family, model in model_information:
for accessory in ACCESSORIES:
r = urllib2.urlopen(PRODUCTS_URL.format(model=model, family=family,
accessory=accessory,
brand=brand,))
accessory_info = json.load(r)
# Do something with accessory_info
# ...
def main():
models = get_model_information()
parse_models(models)
if __name__ == '__main__':
main()
Finally, one side note. I have dropped urllib2 in favor of the requests library. I personally think provides much more functionality and has better semantics, but you can use whatever you would like.
I am using Python3 and the package requests to fetch HTML data.
I have tried running the line
r = requests.get('https://github.com/timeline.json')
, which is the example on their tutorial, to no avail. However, when I run
request = requests.get('http://www.math.ksu.edu/events/grad_conf_2013/')
it works fine. I am getting errors such as
AttributeError: 'MockRequest' object has no attribute 'unverifiable'
Error in sys.excepthook:
I am thinking the errors have something to do with the type of webpage I am attempting to get, since the html page that is working is just basic html that I wrote.
I am very new to requests and Python in general. I am also new to stackoverflow.
As a little example, here is a little tool which I developed in order to fetch data from a website, in this case IP and show it:
# Import the requests module
# TODO: Make sure to install it first
import requests
# Get the raw information from the website
r = requests.get('http://whatismyipaddress.com')
raw_page_source_list = r.text
text = ''
# Join the whole list into a single string in order
# to simplify things
text = text.join(raw_page_source_list)
# Get the exact starting position of the IP address string
ip_text_pos = text.find('IP Information') + 62
# Now extract the IP address and store it
ip_address = text[ip_text_pos : ip_text_pos + 12]
# print 'Your IP address is: %s' % ip_address
# or, for Python 3 ... #
# print('Your IP address is: %s' % ip_address)