Python: BeautifulSoup Scrape, Blank Descriptions For Courses Messing Up Data

Python: BeautifulSoup Scrape, Blank Descriptions For Courses Messing Up Data - python

I'm trying to scrape some course data from the site https://bulletins.psu.edu/university-course-descriptions/undergraduate/ for a project.
# -*- coding: utf-8 -*-
"""
Created on Mon Nov 5 20:37:33 2018
#author: DazedFury
"""
# Here, we're just importing both Beautiful Soup and the Requests library
from bs4 import BeautifulSoup
import requests
# returns a CloudflareScraper instance
#scraper = cfscrape.create_scraper()
#URL and textfile
text_file = open("Output.txt", "w", encoding='UTF-8')
page_link = 'https://bulletins.psu.edu/university-course-descriptions/undergraduate/acctg/'
page_response = requests.get(page_link)
page_content = BeautifulSoup(page_response.content, "html.parser")
#Array for storing URL's
URLArray = []
#Find links
for link in page_content.find_all('a'):
if('/university-course-descriptions/undergraduate' in link.get('href')):
URLArray.append(link.get('href'))
k = 1
#Parse Loop
while(k != 242):
print("Writing " + str(k))
completeURL = 'https://bulletins.psu.edu' + URLArray[k]
# this is the url that we've already determined is safe and legal to scrape from.
page_link = completeURL
# here, we fetch the content from the url, using the requests library
page_response = requests.get(page_link)
#we use the html parser to parse the url content and store it in a variable.
page_content = BeautifulSoup(page_response.content, "html.parser")
page_content.prettify
#Find and print all text with tag p
paragraphs = page_content.find_all('div', {'class' : 'course_codetitle'})
paragraphs2 = page_content.find_all('div', {'class' : 'courseblockdesc'})
j = 0
for i in range(len(paragraphs)):
if i % 2 == 0:
text_file.write(paragraphs[i].get_text())
text_file.write("\n")
if j < len(paragraphs2):
text_file.write(" ".join(paragraphs2[j].get_text().split()))
text_file.write("\n")
text_file.write("\n")
if(paragraphs2[j].get_text() != ""):
j += 1
k += 1
#FORMAT
#text_file.write("<p style=\"page-break-after: always;\"> </p>")
#text_file.write("\n\n")
#Close Text File
text_file.close()
The specific info I need are the course title and the description. The problem is that some of the courses have blank descriptions, which messes up the order and giving bad data.
I thought about just checking if the course description is blank but on the site, the 'courseblockdesc' tag doesn't exists if the course has no description. Therefore when I find_all courseblockdesc, the list doesn't actually add add an element to the array, so the order ends up messed up. There are too many errors on this to manually fix, so I was hoping someone could help me find a solution to this.

The simplest solution would be to go through each item in one find_all for the parents of the items you are looking for.
for block in page_content.find_all('div', class_="courseblock"):
title = block.find('div', {'class' : 'course_codetitle'})
description = block.find('div', {'class' : 'courseblockdesc'})
# do what you need with the navigable strings here.
print(title.get_text()
if description:
print(description.get_text())

You may be over-complicating the procedure somewhat, but you're certainly on the right track. Instead of storing the information in an array and relying on all of the indexes to line up, write the text file as you traverse the courses, pulling title and description dynamically from each course block. If a block doesn't have a description, you can handle that on the spot. Here's a working example:
from bs4 import BeautifulSoup
import requests
url = "https://bulletins.psu.edu/university-course-descriptions/undergraduate/acctg/"
with open("out.txt", "w", encoding="UTF-8") as f:
for link in BeautifulSoup(requests.get(url).content, "html.parser").find_all("a"):
if "/university-course-descriptions/undergraduate" in link["href"]:
soup = BeautifulSoup(requests.get("https://bulletins.psu.edu" + link["href"]).content, "html.parser")
for course in soup.find_all("div", {"class": "courseblock"}):
title = course.find("div", {"class" : "course_title"}).get_text().strip()
try:
desc = course.find("div", {"class" : "courseblockdesc"}).get_text().strip()
except AttributeError:
desc = "No description available"
f.write(title + "\n" + desc + "\n\n")
Output snippet (from end of text file to validate alignment):
WLED 495: **SPECIAL TOPICS**
No description available
WLED 495B: Field Experience for World Languages Teacher Preparation in Grades 1-5
WL ED 495B Field Experience for World Languages Teacher Preparation in Grades 1-5 (3) Practicum situation where Prospective World Language teachers will demonstrate acquired knowledge on second language learning/teaching and educational theories. Prospective World Language teachers will have assigned school placements and will attend a weekly seminar where issues in World Language learning and teaching will be discussed. At their assigned school placement, prospective World Language teachers will have many opportunities to observe/work with children in grades 1-5 (1) focusing on second language learning/teaching and the socio/cultural issues associated to classroom practices while implementing and self-evaluated own designed activities and lessons; (2) weekly seminars will engage students in reflective activities that will enable them to analyze each week's events; (3) inquiry projects on teaching and learning of World Languages.
WLED 495C: Field Experience for World Languages Teacher Preparation in Grades 6-12
WL ED 495C Field Experience for World Languages Teacher Preparation in Grades 6-12 (3) Practicum situation where prospective World Language teachers will demonstrate acquired knowledge on second language learning/teaching and educational theories. Prospective World Language teachers will have assigned school placements in grades 6-12 and will attend a weekly seminar where issues in World Language learning and teaching will be discussed. At their assigned school placement, prospective World Language teachers will have many opportunities to observe/work with students in grades 6-12 (1) focusing on second language learning/teaching and the socio/cultural issues associated to classroom practices while implementing and self-evaluating their own designed activities and lessons, (2) weekly seminars will engage students in reflective activities that will enable them to analyze each week's events, and (3) inquiry projects on teaching and learning of World Languages.
Additional minor remarks:
It's a good idea to use the with keyword for file I/O. This will automatically close the file handle when done.
Verbose intermediate variables and comments that add noise like:
# Here, we're just importing both Beautiful Soup and the Requests library
from bs4 import BeautifulSoup
or
#Close Text File
text_file.close()
can always be removed, making the program logic easier to follow.

Related

Find title from txt file using user input id and add them to a list

I have a txt file with the following info:
545524---Python foundation---Course---https://img-c.udemycdn.com/course/100x100/647442_5c1f.jpg---Outsourcing
Development Work: Learn My Proven System To Hire Freelance Developers
Another line with the same format but different info(might have the same id)and continue....
Here on line 1, Python foundation is the course title. If a user has input id 545524 how do I print out course title Python foundation? It's basically printing the whole title of a course based on the given input id. I tried using following but got stuck:
input = ''
with open(r"sample.txt") as data:
read_data = data.read()
id_search = re.findall(r'regex, read_data)
title_search = re.findall(r'regex', read_data)
for id_input in id_search:
if input in id_input:
#Then I got stuck
I need to print all the titles based on that id. and finally add them to a list. Any help is appreciated

Kemmisch is on the money here, you can just split the whole string on the --- and assign each separate element to it's own variable
somefile.txt
545524---Python foundation---Course---https://img-c.udemycdn.com/course/100x100/647442_5c1f.jpg---Outsourcing Development Work: Learn My Proven System To Hire Freelance Developers
12345---Not Python foundation---Ofcourse---https://some.url/here---This is something else
main.py
with open('somefile.txt') as infile:
data = infile.read().splitlines() # os agnostic, and removes end of lines characters
for line in data:
idnr, title, dunno_what_this_is, url, description = line.split('---')
print('--------------------')
print(idnr)
print(title)
print(dunno_what_this_is)
print(url)
print(description)
output
--------------------
545524
Python foundation
Course
https://img-c.udemycdn.com/course/100x100/647442_5c1f.jpg
Outsourcing Development Work: Learn My Proven System To Hire Freelance Developers
--------------------
12345
Not Python foundation
Ofcourse
https://some.url/here
This is something else
EDIT
Now if you want to search for a specific ID you can just use an if statement. Sample below uses the same somefile.txt.
id_input = input("Provide an ID to search for: ")
with open('somefile.txt') as infile:
data = infile.read().splitlines() # os agnostic, and removes end of lines characters
for line in data:
idnr, title, dunno_what_this_is, url, description = line.split('---')
if idnr == id_input:
print('--------------------')
print(idnr)
print(title)
print(dunno_what_this_is)
print(url)
print(description)
output
Provide an ID to search for: 12345
--------------------
12345
Not Python foundation
Ofcourse
https://some.url/here
This is something else

Python bs4: why am I only seeing part of the HTML?

I'm using bs4 to scrape Product Hunt.
Taking this post as an example, when I scrape it using the below code, the "discussion" section is entirely absent.
res = requests.get('https://producthunt.com/posts/weights-biases')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
pprint.pprint(soup.prettify())
I suspect this has something to do with lazy loading (when you open the page, the "discussion" section takes an extra second or two to appear).
How do I scrape lazy loaded components? Or is this something else entirely?

It seems that some elements of the page are dynamically loaded through Javascript queries.
The requests library allows you to send queries manually and then you parse the content of the updated page with bs4.
However, from my experience with dynamic webpages, this approach can be really annoying if you have a lot of queries to send.
Generally in these cases it's preferable to use a library that integrates real-time browser simulation. This way, the simulator itself will handle client-server communication and update the page; you will just have to wait for the elements to be loaded and then analyze them safely.
So I suggest you take a look at selenium or even selenium-requests if you prefer to keep the requests 'philosophy'.

This is how you can get the comments under discussion. You can always rectify the script to get the concerning replies each thread has got.
import json
import requests
from pprint import pprint
url = 'https://www.producthunt.com/frontend/graphql'
payload = {"operationName":"PostPageCommentsSection","variables":{"commentsListSubjectThreadsCursor":"","commentsThreadRepliesCursor":"","slug":"weights-biases","includeThreadForCommentId":None,"commentsListSubjectThreadsLimit":10},"query":"query PostPageCommentsSection($slug:String$commentsListSubjectThreadsCursor:String=\"\"$commentsListSubjectThreadsLimit:Int!$commentsThreadRepliesCursor:String=\"\"$commentsListSubjectFilter:ThreadFilter$includeThreadForCommentId:ID$excludeThreadForCommentId:ID){post(slug:$slug){id canManage ...PostPageComments __typename}}fragment PostPageComments on Post{_id id slug name ...on Commentable{_id id canComment __typename}...CommentsSubject ...PostReviewable ...UserSubscribed meta{canonicalUrl __typename}__typename}fragment PostReviewable on Post{id slug name canManage featuredAt createdAt disabledWhenScheduled ...on Reviewable{_id id reviewsCount reviewsRating isHunter isMaker viewerReview{_id id sentiment comment{id body __typename}__typename}...on Commentable{canComment commentsCount __typename}__typename}meta{canonicalUrl __typename}__typename}fragment CommentsSubject on Commentable{_id id ...CommentsListSubject __typename}fragment CommentsListSubject on Commentable{_id id threads(first:$commentsListSubjectThreadsLimit after:$commentsListSubjectThreadsCursor filter:$commentsListSubjectFilter include_comment_id:$includeThreadForCommentId exclude_comment_id:$excludeThreadForCommentId){edges{node{_id id ...CommentThread __typename}__typename}pageInfo{endCursor hasNextPage __typename}__typename}__typename}fragment CommentThread on Comment{_id id isSticky replies(first:5 after:$commentsThreadRepliesCursor allForCommentId:$includeThreadForCommentId){edges{node{_id id ...Comment __typename}__typename}pageInfo{endCursor hasNextPage __typename}__typename}...Comment __typename}fragment Comment on Comment{_id id badges body bodyHtml canEdit canReply canDestroy createdAt isHidden path repliesCount subject{_id id ...on Commentable{_id id __typename}__typename}user{_id id headline name firstName username headline ...UserSpotlight __typename}poll{...PollFragment __typename}review{id sentiment __typename}...CommentVote ...FacebookShareButtonFragment __typename}fragment CommentVote on Comment{_id id ...on Votable{_id id hasVoted votesCount __typename}__typename}fragment FacebookShareButtonFragment on Shareable{id url __typename}fragment UserSpotlight on User{_id id headline name username ...UserImage __typename}fragment UserImage on User{_id id name username avatar headline isViewer ...KarmaBadge __typename}fragment KarmaBadge on User{karmaBadge{kind score __typename}__typename}fragment PollFragment on Poll{id answersCount hasAnswered options{id text imageUuid answersCount answersPercent hasAnswered __typename}__typename}fragment UserSubscribed on Subscribable{_id id isSubscribed __typename}"}
r = requests.post(url,json=payload)
for item in r.json()['data']['post']['threads']['edges']:
pprint(item['node']['body'])
Output at this moment:
('Looks like such a powerful tool for extracting performance insights! '
'Absolutely love the documentation feature, awesome work!')
('This is awesome and so Any discounts or special pricing for '
'researchers/students/non-professionals?')
'Amazing. I think this is very helpful tools for us. Keep it up & go ahead.'
('<p>This simple system of record automatically saves logs from every '
'experiment, making it easy to look over the history of your progress and '
'compare new models with existing baselines.</p>\n'
'Pros: <p>Easy, fast, and lightweight experiment tracking</p>\n'
'Cons: <p>Only available for Python projects</p>')
('Very cool! I hacked together something similar but much more basic for '
"personal use and always wondered why TensorBoard didn't solve this problem. "
'I just wish this was open source! :) P.S. awesome use of the parallel '
'co-ordinates d3.js chart - great idea to apply it to experiment '
'configurations!')

Cannot download certain texts from website

I'm trying to download all the last statements from the Death Row Website. Basic outline is like this
1. The info from the site gets imported in an sqlite database, prison.sqlite
2. Based on the names in the table, I generate unique URL's for each name, to get their last statements.
3. The program checks each generated URL, if URL is OK, it checks for the last statement. This statement gets downloaded to the database prison.sqlite (still 2 do).
This is my code:
import sqlite3
import csv
import re
import urllib2
from urllib2 import Request, urlopen, URLError
from BeautifulSoup import BeautifulSoup
import requests
import string
URLS = ["http://www.tdcj.state.tx.us/death_row/dr_info/hernandezramontorreslast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/garciafrankmlast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/martinezdavidlast999173.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/moselydaroycelast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/martinezdavidlast999288.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/hernandezadophlast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/carterrobertanthonylast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/livingstoncharleslast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/gentrykennethlast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/gentrykennethlast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/wilkersonrichardlast.html",
"http://www.tdcj.state.tx.us/death_row/dr_info/hererraleonellast.html",]
conn = sqlite3.connect('prison.sqlite')
conn.text_factory = str
cur = conn.cursor()
cur.execute("DROP TABLE IF EXISTS prison")
cur.execute("CREATE TABLE Prison ( Execution text, link1 text, Statements text, LastName text, Firstname text, TDCJNumber text, Age integer, date text, race text, county text)")
conn.commit()
csvfile = open("prisonfile.csv","rb")
creader = csv.reader(csvfile, delimiter = ",")
for t in creader:
cur.execute('INSERT INTO Prison VALUES (?,?,?,?,?,?,?,?,?,?)', t, )
for column in cur.execute("SELECT LastName, Firstname FROM prison"):
lastname = column[0]
firstname = column[1]
name = lastname+firstname
CleanName = name.translate(None, ",.!-#'#$" "")
CleanName = CleanName.replace(" ", "")
CleanName = CleanName.replace("III","")
CleanName = re.sub("Sr","",CleanName)
CleanName = re.sub("Jr","",CleanName)
CleanName = CleanName.lower()
Baseurl = "http://www.tdcj.state.tx.us/death_row/dr_info/"
Link = Baseurl+CleanName+"last.html"
URLS.append(Link)
for Link in URLS:
try:
r = requests.get(Link)
r.raise_for_status()
print "URL OK", Link
document = urllib2.urlopen(Link)
html = document.read()
soup = BeautifulSoup(html)
Statement = soup.find(text="Last Statement:").findNext('p').contents[0]
print Statement
continue
except requests.exceptions.HTTPError as err:
print err
print "Offender has made no statement.", Link
#cur.execute("INSERT OR IGNORE INTO prison(Statements) VALUES(?)"), (Statement, )
csvfile.close()
conn.commit()
conn.close()
When running the program I get:
C:\python>prison.py
URL OK http://www.tdcj.state.tx.us/death_row/dr_info/hernandezramontorreslast.html
Can you hear me? Did I ever tell you, you have dad's eyes? I've noticed that in the last couple of days. I'm sorry for putting you through all this. Tell everyone I love them. It was good seeing the kids. I love them all; tell mom, everybody. I am very sorry for all of the pain. Tell Brenda I love her. To everybody back on the row, I know you're going through a lot over there. Keep fighting, don't give up everybody.
URL OK http://www.tdcj.state.tx.us/death_row/dr_info/garciafrankmlast.html
Thank you, Jesus Christ. Thank you for your blessing. You are above the president. And know it is you, Jesus Christ, that is performing this miracle in my life. Hallelujah, Holy, Holy, Holy. For this reason I was born and raised. Thank you for this, my God is a God of Salvation. Only through you, Jesus Christ, people will see that you're still on the throne. Hallelujah, Holy, Holy, Holy. I invoke Your name. Thank you, Yahweh, thank you Jesus Christ. Hallelujah, Amen. Thank you, Warden.
URL OK http://www.tdcj.state.tx.us/death_row/dr_info/martinezdavidlast999173.html
Traceback (most recent call last):
File "C:\python\prison.py", line 60, in <module>
Statement = soup.find(text="Last Statement:").findNext('p').contents[0]
AttributeError: 'NoneType' object has no attribute 'findNext'
First two statements are fine, but after that program crashes. Looking at the page source of the URL where the error occurs, I see:
(only relevant data)
<div class="return_to_div"></div>
<h1>Offender Information</h1>
<h2>Last Statement</h2>
<p class="text_bold">Date of Execution:</p>
<p> February 4, 2009</p>
<p class="text_bold"> Offender:</p>
<p> Martinez, David</p>
<p class="text_bold"> Last Statement:</p>
<p> Yes, nothing I can say can change the past. I am asking for forgiveness. Saying sorry is not going to change anything. I hope one day you can find peace. I am sorry for all of the pain that I have caused you for all those years. There is nothing else I can say, that can help you. Mija, I love you. Sis, Cynthia, and Sandy, keep on going and it will be O.K. I am sorry to put you through this as well. I can't change the past. I hope you find peace and know that I love you. I am sorry. I am sorry and I can't change it. </p>
What could be causing this issue. Do I have to change something on this line?:
Statement = soup.find(text="Last Statement:").findNext('p').contents[0]
Feel free to share improvements to my code. Right now I want to get everything working before I will make it more robust.
For the people wondering about the list with URL's in it: It is due to some bugs on the death row site. Sometimes the URL differs from [lastname][firstname]last.html. I added them manually for now.

how to search protein data bank using author's full name (no initials)

I am trying to search the protein data bank with author's name, but the only choice is to use full last name and initials, therefore there are some false hits. Is there a way to do this with python? Below is the code I used:
import urllib2
#http://www.rcsb.org/pdb/software/rest.do#search
url = 'http://www.rcsb.org/pdb/rest/search'
queryText = """
<?xml version="1.0" encoding="UTF-8"?>
<orgPdbQuery>
<queryType>org.pdb.query.simple.AdvancedAuthorQuery</queryType>
<description>Author Name: Search type is All Authors and Author is Wang, R. and Exact match is true</description>
<searchType>All Authors</searchType>
<audit_author.name>Wang, R. </audit_author.name>
<exactMatch>true</exactMatch>
</orgPdbQuery>
"""
print "query:\n", queryText
print "querying PDB...\n"
req = urllib2.Request(url, data=queryText)
f = urllib2.urlopen(req)
result = f.read()
if result:
print "Found number of PDB entries:", result.count('\n')
print result
else:
print "Failed to retrieve results"enter code here

PyPDB can perform an advanced search of the RCSB Protein Data Bank by author, keyword, or subject area. The repository is here but it can also be found on PyPI:
pip install pypdb
For your application, I'd suggest first doing a general keyword search for PDB IDs with the author's name, and then search the resulting list of PDBs for entries containing the author's name among the metadata:
The keyword search for "actin network"
from pypdb import *
author_name = 'J.A. Doudna'
search_dict = make_query(author_name)
found_pdbs = do_search(search_dict)
Now iterate through the results looking for the author's name
matching_results = list()
for pdb_id in found_pdbs:
desc_pdb = describe_pdb(item)
if author_name in desc_pdb['citation_authors']:
matching_results.append(pdb_id)
You can imagine using fancier regular expressions to improve slight variations in the way an author's name or initials are used. There also might be a nicer way to write this code that bundles requests.

getting data from webpage with select menus with python and beautifulsoup

I am trying to collect data from a webpage which has a bunch of select lists i need to fetch
data from. Here is the page:- http://www.asusparts.eu/partfinder/Asus/All In One/E Series/
And this is what i have so far:
import glob, string
from bs4 import BeautifulSoup
import urllib2, csv
for file in glob.glob("http://www.asusparts.eu/partfinder/*"):
##-page to show all selections for the E-series-##
selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/'
##-
page = urllib2.urlopen(selected_list)
soup = BeautifulSoup(page)
##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
##-identify the id of select list which contains the E-series-##
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')
##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]
for option in option_tags:
open(url + option['value'])
html = urllib2.urlopen("http://www.asusparts.eu/partfinder/")
soup = BeautifulSoup(html)
all = soup.find('div', id="accordion")
I am not sure if i am going about the right way? As all the select menus make it confusing. Basically i need to grab
all the data from the selected results such as images,price,description,etc. They are all contained within
one div tag which contains all the results, which is named 'accordion' so would this still gather all the data?
or would i need to dig deeper to search through the tags inside this div? Also i would have prefered to search by id rather than
class as i could fetch all the data in one go. How would i do this from what i have above? Thanks. Also i am unsure about the glob function too if i am using that correctly or not?
EDIT
Here is my edited code, no errors return however i am not sure if it returns all the models for the e-series?
import string, urllib2, urllib, csv, urlparse from bs4 import
BeautifulSoup
##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
base_url = 'http://www.asusparts.eu/' + url
print base_url
##-page to show all selections for the E-series-##
selected_list = urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
print urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
#selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
##-
page = urllib2.urlopen('http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series')
soup = BeautifulSoup(page)
print soup
##-identify the id of select list which contains the E-series-##
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')
print option_tags
##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]
print option_tags
for option in option_tags:
url + option['redirectvalue']
print " " + url + option['redirectvalue']

First of all, I'd like to point out a couple of problems you have in the code you posted. First, of all the glob module is not typically used for making HTTP requests. It is useful for iterating through a subset of files on a specified path, you can read more about it in its docs.
The second issue is that in the line:
for file in glob.glob("http://www.asusparts.eu/partfinder/*"):
you have an indentation error, because there is no indented code that follows. This will raise an error and prevent the rest of the code from being executed.
Another problem is that you are using some of python's "reserved" names for your variables. You should never use words such as all or file for variable names.
Finally when you are looping through option_tags:
for option in option_tags:
open(url + option['value'])
The open statement will try and open a local file whose path is url + option['value']. This will likely raise an error, as I doubt you'll have a file at that location. In addition, you should be aware that you aren't doing anything with this open file.
Okay, so enough with the critique. I've taken a look at the asus page and I think I have an idea of what you want to accomplish. From what I understand, you want to scrape a list of parts (images, text, price, etc..) for each computer model on the asus page. Each model has its list of parts located at a unique URL (for example: http://www.asusparts.eu/partfinder/Asus/Desktop/B%20Series/BM2220). This means that you need to be able to create this unique URL for each model. To make matters more complicated, each parts category is loaded dynamically, so for example the parts for the "Cooling" section are not loaded until you click on the link for "Cooling". This means we have a two part problem: 1) Get all of the valid (brand, type, family, model) combinations and 2) Figure out how to load all the parts for a given model.
I was kind of bored and decided to write up a simple program that will take care of most of the heavy lifting. It isn't the most elegant thing out there, but it'll get the job done. Step 1) is accomplished in get_model_information(). Step 2) is taken care of in parse_models() but is a little less obvious. Taking a look at the asus website, whenever you click on a parts subsection the JavaScript function getProductsBasedOnCategoryID() is run, which makes an ajax call to a formatted PRODUCT_URL (see below). The response is some JSON information that is used to populate the section you clicked on.
import urllib2
import json
import urlparse
from bs4 import BeautifulSoup
BASE_URL = 'http://www.asusparts.eu/partfinder/'
PRODUCTS_URL = 'http://json.zandparts.com/api/category/GetCategories/'\
'44/EUR/{model}/{family}/{accessory}/{brand}/null/'
ACCESSORIES = ['Cable', 'Cooling', 'Cover', 'HDD', 'Keyboard', 'Memory',
'Miscellaneous', 'Mouse', 'ODD', 'PS', 'Screw']
def get_options(url, select_id):
"""
Gets all the options from a select element.
"""
r = urllib2.urlopen(url)
soup = BeautifulSoup(r)
select = soup.find('select', id=select_id)
try:
options = [option for option in select.strings]
except AttributeError:
print url, select_id, select
raise
return options[1:] # The first option is the menu text
def get_model_information():
"""
Finds all the models for each family, all the families and models for each
type, and all the types, families, and models for each brand.
These are all added as tuples (brand, type, family, model) to the list
models.
"""
model_info = []
print "Getting brands"
brand_options = get_options(BASE_URL, 'mySelectList')
for brand in brand_options:
print "Getting types for {0}".format(brand)
# brand = brand.replace(' ', '%20') # URL encode spaces
brand_url = urlparse.urljoin(BASE_URL, brand.replace(' ', '%20'))
types = get_options(brand_url, 'mySelectListType')
for _type in types:
print "Getting families for {0}->{1}".format(brand, _type)
bt = '{0}/{1}'.format(brand, _type)
type_url = urlparse.urljoin(BASE_URL, bt.replace(' ', '%20'))
families = get_options(type_url, 'myselectListFamily')
for family in families:
print "Getting models for {0}->{1}->{2}".format(brand,
_type, family)
btf = '{0}/{1}'.format(bt, family)
fam_url = urlparse.urljoin(BASE_URL, btf.replace(' ', '%20'))
models = get_options(fam_url, 'myselectListModel')
model_info.extend((brand, _type, family, m) for m in models)
return model_info
def parse_models(model_information):
"""
Get all the information for each accessory type for every
(brand, type, family, model). accessory_info will be the python formatted
json results. You can parse, filter, and save this information or use
it however suits your needs.
"""
for brand, _type, family, model in model_information:
for accessory in ACCESSORIES:
r = urllib2.urlopen(PRODUCTS_URL.format(model=model, family=family,
accessory=accessory,
brand=brand,))
accessory_info = json.load(r)
# Do something with accessory_info
# ...
def main():
models = get_model_information()
parse_models(models)
if __name__ == '__main__':
main()
Finally, one side note. I have dropped urllib2 in favor of the requests library. I personally think provides much more functionality and has better semantics, but you can use whatever you would like.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: BeautifulSoup Scrape, Blank Descriptions For Courses Messing Up Data - python

Related

Find title from txt file using user input id and add them to a list

Python bs4: why am I only seeing part of the HTML?

Cannot download certain texts from website

how to search protein data bank using author's full name (no initials)

getting data from webpage with select menus with python and beautifulsoup

Categories

Resources