I've been working on a python 3 script to generate BibTeX entries, and have ISSN's that I would like to use to get information regarding the associated Journal.
For instance, I would like to take the ISSN 0897-4756 and find that this is Chemistry of Materials journal, which is published by ACS Publications.
I can do this manually using this site, where the info that I am looking for is stored in the lxml table //table[#id="journal-search-results-table"], or more specifically, in the cells of the table body thereof.
I have, however, not been able to get this to automate successfully using python 3.x
I have attempted to access the data using approaches from the httplib2, requests, urllib2, and lxml.html packages, with no success thusfar.
What I have so far is shown below:
import certifi
import lxml.html
import urllib.request
ISSN = "0897-4756"
address = "https://www.journalguide.com/journals/search?type=journal-name&journal-name={}".format(ISSN)
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
request = urllib.request.Request(address,None,hdr) #The assembled request
response = urllib.request.urlopen(request)
html = response.read()
tree = lxml.html.fromstring(html)
print(tree.xpath('//table[#id="journal-search-results-table"]/text()'))
# >> ['\n', '\n']
# Shows that I am connecting to the table
print(tree.xpath('//table[#id="journal-search-results-table"]//td/text()'))
# >> []
# Should???? hold the data segments that I am looking for?
Exact page being queryed by the above
From what I can tell, it would appear that the table's tbody element, and thus the tr and td elements that it contains are not being loaded at the time that python is interpretting the HTML - which is accordingly preventing me from reading the data.
How do I make it so that I can read out the Journal Name and Publisher from the specified table above?
Like you mentioned in your question, this table dynamically changes by javascript. To get around this you actually have to render the javascript using:
A web driver like selenium which simulates a website the same way it would look to the user (by rendering the javascript)
requests-html, which is a relatively new module that allows you to render javascript on a webpage and has a lot of other amazing features for web scraping
This is one way to solve your problem using requests-html:
from requests_html import HTMLSession
ISSN = "0897-4756"
address = "https://www.journalguide.com/journals/search?type=journal-name&journal-name={}".format(ISSN)
hdr = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
ses = HTMLSession()
response = ses.get(address, headers=hdr)
response.html.render() # render the javascript to load the elements in the table
tree = response.html.lxml # no need to import lxml.html because requests-html can do this for you
print(tree.xpath('//table[#id="journal-search-results-table"]/text()'))
# >> ['\n', '\n']
print(tree.xpath('//table[#id="journal-search-results-table"]//td/text()'))
# >> ['ACS Publications', '1.905', 'No', '\n', '\n', '\n']
Related
I'm trying to write a script to scrape some data off a Zillow page (https://www.zillow.com/homes/for_rent/38.358484,-77.27869,38.218627,-77.498417_rect/X1-SSc26hnbsm6l2u1000000000_8e08t_sse/). Obviously I'm just trying to gather data from every listing. However I cannot grab the data from every listing as it only finds 9 instances of the class I'm searching for ('list-card-addr') even though I've checked the html from listings it does not find and the class exists. Anyone have any ideas for why this is? Here's my simple code
from bs4 import BeautifulSoup
import requests
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
url="https://www.zillow.com/homes/for_rent/38.358484,-77.27869,38.218627,-77.498417_rect/X1-SSc26hnbsm6l2u1000000000_8e08t_sse/"
response = requests.get(url, headers=req_headers)
data = response.text
soup = BeautifulSoup(data,'html.parser')
address = soup.find_all(class_='list-card-addr')
print(len(address))
Data is stored within a comment. You can regex it out easily as string defining JavaScript object you can handle with json
import requests, re, json
r = requests.get('https://www.zillow.com/homes/for_rent/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22savedSearchEnrollmentId%22%3A%22X1-SSc26hnbsm6l2u1000000000_8e08t%22%2C%22mapBounds%22%3A%7B%22west%22%3A-77.65840518457031%2C%22east%22%3A-77.11870181542969%2C%22south%22%3A38.13250414385234%2C%22north%22%3A38.444339281260426%7D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22mostrecentchange%22%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%7D',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'!--(\{"queryState".*?)-->', r.text).group(1))
print(data['cat1'])
print(data['cat1']['searchList'].keys())
Within this are details on pagination and the next url, if applicable, to get all results. You have only asked for page 1 here.
For example, print addresses
for i in data['cat1']['searchResults']['listResults']:
print(i['address'])
I am trying to create a simple weather script. The idea is to look in this website and extract the current temperature and the icon to its left.
I know how to extract the html code using request and extract the temperature. But I have no clue on how to get the icon. I am not sure on how to get the icon/image path (not sure even if is possible).
Here my code
import requests
import urllib
from bs4 import BeautifulSoup
url = "https://www.meteored.com.ar/tiempo-en_La+Plata-America+Sur-Argentina-Provincia+de+Buenos+Aires-SADL-1-16930.html"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
a=urllib.request.Request(url,headers = hdr)
response = urllib.request.urlopen(a).read()
soup = BeautifulSoup(response,"lxml")
data = soup.find_all(class_ = "dato-temperatura changeUnitT")
temperature = data[0].text
As an exercise, I am trying to scrape data from a dynamic graph using Python. The graph can be found at this link (let's say I want the data from the first one).
Now, I was thinking of doing something like:
src = 'https://marketchameleon.com/Overview/WFT/IV/#_ABSTRACT_RENDERER_ID_11'
import json
import urllib.request
with urllib.request.urlopen(src) as url:
data = url.read()
reply = json.loads(data)
However, I receive an error message on the last line of the code, saying:
JSONDecodeError: Expecting value
"data" is not empty, so I believe there is a problem with the format of the information within it. Does someone have an idea to solve this issue? Thanks!
I opened that link and see that the site loads data from another URL - https://marketchameleon.com/charts/histStockChartData?p=747&m=12&_=1534060722519
You can use json.loads() function twice and do some hacks with headers (urllib2.Request is your friend in case of Python 2) since server returns HTTP 500 when you don't imitate browser
src = 'https://marketchameleon.com/charts/histStockChartData?p=747&m=12'
import json
import urllib.request
user_agent = {
'Host': 'marketchameleon.com',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'Upgrade-Insecure-Requests': 1,
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7,kk;q=0.6'
}
request = urllib.request.Request(src, headers=user_agent)
data = urllib.request.urlopen(request).read()
print(data)
reply = json.loads(data)
table = json.loads(reply['GTable'])
print(table)
I am trying to read all the comments for a given product , this is to both learn python and also for a project,to simplify my task I chose a product randomly to code.
The link I want to read is Amazons and I used urllib to to open the link
amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
After reading the link into "amazon" variable when I display amazon , I get the below message
print(amazon)
<http.client.HTTPResponse object at 0x000000DDB3796A20>
so I read online , and found I need to use read command to read the source ,but sometimes it gives me a webpage kind of result other times not
print(amazon.read())
b''
How do I read the page, and pass it to beautiful soup ?
Edit 1
I did use request.get , and when I check what is present in the text of the retrieved page, I found the below content , which doe not match with the website link.
print(a2)
<html>
<head>
<title>503 - Service Unavailable Error</title>
</head>
<body bgcolor="#FFFFFF" text="#000000">
<!--
To discuss automated access to Amazon data please contact api-services-support#amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.in/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.in/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->
<center>
<a href="http://www.amazon.in/ref=cs_503_logo/">
<img src="https://images-eu.ssl-images-amazon.com/images/G/31/x-locale/communities/people/logo.gif" width=200 height=45 alt="Amazon.in" border=0></a>
<p align=center>
<font face="Verdana,Arial,Helvetica">
<font size="+2" color="#CC6600"><b>Oops!</b></font><br>
<b>It's rush hour and traffic is piling up on that page. Please try again in a short while.<br>If you were trying to place an order, it will not have been processed at this time.</b><p>
<img src="https://images-eu.ssl-images-amazon.com/images/G/02/x-locale/common/orange-arrow.gif" width=10 height=9 border=0 alt="*">
<b>Go to the Amazon.in home page to continue shopping</b>
</font>
</center>
</body>
</html>
Using your current library urllib. This is what you could do! Use .read() to get HTML. Then pass it into BeautifulSoup like this. Keep in mind amazon is heavy-anti-scraping website. The likelihood of you getting different result might be because the HTML is wrapped inside JavaScript. For that you might have to use Selenium or Dryscrape. You may also need to pass in headers/Cookies and extra attributes into your request.
amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
html = amazon.read()
soup = BeautifulSoup(html)
EDIT ---- Turns out you're using requests now. I could get 200 response using requests passing in my headers like this.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
}
response = requests.get('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1',headers=headers)
soup = BeautifulSoup(response)
response[200]
--- Using Dryscrape
import dryscrape
from bs4 import BeautifulSoup
sess = dryscrape.Session(base_url='http://www.amazon.in')
sess.visit('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
sess.set_header('user-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
html = sess.body()
soup = BeautifulSoup(html)
print soup
##Should give you all the amazon HTML attributes now! I haven't tested this code keep in mind. Please refer back to dryscrape documentation for installation https://dryscrape.readthedocs.io/en/latest/apidoc.html
I personally would use the requests library for this and not urllib. Requests has more features
import requests
From there something like:
resp = requests.get(url) #You can break up your paramters and pass base_url & params to this as well if you have multiple products to deal with
soup = BeautifulSoup(resp.text)
Should answer the mail for this one as it is rather simple http request
Edit:
Based on your error, you are going to have to research parameters to pass to make your requests look correct. In general with requests it'll look something like this (obviously with the values you discover -- check your browsers debug/developer options to check your network traffic and see what you are sending to amazon when using a browser):
url = "https://www.base.url.here"
params = {
'param1': 'value1'
.....
}
resp = requests.get(url,params)
For web scrapping use requests and BeautifulSoup modules in python3.
Installing BeautifulSoup:
pip install beautifulsoup4
Use appropriate headers while sending request.
headers = {
'authority': 'www.amazon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
Scrap.py
from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.in/s/ref=mega_elec_s23_2_3_1_1?rh=i%3Acomputers%2Cn%3A3011505031&ie=UTF8&bbn=976392031"
headers = {
'authority': 'www.amazon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
response = requests.get(f"{url}", headers=headers)
with open("webpg.html","w", encoding="utf-8") as file: # saving html file to disk
file.write(response.text)
bs = BeautifulSoup(response.text, "html.parser")
print(bs) # displaying html file use bs.prettify() for making the document more readable
I'm trying to use urllib + bs4 to scrape naturebox's website.
I want to create a voting webapp so my office can vote for new snacks from naturebox each month.
Ideally I'd like to pull the snacks from naturebox's website as they change frequently.
Here's my code:
import urllib2,cookielib
site = 'http://www.naturebox.com/browse/'
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(site, headers=hdr)
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.fp.read()
content = page.read()
print content
but the output is just javascript and none of the data for the actual snacks (picture, name, description, etc.)
Obviously I'm a noob - how can I get at this data?
You can use the Selenium Webdriver. This will allow you to load the site, including the javascript/ajax, and access the html from there. Also, try using PhantomJS so that it can run invisibly in the background.