scrape google resultstats with python [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I would like to get the estimated results number from google for a keyword. Im using Python3.3 and try to accomplish this task with BeautifulSoup and urllib.request. This is my simple code so far
def numResults():
try:
page_google = '''http://www.google.de/#output=search&sclient=psy-ab&q=pokerbonus&oq=pokerbonus&gs_l=hp.3..0i10l2j0i10i30l2.16503.18949.0.20819.10.9.0.1.1.0.413.2110.2-6j1j1.8.0....0...1c.1.19.psy-ab.FEBvxrgi0KU&pbx=1&bav=on.2,or.r_qf.&bvm=bv.48705608,d.Yms&'''
req_google = Request(page_google)
req_google.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
html_google = urlopen(req_google).read()
soup = BeautifulSoup(html_google)
scounttext = soup.find('div', id='resultStats')
except URLError as e:
print(e)
return scounttext
My problem is that my soup variable is somehow encoded and that i cant get any information out of it. So i get back a None because soup.find doesnt work.
What am i doing wrong and how can i extract the wanted resultstats?
Many thanks!

If you haven't solved this problem yet, it looks like the reason BeautifulSoup is failing to find anything is that the resultStats never appear in the soup - your Request(page_google) is only returning JavaScript, not any search results that the JavaScript is dynamically loading in. You can verify this by adding a
print(soup)
command to your code and you will see that the resultStats div doesn't appear.
The following code:
import sys
from urllib2 import Request, urlopen
import urllib
from bs4 import BeautifulSoup
query = 'pokerbonus'
url = "http://www.google.de/search?q=%s" % urllib.quote_plus(query)
req_google = Request(url)
req_google.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3')
html_google = urlopen(req_google).read()
soup = BeautifulSoup(html_google)
scounttext = soup.find('div', id='resultStats')
print(scounttext)
Will print
<div class="sd" id="resultStats">Ungefähr 1.060.000 Ergebnisse</div>
Lastly, using a tool like Selenium Webdriver might be a better way to go about solving this, as Google does not allow bots to scrape search results.

Related

Cannot scrape Amazon data. BeautifulSoup returns empty HTML element [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed yesterday.
Improve this question
I am trying to webscrape data from Amazon's deal page, but the code below is returning an empty element: <div class="" id="slot-4"> </div>.
Why is this code is not working on the Amazon website?
from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.in/gp/goldbox/all-deals/?ie=UTF8&ref_=sv_gb_1"
HEADERS = ({'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36'})
def getdata(url, HEADERS):
webContent = requests.get(url, headers=HEADERS)
htmlContent = webContent.content
soup = BeautifulSoup(htmlContent, 'html.parser')
# print(soup.prettify())
return soup
def getdeals(soup):
data= soup.find_all("div",{"id" : 'slot-4'})
print(data)
soup = getdata(url, HEADERS)
getdeals(soup)
This is my output:
[<div class="" id="slot-4"> </div>]
The HTML element <div class="" id="slot-4"> is empty:
You may want to target the deal-card attribute:
You also want to verify that the content isn't being loaded dynamically with JavaScript. If it is, you will need to use a library like Selenium rather BeautifulSoup.

Why does html from requests response deviate from dev tools? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I am trying to scrape houzz website
In browser dev tools it shows HTML content. But when I scrape it with beautifulsoup, it returns something else together with some of the html, I do not have much knowledge on this.
A little part of what I get is as follows.
</div><style data-styled="true" data-styled-version="5.2.1">.fzynIk.fzynIk{box-sizing:border-box;margin:0;overflow:hidden;}/*!sc*/
.eiQuKK.eiQuKK{box-sizing:border-box;margin:0;margin-bottom:4px;}/*!sc*/
.chJVzi.chJVzi{box-sizing:border-box;margin:0;margin-left:8px;}/*!sc*/
.kCIqph.kCIqph{box-sizing:border-box;margin:0;padding-top:32px;padding-bottom:32px;border-top:1px solid;border-color:#E6E6E6;}/*!sc*/
.dIRCmF.dIRCmF{box-sizing:border-box;margin:0;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-box-pack:justify;-webkit-justify-content:space-between;-ms-flex-pack:justify;justify-content:space-between;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;margin-bottom:16px;}/*!sc*/
.kmAORk.kmAORk{box-sizing:border-box;margin:0;margin-bottom:24px;}/*!sc*/
.bPERLb.bPERLb{box-sizing:border-box;margin:0;margin-bottom:-8px;}/*!sc*/
What should I do with this? Is not this achievable with beautfulsoup?
Developer Tools operate on a live browser DOM, what you’ll see when inspecting the page source is not the original HTML, but a modified one after applying some browser clean up and executing JavaScript code.
Requests is not executing JavaScript so content can deviate slightly, but you can scrape - Just take a deeper look into your soup.
Example (project titles)
from bs4 import BeautifulSoup
import requests
url_news = " https://www.houzz.com.au/professionals/home-builders/turrell-building-pty-ltd-pfvwau-pf~1099128087"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
response = requests.get(url_news, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
[title.text for title in soup.select('#projects h3')]
Output
['Major Renovation & Master Wing',
'"The Italian Village" Private Residence',
'Country Classic',
'Residential Resort',
'Resort Style Extension, Stone and Timber',
'Old Northern Rd Estate']

Is stockx.com blocking webscraping? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm trying to web scrape from "https://stockx.com/" using bs4 but I get:
urllib.error.HTTPError: HTTP Error 403: Forbidden.
Is there any way I can fix this?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://stockx.com/"
uClient = uReq(my_url)
Passing a useragent header seems to solve the issue.
Try something like this:
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup
my_url = "https://stockx.com/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}
uClient = uReq(Request(url=my_url, headers=headers))
But do know that if the data you're trying to scrape is dynamic, bs4 wouldn't be of much help. consider using pyppeteer or selenium, etc.. for that.
Use scrapy, it'll try to request the site again and it'll follow redirects.

Scraping Pantip Forum using BeautifulSoup

I'm trying to scrape some forum posts from http://pantip.com/tag/Isuzu
One such page is http://pantip.com/topic/35647305
I want to get each post text along with its author and timestamp into a csv file.
I'm using Beautiful Soup, but admittedly I'm a complete beginner at python and web scraping. The code that I have right now gets the required fields, but only for the first post. I need information for all posts on that thread. I tried soup.find_all() and soup.select(), but I'm not getting the desired results.
Here's the code I'm using:
from bs4 import BeautifulSoup
import urllib2
print "Reading URL..."
url = urllib2.urlopen("http://pantip.com/topic/35647305")
content = url.read()
soup = BeautifulSoup(content, "html.parser")
print "Finding desired HTML..."
table = soup.select("abbr.timeago")
print "\nScraped HTML is:"
print table
text = BeautifulSoup(str(table).strip(),"html.parser").get_text().encode("utf-8").replace("\n", "")
print "\nScraped text is:\n" + text
Any clues as to what I'm doing wrong would be deeply appreciated. Also, any suggestions as to how this could be done in a better, cleaner way are welcome.
As mentioned, I'm a beginner, so please don't mind any stupid mistakes. :-)
Thanks!
The comments are rendered using an Ajax request:
import requests
from bs4 import BeautifulSoup
params = {"tid": "35647305", # in the url
"type": "3"}
with requests.Session() as s:
s.headers.update({"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"})
r = (s.get("http://pantip.com/forum/topic/render_comments", params=params))
data = r.json() # data["comments"] contains what you want
Which will give you all the data. So all you need is to pass the tid from each url and update the tid in the params dict.

Python Mechanize for form with options populated by Javascript [duplicate]

This question already has an answer here:
Python mechanize javascript
(1 answer)
Closed 8 years ago.
I'm trying to use mechanize to grab prices for New York's metro-north railroad from this site: http://as0.mta.info/mnr/fares/choosestation.cfm
The problem is that when you select the first option, the site uses javascript to populate your list of possible destinations. I have written equivalent code in python, but I can't seem to get it all working. Here's what I have so far:
import mechanize
import cookielib
from bs4 import BeautifulSoup
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open("http://as0.mta.info/mnr/fares/choosestation.cfm")
br.select_form(name="form1")
br.form.set_all_readonly(False)
origin_control = br.form.find_control("orig_stat", type="select")
origin_control_list = origin_control.items
origin_control.value = [origin_control.items[0].name]
destination_control_list = reFillList(0, origin_control_list)
destination_control = br.form.find_control("dest_stat", type="select")
destination_control.items = destination_control_list
destination_control.value = [destination_control.items[0].name]
response = br.submit()
response_text = response.read()
print response_text
I know I didn't give you code for the reFillList() method, because it's long, but assume it correctly creates a list of mechanize.option objects. Python doesn't complain about me about anything, but on submit I get the html for this alert:
"Fare information for travel between two lines is not available on-line. Please contact our Customer Information Center at 511 and ask to speak to a representative for further information."
Am I missing something here? Thanks for all the help!
It can't really be done without trying to make sense of the crazy logic in that function. I suggest a js solution or a full browser like selenium.

Categories