scraping google news headlines - python

Google news is searchable by keyword and then that search can be narrowed down to a certain time period.
I tried doing the search on the website and then using the url of the results page to reverse engineer the search in python thus:
import urllib2
url = 'https://www.google.com/search?hl=en&gl=uk&tbm=nws&authuser=0&q=apple&oq=apple&gs_l=news-cc.3..43j0l9j43i53.5710.6848.0.7058.5.4.0.1.1.0.66.230.4.4.0...0.0...1ac.1.SRcIeXL5d48'
handler = urllib2.urlopen(url)
html = handler.read()
however, i get a 403 error. This method works with other websites, such as bbc.co.uk. so obviously google does not want me to scrape the website with python.
so i have two questions:
1) is it possible to bypass this restriction google has placed? if so, how?
2) are there any other scrapeable news sites where i can search for news on a keyword for a given period.
for either of the options, i don't mind using a paid service. so such suggestions are welcome too.
thanks in advance,
K.

Try setting User-Agent
req = urllib2.Request(path)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)

Related

AWS Lambda - python webscraping - unable to bypass cloudfare anti-bots from AWS ip but working from local ip

I've built a simple python web scraper that works as expected locally but does not work on AWS Lambda -- specifically and only for the website I would like to scrape. I've tested out just the scraping portion of the code and can confirm that is is a cloudflare anti-bot issue.
I've combed through relevant SO and medium articles and tried:
adding the appropriate headers
specifying user agent
using different libraries (urllib, cloudscraper, selenium)
using a virtual display (pyvirtualdisplay with xvfb) as according to this post: How to bypass Cloudflare bot protection in selenium
Example code of the urllib version to illustrate the question:
import json
import urllib.request
def lambda_handler(event, context):
url = 'https://disboard.org/servers/tag/python/15'
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
return respData
The above code returns a 403 status + reCAPTCHA.
I understand that data center IP ranges get handled more carefully by antispam than residential IPs -- is there any workaround for this?
Thank you in advance.

Python requests.get only responds if I don't specify page number

I am scraping web data with python using requests and beautiful soup. I have found that 2 of the websites I am scraping from only respond if I do not specify the page number.
The following code works and allows me to extract the data needed:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)}
r = requests.get('https://www.milkround.com/jobs/graduate-software-engineer', headers = headers)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('div', attrs = {'class':'col-xs-12 job-results clearfix'})
If however I change the link to specify a page number, such as:
r = requests.get('https://www.milkround.com/jobs/graduate-software-engineer?page=2', headers = headers)
Then request never responds. There is no error code, the console just waits indefinitely. What is causing this and how do I resolve it?
EDIT: I opened the site in Incognito manually. It seems that when opening with the page number I get an "access denied" response, but if I refresh the page it lets me in?
That's because if you see, you are not able to access the page numbers on website from outside. So if you are logged in and have some sort of cookie then add it to your headers.
What I just checked on website is you are trying to access wrong URI.There are no page numbers. Did you add ?page= from your own?
The problem you're tackling with is about web scraping. In your very case, the web page you have blocks because your header declaration lacks of a proper user-agent definition.
To get it to work you need to include a user-agent declaration like this:
headers={'user-agent':'Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3',}
You can dive more deeply into the problem of writing good web scrapers here:
https://towardsdatascience.com/5-strategies-to-write-unblock-able-web-scrapers-in-python-5e40c147bdaf
A list of proper user-agents can be found here:
https://webscraping.com/blog/User-agents/
Hope it get's you working with your problem.

Using linkGrabber to get 'href' from google search in python

Ok, so all I want to do is get the very first link inside the first google search. I tried to use beautifoulsoup but it didn't work out at all, I couldn't seem to find a way to get the link. I tried using linkGrabber, so now I get all the urls in the google search (I have limited the results to only 1 per page). My code is:
import re
import linkGrabber
import urllib
input = str(input('Give movie name: '))
input = urllib.parse.quote_plus(input)
imdb_s = '+imdb+review'
n = 1
g_s = 'https://www.google.com/search?q='+ input + imdb_s +'&num=' + str(n)
links = linkGrabber.Links(g_s)
gb = links.find(pretty=True)
print(gb)
however when I print, i get like 15 links that are from google and which I do not want to use, I want to focus only on one specific href, and get this. Can anyone please help me?
you can use the google search library - i think pip install google. This library also relies on beautiful soup, but is fit to return only search results. The problem is that the page that google returns when you search has ads and a bunch of other links that aren't the actual search results.
You can also change your query to "site:imdb.com+" to only search on imbd.
That said, I've stopped using that for my googling needs because it's against googles terms of service. I'm not moralizing anything, but the reality is that I can't seem to get much reliability as google keeps sniffing bots and recaptcha-ing them.
The correct way to do it would be to use google's custom search API - which is also good for only returning the info you need, and it's free for 100 searches per day.
To get the very first link you can use select_one() bs4 method.
It didn't work because you don't specify a user-agent (headers) which is faking real user visits, so Google won't treat your request as a default request user-agent which is: python-requests.
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get(f'https://www.google.com/search?q=minecraft', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
for container in soup.findAll('div', class_='tF2Cxc'):
title = container.select_one('.DKV0Md').text
link = container.find('a')['href']
print(f'{title}\n{link}')
# part of the output:
'''
Minecraft Official Site | Minecraft
https://www.minecraft.net/en-us/
Minecraft Classic
https://classic.minecraft.net/
'''
Alternatively, you can do it as well by using Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.
The main difference is that you don't have to think about why Google is blocks you, why certain selector is giving wrong output, even though it shouldn't. It's already done for the end-user with a JSON output.
Check out the Playground.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # environment for API_KEY
"engine": "google",
"q": "minecraft",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
link = result['link']
print(f'{title}\n{link}')
# part of the output:
'''
Minecraft Official Site | Minecraft
https://www.minecraft.net/en-us/
Minecraft Classic
https://classic.minecraft.net/
'''
Disclaimer, I work for SerpApi.

how to scrape amazon deals page in python

I want to scrape amazon deal page by python and beautiful soup but when run the code I don't get any result but when trying the code on any another page in amazon I get results
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/international-sales-offers/b/?ie=UTF8&node=15529609011&ref_=nav_navm_intl_deal_btn'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
'referer': 'https://www.amazon.com/'
}
s = requests.session()
s.headers.update(headers)
r = s.get(url)
soup = BeautifulSoup(r.content, "lxml")
for x in soup.find_all('span',{'class','a-declarative'}):
print(x.text + "\n")
When you visit that page in your browser, the page makes additional requests to get more information, it then updates the first page with that information. In your case, the url https://www.amazon.com/international-sales-offers/b/?ie=UTF8&node=15529609011&ref_=nav_navm_intl_deal_btn is just a template, and when loaded it makes additional requests to get the deal information to populate the template.
Amazon is a popular site and people have made many web scrapers for it. Check this one out.. If it doesn't do what you need just google github amazon scraper and you will get many options.
If you still want to code a scraper yourself, start reading up on selenium. It is a python package that simulates a web browser, allowing you to load a web page and all its additional requests before scraping.

Unable to get google search results python

I'm building a script to scrape google search results. I've reached till here.
import urllib
keyword = "google"
print urllib.urlopen("https://www.google.co.in/search?q=" + keyword).read()
But it gives me a reply as follows:
<!DOCTYPE html><html lang=en><meta charset=utf-8><meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"><title>Error 403 (Forbidden)!!1</title><style>*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}#media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/errors/logo_sm_2.png) no-repeat}#media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/errors/logo_sm_2_hr.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/errors/logo_sm_2_hr.png) 0}}#media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/errors/logo_sm_2_hr.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:55px;width:150px}</style><a href=//www.google.com/><span id=logo aria-label=Google></span></a><p><b>403.</b> <ins>That’s an error.</ins><p>Your client does not have permission to get URL <code>/search?q=google</code> from this server. (Client IP address: 117.196.168.89)<br><br>
Please see Google's Terms of Service posted at http://www.google.com/terms_of_service.html
<BR><BR><P>If you believe that you have received this response in error, please report your problem. However, please make sure to take a look at our Terms of Service (http://www.google.com/terms_of_service.html). In your email, please send us the <b>entire</b> code displayed below. Please also send us any information you may know about how you are performing your Google searches-- for example, "I'm using the Opera browser on Linux to do searches from home. My Internet access is through a dial-up account I have with the FooCorp ISP." or "I'm using the Konqueror browser on Linux to search from my job at myFoo.com. My machine's IP address is 10.20.30.40, but all of myFoo's web traffic goes through some kind of proxy server whose IP address is 10.11.12.13." (If you don't know any information like this, that's OK. But this kind of information can help us track down problems, so please tell us what you can.)</P><P>We will use all this information to diagnose the problem, and we'll hopefully have you back up and searching with Google again quickly!</P>
<P>Please note that although we read all the email we receive, we are not always able to send a personal response to each and every email. So don't despair if you don't hear back from us!</P>
<P>Also note that if you do not send us the <b>entire</b> code below, <i>we will not be able to help you</i>.</P><P>Best wishes,<BR>The Google Team</BR></P><BLOCKQUOTE>/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/<BR>
aef0-l8vNw3cWys_OWGKrv6VYDewUx0bhWxSeo2Mk4vGTZSoh<BR>
MdeNZki3vp-kzRGjrBTseg6uGBypibuTNGSeJoPRkDPCOFkyA<BR>
YBVgssaJaqSibV7khohBnsUVRVZqALwIe2lD6pdddMQIZ-Zg2<BR>
WEE-rO-ZackE5L2gwlmHZHP2oWML3ZlGgUL6CAbMbFmzVda38<BR>
ZYYVZLKBcjY1gSLk-FSzBc7QQnp0vrhkY6LnrALX94oK7Yrml<BR>
bKX-5KmpyhsI7aW3da5Rt5nt0K9PVPbKvpZ1LN-hdRqg749K6<BR>
T4v8mGfuH6BHSQUAPW1Byx_Wy1TGsyhZJQ02jrz7K0RBg4r0i<BR>
9O6Rs7-FFRzESkiyzRQaExUdpBpl3Mmguh1JXR_yxDJre9R7u<BR>
3AWKfCkt8BxKuv37oAIslM2Caor4QBXSNrq1F7zUetx8HxmaW<BR>
pX_6KsXyjs3-Pfq5NKOuzNCjatrhXdKC74NmNHztTPJU-4MzV<BR>
kUPuUehnDYgcgGAVYLLGiWvG4Scm8G2Gq2UnacMQsZ5BB7rgY<BR>
DXJnZwbMbVX53-llhCMeQfBTteOWIfWQR2FOyc-tuaRHX6c3N<BR>
rzpNDX9ZufFfOXRNkaORCZxkSEoX1xDBq0VGdkkCfwlUdG9Jq<BR>
prYBPnpRyhjxjC3c4n68AuEYHtMTVmbK-fyMtcWLMTVXzIrYS<BR>
EjACpMTnHRavhYza4ZJgs4SViS4FrsmJ0P3CdyLLayR0xMFM6<BR>
m7rxy-zaABo7iof_re5PKcFP6EYqD0Wm-ZlLksUh2a1LVaAsq<BR>
sSqnPPqq5qCu0z8wQe5jeGCRCY2vrT5HWmYNJbhyCyN_HiHGR<BR>
bHDb8f3_OcgAHsT7zv1a4FOG4B0JztqskzYmssBb-ezvErkp6<BR>
uZtwiKJc30F30RpQhKEb_rPjhpwc5dr3MUsTuki2j2tBSQl_O<BR>
kjFef_Jvl3u8TPQY5c6dqUSQv--p0N95Jv-WehS32lvyUbeEB<BR>
mN7ZC8oCFj06BRn5NaU9P8p1d7fmYyxyta2dZ21UfaRMhX8TZ<BR>
VgKiSDVyMO2GZ09bUEFGW4KvvTJDyQT_UMkCsahrv2MP_yI-D<BR>
fwEArSXvPIpyESHeyPXfFN-Z9_OuVwGDU2riHFIWgw5IPwtER<BR>
e0Ukzrn2iwGHHL8j2JdSNbunrifS-RqkK2hgQl16-TfqN11NL<BR>
Lgwtt-Kp3XL86K61Qq7lU-NxB8BOO_i-QOQszn6uRmb3VR__Q<BR>
T_0E9FULbsR9kgTyXDKQmOQ-3qeaFlz4in9V9PJ<BR>
+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+<BR></BLOCKQUOTE>
Doesn't google allow its pages to be scraped?
Actually, google doesn't, in the sense it blocks bots. But you can use mechanize to fake a browser and get the results.
import mechanize
chrome = mechanize.Browser()
chrome.set_handle_robots(False)
chrome.addheaders = [('User-agent',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36')]
base_url = 'https://www.google.co.in/search?q='
search_url = base_url + keyword.replace(' ', '+')
htmltext = chrome.open(search_url).read()
try this. I hope it helps.
You could also fake the headers in the urllib to get the results.
Something like:
import urllib2
keyword = "google"
url = "https://www.google.co.in/search?q=" + keyword
# Build a opener
opener = urllib2.build_opener()
# In case you have proxy then u need to build a ProxyHandler opener
#opener = urllib2.build_opener(urllib2.ProxyHandler(proxies={"http": "http://proxy.corp.ads:8080"}))
# To fake the browser
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
print opener.open(url).read()
Google treats your script with a different user-agent(if you're using requests it will be python-requests) See more and more.
All you need is just to specify browser user-agent (Chrome, Mozilla, Edge, IE, Safari..) so Google will treat it as a "user" AKA fake a real browser visit.
If you're using requests library, then you can specify it this way (list of user-agents amoung other websites)
import requests
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(
'https://www.google.com/search?q=pizza is awesome', headers=headers).text
I answered the question on how to scrape Google Search result titles, summary and links with example code here.
Alternatively, you can use third-party Google Search Engine Results API
or Google Organic Results API from SerpApi. It's a paid API with a free trial.
Check out Playground to test and see the output.
Code to get raw HTML response:
import os, urllib
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "london",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
html = results['search_metadata']['raw_html_file']
print(urllib.request.urlopen(html).read())
Disclaimer, I work for SerpApi.

Categories