I'm a newbie to Python and I'm actually working on a little Python script that request and read the HTML of an URL.
For Information the web page that i'm working on is http://bitcoinity.org/markets ,
I would like with my script to fetch the Current Price of the market.
I checked the HTML code and i found that the Price was in a balise :
<span id="last_price" value="447.77"</span>
Here is the code of my Python script :
import urllib2
import urllib
from bs4 import BeautifulSoup
url = "http://bitcoinity.org/markets"
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
HTML = urllib2.urlopen(req)
soup = BeautifulSoup(HTML)
HTMLText = HTML.read()
HTML.close()
#print soup.prettify()
#print HTMLText
So the problem is that the output of this script ( with the 2 methods BeautifulSoup and read() ) is like this :
</span>
<span id="last_price">
</span>
The "value=" attribute is missing and the syntax changed , so I don't know if the server doesn't allow me to make a request of this value or if there is a problem with my code.
All Help is welcome ! :)
( Sorry for my bad english , i'm not a native )
The price is calculated via a set of javascript functions, urllib2+BeautifulSoup approach would not work in this case.
Consider using a tool that utilizes a real browser, like selenium:
>>> from selenium import webdriver
>>> driver = webdriver.Firefox()
>>> driver.get('http://bitcoinity.org/markets')
>>> driver.find_element_by_id('last_price').text
u'0.448'
I'm not sure beautifulsoup or selenium are the tools for this task. They're actually a very poor solution.
Since we're talking about "stock" prices (bitcoin in this case), it is much better if you feed your app/script with real-time market data. Bitcoinity's default "current price" is actually Bitstamp's price... You can also get it directly from the Bitstamp's API via 2 ways.
HTTP API
Here's the ticker you need to feed your app with: https://www.bitstamp.net/api/ticker/ and here how you can get the last price (It is the 'last' value of that JSON what you really are looking for)
import urllib2
import json
req = urllib2.Request("https://www.bitstamp.net/api/ticker/")
opener = urllib2.build_opener()
f = opener.open(req)
json = json.loads(f.read())
print 'Bitcoin last price is = '+json['last']
Websockets API
This is how bitcoinity, bitcoinwisdom, etc grab the prices and market info in order to show it to you in real-time. For this you'll need pusher package for python, since Bitstamp uses pusher for websockets.
Related
I have a simple script where I want to scrape a menu from a url:
https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822
When I inspect the page using dev tools, I identify that the menu contained in the menu section <div class="menu-area" id="section_1026228">
So my script is fairly simple as follows:
import requests
from bs4 import BeautifulSoup
venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
response = requests.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'html.parser')
menu = soup.find('div', {'class': 'menu-area'})
print(menu.text)
I have tried this on a locally saved page of the url and it works. But when I do it to the full url using the requests library, it does not work. It cannot find the div. It throws this error:
print(menu.text)
AttributeError: 'NoneType' object has no attribute 'text'
which basically means it cannot find the div. Does anyone know why this is happening and how to fix it?
I just logged out from my browser and it showed me a different page. However, my script has no login part at all. Not even sure how that would work
[It doesn't work with all sites, but it seems to be enough for this site so far.] You can login with request.Session.
# import requests
sess = requests.Session()
headers = {'user-agent': 'Mozilla/5.0'}
data = {'username': 'YOUR_EMAIL/USERNAME', 'password': 'YOUR_PASSWORD'}
loginResp = sess.post('https://untappd.com/login', headers=headers, data=data)
print(loginResp.status_code, loginResp.reason, 'from', loginResp.url) ## should print 200 OK...
response = sess.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
## CAN CONTINUE AS BEFORE ##
I've edited my solution to one of your previous questions about this site to include cookies so that the site will treat you as logged in. For example:
# venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
gloryMenu = scrape_untappd_menu(venue_url, cookies=sess.cookies)
will collect the following data:
Note: They have a captcha when logging in so I was worried it would be too hard to automate; if it becomes an issue, you can [probably] still login on your browser before going to the page and then paste the request from your network log to curlconverter to get the cookies as a dictionary. Ofc the process is then no longer fully automated since you'll have to repeat this manual login every time the cookies expire (which could be as fast as a few hours). If you wanted to automate the login at that point, you might have to use some kind of browser automation like with selenium.
I've written a script that looks up a few blogs and sees if a new post has been added. However, when I try to do this on Patreon I cannot find the right element with bs4.
Let's take https://www.patreon.com/cubecoders for example.
Say I want to get the number of exclusive posts under the 'Become a patron to' section, which would be 25 as of now.
This code works just fine:
import requests
from bs4 import BeautifulSoup
plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("div", class_="sc-AxjAm fXpRSH").text
print(text_of_newest_post)
Output: 25
Now, I want to get the title of the newest post, which would be 'New in AMP 2.0.2 - Integrated SCP/SFTP server!' as of now.
I inspect the title in my browser and see that it is contained by a span tag with the class 'sc-1di2uql-1 vYcWR'.
However, when I try to run this code I cannot fetch the element:
import requests
from bs4 import BeautifulSoup
plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("span", class_="sc-1di2uql-1 vYcWR")
print(text_of_newest_post)
Output: None
I've already tried to fetch the element with XPath or CSS selector but couldn't do it. I thought it might be because the site is rendered first with JavaScript and thus I cannot access the elements before they are rendered correctly.
When I use Selenium to render the site first I can see the title when printing out all div tags on the page but when I want to get only the very first title I can't access it.
Do you guys know a workaround maybe?
Thanks in advance!
EDIT:
In Selenium I can do this:
from selenium import webdriver
browser = webdriver.Chrome("C:\webdrivers\chromedriver.exe")
browser.get("https://www.patreon.com/cubecoders")
divs = browser.find_elements_by_tag_name("div")
def find_text(divs):
for div in divs:
for span in div.find_elements_by_tag_name("span"):
if span.get_attribute("class") == "sc-1di2uql-1 vYcWR":
return span.text
print(find_text(divs))
browser.close()
Output: New in AMP 2.0.2 - Integrated SCP/SFTP server!
When I just try to search for the spans with class 'sc-1di2uql-1 vYcWR' from the start it won't give me the result though. Could it be that the find_elements method does not look deeper inside for nestled tags?
The data you see is loaded via Ajax from their API. You can use requests module to load the data.
For example:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.patreon.com/cubecoders'
api_url = 'https://www.patreon.com/api/posts'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': url
}
with requests.session() as s:
html_text = s.get(url, headers=headers).text
campaign_id = re.search(r'https://www\.patreon\.com/api/campaigns/(\d+)', html_text).group(1)
data = s.get(api_url, headers=headers, params={'filter[campaign_id]': campaign_id, 'filter[contains_exclusive_posts]': 'true', 'sort': '-published_at'}).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some information to screen:
for d in data['data']:
print('{:<70} {}'.format(d['attributes']['title'], d['attributes']['published_at']))
Prints:
New in AMP 2.0.2 - Integrated SCP/SFTP server! 2020-07-17T13:28:49.000+00:00
AMP Enterprise Pricing Reveal! 2020-07-07T10:02:02.000+00:00
AMP Enterprise Edition Waiting List 2020-07-03T13:25:35.000+00:00
Upcoming changes to the user system 2020-05-29T10:53:43.000+00:00
More video tutorials! What do you want to see? 2020-05-21T12:20:53.000+00:00
Third AMP tutorial - Windows installation! 2020-05-21T12:19:23.000+00:00
Another day, another video tutorial! 2020-05-08T22:56:45.000+00:00
AMP Video Tutorial - Out takes! 2020-05-05T23:01:57.000+00:00
AMP Video Tutorials - Installing AMP on Linux 2020-05-05T23:01:46.000+00:00
What is the AMP Console Assistant (AMPCA), and why does it exist? 2020-05-04T01:14:39.000+00:00
Well that was unexpected... 2020-05-01T11:21:09.000+00:00
New Goal - MariaDB/MySQL Support! 2020-04-22T13:41:51.000+00:00
Testing out AMP Enterprise Features 2020-03-31T18:55:42.000+00:00
Temporary feature unlock for all Patreon backers! 2020-03-11T14:53:31.000+00:00
Preparing for Enterprise 2020-03-11T13:09:40.000+00:00
Aarch64/ARM64 and Raspberry Pi is here! 2020-03-06T19:07:09.000+00:00
Aarch64/ARM64 and Raspberry Pi progress! 2020-02-26T17:53:53.000+00:00
Wallpaper! 2020-02-13T11:04:39.000+00:00
Instance Templating - Make once, deploy many. 2020-02-06T15:26:09.000+00:00
Time for a new module! 2020-01-07T13:41:17.000+00:00
I have a page that has a table (table id= "ctl00_ContentPlaceHolder_ctl00_ctl00_GV" class="GridListings" )i need to scrape.
I usually use BeautifulSoup & urllib for it,but in this case the problem is that the table takes some time to load ,so it isnt captured when i try to fetch it using BS.
I cannot use PyQt4,drysracpe or windmill because of some installation issues,so the only possible way is to use Selenium/PhantomJS
I tried the following,still no success:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get(url)
wait = WebDriverWait(driver, 10)
table = wait.until(EC.presence_of_element_located(By.CSS_SELECTOR, 'table#ctl00_ContentPlaceHolder_ctl00_ctl00_GV'))
The above code doesnt give me the desired contents of the table.
How do i go about achieveing this???
You can get the data using requests and bs4,, with almost if not all asp sites there are a few post params that always need to be provided like __EVENTTARGET, __EVENTVALIDATION etc.. :
from bs4 import BeautifulSoup
import requests
data = {"__EVENTTARGET": "ctl00$ContentPlaceHolder$ctl00$ctl00$RadAjaxPanel_GV",
"__EVENTARGUMENT": "LISTINGS;0",
"ctl00$ContentPlaceHolder$ctl00$ctl00$ctl00$hdnProductID": "139",
"ctl00$ContentPlaceHolder$ctl00$ctl00$hdnProductID": "139",
"ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortField": "Listing Number",
"ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortDirection": "A-Z, Low-High",
"__ASYNCPOST": "true"}
And for the actual post, we need to add a few more values to out post data:
post = "https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx"
with requests.Session() as s:
s.headers.update({"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"})
soup = BeautifulSoup(s.get(post).content)
data["__VIEWSTATEGENERATOR"] = soup.select_one("#__VIEWSTATEGENERATOR")["value"]
data["__EVENTVALIDATION"] = soup.select_one("#__EVENTVALIDATION")["value"]
data["__VIEWSTATE"] = soup.select_one("#__VIEWSTATE")["value"]
r = s.post(post, data=data)
soup2 = BeautifulSoup(r.content)
table = soup2.select_one("div.GridListings")
print(table)
You will see the table printed when you run the code.
If you want to scrap something, it will be nice first to install a web debugger ( Firebug for Mozilla Firefox for example) to watch how the website you want to scrap is working.
Next, you need to copy the process of how the website is connecting to backoffice
As you said, the content that you want to scrap is being loaded asynchronously (only when the document is ready)
Assuming the debugger is running and also you have refreshed the page, you will see on the network tab the following request:
POST https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx
The final process flow to reach your goal will be:
1/ Use requests python module
2/ Open a requests session to the index page website site (with cookies handling)
3/ Scrap all the input for the specific POST form request
4/ Build a POST parameter DICT containing all inputs & value fields scrapped in the previous step + adding some specific fixed params.
5/ POST the request (with required data)
6/ Use finally BS4 module (as usual) to soup the answered html to scrap your data
Please see bellow a working code:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup
import requests
base_url="https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx"
#create requests session
s = requests.session()
#get index page
r=s.get(base_url)
#soup page
bs=BeautifulSoup(r.text)
#extract FORM html
form_soup= bs.find('form',{'name':'aspnetForm'})
#extracting all inputs
input_div = form_soup.findAll("input")
#build the data parameters for POST request
#we add some required <fixed> data parameters for post
data={
'__EVENTARGUMENT':'LISTINGS;0',
'__EVENTTARGET':'ctl00$ContentPlaceHolder$ctl00$ctl00$RadAjaxPanel_GV',
'__EVENTVALIDATION':'/wEWGwKis6fzCQLDnJnSDwLq4+CbDwK9jryHBQLrmcucCgL56enHAwLRrPHhCgKDk6P+CwL1/aWtDQLm0q+gCALRvI2QDAKch7HjBAKWqJHWBAKil5XsDQK58IbPAwLO3dKwCwL6uJOtBgLYnd3qBgKyp7zmBAKQyTBQK9qYAXAoieq54JAuG/rDkC1djKyQMC1qnUtgoC0OjaygUCv4b7sAhfkEODRvsa3noPfz2kMsxhAwlX3Q=='
}
#we add some <dynamic> data parameters
for input_d in input_div:
try:
data[ input_d['name'] ] =input_d['value']
except:
pass #skip unused input field
#post request
r2=s.post(base_url,data=data)
#write the result
with open("post_result.html","w") as f:
f.write(r2.text.encode('utf8'))
Now, please get a look at "post_result.html" content and you will find the data !
Regards
I am trying to automatically download PDFs from URLs like this to make a library of UN resolutions.
If I use beautiful soup or mechanize to open that URL, I get "Your browser does not support frames" -- and I get the same thing if I use the copy as curl feature in chrome dev tools.
The standard advice for the "Your browser does not support frames" when using mechanize or beautiful soup is to follow the source of each individual frame and load that frame. But if I do so, I get to an error message that the page is not authorized.
How can I proceed? I guess I could try this in zombie or phantom but I would prefer to not use those tools as I am not that familiar with them.
Okay, this was an interesting task to do with requests and BeautifulSoup.
There is a set of underlying calls to un.org and daccess-ods.un.org that are important and set relevant cookies. This is why you need to maintain requests.Session() and visit several urls before getting access to the pdf.
Here's the complete code:
import re
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
BASE_URL = 'http://www.un.org/en/ga/search/'
URL = "http://www.un.org/en/ga/search/view_doc.asp?symbol=A/RES/68/278"
BASE_ACCESS_URL = 'http://daccess-ods.un.org'
# start session
session = requests.Session()
response = session.get(URL, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
# get frame links
soup = BeautifulSoup(response.text)
frames = soup.find_all('frame')
header_link, document_link = [urljoin(BASE_URL, frame.get('src')) for frame in frames]
# get header
session.get(header_link, headers={'Referer': URL})
# get document html url
response = session.get(document_link, headers={'Referer': URL})
soup = BeautifulSoup(response.text)
content = soup.find('meta', content=re.compile('URL='))['content']
document_html_link = re.search('URL=(.*)', content).group(1)
document_html_link = urljoin(BASE_ACCESS_URL, document_html_link)
# follow html link and get the pdf link
response = session.get(document_html_link)
soup = BeautifulSoup(response.text)
# get the real document link
content = soup.find('meta', content=re.compile('URL='))['content']
document_link = re.search('URL=(.*)', content).group(1)
document_link = urljoin(BASE_ACCESS_URL, document_link)
print document_link
# follow the frame link with login and password first - would set the important cookie
auth_link = soup.find('frame', {'name': 'footer'})['src']
session.get(auth_link)
# download file
with open('document.pdf', 'wb') as handle:
response = session.get(document_link, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
You should probably extract separate blocks of code into functions to make it more readable and reusable.
FYI, all of this could be more easily done through the real browser with the help of selenium of Ghost.py.
Hope that helps.
I asked a question on realizing a general idea to crawl and save webpages.
Part of the original question is: how to crawl and save a lot of "About" pages from the Internet.
With some further research, I got some choices to go ahead with both on scraping and parsing (listed at the bottom).
Today, I ran into another Ruby discussion about how to scrape from Google search results. This provides a great alternative for my problem which will save all the effort on the crawling part.
The new question are: in Python, to scrape Google search results for a given keyword, in this case "About", and finally get the links for further parsing.
What are the best choices of methods and libraries to go ahead with? (in measure of easy-to-learn and easy-to-implement).
p.s. in this website, the exactly same thing is implemented, but closed and ask for money for more results. I'd prefer to do it myself if no open-source available and learn more Python in the meanwhile.
Oh, btw, advices for parsing the links from search results would be nice, if any. Still, easy-to-learn and easy-to-implement. Just started learning Python. :P
Final update, problem solved. Code using xgoogle, please read note in the section below in order to make xgoogle working.
import time, random
from xgoogle.search import GoogleSearch, SearchError
f = open('a.txt','wb')
for i in range(0,2):
wt = random.uniform(2, 5)
gs = GoogleSearch("about")
gs.results_per_page = 10
gs.page = i
results = gs.get_results()
#Try not to annnoy Google, with a random short wait
time.sleep(wt)
print 'This is the %dth iteration and waited %f seconds' % (i, wt)
for res in results:
f.write(res.url.encode("utf8"))
f.write("\n")
print "Done"
f.close()
Note on xgoogle (below answered by Mike Pennington):
The latest version from it's Github does not work by default already, due to changes in Google search results probably. These two replies (a b) on the home page of the tool give a solution and it is currently still working with this tweak. But maybe some other day it may stop working again due to Google's change/block.
Resources known so far:
For scraping, Scrapy seems to be a popular choice and a webapp called ScraperWiki is very interesting and there is another project extract it's library for offline/local usage. Mechanize was brought up quite several times in different discussions too.
For parsing HTML, BeautifulSoup seems to be the one of the most
popular choices. Of course. lxml too.
You may find xgoogle useful... much of what you seem to be asking for is there...
There is a twill lib for emulating browser. I used it when had a necessity to login with google email account. While it's a great tool with a great idea, it's pretty old and seems to have a lack of support nowadays (the latest version is released in 2007).
It might be useful if you want to retrieve results that require cookie-handling or authentication. Likely that twill is one of the best choices for that purposes.
BTW, it's based on mechanize.
As for parsing, you are right, BeautifulSoup and Scrapy are great. One of the cool things behind BeautifulSoup is that it can handle invalid HTML (unlike Genshi, for example.)
Have a look at this awesome urllib wrapper for web scraping https://github.com/mattseh/python-web/blob/master/web.py
Another option to scrape Google search results using Python is the one by ZenSERP.
I like the API-first approach which is easy to use and the JSON results are easily integrated into our solution.
Here is an example for a curl request:
curl "https://app.zenserp.com/api/search" -F "q=Pied Piper" -F "location=United States" -F "search_engine=google.com" -F "language=English" -H "apikey: APIKEY"
And the response:
{
"q": "Pied Piper",
"domain": "google.com",
"location": "United States",
"language": "English",
"url": "https://www.google.com/search?q=Pied%20Piper&num=100&hl=en&gl=US&gws_rd=cr&ie=UTF-8&oe=UTF-8&uule=w+CAIQIFISCQs2MuSEtepUEUK33kOSuTsc",
"total_results": 17100000,
"auto_correct": "",
"auto_correct_type": "",
"results": []
}
A Python code for example:
import requests
headers = {
'apikey': 'APIKEY',
}
params = (
('q', 'Pied Piper'),
('location', 'United States'),
('search_engine', 'google.com'),
('language', 'English'),
)
response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)
To extract links from multiple pages of Google Search results you can use SerpApi. It's a paid API with a free trial.
Full example
import os
# Python package: https://pypi.org/project/google-search-results
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "about",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
pages = search.pagination()
for result in pages:
print(f"Current page: {result['serpapi_pagination']['current']}\n")
for organic_result in result["organic_results"]:
print(
f"Title: {organic_result['title']}\nLink: {organic_result['link']}\n"
)
Output
Current page: 12
URL: https://fi.google.com/
URL: https://www.mayoclinic.org/about-mayo-clinic
...
Current page: 18
URL: https://igem.org/About
URL: https://www.ieee.org/
URL: https://www.cancer.org/
...
Disclaimer: I work at SerpApi.
This one works good for this moment. If any search is made, the scraper keeps grabbing titles and their links traversing all next pages until there is no more next page is left or your ip address is banned. Make sure your bs4 version is >= 4.7.0 as I've used pseudo css selector within the script.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
base = "https://www.google.de"
link = "https://www.google.de/search?q={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def grab_content(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
for container in soup.select("[class='g'] a[href^='http'][data-ved]:has(h3)"):
post_title = container.select_one("h3").get_text(strip=True)
post_link = container.get('href')
yield post_title,post_link
next_page = soup.select_one("a[href][id='pnnext']")
if next_page:
next_page_link = urljoin(base,next_page.get("href"))
yield from grab_content(next_page_link)
if __name__ == '__main__':
search_keyword = "python"
qualified_link = link.format(search_keyword.replace(" ","+"))
for item in grab_content(qualified_link):
print(item)
This can be done using google and beautifulsoup module, install it in CMD using command given below:
pip install google beautifulsoup4
Thereafter, run this simplified code given below
import webbrowser, googlesearch as gs
def direct(txt):
print(f"sure, searching '{txt}'...")
results=gs.search(txt,num=1,stop=1,pause=0)
#num, stop denotes number of search results you want
for link in results:
print(link)
webbrowser.open_new_tab(link)#to open the results in browser
direct('cheap thrills on Youtube') #this will play the song on YouTube
#(for this, keep num=1,stop=1)
Output:
TIP: Using this, you can also make a small Virtual Assistant that will open the top search result in browser for your given query(txt) in natural language.
Feel free to comment in case of difficulty while running this code:)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import urllib.request
import re
import numpy as np
count=0
query=input("query>>")
query=query.strip().split()
query="+".join(query)
html = "https://www.google.co.in/search?site=&source=hp&q="+query+"&gws_rd=ssl"
req = urllib.request.Request(html, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(urlopen(req).read(),"html.parser")
#Regex
reg=re.compile(".*&sa=")
links = []
#Parsing web urls
for item in soup.find_all('h3', attrs={'class' : 'r'}):
line = (reg.match(item.a['href'][7:]).group())
links.append(line[:-4])
print(links)
this should be handy....for more go to -
https://github.com/goyal15rajat/Crawl-google-search.git
Here is a Python script using requests and BeautifulSoup to scrape Google results.
import urllib
import requests
from bs4 import BeautifulSoup
# desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"
query = "hackernoon How To Scrape Google With Python"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
title = g.find('h3').text
item = {
"title": title,
"link": link
}
results.append(item)
print(results)