I've written a script that looks up a few blogs and sees if a new post has been added. However, when I try to do this on Patreon I cannot find the right element with bs4.
Let's take https://www.patreon.com/cubecoders for example.
Say I want to get the number of exclusive posts under the 'Become a patron to' section, which would be 25 as of now.
This code works just fine:
import requests
from bs4 import BeautifulSoup
plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("div", class_="sc-AxjAm fXpRSH").text
print(text_of_newest_post)
Output: 25
Now, I want to get the title of the newest post, which would be 'New in AMP 2.0.2 - Integrated SCP/SFTP server!' as of now.
I inspect the title in my browser and see that it is contained by a span tag with the class 'sc-1di2uql-1 vYcWR'.
However, when I try to run this code I cannot fetch the element:
import requests
from bs4 import BeautifulSoup
plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("span", class_="sc-1di2uql-1 vYcWR")
print(text_of_newest_post)
Output: None
I've already tried to fetch the element with XPath or CSS selector but couldn't do it. I thought it might be because the site is rendered first with JavaScript and thus I cannot access the elements before they are rendered correctly.
When I use Selenium to render the site first I can see the title when printing out all div tags on the page but when I want to get only the very first title I can't access it.
Do you guys know a workaround maybe?
Thanks in advance!
EDIT:
In Selenium I can do this:
from selenium import webdriver
browser = webdriver.Chrome("C:\webdrivers\chromedriver.exe")
browser.get("https://www.patreon.com/cubecoders")
divs = browser.find_elements_by_tag_name("div")
def find_text(divs):
for div in divs:
for span in div.find_elements_by_tag_name("span"):
if span.get_attribute("class") == "sc-1di2uql-1 vYcWR":
return span.text
print(find_text(divs))
browser.close()
Output: New in AMP 2.0.2 - Integrated SCP/SFTP server!
When I just try to search for the spans with class 'sc-1di2uql-1 vYcWR' from the start it won't give me the result though. Could it be that the find_elements method does not look deeper inside for nestled tags?
The data you see is loaded via Ajax from their API. You can use requests module to load the data.
For example:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.patreon.com/cubecoders'
api_url = 'https://www.patreon.com/api/posts'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': url
}
with requests.session() as s:
html_text = s.get(url, headers=headers).text
campaign_id = re.search(r'https://www\.patreon\.com/api/campaigns/(\d+)', html_text).group(1)
data = s.get(api_url, headers=headers, params={'filter[campaign_id]': campaign_id, 'filter[contains_exclusive_posts]': 'true', 'sort': '-published_at'}).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some information to screen:
for d in data['data']:
print('{:<70} {}'.format(d['attributes']['title'], d['attributes']['published_at']))
Prints:
New in AMP 2.0.2 - Integrated SCP/SFTP server! 2020-07-17T13:28:49.000+00:00
AMP Enterprise Pricing Reveal! 2020-07-07T10:02:02.000+00:00
AMP Enterprise Edition Waiting List 2020-07-03T13:25:35.000+00:00
Upcoming changes to the user system 2020-05-29T10:53:43.000+00:00
More video tutorials! What do you want to see? 2020-05-21T12:20:53.000+00:00
Third AMP tutorial - Windows installation! 2020-05-21T12:19:23.000+00:00
Another day, another video tutorial! 2020-05-08T22:56:45.000+00:00
AMP Video Tutorial - Out takes! 2020-05-05T23:01:57.000+00:00
AMP Video Tutorials - Installing AMP on Linux 2020-05-05T23:01:46.000+00:00
What is the AMP Console Assistant (AMPCA), and why does it exist? 2020-05-04T01:14:39.000+00:00
Well that was unexpected... 2020-05-01T11:21:09.000+00:00
New Goal - MariaDB/MySQL Support! 2020-04-22T13:41:51.000+00:00
Testing out AMP Enterprise Features 2020-03-31T18:55:42.000+00:00
Temporary feature unlock for all Patreon backers! 2020-03-11T14:53:31.000+00:00
Preparing for Enterprise 2020-03-11T13:09:40.000+00:00
Aarch64/ARM64 and Raspberry Pi is here! 2020-03-06T19:07:09.000+00:00
Aarch64/ARM64 and Raspberry Pi progress! 2020-02-26T17:53:53.000+00:00
Wallpaper! 2020-02-13T11:04:39.000+00:00
Instance Templating - Make once, deploy many. 2020-02-06T15:26:09.000+00:00
Time for a new module! 2020-01-07T13:41:17.000+00:00
i'm trying to get the shipping price from this link:
https://www.banggood.com/Xiaomi-Mi-Air-Laptop-2019-13_3-inch-Intel-Core-i7-8550U-8GB-RAM-512GB-PCle-SSD-Win-10-NVIDIA-GeForce-MX250-Fingerprint-Sensor-Notebook-p-1535887.html?rmmds=search&cur_warehouse=CN
but it seems that the "strong" is empty.
i've tried few solutions but all of them gave me an empty "strong"
i'm using beautifulsoup in python 3.
for example this code led me to an empty "strong":
client = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(client.content, 'lxml')
for child in soup.find("span", class_="free_ship").children:
print(child)
The issue is the 'Free Shipping' is generated by JavaScript after the page loads, rather than being sent in the webpage.
It might obtain the Shipping price by performing a HTTP request after the page has loaded or it may be hidden within the page
You might be able to try to find the XHR Request to pulls the Shipping price using DevTools in Firefox or chrome using the 'networking' tab and using that to get the price.
Using the XHR, you can find that data:
import requests
from bs4 import BeautifulSoup
import json
url = 'https://m.banggood.com/ajax/product/dynamicPro/index.html'
payload = {
'c': 'api',
'sq': 'IY38TmCNgDhATYCmIDGxYisATHA7ANn2HwX2RNwEYrcAGAVgDNxawIQFhLpFhkOCuZFFxA'}
response = requests.get(url, params=payload).json()
data = response['result']
shipping = data['shipment']
for each in shipping.items():
print (each)
print (shipping['shipCost'])
Output:
print (shipping['shipCost'])
<b>Free Shipping</b>
I have a URL:
http://www.goudengids.be/qn/business/advanced/where/Provincie%20Antwerpen/what/restaurant
On that page there is a "next results" button which loads another 20 data point while still showing first dataset, without updating the URL. I wrote a script to scrape this page in python but it only scrapes the first 22 data point even though the "next results" button is clicked and shows about 40 data.
How can I scrape these types of website that dynamically load content
My script is
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.goudengids.be/qn/business/advanced/where/Provincie%20Antwerpen/what/restaurant/"
r = requests.get(url)
r.content
soup = BeautifulSoup(r.content)
print (soup.prettify())
g_data2 = soup.find_all("a", {"class": "heading"})
for item in g_data2:
try:
name = item.text
print name
except IndexError:
name = ''
print "No Name found!"
If you were to solve it with requests, you need to mimic what browser does when you click the "Load More" button - it sends an XHR request to the http://www.goudengids.be/q/ajax/business/results.json endpoint, simulate it in your code maintaining the web-scraping session. The XHR responses are in JSON format - no need for BeautifulSoup in this case at all:
import requests
main_url = "http://www.goudengids.be/qn/business/advanced/where/Provincie%20Antwerpen/what/restaurant/"
xhr_url = "http://www.goudengids.be/q/ajax/business/results.json"
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
# visit main URL
session.get(main_url)
# load more listings - follow the pagination
page = 1
listings = []
while True:
params = {
"input": "restaurant Provincie Antwerpen",
"what": "restaurant",
"where": "Provincie Antwerpen",
"type": "DOUBLE",
"resultlisttype": "A_AND_B",
"page": str(page),
"offset": "2",
"excludelistingids": "nl_BE_YP_FREE_11336647_0000_1746702_6165_20130000, nl_BE_YP_PAID_11336647_0000_1746702_7575_20139729427, nl_BE_YP_PAID_720348_0000_187688_7575_20139392980",
"context": "SRP * A_LIST"
}
response = requests.get(xhr_url, params=params, headers={
"X-Requested-With": "XMLHttpRequest",
"Referer": main_url
})
data = response.json()
# collect listing names in a list (for example purposes)
listings.extend([item["bn"] for item in data["overallResult"]["searchResults"]])
page += 1
# TODO: figure out exit condition for the while True loop
print(listings)
I've left an important TODO for you - figure out an exit condition - when to stop collecting listings.
Instead of focusing on scraping HTML I think you should look at the JSON that is retrieved via AJAX. I think the JSON is less likely to be changed in the future as opposed to the page's markup. And on top of that, it's way easier to traverse a JSON structure than it is to scrape a DOM.
For instance, when you load the page you provided it hits a url to get JSON at http://www.goudengids.be/q/ajax/business/results.json.
Then it provides some url parameters to query the businesses. I think you should look more into using this to get your data as opposed to scraping the page and simulating button clicks, and etc.
Edit:
And it looks like it's using the headers set from visiting the site initially to ensure that you have a valid session. So you may have to hit the site initially to get the cookie headers and set that for subsequent requests to get the JSON from the endpoint above. I still think this will be easier and more predictable than trying to scrape HTML.
I'm a newbie to Python and I'm actually working on a little Python script that request and read the HTML of an URL.
For Information the web page that i'm working on is http://bitcoinity.org/markets ,
I would like with my script to fetch the Current Price of the market.
I checked the HTML code and i found that the Price was in a balise :
<span id="last_price" value="447.77"</span>
Here is the code of my Python script :
import urllib2
import urllib
from bs4 import BeautifulSoup
url = "http://bitcoinity.org/markets"
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
HTML = urllib2.urlopen(req)
soup = BeautifulSoup(HTML)
HTMLText = HTML.read()
HTML.close()
#print soup.prettify()
#print HTMLText
So the problem is that the output of this script ( with the 2 methods BeautifulSoup and read() ) is like this :
</span>
<span id="last_price">
</span>
The "value=" attribute is missing and the syntax changed , so I don't know if the server doesn't allow me to make a request of this value or if there is a problem with my code.
All Help is welcome ! :)
( Sorry for my bad english , i'm not a native )
The price is calculated via a set of javascript functions, urllib2+BeautifulSoup approach would not work in this case.
Consider using a tool that utilizes a real browser, like selenium:
>>> from selenium import webdriver
>>> driver = webdriver.Firefox()
>>> driver.get('http://bitcoinity.org/markets')
>>> driver.find_element_by_id('last_price').text
u'0.448'
I'm not sure beautifulsoup or selenium are the tools for this task. They're actually a very poor solution.
Since we're talking about "stock" prices (bitcoin in this case), it is much better if you feed your app/script with real-time market data. Bitcoinity's default "current price" is actually Bitstamp's price... You can also get it directly from the Bitstamp's API via 2 ways.
HTTP API
Here's the ticker you need to feed your app with: https://www.bitstamp.net/api/ticker/ and here how you can get the last price (It is the 'last' value of that JSON what you really are looking for)
import urllib2
import json
req = urllib2.Request("https://www.bitstamp.net/api/ticker/")
opener = urllib2.build_opener()
f = opener.open(req)
json = json.loads(f.read())
print 'Bitcoin last price is = '+json['last']
Websockets API
This is how bitcoinity, bitcoinwisdom, etc grab the prices and market info in order to show it to you in real-time. For this you'll need pusher package for python, since Bitstamp uses pusher for websockets.
I asked a question on realizing a general idea to crawl and save webpages.
Part of the original question is: how to crawl and save a lot of "About" pages from the Internet.
With some further research, I got some choices to go ahead with both on scraping and parsing (listed at the bottom).
Today, I ran into another Ruby discussion about how to scrape from Google search results. This provides a great alternative for my problem which will save all the effort on the crawling part.
The new question are: in Python, to scrape Google search results for a given keyword, in this case "About", and finally get the links for further parsing.
What are the best choices of methods and libraries to go ahead with? (in measure of easy-to-learn and easy-to-implement).
p.s. in this website, the exactly same thing is implemented, but closed and ask for money for more results. I'd prefer to do it myself if no open-source available and learn more Python in the meanwhile.
Oh, btw, advices for parsing the links from search results would be nice, if any. Still, easy-to-learn and easy-to-implement. Just started learning Python. :P
Final update, problem solved. Code using xgoogle, please read note in the section below in order to make xgoogle working.
import time, random
from xgoogle.search import GoogleSearch, SearchError
f = open('a.txt','wb')
for i in range(0,2):
wt = random.uniform(2, 5)
gs = GoogleSearch("about")
gs.results_per_page = 10
gs.page = i
results = gs.get_results()
#Try not to annnoy Google, with a random short wait
time.sleep(wt)
print 'This is the %dth iteration and waited %f seconds' % (i, wt)
for res in results:
f.write(res.url.encode("utf8"))
f.write("\n")
print "Done"
f.close()
Note on xgoogle (below answered by Mike Pennington):
The latest version from it's Github does not work by default already, due to changes in Google search results probably. These two replies (a b) on the home page of the tool give a solution and it is currently still working with this tweak. But maybe some other day it may stop working again due to Google's change/block.
Resources known so far:
For scraping, Scrapy seems to be a popular choice and a webapp called ScraperWiki is very interesting and there is another project extract it's library for offline/local usage. Mechanize was brought up quite several times in different discussions too.
For parsing HTML, BeautifulSoup seems to be the one of the most
popular choices. Of course. lxml too.
You may find xgoogle useful... much of what you seem to be asking for is there...
There is a twill lib for emulating browser. I used it when had a necessity to login with google email account. While it's a great tool with a great idea, it's pretty old and seems to have a lack of support nowadays (the latest version is released in 2007).
It might be useful if you want to retrieve results that require cookie-handling or authentication. Likely that twill is one of the best choices for that purposes.
BTW, it's based on mechanize.
As for parsing, you are right, BeautifulSoup and Scrapy are great. One of the cool things behind BeautifulSoup is that it can handle invalid HTML (unlike Genshi, for example.)
Have a look at this awesome urllib wrapper for web scraping https://github.com/mattseh/python-web/blob/master/web.py
Another option to scrape Google search results using Python is the one by ZenSERP.
I like the API-first approach which is easy to use and the JSON results are easily integrated into our solution.
Here is an example for a curl request:
curl "https://app.zenserp.com/api/search" -F "q=Pied Piper" -F "location=United States" -F "search_engine=google.com" -F "language=English" -H "apikey: APIKEY"
And the response:
{
"q": "Pied Piper",
"domain": "google.com",
"location": "United States",
"language": "English",
"url": "https://www.google.com/search?q=Pied%20Piper&num=100&hl=en&gl=US&gws_rd=cr&ie=UTF-8&oe=UTF-8&uule=w+CAIQIFISCQs2MuSEtepUEUK33kOSuTsc",
"total_results": 17100000,
"auto_correct": "",
"auto_correct_type": "",
"results": []
}
A Python code for example:
import requests
headers = {
'apikey': 'APIKEY',
}
params = (
('q', 'Pied Piper'),
('location', 'United States'),
('search_engine', 'google.com'),
('language', 'English'),
)
response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)
To extract links from multiple pages of Google Search results you can use SerpApi. It's a paid API with a free trial.
Full example
import os
# Python package: https://pypi.org/project/google-search-results
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "about",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
pages = search.pagination()
for result in pages:
print(f"Current page: {result['serpapi_pagination']['current']}\n")
for organic_result in result["organic_results"]:
print(
f"Title: {organic_result['title']}\nLink: {organic_result['link']}\n"
)
Output
Current page: 12
URL: https://fi.google.com/
URL: https://www.mayoclinic.org/about-mayo-clinic
...
Current page: 18
URL: https://igem.org/About
URL: https://www.ieee.org/
URL: https://www.cancer.org/
...
Disclaimer: I work at SerpApi.
This one works good for this moment. If any search is made, the scraper keeps grabbing titles and their links traversing all next pages until there is no more next page is left or your ip address is banned. Make sure your bs4 version is >= 4.7.0 as I've used pseudo css selector within the script.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
base = "https://www.google.de"
link = "https://www.google.de/search?q={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def grab_content(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
for container in soup.select("[class='g'] a[href^='http'][data-ved]:has(h3)"):
post_title = container.select_one("h3").get_text(strip=True)
post_link = container.get('href')
yield post_title,post_link
next_page = soup.select_one("a[href][id='pnnext']")
if next_page:
next_page_link = urljoin(base,next_page.get("href"))
yield from grab_content(next_page_link)
if __name__ == '__main__':
search_keyword = "python"
qualified_link = link.format(search_keyword.replace(" ","+"))
for item in grab_content(qualified_link):
print(item)
This can be done using google and beautifulsoup module, install it in CMD using command given below:
pip install google beautifulsoup4
Thereafter, run this simplified code given below
import webbrowser, googlesearch as gs
def direct(txt):
print(f"sure, searching '{txt}'...")
results=gs.search(txt,num=1,stop=1,pause=0)
#num, stop denotes number of search results you want
for link in results:
print(link)
webbrowser.open_new_tab(link)#to open the results in browser
direct('cheap thrills on Youtube') #this will play the song on YouTube
#(for this, keep num=1,stop=1)
Output:
TIP: Using this, you can also make a small Virtual Assistant that will open the top search result in browser for your given query(txt) in natural language.
Feel free to comment in case of difficulty while running this code:)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import urllib.request
import re
import numpy as np
count=0
query=input("query>>")
query=query.strip().split()
query="+".join(query)
html = "https://www.google.co.in/search?site=&source=hp&q="+query+"&gws_rd=ssl"
req = urllib.request.Request(html, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(urlopen(req).read(),"html.parser")
#Regex
reg=re.compile(".*&sa=")
links = []
#Parsing web urls
for item in soup.find_all('h3', attrs={'class' : 'r'}):
line = (reg.match(item.a['href'][7:]).group())
links.append(line[:-4])
print(links)
this should be handy....for more go to -
https://github.com/goyal15rajat/Crawl-google-search.git
Here is a Python script using requests and BeautifulSoup to scrape Google results.
import urllib
import requests
from bs4 import BeautifulSoup
# desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"
query = "hackernoon How To Scrape Google With Python"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
title = g.find('h3').text
item = {
"title": title,
"link": link
}
results.append(item)
print(results)