Getting HTML of dynamic websites with BeautifulSoup and requests - python

Code:
from bs4 import BeautifulSoup as bs
import requests
keyword = "dog"
url = f"https://www.pinterest.de/search/pins/?q={keyword}&rs=typed&term_meta[]={keyword}%7Ctyped"
r = requests.get(url)
soup = bs(r.content)
print(r)
Output: <Response [200]>
If i change the print(r) to print(soup), i get a lot of text that dosen't look like HTML.
Something like: ..."marketing_brand_charlotte_phoenix":"control","marketing_brand_houston_chicago":"enabled","marketing_brand_houston_miami_business":"enabled","marketing_brand_northnhinenestphalia_bavaria":"enabled","marketing_brand_northrhinewestphalia_bavaria_business":"enabled","marketing_brand_seattle_dallas_business":"control","marketing_brand_seattle_orlando":"control","merchant_discovery_shopify_boosting":"control","merchant_storefront_mojito_migration":"enabled","merchant_success_activation_banner_collapse_over_dismiss":"enabled","merchant_success_auto_enroll_approved_merchant":"enabled","merchant_success_catalog_activation_card_copy_update":"control","merchant_success_claim_your_website_copy_update":"control","merchant_success_i18n_umr_review_flag":"enabled","merchant_success_product_tagging":"enabled","merchant_success_switch_merchant_review_queues":"enabled","merchant_success_tag_activation_card_copy_update":"control","merchant_success_tag_installation_redirect":"enabled","merchant_success_unified_merchant_review":"enabled","mini_renux_homefeed_refresh":"enabled","more_ideas_email_notifications":"enabled","more_ideas_newshub_notifications":"enabled","more_ideas_push_notifications":"enabled","msft_pwa_announcement_email":"control","multi_format_ad_group":"enabled","mweb_account_switcher_v2":"enabled","mweb_advertiser_growth_add_biz_create_entrypoint":"enabled","mweb_all_profiles_follow_parity":"enabled","mweb_auth_android_lite_low_res_limit_width":"enabled_736x","mweb_auth_low_res_limit_width":"enabled_736x","mweb_auth_no_client_context":"enabled"...
How can I get the HTLM to then eventually scrape the pictures of a page depending on the keyword? And how to handle endless pages when scraping?

You can get the right page if you request it properly. The servers, to prevent from scraping, use some basic checks to filter them. So the problem concerns only the request not the scraping part. The first and cheapest attempt is pass a "fake" user agent to your request.
user_agent = # of firefox, chrome, safari, opera,...
response = requests.get(url, headers={ "user-agent": user_agent})

Related

div not showing up in html from url using requests library and bs4

I have a simple script where I want to scrape a menu from a url:
https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822
When I inspect the page using dev tools, I identify that the menu contained in the menu section <div class="menu-area" id="section_1026228">
So my script is fairly simple as follows:
import requests
from bs4 import BeautifulSoup
venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
response = requests.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'html.parser')
menu = soup.find('div', {'class': 'menu-area'})
print(menu.text)
I have tried this on a locally saved page of the url and it works. But when I do it to the full url using the requests library, it does not work. It cannot find the div. It throws this error:
print(menu.text)
AttributeError: 'NoneType' object has no attribute 'text'
which basically means it cannot find the div. Does anyone know why this is happening and how to fix it?
I just logged out from my browser and it showed me a different page. However, my script has no login part at all. Not even sure how that would work
[It doesn't work with all sites, but it seems to be enough for this site so far.] You can login with request.Session.
# import requests
sess = requests.Session()
headers = {'user-agent': 'Mozilla/5.0'}
data = {'username': 'YOUR_EMAIL/USERNAME', 'password': 'YOUR_PASSWORD'}
loginResp = sess.post('https://untappd.com/login', headers=headers, data=data)
print(loginResp.status_code, loginResp.reason, 'from', loginResp.url) ## should print 200 OK...
response = sess.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
## CAN CONTINUE AS BEFORE ##
I've edited my solution to one of your previous questions about this site to include cookies so that the site will treat you as logged in. For example:
# venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
gloryMenu = scrape_untappd_menu(venue_url, cookies=sess.cookies)
will collect the following data:
Note: They have a captcha when logging in so I was worried it would be too hard to automate; if it becomes an issue, you can [probably] still login on your browser before going to the page and then paste the request from your network log to curlconverter to get the cookies as a dictionary. Ofc the process is then no longer fully automated since you'll have to repeat this manual login every time the cookies expire (which could be as fast as a few hours). If you wanted to automate the login at that point, you might have to use some kind of browser automation like with selenium.

How to scrape an API response that is requested by the target website?

I'd like to scrape content of a website that is requested asynchronous and not visible in the source code.
How can I await the website's request? I need to sniff its traffic somehow, but couldn't find anything yet.
I'm looking something like that (pseudo code):
import requests
from bs4 import BeautifulSoup
page = requests.get("http://target.tld")
traffic = page.sniff_traffic(seconds=10)
for req in traffic:
print(req) # http://api.target.tld
soup = BeautifulSoup(page.content, "html.parser")
Any ideas?
You can't do that with BeautifulSoup, you need to use something which mimics a web browser, such as Selenium with Geckodriver.

API - Web Scrape

how to get access to this API:
import requests
url = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
print(requests.get(url))
I'm trying to retrieve data from this site via API, I found the url above and I can see its data , however I can't seem to get it right because I'm running into code 403.
This is the website url:
https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos
I'm trying to retrieve items category, they are visible for me, but I'm unable to take them.
Later I'll use these categories to iterate over products API.
API Category
Obs: please be gentle it's my first post here =]
To get the data as you shown in your image the following headers and endpoint are needed:
import requests
headers = {
'sm-token': '{"IdLoja":2691,"IdRede":884}',
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos',
}
params = {
'id_loja': '2691',
}
r = requests.get('https://www.nagumo.com.br/api/b2c/page/menu', params=params, headers=headers)
r.json()
Not sure exactly what your issue is here.
Bu if you want to see the content of the response and not just the 200/400 reponses. You need to add '.content' to your print.
Eg.
#Create Session
s = requests.Session()
#Example Connection Variables, probably not required for your use case.
setCookieUrl = 'https://www...'
HeadersJson = {'Accept-Language':'en-us'}
bodyJson = {"__type":"xxx","applicationName":"xxx","userID":"User01","password":"password2021"}
#Get Request
p = s.get(otherUrl, json=otherBodyJson, headers=otherHeadersJson)
print(p) #Print response (200 etc)
#print(p.headers)
#print(p.content) #Print the content of the response.
#print(s.cookies)
I'm also new here haha, but besides this requests library, you'll also need another one like beautiful soup for what you're trying to do.
bs4 installation: https:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Once you install it and import it, it's just continuing what you were doing to actively get your data.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
this gets the entire HTML content of the page, and so, you can get your data from this page based on their css selectors like this:
site_data = soup.select('selector')
site_data is an array of things with that 'selector', so a simple for loop and an array to add your items in would suffice (as an example, getting links for each book on a bookstore site)
For example, if i was trying to get links from a site:
import requests
from bs4 import BeautifulSoup
sites = []
URL = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.select("a") # list of all items with this selector
for link in links:
sites.append(link)
Also, a helpful tip is when you inspect the page (right click and at the bottom press 'inspect'), you can see the code for the page. Go to the HTML and find the data you want and right click it and select copy -> copy selector. This will make it really easy for you to get the data you want on that site.
helpful sites:
https://oxylabs.io/blog/python-web-scraping
https://realpython.com/beautiful-soup-web-scraper-python/

Problems scraping dynamic content with requests and BeautifulSoup

I have tried to scrape the response of the form on the website https://www.languagesandnumbers.com/how-to-count-in-german/en/deu/ by trying to fill out the form and submitting it with requests and BeautifulSoup. After inspecting network-traffic of the submit, I found out that the post params are "numberz" and "lang". That's why I tried to post the following:
import requests
from bs4 import BeautifulSoup
with requests.Session() as session:
response = session.post('https://www.languagesandnumbers.com/how-to-count-in-german/en/deu/', data={
"numberz": "23",
"lang": "deu"
})
soup = BeautifulSoup(response.content, "lxml")
print(soup.find(id='words').get_text())
Unfortunately, the response is dynamic and not visible, so after submitting the form I always get the main page back without any text in the particular div which actually carries that response. Is there another way to scrape the response using requests and BeautifulSoup and not use selenium?
You do not need BeautifulSoup but the correct url to get only the result of written number:
https://www.languagesandnumbers.com/ajax/en/
Cause it returns in this way ack:::dreiundzwanzig you hav to extract the string:
response.text.split(':')[-1]
Example
import requests
with requests.Session() as session:
response = session.post('https://www.languagesandnumbers.com/ajax/en/', data={
"numberz": "23",
"lang": "deu"
})
response.text.split(':')[-1]
Output
dreiundzwanzig

Scraping subreddit top posts of all time using requests is returning the wrong result

I would like to scrape a subreddit for their top posts of all time. I know there is a PRAW module that may work better, but I would prefer to scrape using requests only for now.
import requests
url = "https://www.reddit.com/r/shittysuperpowers/top/?t=all.html"
headers = {"User-agent": "bot_0.1"}
res = requests.get(url, headers=headers)
res.status_code returned a 200, and the scrape was successful. But closer inspection of res.text revealed that the data html scraped is not from the desired page. In fact, what was scraped were from the top posts today rather than of all time, or from this url https://www.reddit.com/r/shittysuperpowers/top/?t=day.html. Is there any reason why I am unable to scrape the top posts of all time? I have tried this with other subreddits too and they all run into the same problem.
Use the .json modifier before the queries to get the data in json format.
import requests
url = 'https://www.reddit.com/r/shittysuperpowers/top/.json?sort=top&t=all'
resp = requests.get(url, headers = {'User-agent': 'bot_0.1'})
if resp.ok:
data = resp.json()
The .json modifier also works in browser.

Categories