i'm trying to get the shipping price from this link:
https://www.banggood.com/Xiaomi-Mi-Air-Laptop-2019-13_3-inch-Intel-Core-i7-8550U-8GB-RAM-512GB-PCle-SSD-Win-10-NVIDIA-GeForce-MX250-Fingerprint-Sensor-Notebook-p-1535887.html?rmmds=search&cur_warehouse=CN
but it seems that the "strong" is empty.
i've tried few solutions but all of them gave me an empty "strong"
i'm using beautifulsoup in python 3.
for example this code led me to an empty "strong":
client = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(client.content, 'lxml')
for child in soup.find("span", class_="free_ship").children:
print(child)
The issue is the 'Free Shipping' is generated by JavaScript after the page loads, rather than being sent in the webpage.
It might obtain the Shipping price by performing a HTTP request after the page has loaded or it may be hidden within the page
You might be able to try to find the XHR Request to pulls the Shipping price using DevTools in Firefox or chrome using the 'networking' tab and using that to get the price.
Using the XHR, you can find that data:
import requests
from bs4 import BeautifulSoup
import json
url = 'https://m.banggood.com/ajax/product/dynamicPro/index.html'
payload = {
'c': 'api',
'sq': 'IY38TmCNgDhATYCmIDGxYisATHA7ANn2HwX2RNwEYrcAGAVgDNxawIQFhLpFhkOCuZFFxA'}
response = requests.get(url, params=payload).json()
data = response['result']
shipping = data['shipment']
for each in shipping.items():
print (each)
print (shipping['shipCost'])
Output:
print (shipping['shipCost'])
<b>Free Shipping</b>
Related
Looking to Pass Python a list then using a combinate of Beautiful soup and requests, pull the corresponding peice of information for each web page.
So i have a list of around 7000 barcodes that i want to pass to this site 'https://www.barcodelookup.com/' (you just add the barcode after the backslash), then pull back the manufacturer of that product which is in the span "product-text".I'm currently trying to get it to run with the below;
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.barcodelookup.com/194398882321')
soup = BeautifulSoup(source, 'lxml')
#print(soup.prettify())
price = soup.find('span', {'class' : 'product-text'})
print(price.text)
This gives an error as below;
TypeError: object of type 'Response' has no len()
Any help would be greatly appreciated, thanks
If you inspect the source, you will see that the response status is 403 and the overall source.text reveals that the website is protected by Cloudflare. This means that using requests is not really helpful for you. You need the means to overcome the 'antibot' protection from Cloudflare. Here are two options:
1. Use a third party service
I am an engineer at WebScrapingAPI and I can recommend you our web scraping API. We're preventing detection by using various proxies, IP rotations, captcha solvers and other advanced features. A basic example of using our API for your scenarios is:
import requests
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://www.barcodelookup.com/194398882321'
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL,
"render_js":1,
"timeout":"20000",
"proxy_type": "residential",
"extract_rules":'{"elements":{"selector":"span.product-text","output":"text"}}',
}
response = requests.get(SCRAPER_URL, params=PARAMS )
print(response.text)
Response:
{"elements":["\nUPC-A 194398882321, EAN-13 0194398882321\n","Media ","Sony Uk ","\n1-2-3: The 80s CD.\n"]}
2. Build an undetectable web scraper
You can also try building a more 'undetectable' web scraper on your end. For example, try using a real browser for your scraper, instead of requests. Selenium would be a good place to start. Here is an implementation example:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
BASE_URL = 'https://www.barcodelookup.com/194398882321'
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 5000)
driver.get(BASE_URL)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
price = soup.find('span', {'class' : 'product-text'})
print(price)
driver.quit()
In time though, Cloudflare might flag your 'fingerprint' and block your requests. Some more things you could add to your project are:
Residential proxies
Advanced Selenium Evasions
how to get access to this API:
import requests
url = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
print(requests.get(url))
I'm trying to retrieve data from this site via API, I found the url above and I can see its data , however I can't seem to get it right because I'm running into code 403.
This is the website url:
https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos
I'm trying to retrieve items category, they are visible for me, but I'm unable to take them.
Later I'll use these categories to iterate over products API.
API Category
Obs: please be gentle it's my first post here =]
To get the data as you shown in your image the following headers and endpoint are needed:
import requests
headers = {
'sm-token': '{"IdLoja":2691,"IdRede":884}',
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos',
}
params = {
'id_loja': '2691',
}
r = requests.get('https://www.nagumo.com.br/api/b2c/page/menu', params=params, headers=headers)
r.json()
Not sure exactly what your issue is here.
Bu if you want to see the content of the response and not just the 200/400 reponses. You need to add '.content' to your print.
Eg.
#Create Session
s = requests.Session()
#Example Connection Variables, probably not required for your use case.
setCookieUrl = 'https://www...'
HeadersJson = {'Accept-Language':'en-us'}
bodyJson = {"__type":"xxx","applicationName":"xxx","userID":"User01","password":"password2021"}
#Get Request
p = s.get(otherUrl, json=otherBodyJson, headers=otherHeadersJson)
print(p) #Print response (200 etc)
#print(p.headers)
#print(p.content) #Print the content of the response.
#print(s.cookies)
I'm also new here haha, but besides this requests library, you'll also need another one like beautiful soup for what you're trying to do.
bs4 installation: https:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Once you install it and import it, it's just continuing what you were doing to actively get your data.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
this gets the entire HTML content of the page, and so, you can get your data from this page based on their css selectors like this:
site_data = soup.select('selector')
site_data is an array of things with that 'selector', so a simple for loop and an array to add your items in would suffice (as an example, getting links for each book on a bookstore site)
For example, if i was trying to get links from a site:
import requests
from bs4 import BeautifulSoup
sites = []
URL = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.select("a") # list of all items with this selector
for link in links:
sites.append(link)
Also, a helpful tip is when you inspect the page (right click and at the bottom press 'inspect'), you can see the code for the page. Go to the HTML and find the data you want and right click it and select copy -> copy selector. This will make it really easy for you to get the data you want on that site.
helpful sites:
https://oxylabs.io/blog/python-web-scraping
https://realpython.com/beautiful-soup-web-scraper-python/
I am using following URL to extract the JSON file for the price history
https://steamcommunity.com/market/pricehistory/?appid=730&market_hash_name=P90%20|%20Blind%20Spot%20(Field-Tested)
The python code I am using:
item = requests.get(URL, cookies={'steamLogin': steamid}); # get item data
print(str(currRun),' out of ',str(len(allItemNames))+' code: '+str(item.status_code))
item = item.content
item = json.loads(item)
Now I went to almost all the solutions that was posted in this community but I am still getting status code as 400 and Items as [].
When I copy paste the URL and open it in browser I am able to see the JSON file with required data but somehow the Jupyter notebook is unable to detect the content
I also tried Beautiful soup to read the content with the following code:
r = requests.get(url)
#below code extracts the whole HTML Code of above URL
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find_all('pre')
print(table)
Output: []
So you are getting [] because you are not authorized, so you recieve empty json array. You can check it by opening link in incognito (Ctrl+Shift+N) mode.
To authorize you need to set Cookie header to your request, so your code will be as this:
import requests
url = "https://steamcommunity.com/market/pricehistory/?appid=730&market_hash_name=P90%20%7C%20Blind%20Spot%20(Field-Tested)"
headers = {
"Cookie": "Your cookie"
}
json = requests.get(url, headers=headers).text
...
How to find Cookie (Chrome)
Go to link with json
Press F12 to open Chrome Development Toolkit.
Open Network tab
Reload page.
Double click on first sent request
Open Headers subtab
Scroll to Request Headers
Find Cookie header
Code:
from bs4 import BeautifulSoup as bs
import requests
keyword = "dog"
url = f"https://www.pinterest.de/search/pins/?q={keyword}&rs=typed&term_meta[]={keyword}%7Ctyped"
r = requests.get(url)
soup = bs(r.content)
print(r)
Output: <Response [200]>
If i change the print(r) to print(soup), i get a lot of text that dosen't look like HTML.
Something like: ..."marketing_brand_charlotte_phoenix":"control","marketing_brand_houston_chicago":"enabled","marketing_brand_houston_miami_business":"enabled","marketing_brand_northnhinenestphalia_bavaria":"enabled","marketing_brand_northrhinewestphalia_bavaria_business":"enabled","marketing_brand_seattle_dallas_business":"control","marketing_brand_seattle_orlando":"control","merchant_discovery_shopify_boosting":"control","merchant_storefront_mojito_migration":"enabled","merchant_success_activation_banner_collapse_over_dismiss":"enabled","merchant_success_auto_enroll_approved_merchant":"enabled","merchant_success_catalog_activation_card_copy_update":"control","merchant_success_claim_your_website_copy_update":"control","merchant_success_i18n_umr_review_flag":"enabled","merchant_success_product_tagging":"enabled","merchant_success_switch_merchant_review_queues":"enabled","merchant_success_tag_activation_card_copy_update":"control","merchant_success_tag_installation_redirect":"enabled","merchant_success_unified_merchant_review":"enabled","mini_renux_homefeed_refresh":"enabled","more_ideas_email_notifications":"enabled","more_ideas_newshub_notifications":"enabled","more_ideas_push_notifications":"enabled","msft_pwa_announcement_email":"control","multi_format_ad_group":"enabled","mweb_account_switcher_v2":"enabled","mweb_advertiser_growth_add_biz_create_entrypoint":"enabled","mweb_all_profiles_follow_parity":"enabled","mweb_auth_android_lite_low_res_limit_width":"enabled_736x","mweb_auth_low_res_limit_width":"enabled_736x","mweb_auth_no_client_context":"enabled"...
How can I get the HTLM to then eventually scrape the pictures of a page depending on the keyword? And how to handle endless pages when scraping?
You can get the right page if you request it properly. The servers, to prevent from scraping, use some basic checks to filter them. So the problem concerns only the request not the scraping part. The first and cheapest attempt is pass a "fake" user agent to your request.
user_agent = # of firefox, chrome, safari, opera,...
response = requests.get(url, headers={ "user-agent": user_agent})
Currently doing web-scraping for the first time trying to grab and compile a list of completed Katas from my CodeWars profile. You can view the completed problems without being logged in but it does not display your solutions unless you have logged in to that specific account.
Here is an inspect preview of the page display when logged in and the relevant divs I'm trying to scrape:
The url for that page is https://www.codewars.com/users/User_Name/completed_solutions
with User_Name replaced by an actual username.
The log-in page is: https://www.codewars.com/users/sign_in
I have attempted to get the divs with the class "list-item solutions" in two different ways now which I'll write:
#attempt 1
import requests
from bs4 import BeautifulSoup
login_url = "https://www.codewars.com/users/sign_in"
end_url = "https://www.codewars.com/users/Ash-Ozen/completed_solutions"
with requests.session() as sesh:
result = sesh.get(login_url)
soup = BeautifulSoup(result.content, "html.parser")
token = soup.find("input", {"name": "authenticity_token"})["value"]
payload = {
"user[email]": "ph#gmail.com",
"user[password]": "phpass>",
"authenticity_token": str(token),
}
result = sesh.post(login_url, data=payload) #this logs me in?
page = sesh.get(end_url) #This navigates me to the target page?
soup = BeautifulSoup(page.content, "html.parser")
print(soup.prettify()) # some debugging
# Examining the print statement shows that the "list-item solutions" is not
# there. Checking page.url shows the correct url(https://www.codewars.com/users/Ash-Ozen/completed_solutions).
solutions = soup.findAll("div", class_="list-item solutions")
# solutions yields an empty list.
and
#attempt 2
from robobrowser import RoboBrowser
from bs4 import BeautifulSoup
browser = RoboBrowser(history=True)
browser.open("https://www.codewars.com/users/sign_in")
form = browser.get_form()
form["user[email]"].value = "phmail#gmail.com"
form["user[password]"].value = "phpass"
browser.submit_form(form) #think robobrowser handles the crfs token for me?
browser.open("https://www.codewars.com/users/Ash-Ozen/completed_solutions")
r = browser.parsed()
soup = BeautifulSoup(str(r[0]), "html.parser")
solutions = soup.find_all("div", class_="list-item solutions")
print(solutions) # returns empty list
No idea how/what to debug from here to get it working.
Edit: My initial thoughts about what is going wrong is that, after performing either post I get redirected to the dashboard (behavior after logging in successfully) but it seems that when trying to get the final url I end up with the non-logged-in version of the page.