Python requests through API with variable URL in json to scrape content

Python requests through API with variable URL in json to scrape content - python

I'm currently working on a scraper to analyze data and making charts of a website with Python 2.7, BeautifulSoup, Requests, Json, etc...
I want to make a search with definite keywords and then scrape the prices of the different items to make an average value.
So I tried BeautifulSoup to scrape the json response as I usually do but the response it gives me is:
{"data":{"uuid":"YNp-EuXHrw","index_name":"Listing","default_name":null,"query":"supreme box logo","filters":{"strata":["basic","grailed","hype"]}}}
My request goes to : https://www.grailed.com/api/searches , URL I've found on the index page when making a search.
I figured out that "uuid":"YNp-EuXHrw" (always being a different value) is set to define the URL that will show the items data, as: https:// www.grailed.com/feed/YNp-EuXHrw
So I'm making a request to scrape the uuid from the api with
response = s.post(url, headers=headers, json=payload)
res_json = json.loads(response.text)
print response
id = res_json['data']['uuid']
But the problem is, when I'm making a request to
https:// www.grailed.com/ feed/YNp-EuXHrw
or whatever the uuid is, I'm getting <Response [500]>.
My whole code is:
import BeautifulSoup,requests,re,string,time,datetime,sys,json
s = requests.session()
url = "https://www.grailed.com/api/searches"
payload = {
"index_name":"Listing_production","query":"supreme box logo sweatshirts","filters":{"strata":["grailed","hype","basic"],"category_paths":[],"sizes":[],"locations":[],"designers":[],"min_price":"null","max_price":"null"}
}
headers = {
"Host": "www.grailed.com",
"Connection":"keep-alive",
"Content-Length": "217",
"Origin": "null",
"x-api-version": "application/grailed.api.v1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"content-type": "application/json",
"accept": "application/json",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4",
}
response = s.post(url, headers=headers, json=payload)
res_json = json.loads(response.text)
print response
id = res_json['data']['uuid']
urlID = "https://www.grailed.com/feed/" + str(id)
print urlID
response = s.get(urlID, headers=headers, json=res_json)
print response
As you can see when you're doing the requests through Chrome or whatever the URL quickly changes from
grailed. com
to
grailed.com/ feed/uuid
So I've tried to make a GET request to this URL but just getting Response 500.
What can I do to scrape data shown on the uuid URL as it don't even appears on Network requests?
I hope I was pretty clear, sorry for my english

Install phantomJs.
http://phantomjs.org/
not a full solution, but hope this helps.
pip install selenium
npm install phantomjs
test.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs') //path to phantomjs driver
driver.set_window_size(1120, 550)
driver.get("https://www.grailed.com/")
try:
//you want to wait untill page is renderded
element = WebDriverWait(driver,1).until(
EC.presence_of_all_elements_located((By.XPATH,'//*[#id="homepage"]/div/div[3]/div[1]/div/form/label/input'))
)
element = driver.find_element_by_xpath('//*[#id="homepage"]/div/div[3]/div[1]/div/form/label/input')
if element.is_displayed():
element.send_keys('search this')
else:
print ('no element')
except Exception as e:
print (e)
print (driver.current_url)
driver.quit()

Related

Python cloudscraper requests slow, with 403 responses

I am using Cloduscraper Python library in order to obtain a JSON response from an url.
The probem is that I have to retry the same request 2-3 times before I get the correct output. The first responses have a 403 HTTP status code.
Here is my code:
import json
from time import sleep
import cloudscraper
url = "https://www.endpoint.com/api/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0",
"Accept": "*/*",
"Content-Type": "application/json"
}
json_response = 0
while json_response == 0:
try:
scraper = cloudscraper.create_scraper()
r = scraper.get(url, headers=headers)
json_response = json.loads(r.text)
except:
print(r.status_code)
sleep(2)
return json_response
What can I do in order to optimize my code and prevent the 403 responses?

You could use real browser to prevent some part of bot detection, here is the example with playwright:
import json
from playwright.sync_api import sync_playwright
API_URL = 'https://www.soraredata.com/api/players/info/29301348132354218386476497174231278066977835432352170109275714645119105189666'
with sync_playwright() as p:
# Webkit is fastest to start and hardest to detect
browser = p.webkit.launch(headless=True)
page = browser.new_page()
page.goto(API_URL)
# Use evaluate instead of `content` not to import bs4 or lxml
html = page.evaluate('document.querySelector("pre").innerText')
try:
data = json.loads(html)
except:
# Still might fail sometimes
data = None
print(data)

The HTTP 403 Forbidden response status code indicates that the server understands the request but refuses to authorize it.
If you had no authorization, I would suggest first of all, to check if the url you are sending the request to, needs any sort of permissions to authorize the request.
However you do get a response at the 2nd or 3rd trial, and what happens is that some servers will take a few seconds before returning the answer, so they require the browser to wait ~5 seconds before submitting the response.
I would suggest adding a delay, which can be passed as an argument to create_scraper():
scraper = cloudscraper.create_scraper(delay=10)
If it is succesfull, then reduce the delay until it can no longert be reduced.

How to extract a specific string with BeautifulSoup

So I'm trying to retrieve Bitcoin prices from CoinMarketCap.com.
I'm using Python along with requests and bs4.
import requests
from bs4 import BeautifulSoup
link = "https://coinmarketcap.com/currencies/bitcoin/"
header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}
data = requests.get(headers = header, url = link)
soup = BeautifulSoup(data.content, 'html.parser')
bitcoinPrice = soup.find(id="quote_price")
print(bitcoinPrice)
So when I run the script, I have the following result with some additional code that I don't want. I just want the Bitcoin price.
<span data-currency-price="" data-usd="9806.68980398" id="quote_price">
<span class="h2 text-semi-bold details-panel-item--price__value" data-currency-value="">9806.69</span>
<span class="text-large" data-currency-code="">USD</span>
</span>
How do I extract the Bitcoin price from that chunk of data?

I believe this should give you what you want:
bitcoinPrice.span.contents[0]
contains
'9808.16'

bitcoinPrice = soup.find("span", class_="details-panel-item--price__value").text

This is another way using css selector.
print(soup.select_one('.details-panel-item--price__value').text)

You can use the official API under the basic (free) plan and then simply add your API key into below. Code example updated from here.
from requests import Request, Session
from requests.exceptions import ConnectionError, Timeout, TooManyRedirects
import json
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/quotes/latest'
parameters = {
'id':'1'
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': 'api_key',
}
session = Session()
session.headers.update(headers)
try:
response = session.get(url, params=parameters)
data = json.loads(response.text)
#print(data)
print(data['data']['1']['quote']['USD']['price'])
except (ConnectionError, Timeout, TooManyRedirects) as e:
print(e)

requests-html not finding page element

So I'm trying to navigate to this url: https://www.instacart.com/store/wegmans/search_v3/horizon%201%25
and scrape data from the div with the class item-name item-row. There are two main problems though, the first is that instacart.com requires a login before you can get to that url, and the second is that most of the page is generated with javascript.
I believe I've solved the first problem because my session.post(...) gets a 200 response code. I'm also pretty sure that r.html.render() is supposed to solve the second problem by rendering the javascript generated html before I scrape it. Unfortunately, the last line in my code is only returning an empty list, despite the fact that selenium had no problem getting this element. Does anyone know why this isn't workng?
from requests_html import HTMLSession
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
session = HTMLSession()
res1 = session.get('http://www.instacart.com', headers=headers)
soup = BeautifulSoup(res1.content, 'html.parser')
token = soup.find('meta', {'name': 'csrf-token'}).get('content')
data = {"user": {"email": "alexanderjbusch#gmail.com", "password": "password"},
"authenticity_token": token}
response = session.post('https://www.instacart.com/accounts/login', headers=headers, data=data)
print(response)
r = session.get("https://www.instacart.com/store/wegmans/search_v3/horizon%201%25", headers=headers)
r.html.render()
print(r.html.xpath("//div[#class='item-name item-row']"))

After logging in using requests module and BeautifulSoup, you can make use of the link I've already suggested in the comment to parse the required data available within json. The following script should get you name, quantity, price and a link to the concerning product. You can only get 21 product using the script below. There is an option for pagination within this json content. You can get all of the products by playing around with that pagination.
import json
import requests
from bs4 import BeautifulSoup
baseurl = 'https://www.instacart.com/store/'
data_url = "https://www.instacart.com/v3/retailers/159/module_data/dynamic_item_lists/cart_starters/storefront_canonical?origin_source_type=store_root_department&tracking.page_view_id=b974d56d-eaa4-4ce2-9474-ada4723fc7dc&source=web&cache_key=df535d-6863-f-1cd&per=30"
data = {"user": {"email": "alexanderjbusch#gmail.com", "password": "password"},
"authenticity_token": ""}
headers = {
'user-agent':'Mozilla/5.0',
'x-requested-with': 'XMLHttpRequest'
}
with requests.Session() as s:
res = s.get('https://www.instacart.com/',headers={'user-agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text, 'lxml')
token = soup.select_one("[name='csrf-token']").get('content')
data["authenticity_token"] = token
s.post("https://www.instacart.com/accounts/login",json=data,headers=headers)
resp = s.get(data_url, headers=headers)
for item in resp.json()['module_data']['items']:
name = item['name']
quantity = item['size']
price = item['pricing']['price']
product_page = baseurl + item['click_action']['data']['container']['path']
print(f'{name}\n{quantity}\n{price}\n{product_page}\n')
Partial output:
SB Whole Milk
1 gal
$3.90
https://www.instacart.com/store/items/item_147511418
Banana
At $0.69/lb
$0.26
https://www.instacart.com/store/items/item_147559922
Yellow Onion
At $1.14/lb
$0.82
https://www.instacart.com/store/items/item_147560764

Requests login into website only getting 403 error

I am trying to login into www.ebay-kleinanzeigen.de using the requests library, but every time I try to post my data (on the register page its the same as on the login page) I am getting a 403 error.
Here is the code for the register function:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'user-agent': user_agent, 'Referer': 'https://www.ebay-kleinanzeigen.de'}
with requests.Session() as c:
url = 'https://www.ebay-kleinanzeigen.de/m-benutzer-anmeldung.html'
c.headers = headers
hp = c.get(url, headers=headers)
soup = BeautifulSoup(hp.content, 'html.parser')
crsf = soup.find('input', {'name': '_csrf'})['value']
print(crsf)
payload = dict(email='test.email#emailzz1.de', password='test123', passwordConfirmation='test123',
_marketingOptIn='on', _crsf=crsf)
page = c.post(url, data=payload, headers=headers)
print(page.text)
print(page.url)
print(page.status_code)
Is the problem that I need some more headers? Isn't a user-agent and a referrer enough?
I have tried adding all requested headers, but then I am getting no response.

I have managed to create a script that will successfully complete the register form you're trying to fill in using the mechanicalsoup library. Note you will have to manually check your email account for the email they send you to complete registration.
I realise this doesn't actually answer the question of why BeautifulSoup returned a 403 forbidden error however it does complete your task without encountering the same error.
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://www.ebay-kleinanzeigen.de/m-benutzer-anmeldung.html")
browser.select_form('#registration-form')
browser.get_current_form().print_summary()
browser["email"] = "mailuser#emailprovider.com"
browser["password"] = "testSO12345"
browser["passwordConfirmation"] = "testSO12345"
response = browser.submit_selected()
rsp_code = response.status_code
#print(response.text)
print("Response code:",rsp_code)
if(rsp_code == 200):
print("Success! Opening a local debug copy of the page... (no CSS formatting)")
browser.launch_browser()
else:
print("Failure!")

Can't parse a website link from a webpage

I've created a script in python with selenium to scrape the website address located within Contact details in a website. However, the problem is there is no url associated with that link (I can click on that link, though).
How can I parse the website link located within Contact details?
from selenium import webdriver
URL = 'https://www.truelocal.com.au/business/vitfit/sydney'
def get_website_link(driver,link):
driver.get(link)
website = driver.find_element_by_css_selector("[ng-class*='getHaveSecondaryWebsites'] > span").text
print(website)
if __name__ == '__main__':
driver = webdriver.Chrome()
try:
get_website_link(driver,URL)
finally:
driver.quit()
When I run the script, I get the visible text associate with that link which is Visit website.

Element with "Visit website" text is a span, that has vm.openLink(vm.getReadableUrl(vm.getPrimaryWebsite()),'_blank') javascript and not actual href.
My suggestion, if your goal is to scrape and not testing, you can use solution below with requests package to get data as json and extract any information you need.
Another one is actually click, as you did.
import requests
import re
headers = {
'Referer': 'https://www.truelocal.com.au/business/vitfit/sydney',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/73.0.3683.75 Safari/537.36',
'DNT': '1',
}
response = requests.get('https://www.truelocal.com.au/www-js/configuration.constant.js?v=1552032205066',
headers=headers)
assert response.ok
# extract token from response text
token = re.search("token:\\s'(.*)'", response.text)[1]
headers['Accept'] = 'application/json, text/plain, */*'
headers['Origin'] = 'https://www.truelocal.com.au'
response = requests.get(f'https://api.truelocal.com.au/rest/listings/vitfit/sydney?&passToken={token}', headers=headers)
assert response.ok
# use response.text to get full json as text and see what information can be extracted.
contact = response.json()["data"]["listing"][0]["contacts"]["contact"]
website = list(filter(lambda x: x["type"] == "website", contact))[0]["value"]
print(website)
print("the end")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python requests through API with variable URL in json to scrape content - python

Related

Python cloudscraper requests slow, with 403 responses

How to extract a specific string with BeautifulSoup

requests-html not finding page element

Requests login into website only getting 403 error

Can't parse a website link from a webpage

Categories

Resources