Website always hangs using python requests library - python

I am trying to use the python requests library to get the html from this url https://www.adidas.com/api/products/EF2302/availability?sitePath=us
However every time I run my code it hangs when making the get request
header = BASE_REQUEST_HEADER
url = 'https://www.adidas.com/api/products/EF2302/availability?sitePath=us'
r = requests.get(url, headers = header)
I checked the network tab in chrome and copied all the headers used including user agent so that is not the issue. I was also able to load the page in chrome with javascript and cookies disabled.
This code works fine with other websites. I simply cant get a response from any of the adidas websites (including https://www.adidas.com/us).
Any suggestions are greatly appreciated.

This site doesn't like the default User-Agent field supplied by requests, change it to Firefox/Chrome (I chose Firefox in my example), and you can read data successfully:
from bs4 import BeautifulSoup
import requests
import json
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
url = 'https://www.adidas.com/api/products/EF2302/availability?sitePath=us'
r = requests.get(url, headers=headers)
json_data = json.loads(r.text)
print(json.dumps(json_data, indent=4))
Prints:
{
"id": "EF2302",
"availability_status": "PREORDER",
"variation_list": [
{
"sku": "EF2302_530",
"availability": 15,
"availability_status": "PREORDER",
"size": "4",
"instock_date": "2018-08-16T00:00:00.000Z"
},
{
"sku": "EF2302_550",
"availability": 15,
"availability_status": "PREORDER",
"size": "5",
"instock_date": "2018-08-16T00:00:00.000Z"
},
{
"sku": "EF2302_570",
"availability": 15,
"availability_status": "PREORDER",
"size": "6",
"instock_date": "2018-08-16T00:00:00.000Z"
},
{
"sku": "EF2302_590",
"availability": 15,
"availability_status": "PREORDER",
"size": "7",
"instock_date": "2018-08-16T00:00:00.000Z"
},
{
"sku": "EF2302_610",
"availability": 15,
"availability_status": "PREORDER",
"size": "8",
"instock_date": "2018-08-16T00:00:00.000Z"
},
{
"sku": "EF2302_630",
"availability": 15,
"availability_status": "PREORDER",
"size": "9",
"instock_date": "2018-08-16T00:00:00.000Z"
},
{
"sku": "EF2302_650",
"availability": 15,
"availability_status": "PREORDER",
"size": "10",
"instock_date": "2018-08-16T00:00:00.000Z"
},
{
"sku": "EF2302_670",
"availability": 15,
"availability_status": "PREORDER",
"size": "11",
"instock_date": "2018-08-16T00:00:00.000Z"
},
{
"sku": "EF2302_690",
"availability": 15,
"availability_status": "PREORDER",
"size": "12",
"instock_date": "2018-08-16T00:00:00.000Z"
},
{
"sku": "EF2302_710",
"availability": 15,
"availability_status": "PREORDER",
"size": "13",
"instock_date": "2018-08-16T00:00:00.000Z"
}
]
}

One different is the User-agent field, which requests sets as
User-Agent: python-requests/2.18.4
Adidas may be just dropping these http requests to stop people abusing their system.
(btw, it also happens for just www.adidas.com)
I reproduced the issue and took a look at wireshark packet sniffer. It seems the http request is good and there is tcp acknowledgement but no http reply.

Related

Web scraping using Python

I'm trying to get data from a list of companies (currently testing only for one) from a website. I am not sure I can recognise how to get the score that I want because I can only find the formatting part instead of the actual data. Please could someone help?
from selenium import webdriver
import time
from selenium.webdriver.support.select import Select
driver=webdriver.Chrome(executable_path='C:\webdrivers\chromedriver.exe')
driver.get('https://www.refinitiv.com/en/sustainable-finance/esg-scores')
driver.maximize_window()
time.sleep(1)
cookie= driver.find_element("xpath", '//button[#id="onetrust-accept-btn-handler"]')
try:
cookie.click()
except:
pass
company_name=driver.find_element("id",'searchInput-1')
company_name.click()
company_name.send_keys('Jumbo SA')
time.sleep(1)
search=driver.find_element("xpath", '//button[#class="SearchInput-searchButton"]')
search.click()
time.sleep(2)
company_score = driver.find_elements("xpath",'//div[#class="fiscal-year"]')
print(company_score)
That's what I have so far. I want the number "42" to come back to my results but instead I get the below;
[<selenium.webdriver.remote.webelement.WebElement (session="bffa2fe80dd3785618b5c52d7087096d", element="62eaf2a8-d1a2-4741-8374-c0f970dfcbfe")>]
My issue is that the locator is not working.
//div[#class="fiscal-year"] = This part I think is wrong - but I am not sure what I need to pick from the website.
Website Screenshot
please use requests look at this example:
import requests
url = "https://www.refinitiv.com/bin/esg/esgsearchsuggestions"
payload = ""
response = requests.request("GET", url, data=payload)
print(response.text)
so this returns something like this:
[
{
"companyName": "GEK TERNA Holdings Real Estate Construction SA",
"ricCode": "HRMr.AT"
},
{
"companyName": "Mytilineos SA",
"ricCode": "MYTr.AT"
},
{
"companyName": "Hellenic Telecommunications Organization SA",
"ricCode": "OTEr.AT"
},
{
"companyName": "Jumbo SA",
"ricCode": "BABr.AT"
},
{
"companyName": "Folli Follie Commercial Manufacturing and Technical SA",
"ricCode": "HDFr.AT"
},
{
]
Here we can see the text and the code behind it so for Jumbo SA its BABr.AT. Now with this info lets get the data:
import requests
url = "https://www.refinitiv.com/bin/esg/esgsearchresult"
querystring = {"ricCode":"BABr.AT"} ## supply the company code
payload = ""
headers = {"cookie": "encaddr=NeVecfNa7%2FR1rLeYOqY57g%3D%3D"}
response = requests.request("GET", url, data=payload, headers=headers, params=querystring)
print(response.text)
Now we see the response is in json:
{
"industryComparison": {
"industryType": "Specialty Retailers",
"scoreYear": "2020",
"rank": "162",
"totalIndustries": "281"
},
"esgScore": {
"TR.TRESGCommunity": {
"score": 24,
"weight": 0.13
},
"TR.TRESGInnovation": {
"score": 9,
"weight": 0.05
},
"TR.TRESGHumanRights": {
"score": 31,
"weight": 0.08
},
"TR.TRESGShareholders": {
"score": 98,
"weight": 0.08
},
"TR.SocialPillar": {
"score": 43,
"weight": 0.42999998
},
"TR.TRESGEmissions": {
"score": 19,
"weight": 0.08
},
"TR.TRESGManagement": {
"score": 47,
"weight": 0.26
},
"TR.GovernancePillar": {
"score": 53,
"weight": 0.38999998569488525
},
"TR.TRESG": {
"score": 42,
"weight": 1
},
"TR.TRESGWorkforce": {
"score": 52,
"weight": 0.1
},
"TR.EnvironmentPillar": {
"score": 20,
"weight": 0.19
},
"TR.TRESGResourceUse": {
"score": 30,
"weight": 0.06
},
"TR.TRESGProductResponsibility": {
"score": 62,
"weight": 0.12
},
"TR.TRESGCSRStrategy": {
"score": 17,
"weight": 0.05
}
}
}
Now you can get the data you want without using selenium. This way its faster, easier and better.
Please accept this as an answer.

API Call using request module in python

I am not very familiar with API calls or the requests module. I am trying to get the about information (details) for each DAO. I correctly get the names of the DAOs but I get KeyError when I try to do the details. Any help would be greatly appreciated.
import pandas as pd
import requests
payload = {"requests": [{"indexName": "governance_production", "params": "highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&hitsPerPage=855&attributesToRetrieve=%5B%22id%22%5D&maxValuesPerFacet=100&query=&page=0&facets=%5B%22types%22%2C%22tags%22%5D&tagFilters="}]}
url = 'https://3b439zgym3-2.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(3.35.1)%3B%20Browser%20(lite)&x-algolia-application-id=3B439ZGYM3&x-algolia-api-key=14a0c8d17665d52e61167cc1b2ae9ff1'
headers = {"content-type": "application/x-www-form-urlencoded"}
req = requests.post(url, headers=headers, json=payload).json()
data = []
for item in req['results'][0]['hits']:
data.append({
"name": item['_highlightResult']['name']['value'],
"details": item['_highlightResult']['details']['value'],
})
print(data)
df = pd.DataFrame(data)
print(df)
Because there is no key named details exists in the resulted JSON, that's why it returns an error.
Here is a sample from the request you made above -
Either it includes tags key along with name and types
{
"_highlightResult": {
"assetSlug": {
"matchLevel": "none",
"matchedWords": [],
"value": "tribe"
},
"name": {
"matchLevel": "none",
"matchedWords": [],
"value": "Fei"
},
"tags": [
{
"matchLevel": "none",
"matchedWords": [],
"value": "DeFi"
}
],
"types": [
{
"matchLevel": "none",
"matchedWords": [],
"value": "Protocol"
}
]
},
"id": "f9779bc3-4eb4-4830-982b-fc981762dbd8",
"objectID": "f9779bc3-4eb4-4830-982b-fc981762dbd8"
}
or not including tags key
{
"_highlightResult": {
"assetSlug": {
"matchLevel": "none",
"matchedWords": [],
"value": "aave"
},
"name": {
"matchLevel": "none",
"matchedWords": [],
"value": "Aave Grants DAO"
},
"types": [
{
"matchLevel": "none",
"matchedWords": [],
"value": "Grants"
}
]
},
"id": "b3a88880-b343-4eba-955e-dd0c4970291a",
"objectID": "b3a88880-b343-4eba-955e-dd0c4970291a"
}
Here is the full body of JSON data -
JSON data

Build a JSON with multiple arrays in Python

im calling an API in Python and get a JSON response. Im filtering that response for the values I need. Then I want to make a JSON from that values again and print it.
Here's my code:
import requests
import json
url = "https://my.api"
payload={}
headers = {
'Cookie': 'PVEAuthCookie=123'
}
response = requests.request("GET", url, headers=headers, data=payload)
json_object = json.loads(response.text)
json_formatted_str = json.dumps(json_object, indent=2)
vmid_count = json_formatted_str.count("vmid")
i = 0
for i in range(vmid_count):
vmid = json_object['data'][i]['vmid']
cpu = json_object['data'][i]['cpu']
mem = json_object['data'][i]['mem']
data = { "data": [{"vmid": vmid, "name": name, "type": type, "maxcpu": maxcpu, "maxmem": maxmem, "maxdisk": maxdisk, "cpu": cpu, "mem": mem, "disk": disk, "status": status, "uptime": uptime, "node": node}]}
json_dump = json.dumps(data, indent=2)
print(json_dump)
json_formatted_str contains the JSON I receive from the API and looks like that:
{
"data": [
{
"status": "running",
"netin": 44452797,
"maxdisk": 16729894912,
"diskwrite": 649285632,
"node": "pve",
"uptime": 76654,
"vmid": 108,
"id": "lxc/108",
"type": "lxc",
"mem": 111636480,
"cpu": 0.000327262867680765,
"diskread": 456568832,
"name": "container108",
"disk": 2121224192,
"maxmem": 2147483648,
"netout": 25054481,
"maxcpu": 1,
"template": 0
},
more arrays (a lot more)
]
}
json_dump looks like that:
{
"data": [
{
"vmid": 108,
"name": "container108",
"type": "lxc",
"maxcpu": 1,
"maxmem": 2147483648,
"maxdisk": 16729894912,
"cpu": 0.0123243844774696,
"mem": 111116288,
"disk": 2121342976,
"status": "running",
"uptime": 76825,
"node": "pve"
}
]
}
{
"data": [
{
"vmid": 1007,
... more arrays
It starts a whole new object every time it runs through the for-loop. If I remove the print(json_dump) from the loop, I only get the last array.
("data":[ should not be there more then one time at the beginning and the commas at the end of the arrays are missing too.
I want the output to look like this:
{
"data":[
{
"vmid": "100",
"cpu": "4",
"mem": "16384" (more keys and values...)
},
{
"vmid": "101",
"cpu": "2",
"mem": "4096"
},
{
"vmid": "102",
"cpu": "6",
"mem": "32768"
}
]
}
I tried to find examples online and here on Stackoverflow, but I coundn't find anything so I thought I ask here.
You have to append every time new data, not create again.
like this
data["data"].append({"vmid": vmid, "name": name, "type": type, "maxcpu": maxcpu, "maxmem": maxmem, "maxdisk": maxdisk, "cpu": cpu, "mem": mem, "disk": disk, "status": status, "uptime": uptime, "node": node})
And than you can dump and print out of the loop.

I can't get values page by page with for-in-loop

As title,I could get the values in just first page, but I can't get values page by page with for-in-loop.
I've chek my code, but I'm still confused with it. How could I get that values in every page?
# Imports Required
!pip install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import requests
from bs4 import BeautifulSoup
browser = webdriver.Chrome(executable_path='./chromedriver.exe')
wait = WebDriverWait(browser,5)
output = list()
for i in range(1,2):
browser.get("https://www.rakuten.com.tw/shop/watsons/product/?l-id=tw_shop_inshop_cat&p={}".format(i))
# Wait Until the product appear
wait.until(EC.presence_of_element_located((By.XPATH,"//div[#class='b-content b-fix-2lines']")))
# Get the products link
product_links = browser.find_elements(By.XPATH,"//div[#class='b-content b-fix-2lines']/b/a")
# Iterate over 'product_links' to get all the 'href' values
for link in (product_links):
print(link.get_attribute('href'))
browser.get(link.get_attribute('href'))
soup = BeautifulSoup(browser.page_source)
products =[]
product = {}
product['商品名稱'] = soup.find('div',class_="b-subarea b-layout-right shop-item ng-scope").h1.text.replace('\n','')
product['價錢'] = soup.find('strong',class_="b-text-xlarge qa-product-actualPrice").text.replace('\n','')
all_data=soup.find_all("div",class_="b-container-child")[2]
main_data=all_data.find_all("span")[-1]
product['購買次數'] = main_data.text
products.append(product)
print(products)
You can scrape this website using BeautifulSoup web scraping library without the need to use selenium, it will be much faster than launching the whole browser.
Problems with site parsing may arise because when you try to request a site, it may consider that this is a bot, so that this does not happen, you need to send headers that contain user-agent in the request, then the site will assume that you're a user and display information.
The request might be blocked (if using requests as default user-agent in requests library is a python-requests.
An additional step could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on.
Code that extracts data from all pages without hardcoded page numbers and full example in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
data = []
page_num = 1
while True:
html = requests.get(f"https://www.rakuten.com.tw/shop/watsons/product/?p={page_num}", headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
print(f"Extracting page: {page_num}")
print("-" * 10)
for result in soup.select(".b-item"):
title = result.select_one(".product-name").text.strip()
price = result.select_one(".b-underline").text.strip()
data.append({
"title" : title,
"price" : price
})
if soup.select_one(".arrow-right-icon"):
page_num += 1
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output
Extracting page: 1
----------
[
{
"title": "桂格無糖養氣人蔘盒裝19瓶",
"price": "989 元"
},
{
"title": "DR.WU杏仁酸溫和煥膚精華15ML",
"price": "800 元"
},
{
"title": "桂格養氣人蔘盒裝19瓶",
"price": "989 元"
},
{
"title": "天地合補高單位葡萄糖胺飲60mlx18入",
"price": "939 元"
},
{
"title": "幫寶適超薄乾爽XL號紙尿褲尿布136片裝(68片/包)",
"price": "1,189 元"
},
{
"title": "耶歐雙氧保養液360ml*3網路獨家品",
"price": "699 元"
},
{
"title": "得意抽取式花紋衛生紙100抽10包7串(箱)",
"price": "689 元"
},
{
"title": "老協珍熬雞精14入",
"price": "1,588 元"
},
{
"title": "桂格活靈芝盒裝19瓶",
"price": "989 元"
},
{
"title": "善存葉黃素20mg 60錠",
"price": "689 元"
},
{
"title": "桂格養氣人蔘雞精-雙效滋補盒裝18瓶",
"price": "799 元"
},
{
"title": "天地合補含鐵玫瑰四物飲12入",
"price": "585 元"
},
{
"title": "好立善葉黃素軟膠囊30粒",
"price": "199 元"
},
{
"title": "全久榮75度防疫酒精350ml",
"price": "45 元"
},
{
"title": "白蘭氏雙認證雞精12入",
"price": "699 元"
},
{
"title": "保麗淨-假牙黏著劑 無味70g",
"price": "296 元"
},
{
"title": "義美生醫常順軍益生菌-30入",
"price": "680 元"
},
{
"title": "克補+鋅加強錠-禮盒(60+30錠) 2入組合",
"price": "1,249 元"
},
{
"title": "康乃馨寶寶潔膚濕巾超厚型80片2包(屈臣氏獨家)",
"price": "69 元"
},
{
"title": "天地合補青木瓜四物飲120ml*12瓶入",
"price": "579 元"
}
]
Extracting page: 2
----------
[
{
"title": "桂格無糖養氣人蔘盒裝19瓶",
"price": "989 元"
},
{
"title": "DR.WU杏仁酸溫和煥膚精華15ML",
"price": "800 元"
},
{
"title": "桂格養氣人蔘盒裝19瓶",
"price": "989 元"
},
{
"title": "天地合補高單位葡萄糖胺飲60mlx18入",
"price": "939 元"
},
{
"title": "幫寶適超薄乾爽XL號紙尿褲尿布136片裝(68片/包)",
"price": "1,189 元"
},
# ...
]
product_links = browser.find_elements(By.XPATH,"//div[#class='b-content b-fix-2lines']/b/a")
# Iterate over 'product_links' to get all the 'href' values
for link in (product_links):
print(link.get_attribute('href'))
browser.get(link.get_attribute('href'))
The problem is that when you do browser.get(), it invalidates the HTML element referred to by product_links because it no longer exists in the current page. You should get all of the 'href' attributes into an array. One way is with a list comprehension:
links = [link.get_attribute('href') for link in product_links]
Now you can loop over the strings in links to load new pages.
With that said, you should look at the library scrapy which can do a lot of the heavy lifting for you.

Getting a request error 422 upon get request

I have seen some posts about this but most of them used requests.post and I want a get request.
I am making a program that simply goes to a url and gets a list of orders.
This is the example response:
{
"page": 1,
"pageCount": 22,
"orderCount": 214,
"orders": [
{
"id": "c1497823-370c-4c7a-82cd-dacddb36fc30",
"productId": "1a641ba5-38df-4acb-86f7-f5c031e538a0",
"email": "demoemail#autobuy.io",
"ipAddress": "127.0.0.1",
"total": 18,
"currency": "USD",
"gateway": "Stripe",
"isComplete": true,
"createdAtUtc": "2019-10-15T01:44:10.5446599+00:00"
},
{
"id": "228f4ca4-5001-4c19-8350-f960f13d35a7",
"productId": "a0041cc0-2bc6-40a2-9084-5880bae5ecec",
"email": "demoemail#autobuy.io",
"ipAddress": "127.0.0.1",
"total": 50,
"currency": "USD",
"gateway": "Stripe",
"isComplete": true,
"createdAtUtc": "2019-10-15T01:43:17.8322919+00:00"
},
{
"id": "71aed9b2-4bd2-4a49-9e6a-82119e6e05bf",
"productId": "2e0ac75b-bfea-42f1-ad60-ead17825162a",
"email": "demoemail#autobuy.io",
"ipAddress": "127.0.0.1",
"total": 6,
"currency": "USD",
"gateway": "Stripe",
"isComplete": true,
"createdAtUtc": "2019-10-14T23:54:44.6217478+00:00"
}
]
}
This is what I get:
<Response [422]>
My code:
import requests
url = "https://autobuy.io/api/Orders?page=1"
headers = {
"Cache-Control": "no-cache",
"Pragma": "no-cache",
"APIKey": "<APIKEY>"
}
req = requests.get(url, headers=headers)
print(req)
Documentation for the api I'mtrying to use https://api.autobuy.io/?version=latest
edit: on their site it says
HEADERS
APIKey
2cdbdc48-b297-41ad-a234-329db0d2dbea
AutoBuy Api Key found in your shop settings
but then when I remove my Cache-Control and Pragma headers I get an error that has to do with my headers being cached (cause the site is behind a CDN?)
Ended up being a bug in the site, my dashboard was showing an invalid key.

Categories