Issue about getting a 404 error page when scraping a website

Issue about getting a 404 error page when scraping a website - python

I wanted to get Fear and greed index from 'https://alternative.me/crypto/ and I found a url where the XHR data is by using Chrome network tool but when trying to scrape the website, I got 404 or 500 error page.
url = 'https://alternative.me/api/crypto/fear-and-greed-index/history'
headers = {
'user-agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'),
'Referer': ('https://alternative.me/crypto/fear-and-greed-index/'),
'Accept-Language': ('ko-KR,ko;q=0.9,en-US;q=0.8,en;q=0.7'),
'accept-encoding': ('gzip, deflate, br'),
'accept': ('application/json, text/plain, */*')
}
res = requests.post(url, headers=headers)

Related

Download NSE 2021 data using Python

I'm facing issue to access such URL via Python code
https://www1.nseindia.com/content/historical/EQUITIES/2021/JAN/cm01JAN2021bhav.csv.zip
This was working for last 3 years until 31-Dec-2020. Seems that the site has implemented some restrictions.
There's solution for similar one here in
VB NSE ACCESS DENIED
This addition is made : "User-Agent" : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11" "Referer" : "https://www1.nseindia.com/products/content/equities/equities/archieve_eq.htm"
Original code is here :
https://github.com/naveen7v/Bhavcopy/blob/master/Bhavcopy.py
It's not working even after adding following in requests section
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11'}
##print (Path)
a=requests.get(Path,headers)
Can someone help?

def download_bhavcopy(formated_date):
url = "https://www1.nseindia.com/content/historical/DERIVATIVES/{0}/{1}/fo{2}{1}{0}bhav.csv.zip".format(
formated_date.split('-')[2],
month_dict[formated_date.split('-')[1]],
formated_date.split('-')[0])
print(url)
res=None
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*,q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-IN,en;q=0.9,en-GB;q=0.8,en-US;q=0.7,hi;q=0.6',
'Connection': 'keep-alive','Host':'www1.nseindia.com',
'Cache-Control':'max-age=0',
'Host':'www1.nseindia.com',
'Referer':'https://www1.nseindia.com/products/content/derivatives/equities/fo.htm',
}
cookie_dict={'bm_sv':'E2109FAE3F0EA09C38163BBF24DD9A7E~t53LAJFVQDcB/+q14T3amyom/sJ5dm1gV7z2R0E3DKg6WiKBpLgF0t1Mv32gad4CqvL3DIswsfAKTAHD16vNlona86iCn3267hHmZU/O7DrKPY73XE6C4p5geps7yRwXxoUOlsqqPtbPsWsxE7cyDxr6R+RFqYMoDc9XuhS7e18='}
session = requests.session()
for cookie in cookie_dict:
session.cookies.set(cookie,cookie_dict[cookie])
response = session.get(url,headers = hdr)
if response.status_code == 200:
print('Success!')
elif response.status_code == 404:
print('Not Found.')
else :
print('response.status_code ', response.status_code)
file_name="none";
try:
zipT=zipfile.ZipFile(io.BytesIO(response.content) )
zipT.extractall()
file_name = zipT.filelist[0].filename
print('file name '+ file_name)
except zipfile.BadZipFile: # if the zip file has any errors then it prints the error message which you wrote under the 'except' block
print('Error: Zip file is corrupted')
except zipfile.LargeZipFile: # it raises an 'LargeZipFile' error because you didn't enable the 'Zip64'
print('Error: File size if too large')
print(file_name)
return file_name

Inspect the link in your web browser and find GET for the required download link.
Go to Headers and check User-Agent
e.g. User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0
Now modify your code as:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0'
}
result = requests.get(URL, headers = headers)

Getting "WinError 10054" When using urllib, trying to get info from a site. How to fix?

I'm trying to submit info to this site > https://cxkes.me/xbox/xuid
The info: e = {'gamertag' : "Xi Fall iX"}
Every time I try, I get WinError 10054. I can't seem to find a fix for this.
My Code:
import urllib.parse
import urllib.request
import json
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
url = "https://cxkes.me/xbox/xuid"
e = {'gamertag' : "Xi Fall iX"}
f = {'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
'accept-encoding': "gzip, deflate, br",
'accept-language': "en-GB,en-US;q=0.9,en;q=0.8",
'cache-control': "max-age=0",
'content-length': "76",
'content-type': "application/x-www-form-urlencoded",
'cookie': "__cfduid=d2f371d250727dc4858ad1417bdbcfba71593253872; XSRF-TOKEN=eyJpdiI6IjVcL2dHMGlYSGYwd3ZPVEpTRGlsMnFBPT0iLCJ2YWx1ZSI6InA4bDJ6cEtNdzVOT3UxOXN4c2lcLzlKRTlYaVNvZjdpMkhqcmllSWN3eFdYTUxDVHd4Y2NiS0VqN3lDSll4UDhVMHM1TXY4cm9lNzlYVGE0dkRpVWVEZz09IiwibWFjIjoiYjdlNjU3ZDg3M2Y0MDBlZDY3OWE5YTdkMWUwNGRiZTVkMTc5OWE1MmY1MWQ5OTQ2ODEzNzlhNGFmZGNkZTA1YyJ9; laravel_session=eyJpdiI6IjJTdlFhK0dacFZ4cFI5RFFxMHgySEE9PSIsInZhbHVlIjoia2F6UTJXVmNSTEt1M3lqekRuNVFqVE5ZQkpDang4WWhraEVuNm0zRmlVSjVTellNTDRUb1wvd1BaKzNmV2lISGNUQ0l6Z21jeFU3VlpiZzY0TzFCOHZ3PT0iLCJtYWMiOiIwODU3YzMxYzg2N2UzMjdkYjcxY2QyM2Y4OTVmMTY1YTcxZTAxZWI0YTExZDE0ZjFhYWI2NzRlODcyOTg3MjIzIn0%3D",
'origin': "https://cxkes.me",
'referer': "https://cxkes.me/xbox/xuid",
'sec-fetch-dest': "document",
'sec-fetch-mode': "navigate",
'sec-fetch-site': "same-origin",
'sec-fetch-user': "?1",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"}
data = urllib.parse.urlencode(e)
data = json.dumps(data)
data = str(data)
data = data.encode('ascii')
req = urllib.request.Request(url, data, f)
with urllib.request.urlopen(req) as response:
the_page = response.read()
print(the_page)

Having run the code, I get the following error
[WinError 10054] An existing connection was forcibly closed by the remote host
that could be caused by any of the followings:
The network link between server and client may be temporarily going down.
running out of system resources.
sending malformed data.
I am not sure what you trying to achieve here entirely. But if your aim is to simply read the XUID of a gamer tag then use a web-automator like Selenium to retrieve that value.

WEB SCRAPING - python requests session not able to gather data

I've seen some similar threads but neither gave me the answer. I simply need to get html content from one website. I'm sending the POST request with data for particular case and then using GET requests I want to scrape the text from html. The problem is that I always receive the first page's content. Not sure what I am doing wrong.
import requests
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection':'keep-alive',
'Content-Type':'application/x-www-form-urlencoded',
'Origin':'https://przegladarka-ekw.ms.gov.pl',
'Referer':'https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
}
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
url = 'https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW'
r = requests.session()
r.post(url, data=data, headers=headers)
final_content = r.get(url, headers=headers)
print(final_content.text)
The GET requests come from ("https://przegladarka-ekw.ms.gov.pl/eukw_prz/eukw201906070952/js/jquery-1.11.0_min.js
") but it returns a wall of code. My goal is to scrape the page which appears after providing the data from above to search menu.

try this
import json
import urllib.request
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection':'keep-alive',
'Content-Type':'application/x-www-form-urlencoded',
'Origin':'https://przegladarka-ekw.ms.gov.pl',
'Referer':'https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
}
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
url = 'https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW'
r=urllib.request.urlopen(url, data=bytes(json.dumps(data), encoding="utf-8"))
final_content = r
for i in r:
print(i)

How to scrape a hidden table with beautifulsoup

It's about scraping a hidden table with beautifulsoup.
As you can see in this website, there is a button "choisissez votre séance" and when we click on it a table will be shown.
When I click on inspect the table element i can see the tag that contains attributes like price. However, when I view the website's source code, I can't find this information.
There is something in the code of the table 'display : none' which I think affects this, but I can't find a solution.

It would appear the page is using AJAX and loading the data for pricing in the background. Using Chrome I pressed F12 and had a look under the network tab. When I clicked the "choisissez votre séance" button I noticed a POST to this address:
'https://www.ticketmaster.fr/fr/manifestation/holiday-on-ice-billet/idmanif/446304'
This is great news for you as you do not need to scrape the HTML data, you simply need to provide the ID (in page source) to the API.
In the below code I am
Requesting the initial page
Collecting the cookie
Posting the ID (data) and the cookie we collected
Returning the JSON data you require to further process (variable J)
Hope the below helps out!
Cheers,
Adam
import requests
from bs4 import BeautifulSoup
h = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
s = requests.session()
initial_page_request = s.get('https://www.ticketmaster.fr/fr/manifestation/holiday-on-ice-billet/idmanif/446304',headers=h)
soup = BeautifulSoup(initial_page_request.text,'html.parser')
idseanc = soup.find("select",{"id":"sessionsSelect"})("option")[0]['value'].split("_")[1]
cookies = initial_page_request.cookies.get_dict()
headers = {
'Origin': 'https://www.ticketmaster.fr',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
'Content-Type': 'application/json; charset=UTF-8',
'Accept': '*/*',
'Referer': 'https://www.ticketmaster.fr/fr/manifestation/holiday-on-ice-billet/idmanif/446304',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
}
data = {'idseanc':str(idseanc)}
response = s.post('https://www.ticketmaster.fr/planPlacement/FindPrices/connected/false/idseance/2870471', headers=headers, cookies=cookies, data=data)
j = response.json()

Retrieve the Request Headers from a page

Is there a possible way to retrieve the content of request header from a page such as this:
Accept:*/*
Accept-Encoding:gzip, deflate, br
Accept-Language:en-US,en;q=0.8,id;q=0.6
Connection:keep-alive
Cookie:SPC_IA=-1; SPC_EC=-; SPC_F=oTIBQWRUjdH7sWSqJu1UBBA1o3zy5j1C; REC_T_ID=9af9f032-8d77-11e7-b124-1866da5681e2; SPC_T_ID="VkC0m8L3ZwixZk8y836Lhq4XucdTWJQtxOT+CCjn+u7HhYZ0zEcK/BI8L3dT2/em76AgwKj3p9ysfh7yUnOrq9CqS4lRPFaqLTpEuecgX8U="; SPC_U=-; SPC_T_IV="PA3yjLFENXXf8Tzq685zSg=="; csrftoken=SiNmh7GZo00aZ3a0gxIqEaNjB38zhCQI; bannerShown=true; django_language=id; sessionid=15fgkr8ohrult2zkmgu2xyiwwnm4ejcx; SPC_SC_TK=; UYOMAPJWEMDGJ=; SPC_SC_UD=; SPC_SI=i94582s7ffe99b47y3qomp1siqy4adz5
Host:shopee.co.id
If-None-Match:"75c23fc0e3e55d18c21158ab8a335ab4;gzip"
If-None-Match-:55b03-c56af1c195a559f1680c15f63d56f07a
Referer:https://shopee.co.id/SHARP-LED-TV-24INCH-LC24LE175ITT_sby-Area-Surabaya-i.24413460.298360054
User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
X-API-SOURCE:pc
X-Requested-With:XMLHttpRequest
I want to store the value of "If-None-Match-: xxxx" in my code without having to manually do it, so for every page I open using the for loop that I have in my code, every page's "If-None-Match-: xxxx" value is saved as a variable.
I tried using
r = requests.get(url)
r.headers
But it only prints out the HTTP response header.
I was wondering if there's a way.

For getting the header of client side request header you need to get re.request.headers
import requests
res = requests.request('GET', "https://www.google.com")
print res.request.headers
Output
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.13.0'}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issue about getting a 404 error page when scraping a website - python

Related

Download NSE 2021 data using Python

Getting "WinError 10054" When using urllib, trying to get info from a site. How to fix?

WEB SCRAPING - python requests session not able to gather data

How to scrape a hidden table with beautifulsoup

Retrieve the Request Headers from a page

Categories

Resources