I want to scrape facebook companies for their date (if they have).
problem is that when I try to retrieve the HTML, I get the Hebrew version of it (I'm located in Israel)
this is part of the result:
�1u�9X�/.������~�O+$B\^����y�����e�;�+
Code:
import requests
from bs4 import BeautifulSoup
headers = {'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'referer': 'https',
'sec-fetch-mode': 'no-cors',
'sec-fetch-site': 'cross-site',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
}
url = 'https://www.facebook.com/pg/google/about/'
def fetch(URL):
try:
response = requests.get(url=URL, headers=headers).text
print(response)
except:
print('Could not retrieve data, or connect')
fetch(url)
Is there a way to check the EN website? any subdomain? or i should use proxy in the request?
What are you seeing isn't Hebrew version of the site, but compressed response from the server. As quick solution, you can remove accept-encoding header from the request:
import requests
from bs4 import BeautifulSoup
headers = {
'accept': '*/*',
# 'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'referer': 'https',
'sec-fetch-mode': 'no-cors',
'sec-fetch-site': 'cross-site',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
}
url = 'https://www.facebook.com/pg/google/about/'
def fetch(URL):
try:
response = requests.get(url=URL, headers=headers).text
print(response)
except:
print('Could not retrieve data, or connect')
fetch(url)
Prints the uncompressed page:
<!DOCTYPE html>
<html lang="en" id="facebook" class="no_js">
<head><meta charset="utf-8" /><meta name="referrer" content="origin-when-crossorigin" id="meta_referrer" /><script>window._cstart=+new Date();</script><script>function envFlush(a){function b(b){for(var c in a)b[
...and so on.
Related
I want to use proxy with Python web requests. To test if my request is working or not, I send a request to jsonip.com. In the response it returns my real ip instead of the proxy. Also the website providing proxy also says "no activity". Am I connecting to the proxy correctly? Here the code:
import time, requests, random
from requests.auth import HTTPProxyAuth
auth = HTTPProxyAuth("muyjgovw", "mtpysgrb3nkj")
def reqs():
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
# 'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.google.com/',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-User': '?1',
}
prox = [{"http": "http://64.137.58.19:6265"}]
proxies = random.choice(prox)
response = requests.get('https://jsonip.com/', headers=headers, proxies=proxies)
print(response.status_code)
print(response.json())
reqs()
Screenshot of website showing no activity
Your have to do this to include the proxy
import time, requests, random
from requests.auth import HTTPProxyAuth
auth = HTTPProxyAuth("muyjgovw", "mtpysgrb3nkj")
def reqs():
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
# 'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.google.com/',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-User': '?1',
}
prox = [{"http": "http://64.137.58.19:6265",
"https": "http://64.137.58.19:6265" }]
proxies = random.choice(prox)
response = requests.get('https://jsonip.com/', headers=headers, proxies=proxies)
print(response.status_code)
print(response.json())
reqs()
Scraping an AJAX web page using python and requests
I used the script in above link to get a table on Barchart webite and it somehow stopped working recently with the error message {'error': {'message': 'The payload is invalid.', 'code': 400}}. I guess some of the filed names have been changed but I am pretty new to web scanning and I couldn't figure out how to fix it. Any suggestions?
import requests
geturl=r'https://www.barchart.com/futures/quotes/CLJ19/all-futures'
apiurl=r'https://www.barchart.com/proxies/core-api/v1/quotes/get'
getheaders={
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
getpay={
'page': 'all'
}
s=requests.Session()
r=s.get(geturl,params=getpay, headers=getheaders)
headers={
'accept': 'application/json',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'referer': 'https://www.barchart.com/futures/quotes/CLJ19/all-futures?page=all',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
'x-xsrf-token': s.cookies.get_dict()['XSRF-TOKEN']
}
payload={
'fields': 'symbol,contractSymbol,lastPrice,priceChange,openPrice,highPrice,lowPrice,previousPrice,volume,openInterest,tradeTime,symbolCode,symbolType,hasOptions',
'list': 'futures.contractInRoot',
'root': 'CL',
'meta': 'field.shortName,field.type,field.description',
'hasOptions': 'true',
'raw': '1'
}
r=s.get(apiurl,params=payload,headers=headers)
j=r.json()
print(j)
OUT: {'error': {'message': 'The payload is invalid.', 'code': 400}}
This happened with me too. This is because the website gets the table from an internal API and the cookies should be decoded to avoid this error.
Try this solution:
1- import the unquote function at the beginning of your code
from urllib.parse import unquote
2- Change this line:
'x-xsrf-token': s.cookies.get_dict()['XSRF-TOKEN']
to this:
'x-xsrf-token': unquote(unquote(s.cookies.get_dict()['XSRF-TOKEN']))
I have a text file with entries like this (url.py):
import requests
headers = {
'authority': 'www.spain.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9,pt;q=0.8',
}
links=['https://www.spain.com']
for url in links:
page = requests.get(url, headers=headers)
print(page)
Return
ubuntu#OS-Ubuntu:/mnt/$ python3 url.py
<Response [200]>
I need this to be filled in automatically because I will receive a txt file (domain.txt) with the domains like this:
www.spain.com
www.uk.com
www.italy.com
I wanted the python script to be unique and transversal ... I would just add more domains to my domain.txt and then I would run my url.py and it would automatically make the request on all domains of domain.txt
You can help me with that.
Assuming url.py is located in the same directory as domains.txt, you can open the file and read each link into a list using:
with open('domains.txt', 'r') as f:
links = f.read().splitlines()
api_url = "https://en.coinjinja.com/api/events/search"
headers = {'origin': 'https://en.coinjinja.com',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'ko-KR,ko;q=0.9,en-US;q=0.8,en;q=0.7',
'authority': 'en.coinjinja.com',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36',
'content-type': 'application/json',
'content-length': '126',
'accept': '*/*',
'referer': 'https://en.coinjinja.com/events/time/next_week/tags/hardfork+airdrop+burn+exchange+partnership'
}
data = {"start": "2020-02-17","end":"2020-02-24","symbol":"","types":["hardfork","airdrop","burn","exchange","partnership"]}
api_request = requests.post(api_url, headers=headers, data=json.dumps(data))
print(api_request.headers)
print(api_request.encoding)
print(api_request.content.decode('utf-8','ignore'))
You need to install brotli package to work with 'Content-Encoding': 'br'. It's duplicate of unable to decode Python web request
completely new to Python and I'm trying to get stuck in but I'm struggling with requests. I run a node for a small cryptocurrency project and am trying to create a python script that can scrape my wallet value and telegram it to me once a day, I've managed the telegram bot and I've practiced with BeautifulSoup to pull values out from a source fine, it's just getting a response that contains my balance that's frustrating me.
Here's the URL with my balance on: https://www.hpbscan.org/address/0x7EC332476fCA4Bcd20176eE06F16960b5D49333e/
The value obviously changes so I don't think I can just do a get request for the above page and parse it to beautiful soup, so I loaded up Developer Tools and saw that there was a post request:
METHOD: POST
URL: https://www.hpbscan.org/HpbScan/addrs/getAddressDetailInfo
Request Headers:
Host: www.hpbscan.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0
Accept: /
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate, br
X-Requested-With: XMLHttpRequest
Content-Type: application/json;charset=utf-8
Content-Length: 46
DNT: 1
Connection: keep-alive
Referer: https://www.hpbscan.org/address/0x7EC332476fCA4Bcd20176eE06F16960b5D49333e/
Pragma: no-cache
Cache-Control: no-cache
Request Body:
["0x7EC332476fCA4Bcd20176eE06F16960b5D49333e"]
The response (at least in a browser) is JSON formatted data that does indeed contain the balance I need.
Here's where I got to so far trying to recreate the above request:
import requests
import json
url = "https://www.hpbscan.org/HpbScan/addrs/getAddressDetailInfo"
payload = '["0x7EC332476fCA4Bcd20176eE06F16960b5D49333e"]'
headers = """
'Host': 'www.hpbscan.org'
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'
'Accept': '*/*'
'Accept-Language': 'en-GB,en;q=0.5'
'Accept-Encoding': 'gzip, deflate, br'
'X-Requested-With': 'XMLHttpRequest'
'Content-Type': 'application/json;charset=utf-8'
'Content-Length': '46'
'DNT': '1'
'Connection': 'keep-alive'
'Referer': 'https://www.hpbscan.org/address/0x7EC332476fCA4Bcd20176eE06F16960b5D49333e/'
'Pragma': 'no-cache'
'Cache-Control': 'no-cache'
"""
data = requests.post(url, data=payload, headers=headers)
print(data.text)
I've never used requests before so I'm a bit in the dark, I've tried fiddling with things based on what I can see other people doing but it's no use, currently I'm getting "AttributeError: 'str' object has no attribute 'items'.
I'd imagine it to be something along the lines of me not specifying the request headers and body correctly, or maybe because the response is in json format which my code can't understand?
Any help would be massively appreciated :)
You should change "headers" from string to dict. Here your final code:
import requests
import json
url = "https://www.hpbscan.org/HpbScan/addrs/getAddressDetailInfo"
payload = '["0x7EC332476fCA4Bcd20176eE06F16960b5D49333e"]'
headers = {
'Host': 'www.hpbscan.org',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0',
'Accept': '*/*',
'Accept-Language': 'en-GB,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/json;charset=utf-8',
'Content-Length': '46',
'DNT': '1',
'Connection': 'keep-alive',
'Referer': 'https://www.hpbscan.org/address/0x7EC332476fCA4Bcd20176eE06F16960b5D49333e/',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'}
data = requests.post(url, data=payload, headers=headers)
print(data.text)
The headers should be a dictionary
import requests
import json
url = "https://www.hpbscan.org/HpbScan/addrs/getAddressDetailInfo"
payload = '["0x7EC332476fCA4Bcd20176eE06F16960b5D49333e"]'
headers = {
'Host': 'www.hpbscan.org',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0',
'Accept': '*/*',
'Accept-Language': 'en-GB,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/json;charset=utf-8',
'Content-Length': '46',
'DNT': '1',
'Connection': 'keep-alive',
'Referer': 'https://www.hpbscan.org/address/0x7EC332476fCA4Bcd20176eE06F16960b5D49333e/',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'}
data = requests.post(url, data=payload, headers=headers)
print(json.loads(data))
the final bit converts the response you get from the browser to a python dictionary so you can continue to make use of it within your code.