trouble logging in to LinkedIn with requests_html - python

I'm trying to scrape data from linkedin. to do that I'll need to login, the issue is that I get an error in the html tag. here is the code.
url = 'https://www.linkedin.com/checkpoint/lg/login-submit'
s = AsyncHTMLSession()
headers = {
'User_Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'
}
payload = {
'session_key': key,
'session_password': password
}
r = await s.post(url, data = payload, headers = headers)
print(r.html)
traceback:
<html url='https://www.linkedin.com/checkpoint/lg/login?errorKey=unexpected_error'>
EDIT:
I realized that it was probably a rendered page or some kind of trickery, so I used selenium. can't be stubborn with code.

Related

python-requests CSRF verification failed

I'm trying to use a python script to access on a website
But it tells me CSRF verification failed. Request aborted. The python-requests code is as followed:
import requests
payload = {
'inUserName': 'NAN',
'inUserPass': 'NAN'
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/103.0.0.0 Safari/537.36',
'From': 'https://pruebita.gob.pe/accounts/login/',
}
with requests.Session() as s:
p = s.post('https://pruebita.gob.pe/accounts/login/', data=payload, headers=headers)
print (p.text)
r = s.get('A protected web page url')
print (r.text)

Can't fetch tabular content from a webpage using requests

I would like to scrape tabular content from the landing page of this website. There are 100 rows in it's first page. When I observe network activity in dev tools, I could notice that some get requests is being issued to this url https://io6.dexscreener.io/u/ws3/screener3/ with appropriate parameters which ends up producing json content.
However, when I try to mimic that requests through my following efforts:
import requests
url = 'https://io6.dexscreener.io/u/ws3/screener3/'
params = {
'EIO': '4',
'transport': 'polling',
't': 'NwYSrFK',
'sid': 'ztAOHWOb-1ulTq-0AQwi',
}
headers = {
'accept': '*/*',
'referer': 'https://dexscreener.com/',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(url,params=params)
print(res.content)
I get this response:
`{"code":3,"message":"Bad request"}`
How can I get response having tabular content from that webpage?
Here is a very quick and dirty piece of python code that does the initial handshake and sets up the websocket connection and downloads the data in json format infinitely. I haven't tested this code extensively and I am not sure exactly what is necessary or not (in terms of the steps in the handshake) but I have mimicked the browser behaviour and it seems to work fine:
import requests
from websocket import create_connection
import json
s = requests.Session()
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
url = 'https://dexscreener.com/ethereum'
resp = s.get(url,headers=headers)
print(resp)
step1 = s.get('https://io3.dexscreener.io/u/ws3/screener3/?EIO=4&transport=polling&t=Nwof-Os')
step2 = s.get('https://io4.dexscreener.io/u/ws3/screener3/?EIO=4&transport=polling&t=Nwof-S5')
obj = json.loads(step2.text[1:])
code = obj['sid']
payload = '40/u/ws/screener/consolidated/platform/ethereum/h1/top/1,'
step3 = s.post(f'https://io4.dexscreener.io/u/ws3/screener3/?EIO=4&transport=polling&t=Nwof-Xt&sid={code}',data=payload)
step4 = s.get(f'https://io4.dexscreener.io/u/ws3/screener3/?EIO=4&transport=polling&t=Nwof-Xu&sid={code}')
d = step4.text.replace('','').replace('42/u/ws/screener/consolidated/platform/ethereum/h1/top/1,','').replace(payload,'')
start = '["screener",'
end = ']["latestBlock",'
dirty = d[d.find(start)+len(start):d.rfind(end)].strip()
clean = json.loads(dirty)
print(clean)
# Initialize the headers needed for the websocket connection
headers = json.dumps({
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'en-ZA,en;q=0.9,en-GB;q=0.8,en-US;q=0.7,de;q=0.6',
'Cache-Control':'no-cache',
'Connection':'Upgrade',
'Host':'io3.dexscreener.io',
'Origin':'https://dexscreener.com',
'Pragma':'no-cache',
'Sec-WebSocket-Extensions':'permessage-deflate; client_max_window_bits',
'Sec-WebSocket-Key':'ssklBDKxAOUt3D47SoEttQ==',
'Sec-WebSocket-Version':'13',
'Upgrade':'websocket',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
})
# Then create a connection to the tunnel
ws = create_connection(f"wss://io4.dexscreener.io/u/ws3/screener3/?EIO=4&transport=websocket&sid={code}",headers=headers)
# Then send the initial messages through the tunnel
ws.send('2probe')
ws.send('5')
# Here you will view the message return from the tunnel
while True:
try:
json_data = json.loads(ws.recv().replace('42/u/ws/screener/consolidated/platform/ethereum/h1/top/1,',''))
print(json_data)
except:
pass

How do you grab tokens in request headers using python requests

In the request headers when logging in, there's a header called "cookie" that changes every time, how would I grab that each time and put it in the headers using python requests?
screenshot of network tab in chrome
Heres my code:
import requests
import time
proxies = {
"http": "http://us.proxiware.com:2000"
}
login_data = {'op':'login-main', 'user':'UpbeatPark', 'passwd':'Testingreddit123', 'api_type':'json'}
comment_data = {'thing_id':'t3_gluktj', 'text':'epical. redditor', 'id':'#form-t3_gluktjbx2', 'r':'gaming','renderstyle':'html'}
s = requests.Session()
s.headers.update({'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/82.0.4085.6 Safari/537.36'})
r = s.get('https://old.reddit.com/', proxies=proxies)
time.sleep(2)
r = s.post('https://old.reddit.com/api/login/UpbeatPark', proxies=proxies, data=login_data)
print(r.text)
here's the output (I know for a fact it is the correct password):
{"json": {"errors": [["WRONG_PASSWORD", "wrong password", "passwd"]]}}
This worked for me:
import requests
login_data = {
"op": "login-main",
"user": "USER",
"passwd": "PASS",
"api_type": "json",
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/82.0.4085.6 Safari/537.36",
}
s = requests.Session()
r = s.post("https://old.reddit.com/api/login/USER", headers=headers, data=login_data)
print(r.text)
It seems exactly like the code you are using but without proxy. Can you try to turn it off? The proxy might block cookies.

How can I check that I logged in correctly with python requests?

I'm pretty new in python requests, and I have made a simple program, just to login, in netflix.
Here's my code.
url = 'https://www.netflix.com/login'
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/80.0.3987.149 Safari/537.36'
}
r = requests.post(url, data={'userLoginId':'my-email', 'password': 'my-password'}, headers=headers)
print(r.status_code)
The output of status code is 200, so it's right
You could use a Session to check the results of your login.
s = requests.Session()
r = s.post(url, data={'userLoginId':'my-email', 'password': 'my-password'}, headers=headers)
print(r.status_code)
Then you could check the cookies for your session with:
s.cookies.getdict()
Another example in this question:
Using Python Requests: Sessions, Cookies, and POST

WEB SCRAPING - python requests session not able to gather data

I've seen some similar threads but neither gave me the answer. I simply need to get html content from one website. I'm sending the POST request with data for particular case and then using GET requests I want to scrape the text from html. The problem is that I always receive the first page's content. Not sure what I am doing wrong.
import requests
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection':'keep-alive',
'Content-Type':'application/x-www-form-urlencoded',
'Origin':'https://przegladarka-ekw.ms.gov.pl',
'Referer':'https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
}
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
url = 'https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW'
r = requests.session()
r.post(url, data=data, headers=headers)
final_content = r.get(url, headers=headers)
print(final_content.text)
The GET requests come from ("https://przegladarka-ekw.ms.gov.pl/eukw_prz/eukw201906070952/js/jquery-1.11.0_min.js
") but it returns a wall of code. My goal is to scrape the page which appears after providing the data from above to search menu.
try this
import json
import urllib.request
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection':'keep-alive',
'Content-Type':'application/x-www-form-urlencoded',
'Origin':'https://przegladarka-ekw.ms.gov.pl',
'Referer':'https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
}
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
url = 'https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW'
r=urllib.request.urlopen(url, data=bytes(json.dumps(data), encoding="utf-8"))
final_content = r
for i in r:
print(i)

Categories