I just want to get my location using python.requests by retrieving the content of this website 'http://ipinfo.io/json'.
I planned to get the website content using google vpn extension (hola vpn) by using coockies. BUT I can't get the content of this website while using google vpn extension due to unknown reason
code:
import requests
import browser_cookie3
cookies = browser_cookie3.chrome(cookie_file="C:\\Users\\USERNAME\\AppData\\Local\\Google\\Chrome\\User Data\\Profile 1\\Network\\Cookies")
response = requests.get('http://ipinfo.io/json', cookies=cookies)
print(response.content)
Note: I can do it by using selenium but I want another way to do it
Related
In this page you will find a link to download an xls file (below attachment or adjuntos): https://www.banrep.gov.co/es/emisiones-vigentes-el-dcv
The link to download the xls file is: https://www.banrep.gov.co/sites/default/files/paginas/emisiones/EMISIONES.xls
I was using this code to automatically download that file:
import requests
import os
path = os.path.abspath(os.getcwd()) #donde se descargará el archivo
path = path.replace("\\", '/')+'/'
url = 'https://www.banrep.gov.co/sites/default/files/paginas/emisiones/EMISIONES.xls'
myfile = requests.get(url, verify=False)
open(path+'EMISIONES.xls', 'wb').write(myfile.content)
This code was working well, but suddently the downloaded file started being corrupted.
If I run the code, it raises this warning:
InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.banrep.gov.co'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
warnings.warn(
The error is related to how your request is being built. The status_code returned by the request is 403 [Forbiden]. You can see it typing
myfile.status_code
I guess the security issue is related to cookies and headers in your get request, because of that I suggest you take a view on how the webpage is building its headers in your request before the URL you're using is sent.
TIP: start you web browser in development mode and using Network tab, try to identify the headers.
To solve the issue of cookies take a view on how to retrieve naturally cookies pointing out to a previous webpage in www.banrep.gov.co, using requests.sessions
session_ = requests.Session()
Before coding you could try to test your requests using Postman, or other REST API test software.
I have to scrape an internal web page of my organization. If I use Beautiful soap I get
"Unauthorized access"
I don't want to put my username/password in the source code because it will be shared across collegues.
If I open the same web url using Firefox It doesn't not ask me to login, the only problem is when I make the same request using python script.
Is there a way to share the same session used by firefox with a python script?
I think my authentication is with my PC because if I log off deleting all cookies When i re-enter I because logged in automatically. Do you know why with my python script this doesn’t not happen?
When you use the browser to login to your organization, you provide your credentials and the server returns a cookie tied to your organization's domain. This cookie has an expiration and allows to use navigate your organization's site without having to login as long as the cookie is valid.
You can read about cookies here:
https://en.wikipedia.org/wiki/HTTP_cookie
Your website scraper does not need to store your credentials. First delete the cookies then, using your browser's developer tools, you can (look at the network tab):
Figure out if your organization uses a separate auth end point
If it's not evident, then you might ask the IT department
Use the auth endpoint to get a cookie using credentials passed in
See how this cookie is used by the system (look at the HTTP request/response headers)
Use this cookie to scrape the website
Share your code freely - if someone needs to scrape the website then they can either pass in their credentials, or use a curl command to get/set a valid cookie header
1) After authenticating in your Firefox browser, make sure to get the cookie key/value.
2) Use that data in the code below :
from bs4 import BeautifulSoup
import requests
browser_cookies = {'your_cookie_key':'your_cookie_value'}
s = requests.Session()
r = s.get(your_url, cookies=browser_cookies)
bsoup = BeautifulSoup(r.text, 'lxml')
The requests.Session() is for persistence.
One more tips, you could also call your script like that :
python3 /path/to/script/script.py cookies_key cookies_value
Then, get the two values with sys module. The code will be :
import sys
browser_cookies = {sys.argv[1]:sys.argv[2]}
I am writing a script to download files from a website.
import requests
import bs4 as bs
import urllib.request
import re
with requests.session() as c: #making c denote the requests.session() function
link="https://gpldl.com/wp-login.php" #login link
initial=c.get(link) #passing link through .get()
headers = {
'User-agent': 'Mozilla/5.0'
}
login_data= {"log":"****","pwd":"****","redirect_to":"https://gpldl.com/my-gpldl-account/","redirect_to_automatic":1,"rememberme": "forever"} #login data for logging in
page_int=c.post(link, data=login_data,headers=headers) #posting the login data to the login link
prefinal_link="https://gpldl.com" #initializing a part of link to be used later
page=c.get("https://gpldl.com/repository/",headers=headers) #passing the given URL through .get() to be used later
good_data = bs.BeautifulSoup(page.content, "lxml") #parsing the data from previous statement into lxml from by BS4
#loop for finding all required links
for category in good_data.find_all("a",{"class":"dt-btn-m"}):
inner_link=str(prefinal_link)+str(category.get("href"))
my_var_2 = requests.get(inner_link)
good_data_2 = bs.BeautifulSoup(my_var_2.content, "lxml") #parsing each link with lxml
for each in good_data_2.find_all("tr",{"class":"row-2"}):
for down_link_pre in each.find_all("td",{"class":"column-4"}): #downloading all files and getting their addresses for to be entered into .csv file
for down_link in down_link_pre.find_all("a"):
link_var=down_link.get("href")
file_name=link_var.split('/')[-1]
urllib.request.urlretrieve(str(down_link),str(file_name))
my_var.write("\n")
Using my code, when I access the website to download the files, the login keeps failing. Can anyone help me to find what's wrong with my code?
Edit: I think the error is with maintaining the logged in state since, when I try to access one page at a time, I'm able to access the links that can be accessed only when one is logged in. But from that, when I navigate, I think, the bot gets logged out and not able to retrieve the download links and download them.
Websites use cookies to check login status in every request to tell if it's coming from a logged in user or not, and modern browsers (Chrome/Firefox etc.) automatically manage your cookies. requests.session() has support for cookies and it handles cookies by default, so in your code with requests.session() as c c is like the miniature version of a browser, cookie is involved in every request made by c, once you log in with c, you're able to use c.get() to browse all those login-accessible-only pages.
And in your code urllib.request.urlretrieve(str(down_link),str(file_name)) is used for downloading, it has no idea of previous login state, that's why you're not able to download those files.
Instead, you should keep using c, which has the login state, to download all those files:
with open(str(file_name), 'w') as download:
response = c.get(down_link)
download.write(response.content)
I'm interested in using Python to retrieve a file that exists at an HTTPS url.
I have credentials for the site, and when I access it in my browser I'm able to download the file. How do I use those credentials in my Python script to do the same thing?
So far I have:
import urllib.request
response = urllib.request.urlopen('https:// (some url with an XML file)')
html = response.read()
html.write('C:/Users/mhurley/Portable_Python/notebooks/XMLOut.xml')
This code works for non-secured pages, but (understandably) returns 401:Unauthorized for the https address. I don't understand how urllib handles credentials, and the docs aren't as helpful as I'd like.
I'm trying to log in into Instagram with python requests module. As I checked this site with Firefox Developer Tools I saw that whenever I click the login button a request is sent to instagram.com/accounts/login/ajax/ as you can see below:
So I wrote this piece of code:
import json, requests
ses = requests.session()
url = "https://www.instagram.com/accounts/login/ajax/"
payload = { 'username':'****' , 'password':'****' }
req = ses.post(base,data= json.dumps(payload))
But the response object (req) contains a HTML page with this error:
"This page could not be loaded. If you have cookies disabled in your browser, or you are browsing in Private Mode, please try enabling cookies or turning off Private Mode, and then retrying your action"
What should I do?