How do I handle '403 Forbidden' respond with python scraping? - python

I have scraped the url of the picture I want, but I use requests module to download the pic, the server responds 403 Forbidden.
I have tried to capture traffic in chrome F12, there are many JS responses in main page and the url of the picture respond just type of Doc
import requests
lines =[
'https://i.hamreus.com/ps4/0-9/9%E5%8F%B7%E6%9D%80%E6%89%8B%E6%B9%9B%E8%93%9D%E4%BB%BB%E5%8A%A1[%E9%AB%98%E6%A1%A5%E7%BE%8E%E7%94%B1%E7%BA%AA]/vol_02/seemh-001-a5f6.jpg.webp?cid=121333&md5=7dHbKv51JwzRC6jjd7p3oQ',
'https://i.hamreus.com/ps4/0-9/9%E5%8F%B7%E6%9D%80%E6%89%8B%E6%B9%9B%E8%93%9D%E4%BB%BB%E5%8A%A1[%E9%AB%98%E6%A1%A5%E7%BE%8E%E7%94%B1%E7%BA%AA]/vol_02/seemh-002-c60d.jpg.webp?cid=121333&md5=7dHbKv51JwzRC6jjd7p3oQ',
'https://i.hamreus.com/ps4/0-9/9%E5%8F%B7%E6%9D%80%E6%89%8B%E6%B9%9B%E8%93%9D%E4%BB%BB%E5%8A%A1[%E9%AB%98%E6%A1%A5%E7%BE%8E%E7%94%B1%E7%BA%AA]/vol_02/seemh-003-4b8a.jpg.webp?cid=121333&md5=7dHbKv51JwzRC6jjd7p3oQ',
'https://i.hamreus.com/ps4/0-9/9%E5%8F%B7%E6%9D%80%E6%89%8B%E6%B9%9B%E8%93%9D%E4%BB%BB%E5%8A%A1[%E9%AB%98%E6%A1%A5%E7%BE%8E%E7%94%B1%E7%BA%AA]/vol_02/seemh-004-87ac.jpg.webp?cid=121333&md5=7dHbKv51JwzRC6jjd7p3oQ',
]
def download_pic(url,s):
pass
r = s.get(url,headers = headers)
with open(url.split('/')[-1].split('.')[0] +'.jpg','wb') as fp:
fp.write(r.content())
def main():
pass
s = requests.Session()
main_url = 'https://www.manhuagui.com/comic/12087/121333.html'
r = s.get(main_url,headers = headers)
for each_url in lines:
download_pic(each_url.strip(r'\n'),s)
if __name__ == '__main__':
main()
I can't download the picture I want

Some websites have a security provision against requests from external sources particularly python files. That is why you are getting the 403 error. You will not be able to use either the urllib or requests module.
My workaround was I made a call to a shell script from python in which I passed the URL of the image. In the shell script I used $1 to access the url passed with wget to download the image as such:
Python:
import subprocess
subprocess.call([filename, url])
Script (.sh)
wget $1

Related

Download text data from HTML link in Python

Hi I want to download delimited text which is hosted on a HTML Link. (The link is accessible on a Private network only, so can't share here).
In R, following function solves the purpose (all other functions gave "Unauthorized access" or "401" error)
url = 'https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain'
download.file(url, "~/insights_dashboard/testing_file.tsv")
a = read.csv("~/insights_dashboard/testing_file.tsv",header = T,stringsAsFactors = F,sep='\t')
I want to do the same thing in Python, for which I used:
(A)urllib and requests.get()
import urllib.request
url_get = requests.get(url, verify=False)
urllib.request.urlretrieve(url_get, 'C:\\Users\\cssaxena\\Desktop\\24.tsv')
(B)requests.get() and read.html
url='https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain'
s = requests.get(url, verify=False)
a = pd.read_html(io.StringIO(s.decode('utf-8')))
(C) Using wget:
import wget
url = 'https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain'
wget.download(url,--auth-no-challenge, 'C:\\Users\\cssaxena\\Desktop\\24.tsv')
OR
wget --server-response -owget.log "https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain"
NOTE: The URL doesn't asks for any credentials and it is accessible by browser and able to download using R with download.file. I am looking for a solution in Python
def geturls(path):
yy=open(path,'rb').read()
yy="".join(str(yy))
yy=yy.split('<a')
out=[]
for d in yy:
z=d.find('href="')
if z>-1:
x=d[z+6:len(d)]
r=x.find('"')
x=x[:r]
x=x.strip(' ./')
#
if (len(x)>2) and (x.find(";")==-1):
out.append(x.strip(" /"))
out=set(out)
return(out)
pg="./test.html"# your html
url=geturls(pg)
print(url)

Python - Requests pulling HTML instead of JSON

I'm building a Python web scraper (personal use) and am running into some trouble retrieving a JSON file. I was able to find the request URL I need, but when I run my script (I'm using Requests) the URL returns HTML instead of the JSON shown in the Chrome Developer Tools console. Here's my current script:
import requests
import json
url = 'https://nytimes.wd5.myworkdayjobs.com/Video?clientRequestID=1f1a6071627946499b4b09fd0f668ef0'
r = requests.get(url)
print(r.text)
Completely new to Python, so any push in the right direction is greatly appreciated. Thanks!
Looks like that website returns the response depending on the accept headers provided by the request. So try:
import requests
import json
url = 'https://nytimes.wd5.myworkdayjobs.com/Video?clientRequestID=1f1a6071627946499b4b09fd0f668ef0'
r = requests.get(url, headers={'accept': 'application/json'})
print(r.json())
You can have a look at the full api for further reference: http://docs.python-requests.org/en/latest/api/.

Python Request module error while logging into an Wordpress site

I am writing a script to download files from a website.
import requests
import bs4 as bs
import urllib.request
import re
with requests.session() as c: #making c denote the requests.session() function
link="https://gpldl.com/wp-login.php" #login link
initial=c.get(link) #passing link through .get()
headers = {
'User-agent': 'Mozilla/5.0'
}
login_data= {"log":"****","pwd":"****","redirect_to":"https://gpldl.com/my-gpldl-account/","redirect_to_automatic":1,"rememberme": "forever"} #login data for logging in
page_int=c.post(link, data=login_data,headers=headers) #posting the login data to the login link
prefinal_link="https://gpldl.com" #initializing a part of link to be used later
page=c.get("https://gpldl.com/repository/",headers=headers) #passing the given URL through .get() to be used later
good_data = bs.BeautifulSoup(page.content, "lxml") #parsing the data from previous statement into lxml from by BS4
#loop for finding all required links
for category in good_data.find_all("a",{"class":"dt-btn-m"}):
inner_link=str(prefinal_link)+str(category.get("href"))
my_var_2 = requests.get(inner_link)
good_data_2 = bs.BeautifulSoup(my_var_2.content, "lxml") #parsing each link with lxml
for each in good_data_2.find_all("tr",{"class":"row-2"}):
for down_link_pre in each.find_all("td",{"class":"column-4"}): #downloading all files and getting their addresses for to be entered into .csv file
for down_link in down_link_pre.find_all("a"):
link_var=down_link.get("href")
file_name=link_var.split('/')[-1]
urllib.request.urlretrieve(str(down_link),str(file_name))
my_var.write("\n")
Using my code, when I access the website to download the files, the login keeps failing. Can anyone help me to find what's wrong with my code?
Edit: I think the error is with maintaining the logged in state since, when I try to access one page at a time, I'm able to access the links that can be accessed only when one is logged in. But from that, when I navigate, I think, the bot gets logged out and not able to retrieve the download links and download them.
Websites use cookies to check login status in every request to tell if it's coming from a logged in user or not, and modern browsers (Chrome/Firefox etc.) automatically manage your cookies. requests.session() has support for cookies and it handles cookies by default, so in your code with requests.session() as c c is like the miniature version of a browser, cookie is involved in every request made by c, once you log in with c, you're able to use c.get() to browse all those login-accessible-only pages.
And in your code urllib.request.urlretrieve(str(down_link),str(file_name)) is used for downloading, it has no idea of previous login state, that's why you're not able to download those files.
Instead, you should keep using c, which has the login state, to download all those files:
with open(str(file_name), 'w') as download:
response = c.get(down_link)
download.write(response.content)

displaying cookies through python

I wanted to print cookies into text file printcooki.txt. And open a webpage https://www.google.co.in. But at end I am getting blank text file and webpage not opening. What changes to be made in my program ? . Please help me out.
from urllib2 import Request, build_opener, HTTPCookieProcessor, HTTPHandler
import cookielib
import io
object = cookielib.CookieJar()
opener = build_opener(HTTPCookieProcessor(object), HTTPHandler())
webreq = Request("https://www.google.co.in/")
f = opener.open(webreq)
html = f.read()
print html[:10]
print "the webpage has following cookies "
for cookie in object:
print>> cookie
createtext = open("C:\Users\****\Desktop\printcooki.txt", "w")
print>> cookies ' #to save cookies into printcooki.txt
opener.open('https://www.google.co.in') #to open a webpage
It is likely that your company makes use of a corporate firewall.
In such a case - and if you have the needed credentials - you can set a couple of environment variables to instruct urllib2 to use your corporate proxy.
For example in Bash you can run the following commands:
export HTTP_PROXY="http://<user_name>:<user_password>#<proxy_ip_address_or_name>:<proxy_port>"
export HTTPS_PROXY="$HTTP_PROXY"
before running your Python script.

Website scraping script works in Linux but not in Windows 7?

I have written a script that scrapes a URL. It works fine on Linux OS. But i am getting http 503 error when running on Windows 7. The URL has some issue.
I am using python 2.7.11 .
Please help.
Below is the script:
import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document
if __name__ == "__main__":
### Import Beautiful Soup
### Here, I have the BeautifulSoup folder in the level of this Python script
### So I need to tell Python where to look.
sys.path.append("./BeautifulSoup")
from bs4 import BeautifulSoup
### Create opener with Google-friendly user agent
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
### Open page & generate soup
### the "start" variable will be used to iterate through 10 pages.
for start in range(0,1000):
url = "http://www.google.com/search?q=site:theknot.com/us/&start=" + str(start*10)
page = opener.open(url)
soup = BeautifulSoup(page)
### Parse and find
### Looks like google contains URLs in <cite> tags.
### So for each cite tag on each page (10), print its contents (url)
file = open("parseddata.txt", "wb")
for cite in soup.findAll('cite'):
print cite.text
file.write(cite.text+"\n")
# file.flush()
# file.close()
In case you run it in windows 7, the cmd throws http503 error stating the issue is with url.
The URL works fine in Linux OS. In case URL is actually wrong please suggest the alternatives.
Apparently with Python 2.7.2 on Windows, any time you send a custom User-agent header, urllib2 doesn't send that header. (source: https://stackoverflow.com/a/8994498/6479294).
So you might want to consider using requests instead of urllib2 in Windows:
import requests
# ...
page = requests.get(url)
soup = BeautifulSoup(page.text)
# etc...
EDIT: Also a very good point to be made is that Google may be blocking your IP - they don't really like bots making 100 odd requests sequentially.

Categories