Web scraping keeps blocking website even after mentioning proxy server - python

I am scrapping the website craiglist.com but after getting certain requests it keeps blocking my device. I tried out the solution in Proxies with Python 'Requests' module but didn't understand how to specify the headers every time. Here's the code :
from bs4 import BeautifulSoup
import requests,json
list_of_tuples_with_given_zipcodes = []
id_of_apartments = []
params = {
'sort': 'dd',
'filter': 'reviews-dd',
'res_id': 18439027
}
http_proxy = "http://10.10.1.10:3128"
https_proxy = "https://10.10.1.11:1080"
ftp_proxy = "ftp://10.10.1.10:3128"
proxies = {
"http" : http_proxy,
"https" : https_proxy,
"ftp" : ftp_proxy
}
for i in range(1,30):
content = requests.get('https://losangeles.craigslist.org/search/apa?s = ' + str(i),params = params) #https://losangeles.craigslist.org/search/apa?s=120
# content = requests.get('https://www.zillow.com/homes/for_rent/')
soup = BeautifulSoup(content.content, 'html.parser')
my_anchors = list(soup.find_all("a",{"class": "result-image gallery"}))
for index,each_anchor_tag in enumerate(my_anchors):
URL_to_look_for_zipcode = soup.find_all("a",{"class": "result-title"}) #taking set so that a page is not visited twice.
for each_href in URL_to_look_for_zipcode:
# content_href = requests.get(each_href['href']) #script id="ld_posting_data" type="application/ld+json">
content_href = requests.get(each_href['href']) #script id="ld_posting_data" type="application/ld+json">
# print(each_href['href'])
soup_href = BeautifulSoup(content_href.content, 'html.parser')
my_script_tags = soup_href.find("script",{"id": "ld_posting_data"})
# for each_tag in my_script_tags:
if my_script_tags:
res = json.loads(str(list(my_script_tags)[0]))
if res and 'address' in list(res.keys()):
if res['address']['postalCode'] == "90012": #use the input zipcode entered by the user.
list_of_tuples_with_given_zipcodes.append(each_href['href'])
I am still not sure about the value of the http_proxy variable. I specified it as what was given but should it be the IP address of my device mapped to the localhost port number? It still keeps blocking the code.
Please help.

request's GET method lets you specify the proxy to use it on a call
r = requests.get(url, headers=headers, proxies=proxies)

Related

Unable to receive desired results from POST in Python

I am attempting to get data from ITC TradeMap (I have selected the page at random so don't give that too much thought) using Requests and then clean (have not done this yet) and export it using Pandas, however I am facing difficulties getting the full datasets.
import pandas as pd
import requests as rq
#Pandas Settings
pd.set_option("display.max_rows", 999)
pd.set_option("display.max_columns", 999)
#Request Settings
url = 'https://www.trademap.org/Country_SelProductCountry_TS.aspx?nvpm=1%7c643%7c%7c%7c%7c36%7c%7c%7c2%7c1%7c1%7c2%7c2%7c1%7c2%7c1%7c1%7c1'
headers = {
"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}
payload = {
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_OutputMode': 'T',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_PageSizeTab': '300',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_TS_NumTimePeriod': '10',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_TS_ReferencePeriod': '2019'
}
#Output Settings
output_file = 'ITC_Test.xlsx'
#Work
request = rq.post(url, verify=False, headers= headers, data=payload)
table = pd.read_html(request.content)
table[8].to_excel(output_file)
print(table[8])
So far I am in the testing stage of this and solving issues as they arise (e.g. If requested without verify = False, it throws a severside ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997) error), but that's besides the point.
The real problem I am facing is that while most data needed for successful queries is contained in the url itself and I will simply for loop through it when time comes, the view settings are not, and without them I am limited to retrieving only 25 rows and 5 columns of data (top 25 trade partners over the last 5 years).
Those settings are located in dropdown windows which seem to be fed into a aspnetForm, I have tried to use data parameter of post to feed it with those values:
payload = {
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_OutputMode': 'T',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_PageSizeTab': '300',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_TS_NumTimePeriod': '10',
'ctl00$PageContent$GridViewPanelControl$HiddenField_Current_TS_ReferencePeriod': '2019'
}
However the output does not seem to be affected and it still only returns 25 rows and 5 columns of data instead of 300 rows and 10 columns I would expect it to.
Thanks!
I have seen some questions here which seemed similar and tries to implement those ideas, however most likely due to the fact that I haven't worked with those libraries and my knowledge of Python in general is rather basic I was unable to resolve the issues so any help would be much appreciated.
Thanks!
I found 3 problems:
it has to use Session() to send cookies
it has to send all values in payload - so first I GET page to get all values from <input>, <select>
it sends new values in different variables.
You used variables which described current state but it send in variables with DropDownList in name
payload['ctl00$PageContent$GridViewPanelControl$DropDownList_NumTimePeriod'] = '20'
payload['ctl00$PageContent$GridViewPanelControl$DropDownList_PageSize'] = '300'
EDIT:
I saved payload generated by code and payload from browser and I used program Meld to compare files to see differences.
I had to correct code which gets values from <select> because it needs to search <option> with selected
For some addresses it needed to manually some values because normally JavaScript was adding these values
And it needed to skip DropDownList_Product_Group
Full working code:
import pandas as pd
import requests as rq
from bs4 import BeautifulSoup as BS
# hide SSL warnings
from requests.packages.urllib3.exceptions import InsecureRequestWarning
rq.packages.urllib3.disable_warnings(InsecureRequestWarning)
# request settings
#url = 'https://www.trademap.org/Country_SelProduct_TS.aspx?nvpm=1%7c%7c%7c%7c%7c88%7c%7c%7c2%7c1%7c1%7c1%7c2%7c1%7c2%7c1%7c1%7c1'
url = 'https://www.trademap.org/Country_SelProductCountry_TS.aspx?nvpm=1%7c616%7c%7c%7c%7cTOTAL%7c%7c%7c2%7c1%7c1%7c1%7c2%7c1%7c2%7c1%7c%7c1'
print('url:', url.replace('%7c', '|'))
headers = {
"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}
s = rq.Session()
# --- GET ---
print('sending GET ...')
response = s.get(url, verify=False, headers=headers)
soup = BS(response.content, 'html.parser')
form = soup.find('form') #.find('div', {'id': 'div_nav_combo'})
payload = {}
#print('--- inputs ---')
inputs = form.find_all('input')
for item in inputs:# + select:
name = item.get('name', '')
#print('name:', name)
value = item.get('value', '')
#print('value:', value)
if name: #and name != 'pg_goal' and 'button' not in name.lower():
payload[name] = value
#print(name, '=', value)
#print('--- selects ---')
selects = form.find_all('select')
for item in selects:
name = item.get('name', '')
#print('name:', name)
value = item.get('value', '')
#print('value:', value)
if name:
value = item.find('option', {'selected': True}) or ""
if value:
value = value['value']
payload[name] = value
#print(name, '=', value)
#print('--- textareas ---')
#textareas = form.find_all('textarea')
#for item in textareas:
# name = item.get('name', '')
# print('name:', name)
# value = item.get('value', '')
# print('value:', value)
# --- POST ---
print('sending POST ...')
#payload['ctl00$PageContent$GridViewPanelControl$DropDownList_NumTimePeriod'] = '20'
#payload['__EVENTTARGET'] = 'ctl00$PageContent$GridViewPanelControl$DropDownList_NumTimePeriod'
payload['ctl00$PageContent$GridViewPanelControl$DropDownList_PageSize'] = '300'
#payload['__EVENTTARGET'] = 'ctl00$PageContent$GridViewPanelControl$DropDownList_PageSize'
#payload['ctl00$MenuControl$DDL_Language'] = 'en'
# added by JavaScript
payload['ctl00$NavigationControl$DropDownList_Country_Group'] = '-2'
payload['ctl00$NavigationControl$DropDownList_Partner'] = '-2'
payload['ctl00$NavigationControl$DropDownList_Partner_Group'] = '-2'
# has to remove it for `PageSize` (at least for some addresses)
del payload['ctl00$NavigationControl$DropDownList_Product_Group']
response = s.post(url, verify=False, headers=headers, data=payload)
#print(response.content[:1000])
# --- rest ---
# pandas settings
pd.set_option("display.max_rows", 999)
pd.set_option("display.max_columns", 999)
# output settings
output_file = 'ITC_Test.xlsx'
all_tables = pd.read_html(response.content)
table = all_tables[8]
table.to_excel(output_file)
print('len(table):', len(table))
#print(table)
Result: table with ~230 rows and ~20 columns

BeautfulSoup issues to login in to website with request session

I using code below to make a login in a website en be able to scrape data from my own profile page.
However same after i make get from URL of profile the selector(soup) only returns data from login page.
I still dont be able to found a reason for that.
import requests
from requests import session
from bs4 import BeautifulSoup
login_url='https://caicara.pizzanapoles.com.br/Account/Login'
url_perfil = 'https://caicara.pizzanapoles.com.br/AdminCliente'
payload = {
'username' : 'MY_USERNAME',
'password' : 'MY_PASSWORD'
}
with requests.session() as s:
s.post(login_url, data = payload)
r = requests.get(url_perfil)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.title)
Firstly you need to use your session object s for all the requests.
r = requests.get(url_perfil)
changes to
r = s.get(url_perfil)
A __RequestVerificationToken is sent in the POST data when you try to login - you may need to send it too.
It is present inside the HTML of the login_url
<input name="__RequestVerificationToken" value="..."
This means you .get() the login page - extract the token - then send your .post()
r = s.get(login_url)
soup = BeautifulSoup(r.content, 'html.parser')
token = soup.find('input', {'name': '__RequestVerificationToken'})['value']
payload['__RequestVerificationToken'] = token
r1 = s.post(login_url, data=payload)
r2 = s.get(url_perfil)
You may want to save each request into its own variable for further debugging.
Thank You Karl for yout return,
But it dont worked fine.
U change my code using tips as you mentioned above.
import requests
from bs4 import BeautifulSoup
login_url = 'https://caicara.pizzanapoles.com.br/Account/Login'
url = 'https://caicara.pizzanapoles.com.br/AdminCliente'
data = {
'username': 'myuser',
'password': 'mypass',
}
with requests.session() as s:
r = s.get(login_url)
soup = BeautifulSoup(r.content, 'html.parser')
token = soup.find('input', name='__RequestVerificationToken')['value_of
_my_token']
payload['__RequestVerificationToken'] = token
r1 = s.post(login_url, data=payload)
r2 = s.get(url_perfil)
However it returns a error below.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-140-760e35f7b327> in <module>
13
14 soup = BeautifulSoup(r.content, 'html.parser')
---> 15 token = soup.find('input', name='__RequestVerificationToken')['QHlUQaro9sNo4lefL59lQRtbuziHnHtolV7Xm_Et_3tvnZKZnS4gjBBJZakw7crW0dyXy_lok44RozrMAvWm61XXGla5tC3AuZlgXC4GukA1']
16
17 payload['__RequestVerificationToken'] = token
TypeError: find() got multiple values for argument 'name'

Python : How to stay logged in while scraping?

Just to clarify from the beginning: I'm a total beginner (I wrote something in Python for the first time today). This was more applying from a guide and trying to remember what I did 7 years ago when I tried learning java than anything else.
I wanted to scrape the image tags from a website (to plot them later) but have to stay logged in to view all images. After I got the scraping down I noticed that there were some tags blocked so the issue with the login came up. I now managed to log in but it doesn't work outside of the session itself which makes the rest of my code useless. Can I get this to work or do I have to give up?
This is the working login:
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
login_data = {
'user' : 'theusername',
'pass' : 'thepassword',
'op' : 'Log in'
}
with requests.Session() as s:
url = "https://thatwebsite.com/index.php?page=account&s=login&code=00"
r = s.get(url)
r = s.post(url, data=login_data)
And what I had working before to scrape the website but with the login missing:
filename = "taglist.txt"
f = open(filename, "w", encoding="utf-8")
headers = "tags\n"
f.write(headers)
pid = 0
actual_page = 1
while pid < 150:
url = "https://thatwebsite.com/index.php?page=post&s=list&tags=absurdres&pid=" + str(pid)
print(url)
client = urlopen(url)
page_html = client.read()
client.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"thumbnail-preview"})
print("Current pid: " + str(pid))
for container in containers:
tags = container.span.a.img["title"]
f.write(tags.replace(" ", "\n") + "\n")
pid = pid + 42
print("Current page: " + str(actual_page))
actual_page += 1
print("Done.")
f.close()
Out comes a list of every tag used by high res images.
I hope I don't offend anyone with this.
Edit: The code is working now, had a cookie typo:
import requests
from bs4 import BeautifulSoup as soup
login_data = {
'user' : 'myusername',
'pass' : 'mypassword',
'op' : 'Log in'
}
s = requests.Session()
print("\n\n\n\n\n")
filename = "taglist.txt"
f = open(filename, "w", encoding="utf-8")
headers = "tags\n"
f.write(headers)
pid = 0
actual_page = 1
while pid < 42:
url2 = "https://thiswebsite.com/index.php?page=post&s=list&tags=rating:questionable&pid=" + str(pid)
r = s.get(url2, cookies={'duid' : 'somehash', 'user_id' : 'my userid', 'pass_hash' : 'somehash'})
page_html = str(r.content)
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"thumbnail-preview"})
for container in containers:
tags = container.span.a.img["title"]
f.write(tags.replace(" ", "\n") + "\n")
print("\nCurrent page: " + str(actual_page) + " Current pid: " + str(pid) + "\nDone.")
actual_page += 1
pid = pid + 42
f.close()
You use two different libraries for doing web requests right now. requests and urllib. I would opt for using only requests.
Also don't use the Session() context manager. Context manager are used to do some cleanup after leaving the indented block and have that with ... as x syntax you use on the requests.Session() object. In context of requests this will clear the cookies as you leave the session. (I assume login is managed by cookies at this site).
Keep the session in a variable instead that you can use for subsequent requests as this stores your cookies at login. You need them for subsequent requests.
s = requests.Session()
url = "https://thatwebsite.com/index.php?page=account&s=login&code=00"
r = s.get(url) # do you need this request?
r = s.post(url, data=login_data)
Also make the subsequent call in the loop with requests:
client = s.get(url)

Unable to successfully make the second call using the first call response as parameters

I am a beginner in Python and I am trying to access the following data using python.
1) https://www.nseindia.com/corporates/corporateHome.html, click on 'Corporate Announcements' under 'Corporate Information' on the left pane.
2) Entering the company symbol (KSCL for example) and selecting the announcement period
3) Click on any individual row subject to get additional detail
The first two steps translates to the below url 'https://www.nseindia.com/corporates/corpInfo/equities/getAnnouncements.jsp?period=More%20than%203%20Months&symbol=kscl&industry=&subject='. This is working fine in my python requests code.
However I am not able to replicate the third step, the request is successful but I am not getting the data. Following is the code that I am using, I am stuck please help.
I compared all the request headers that are going when I tried this from browser to what I am sending with python and they match. I tried sending cookie too but that didn't work. I think cookie might not be required as the website works in browser after disabling cookies too. I am running this on Python 3.5.
import requests as rq
from requests.utils import requote_uri
from requests_html import HTMLSession
import demjson as dj
from urllib.parse import quote
class BuyBack:
def start(self):
# Define headers used across all requests
self.req_headers = {'user-agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',}
self.req_headers['Accept'] = '*/*'
self.req_headers['Accept-Encoding'] = 'gzip, deflate, br'
self.getAllSymbols()
def readAnnouncement(self, pyAnnouncement):
# This is done using request_html
symbol = pyAnnouncement['sym']
desc = pyAnnouncement['desc']
tstamp = pyAnnouncement['date']
seqId = pyAnnouncement['seqId']
payload = {'symbol' : symbol,'desc' : desc, 'tstamp' : tstamp, 'seqId' : seqId}
quote_payload = {}
params_string = '?'
#formats as required with '%20' for spaces
for(k,v) in payload.items():
quote_payload [quote(k)] = quote(v)
params_string += quote(k)
params_string += '='
params_string += quote(v)
params_string += '&'
params_string = params_string[:-1]
announDetail_Url = 'https://nseindia.com/corporates/corpInfo/equities/AnnouncementDetail.jsp'
self.req_headers['Referer'] = 'https://www.nseindia.com/corporates/corpInfo/equities/Announcements.html'
self.req_headers['X-Requested-With'] = 'XMLHttpRequest'
self.req_headers['Host'] = 'www.nseindia.com'
annReqUrl = announDetail_Url + params_string
session = HTMLSession()
r = session.get(annReqUrl, headers = self.req_headers)
print(r.url)
#I am not getting the proper data in the response
print(r.content)
print(r.request.headers)
def getAllSymbols(self):
# To get the list of symbols to run the rest of the process, for now just run with one
symbol = 'KSCL'
self.getAnnouncements(symbol)
def getAnnouncements(self,symbol):
# To get a list of all announcements so far in the last few months
# This is done by using requests and demjson because the request returns a js object
# Open request to get everything
payload = {'symbol' : symbol,'Industry' : '', 'ExDt' : '', 'subject' : ''}
corporateActions_url='https://www.nseindia.com/corporates/corpInfo/equities/getAnnouncements.jsp'
r = rq.get(corporateActions_url, headers = self.req_headers, params=payload)
for line in r.iter_lines():
lineAscii = line.decode("ascii")
if len(lineAscii) > 5:
pyAnnouncements = dj.decode(lineAscii)
#Tried setting the cookie but no use
#cookie = r.headers['Set-Cookie']
#self.req_headers['Cookie'] = cookie
# read from the announcements
if pyAnnouncements['success']:
#for x in pyAnnouncements['rows']:
for i in range(0,1):
self.readAnnouncement(pyAnnouncements['rows'][i])
BuyBack_inst = BuyBack()
BuyBack_inst.start()
When I try this flow from browser, the second call response will have a href link to another pdf. But I am not getting that href link in my python response.
The following works for me to get all PDF hrefs given a symbol and announcement period:
import demjson
import requests
from bs4 import BeautifulSoup
symbol = 'KSCL'
s = requests.Session()
r = s.get("https://www.nseindia.com/corporates/corpInfo/equities/getAnnouncements.jsp"
f"?period=Last%201%20Month&symbol={symbol}&industry=&subject=")
for ann in demjson.decode(r.text.strip())['rows']:
url = (
"https://www.nseindia.com/corporates/corpInfo/equities/AnnouncementDetail.jsp?"
f"symbol={ann['sym']}"
f"&desc={ann['desc']}"
f"&tstamp={int(ann['date']) // 100}"
f"&seqId={ann['seqId']}"
)
soup = BeautifulSoup(s.get(url).content, 'html.parser')
print(soup.select_one('.t1 a[href$=".pdf"]')['href'])
Result:
/corporate/KSCL_20122018134432_Outcome_046.pdf
/corporate/KSCL_20122018133033_Outcome_043.pdf

Equivalent Python code for the following Java http get requests

I am trying to convert the following Java code to Python. Not sure what I am doing wrong, but I end up with an internal server error 500 with python.
Is the "body" in httplib.httpConnection method equivalent to Java httpentity?
Any other thoughts on what could be wrong?
The input information I collect is correct for sure.
Any help will be really appreciated. I have tried several things, but end up with the same internal server error.
Java Code:
HttpEntity reqEntitiy = new StringEntity("loginTicket="+ticket);
HttpRequestBase request = reMethod.getRequest(uri, reqEntitiy);
request.addHeader("ticket", ticket);
HttpResponse response = httpclient.execute(request);
HttpEntity responseEntity = response.getEntity();
StatusLine responseStatus = response.getStatusLine();
Python code:
url = serverURL + "resources/slmservices/templates/"+templateId+"/options"
#Create the request
ticket = ticket.replace("'",'"')
headers = {"ticket":ticket}
print "ticket",ticket
reqEntity = "loginTicket="+ticket
body = "loginTicket="+ticket
url2 = urlparse.urlparse(serverURL)
h1 = httplib.HTTPConnection(url2.hostname,8580)
print "h1",h1
url3 = urlparse.urlparse(url)
print "url path",url3.path
ubody = {"loginTicket":ticket}
data = urllib.urlencode(ubody)
conn = h1.request("GET",url3.path,data,headers)
#conn = h1.request("GET",url3.path)
response = h1.getresponse()
lines = response.read()
print "response.status",response.status
print "response.reason",response.reason
You don't need to go this low level. Using urllib2 instead:
import urllib2
from urllib import urlencode
url = "{}resources/slmservices/templates/{}/options".format(
serverURL, templateId)
headers = {"ticket": ticket}
params = {"loginTicket": ticket}
url = '{}?{}'.format(url, urlencode(params))
request = urllib2.Request(url, headers=headers)
response = urllib2.urlopen(request)
print 'Status', response.getcode()
print 'Response data', response.read()
Note that the parameters are added to the URL to form URL query parameters.
You can do this simpler still by installing the requests library:
import requests
url = "{}resources/slmservices/templates/{}/options".format(
serverURL, templateId)
headers = {"ticket": ticket}
params = {"loginTicket": ticket}
response = requests.get(url, params=params, headers=headers)
print 'Status', response.status
print 'Response data', response.content # or response.text for Unicode
Here requests takes care of URL-encoding the URL query string parameters and adding it to the URL for you, just like Java does.

Categories