I'm working on a project that will access a specific site to do a search and then I will filter and return the value; the program logs in and then runs the search saving the cookie with a cookie jar to authenticate the connection while it runs the search . However when I run the program it returns no results and the packet header looks completely different. What am I doing wrong that the search always returns no results.
Here is my code:
import cookielib, urllib, urllib2
file= open('results.txt', 'wb')
cj=cookielib.CookieJar()
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders=[('Referer', 'http:// site that runs the search/psc/p01ps1/EMPLOYEE/CRM/c/BANNER_TAP.SRCH_ATDO_TAP.GBL')]
opener.addheaders=[('User-Agent', 'Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0')]
posts={'timezoneOffset':'180', 'userid':'user', 'pwd':'password', 'Submit':'Signon'}
data = urllib.urlencode(posts)
opens=opener.open('loginpage.com', data)
print cj
file.write(opens.read())
cjs=str(cj)
posts2 = urllib.urlencode({'ICType':'Panel', 'ICElementNum':0, 'ICStateNum':1, 'ICAction':'SRCH_ATD_TAP_WK_SRCH_PB', 'ICXPos':0, 'ICYPos':0, 'ICFocus':'', 'ICChanged':1, 'ICResubmit':0, 'ICFind':'', 'SRCH_ATD_TAP_WK_MSISDN_TAP':'', 'SRCH_ATD_TAP_WK_CNPJ_TAP':'', 'SRCH_ATD_TAP_WK_STATUS_RA_TAP':'', 'SRCH_ATD_TAP_WK_INTERACTION_ID':'', 'SRCH_ATD_TAP_WK_CASE_ID':48373914, 'SRCH_ATD_TAP_WK_PROTOCOLO_TAP':'', 'SRCH_ATD_TAP_WK_DATA_INI_BAN_TAP':'', 'SRCH_ATD_TAP_WK_HORA_INI_RA_TAP':'', 'SRCH_ATD_TAP_WK_DATA_FIM_BAN_TAP':'', 'SRCH_ATD_TAP_WK_HORA_FIM_BAN_TAP':'', 'SRCH_ATD_TAP_WK_MOTIVO_ID1_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID2_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID3_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID4_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID5_TAP':0, 'SRCH_ATD_TAP_WK_COMPANY_TYPE_TAP':'','SRCH_ATD_TAP_WK_SUBTIPO_CLI_TAP':''})
url2='searchpage.com'
opens2 = opener.open(url2, posts2)
str=opens2.read()
print cj
file.write(str + cjs)
file.close()
It connects the first time to the login page to save the cookie and then it connects to the the search page. Again this is just to be used on one site so the connections and post data are very specific.
Again, this code doesn't return any results (after searching the str var which is the entire unfiltered site.
Here are the results I get when scanning the the requests with wireshark, the first one is the site ran in firefox doing the search in a normal browser (including the post data sent) and the second one is my program running and automating the search for me.
POST /psc/p01ps1/EMPLOYEE/CRM/c/BANNER_TAP.SRCH_ATDO_TAP.GBL HTTP/1.1
Host: siteroot
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: pt-BR,pt;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Referer: site that runs the search/BANNER_TAP.SRCH_ATDO_TAP.GBL #note I wasn't able to create this header.
Cookie: SignOnDefault=my login id; PS_LOGINLIST=http:// siteroot; brux0128-claro-com-br-7090-PORTAL-PSJSESSIONID=dpLmTCpY8vTmj4nMHbpyptPMdvphpRLR!841308261; ExpirePage=http:// siteroot/psp/p01ps1/; PS_TOKEN=AAAAogECAwQAAQAAAAACvAAAAAAAAAAsAARTaGRyAgBOcQgAOAAuADEAMBSfJDUA/BR2T3ekF0/cVhdJ7uJlpgAAAGIABVNkYXRhVnicHYpBCoAgFESfFi2jixRqYrgO2hbWvjN0vw7X5B94bxg+8BjbtBh09v05kJlxpGq1joOd0ksnGxc3KyUS9OSJjHIQPUtlYNLqK52Ya5Li+ABuIwtr; http%3a%2f%2fsiteroot%2fpsp%2fp01ps1%2femployee%2fcrm%2frefresh=list:||||||; PS_360=PS_360_BO_ID_CUST!0!PS_360_CUST_SETID!!PS_360_BO_ID_CONT!0!PS_360_BO_ID_SITE!0!PS_360_CUST_ROLE!0!PS_360_CONT_ROLE!0!PS_360_BO_ID!0!PS_360_VIEW_OPTION!False; PS_TOKENEXPIRE=18_Feb_2014_00:04:41_GMT; HPTabName=DEFAULT
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 683
POST DATA: ICType=Panel&ICElementNum=0&ICStateNum=17&ICAction=SRCH_ATD_TAP_WK_SRCH_PB&ICXPos=0&ICYPos=84&ICFocus=&ICChanged=1&ICResubmit=0&ICFind=&SRCH_ATD_TAP_WK_MSISDN_TAP=&SRCH_ATD_TAP_WK_CNPJ_TAP=&SRCH_ATD_TAP_WK_STATUS_RA_TAP=&SRCH_ATD_TAP_WK_INTERACTION_ID=&SRCH_ATD_TAP_WK_CASE_ID=48373914&SRCH_ATD_TAP_WK_PROTOCOLO_TAP=&SRCH_ATD_TAP_WK_DATA_INI_BAN_TAP=&SRCH_ATD_TAP_WK_HORA_INI_RA_TAP=&SRCH_ATD_TAP_WK_DATA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_HORA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID1_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID2_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID3_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID4_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID5_TAP=0&SRCH_ATD_TAP_WK_COMPANY_TYPE_TAP=&SRCH_ATD_TAP_WK_SUBTIPO_CLI_TAP=
POST /psc/p01ps1/EMPLOYEE/CRM/c/BANNER_TAP.SRCH_ATDO_TAP.GBL HTTP/1.1
Accept-Encoding: identity
Content-Length: 681
Host: siteroot
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0
Connection: close
Cookie: PS_TOKEN=AAAAogECAwQAAQAAAAACvAAAAAAAAAAsAARTaGRyAgBOcQgAOAAuADEAMBSX+ZILWKx7oU/VKvJbVT8LbueJtwAAAGIABVNkYXRhVnicJYpLCoAwDAWnVVyKF1Hsh2rXgluluvcM3s/DGWNCZh6PALexVY1Bxj4fOzKBkaSW1LCzUVrRwcrJxUKJeHlyRHqxFzomZWCQZlYm5b9Z7gVtawtT; ExpirePage=siteroot; PS_LOGINLIST=siteroot; PS_TOKENEXPIRE=18_Feb_2014_00:08:09_GMT; brux0128-claro-com-br-7090-PORTAL-PSJSESSIONID=QG14TCkJK7PpfRtNH0CSCw9S1m6jtRR9!841308261; SignOnDefault=my login id; http%3a%2f%2fsiteroot%2fpsp%2fp01ps1%2femployee%2fcrm%2frefresh=list:
Content-Type: application/x-www-form-urlencoded
POST DATA: SRCH_ATD_TAP_WK_DATA_INI_BAN_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID4_TAP=0&ICResubmit=0&ICXPos=0&SRCH_ATD_TAP_WK_DATA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_PROTOCOLO_TAP=&SRCH_ATD_TAP_WK_SUBTIPO_CLI_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID3_TAP=0&ICAction=SRCH_ATD_TAP_WK_SRCH_PB&SRCH_ATD_TAP_WK_MOTIVO_ID5_TAP=0&ICElementNum=0&SRCH_ATD_TAP_WK_INTERACTION_ID=&ICType=Panel&SRCH_ATD_TAP_WK_STATUS_RA_TAP=&SRCH_ATD_TAP_WK_COMPANY_TYPE_TAP=&SRCH_ATD_TAP_WK_HORA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID2_TAP=0&ICFind=&SRCH_ATD_TAP_WK_MOTIVO_ID1_TAP=0&SRCH_ATD_TAP_WK_HORA_INI_RA_TAP=&ICChanged=1&ICStateNum=1&ICYPos=0&ICFocus=&SRCH_ATD_TAP_WK_CASE_ID=48373914&SRCH_ATD_TAP_WK_MSISDN_TAP=&SRCH_ATD_TAP_WK_CNPJ_TAP=
(This is for personal use at the company I work at to make this task more simple which needs to be done around 500 times at this point manualy. it is a site that registers protocols and we need to search the protocols to check if (later will import a list from excel) the protocol is closed of not)
note that I don't have the additional headers but if that could solve the problem I can. And for some reason my post data gets all disorganized ( but from what I understand about post data that shouldn't make a difference) and the cookie information is also somewhhat backwards, but that also shouldn't matter I would assum because to retrieve the cookie info is handled much like a python dictionary.
so I've been breaking my head over this little code and rewritting it several times for the past two weeks and I still can't get it to return the search results.
it's also important to note that I won't be able to install the browser core to be able to execute the javascript, but I also don't think that it's necessary do to the fact that the results from the search done on firefox show in wireshark, so the site is downloaded with the result. I was able to get mechanize running, but I havn't been able to try it yet. If there is a way to automate firefox (I don't remember which version at this moment) with python, that is an option that I'm open to.
One ore thing, because I'm working on this project at work, I'm not able to use and python plugin that has to be installed. I got mechanize to work because I open and copied the file over, with out running the setup.py. So just to make things easier, I have no way to install libraries.
You don't have PS_360 set in your cookie. Not sure how essential this is, but the best strategy going through these issues is to get step by step identical requests. Probably the first request to get ỳour cookie set was already different, or your browser has cookie data from previous requests, that you need to create manually for your request.
Related
I'm trying to scrape website, which use datadome and after some requests I have to complete geetest (slider captcha puzzle).
Here is a sample link to it:
captcha link
I've decided to don't use selenium (at least for now) and I'm trying to solve my problem by python module: Requests.
My idea was to complete geetest by myself then send the same request in my program, that my web browser is sending after completing that slider.
At the beginning, I've scraped html code which I got on website after captcha prompt:
<head><title>allegro.pl</title><style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script>var dd={'cid':'AHrlqAAAAAMAsB0jkXrMsMAsis8SQ==','hsh':'77DC0FFBAA0B77570F6B414F8E5BDB','t':'fe','s':29701,'host':'geo.captcha-delivery.com'}</script><script src="https://ct.captcha-delivery.com/c.js"></script></body></html>
I couldn't access iframe where most important info is, but I found out that link to to that iframe can be build with info from that html code above. As u can see in link above:
cid is initialCid, hsh is hash etc., one part of the link, cid is a cookie that I got at the moment when captcha appeared.
I've seen there are available services which can solve captcha for u, so I've decided to complete captcha for myself, then send exact request, including cookies and headers, to my program then send request in my program by requests. For now I'm doing it by hand, but it doesn't work. Response is 403, when manually it's 200 and redirect.
Here is a sample request that my browser is sending after completing captcha:
sample request
I'm sending it in program by:
s = requests.Session()
s.headers = headers
s.cookies.set(cookie_from_web_browser)
captcha = s.get(request)
Response is 403 and I have no idea how to make it work, help me.
Captcha's are really tricky in the web scraping world, most of the time you can bypass this by solving the captcha and then manually taking the returned source's cookie and plugging it into your script. Depending on the website the cookie could hold for 15minutes, a day, or even longer.
The other alternative is to use captcha solving services such as https://www.scraperapi.com/ where you would have to pay a fee for x amount of requests but you won't run into the captcha issue as they solve them for you
Use a header parameter to solve this problem. Just like so
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'
}
r = requests.get("http://webcache.googleusercontent.com/search?q=cache:www.naukri.com/jobs-in-andhra-pradesh",headers=header)
Test it with web cache before running with real url
I want use cookies that copy from my chrome, but make much error.
import urllib.request
import re
def open_url(url):
header={"User-Agent":r'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}
Cookies={'Cookie':r"xxxxx"}
Request=urllib.request.Request(url=url,headers=Cookies)
response=urllib.request.urlopen(Request,timeout=100)
return response.read().decode("utf-8")
Where does my code go wrong? Is that headers=Cookies ?
The correct way when using urllib.request is to use an OpenerDirector populated with aCookieProcessor:
cookieProcessor = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(cookieProcessor)
then you use opener and it will automagically process the cookies:
response = opener.open(request,timeout=100)
By default, the CookieJar (http.cookiejar.CookieJar) used in a simple in memory store, but you can use a FileCookieJar in you need long term storage of persistent cookies, or even a http.cookiejar.MozillaCookieJar if you want to use persistent cookies stored in a cookies.txt now legacy Mozilla format
If you want to use cookies existing in your web browser, you must first store them in a cookie.txt compatible file and load them in a MozillaCookieJar. For Mozilla, you can find an add-on Cookie Exporter. For other browser, you must manually create a cookie.txt file by reading the content of the cookies you need in your browser. The format can be found in The Unofficial Cookie FAQ. Extracts:
... each line contains one name-value pair. An example cookies.txt file may have an entry that looks like this:
.netscape.com TRUE / FALSE 946684799 NETSCAPE_ID 100103
Each line represents a single piece of stored information. A tab is inserted between each of the fields.
From left-to-right, here is what each field represents:
domain - The domain that created AND that can read the variable.
flag - A TRUE/FALSE value indicating if all machines within a given domain can access the variable. This value is set automatically by the browser, depending on the value you set for domain.
path - The path within the domain that the variable is valid for.
secure - A TRUE/FALSE value indicating if a secure connection with the domain is needed to access the variable.
*expiration - The UNIX time that the variable will expire on. UNIX time is defined as the number of seconds since Jan 1, 1970 00:00:00 GMT.
name - The name of the variable.
value - The value of the variable.
But the normal way is to mimic a full session and extract automatically the cookies from the responses.
"When receiving an HTTP request, a server can send a Set-Cookie header with the response. The cookie is usually stored by the browser and, afterwards, the cookie value is sent along with every request made to the same server as the content of a Cookie HTTP header" extracted from mozilla site.
This link
Please go through this
will help you give some knowledge about headers and http request. Please go through this. This might answer alot of your answer.
You can use a better library (IMHO) - requests.
import requests
headers = {
'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}
cookies = dict(c1="cookie_numba_one")
r = requests.get('http://example.com', headers = headers, cookies = cookies)
print(r.text)
I am using urlfetch.fetch in App engine using Python 2.7.
I tried fetching 2 URLs belonging to 2 different domains. For the first one, the result of urlfetch.fetch includes results after resolving XHR queries that are made for getting recommended products.
However for the other page belonging to another domain, the XHR queries are not resolved and I just get the plain HTML for the most part. The XHR queries for this page are also made for purposes of getting recommended products to show, etc.
Here is how I use urlfetch:
fetch_result = urlfetch.fetch(url, deadline=5, validate_certificate=True)
URL 1 (the one where XHR is resolved and the response is complete)
https://www.walmart.com/ip/HP-15-f222wm-ndash-15.6-Laptop-Touchscreen-Windows-10-Home-Intel-Pentium-Quad-Core-Processor-4GB-Memory-500GB-Hard-Drive/53853531
URL 2 (the one where I just get the plain HTML for the most part)
https://www.flipkart.com/oricum-blue-486-loafers/p/itmezfrvwtwsug9w?pid=SHOEHZWJUMMTEYRU
Can someone please advice what I may be missing in regards to the inconsistency.
The server is serving different output based on the user-agent string supplied in the request headers.
By default, urlfetch.fetch will send requests with the user agent header set to something like AppEngine-Google; (+http://code.google.com/appengine; appid: myapp.appspot.com.
A browser will send a user agent header like this: Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0
If you override the default headers for urlfetch.fetch
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0'}
urlfetch.fetch(url, headers=headers)
you will find that the html that you receive is almost identical to that served to the browser.
I've written a simple http web server using python but I've noticed that when I connect to it, the html page appears in the browser window but the indicator in the chrome tab continues to spin and the server receives empty strings. This continues until I click the 'X' to stop loading the page. Could someone please explain why this is happening and how to fix this. Also, if http headers are wrong or I'm missing important ones please tell me. I found it very difficult to find information on http headers and commands.
You find the code here.
Link to image of network tab
Console output:
Socket created
Bound socket
Socket now listening
Connected with 127.0.0.1:55146
Connected with 127.0.0.1:55147Received data: GET / HTTP/1.1
Host: localhost
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0
.8
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/33.0.1750.154 Safari/537.36
DNT: 1
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-GB,en-US;q=0.8,en;q=0.6
Parsing GET command
Client requested directory /index.html with HTTP version 1.1
html
/index.html
Reply headers:
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Server: Max's Python Web Server/1.0
Cache-Control: max-age=600, public
Connected with 127.0.0.1:55148
Received data:
Received data:
Received data:
Received data:
Your fault is how you think about sockets:
socket.recv will wait forever for data from clients
You don't need a loop here.
However, your requests will be limited by recv param.
But if you want to allow any size request,
you should detect the end of data by HTTP specification.
For example, if you wait headers only, double linefeed will mean they ends.
And size of body(for POST method for example) should be passed with Content-length header as I know.
You issue is same as in this question: link
And google for HTTP Specifications, if you want to make correct HTTP server.
Some Code
headers = {}
headers['user-agent'] = 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0'
headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
headers['Accept-Language'] = 'en-gb,en;q=0.5'
#headers['Accept-Encoding'] = 'gzip, deflate'
request = urllib.request.Request(sURL, headers = headers)
try:
response = urllib.request.urlopen(request)
except error.HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: {0}'.format(e.code))
except error.URLError as e:
print('We failed to reach a server.')
print('Reason: {0}'.format(e.reason))
else:
f = open('output/{0}.html'.format(sFileName),'w')
f.write(response.read().decode('utf-8'))
A url
http://groupon.cl/descuentos/santiago-centro
The situation
Here's what I did:
enable javascript in browser
open url above and keep an eye on the console
disable javascript
repeat step 2 (for those of you who have just tuned in, javascript has now been disabled)
use urllib2 to grab the webpage and save it to a file
enable javascript
open the file with browser and observe console
repeat 7 with javascript off
results
In step 2 I saw that a whole lot of the page content was loaded dynamically using ajax. So the HTML that arrived was a sort of skeleton and ajax was used to fill in the gaps. This is fine and not at all surprising
Since the page should be seo friendly it should work fine without js. in step 4 nothing happens in the console and the skeleton page loads pre-populated rendering the ajax unnecessary. This is also completely not confusing
in step 7 the ajax calls are made but fail. this is also ok since the urls they are using are not local, the calls are thus broken. The page looks like the skeleton. This is also great and expected.
in step 8: no ajax calls are made and the skeleton is just a skeleton. I would have thought that this should behave very much like in step 4
question
What I want to do is use urllib2 to grab the html from step 4 but I cant figure out how.
What am I missing and how could I pull this off?
To paraphrase
If I was writing a spider I would want to be able to grab plain ol' HTML (as in that which resulted in step 4). I dont want to execute ajax stuff or any javascript at all. I don't want to populate anything dynamically. I just want HTML.
The seo friendly site wants me to get what I want because that's what seo is all about.
How would one go about getting plain HTML content given the situation I outlined?
To do it manually I would turn off js, navigate to the page, view source, ctrl-a, ctrl-c, ctrl-v(somewhere useful).
To get a script to do it for me I would...?
stuff I've tried
I used wireshark to look at packet headers and the GETs sent off from my pc in steps 2 and 4 have the same headers. Reading about SEO stuff makes me think that this is pretty normal otherwise techniques such as hijax wouldn't be used.
Here are the headers my browser sends:
Host: groupon.cl
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Here are the headers my script sends:
Accept-Encoding: identity
Host: groupon.cl
Accept-Language: en-gb,en;q=0.5
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0
The differences are:
my script has Connection = close instead of keep-alive. I can't see how this would cause a problem
my script has Accept-encoding = identity. This might be the cause of the problem. I can't really see why the host would use this field to determine the user-agent though. If I change encoding to match the browser request headers then I have trouble decoding it. I'm working on this now...
watch this space, I'll update the question as new info comes up