I have a situation where I get Chrome's performance logs after running some automated tests. Then I parse it with a script I made to get a HAR file (http://www.softwareishard.com/blog/har-12-spec/).
Everything runs works correctly most of the times, but every now and then I get a request that has no answer (whether I stopped the page loading before it gets complete, or something else). I tried leaving the "response" object blank for this request, but this is required. After that, I tried removing the object, but it is required too.
What should I do in those cases? Can anyone advice?
Well, after some debug i found out that requests that have no answer do not go to HAR. At least that is what WebPageTest does and what I did.
Related
I'm currently trying to design a filter that with it I can block certain URLs and also
block based on keywords that may be at the data that came back from the http response.
Just for clarification I'm working on a Windows 10 x64 machine for this project.
In order to be able to do such thing, I quickly understood that I would need a web proxy.
I checked out about 6 proxies written in python that I found on github.
These are the project I tried to use (some are Python3 some 2):
https://github.com/abhinavsingh/proxy.py/blob/develop/proxy.py
https://github.com/inaz2/proxy2/blob/master/proxy2.py
https://github.com/inaz2/SimpleHTTPProxy - this one is the earlier version of the top one
https://github.com/FeeiCN/WebProxy
Abhinavsingh's Proxy (first in the list):
what I want to happen
I want the proxy to be able to block sites based on the requests and the content came back, I also need the filter to be in a separated file and to be generic
so I can apply it on every site and every request/response.
I'd like understand where is the correct place to put a filter on this proxy
and how to do redirect or just send a block page back when the client tries to access
sites that has specific urls, or when the response is a page that contains some keywords.
what I tried
I enabled the proxy on google chrome's 'Open proxy settings'
and executed the script. It looked pretty promising, and I noticed that I can insert a filterer's function call in line 383 at the _process_request function so that I can return
to it maybe another host to redirect to or just block. It worked partially for me.
The problems
First of all, I couldn't fully redirect/block the sites. Sometimes it worked, Sometimes it
didn't.
Another problem I ran into was that I realized that I can't access the content of site that returned, if it is https.
Also, filter the response was unfortunately not clear for me.
I noticed also that proxy2 (the second in the list) can solve that problem I had
of filtering https page's content, but I couldn't find out how to make this feature work (and also I think it requires linux utilities anyhow).
The process I described above was pretty much the one that I tried to work on every proxy in the list. At some proxies, like proxy2.py I couldn't understand at all what I needed to do.
If anybody managed to make a filter on top of this proxy or any other from this list and can help me understand how to do it, I'll be grateful if you'll comment down below.
Thank you.
I use scripting extensively, and use Python to hop around from an executed script back to the page that launched it, or sometimes, other pages entirely.
Say I'm running a script, which we will call "thisplace.cgi", which does its thing, whatever that is, then executes the following:
sys.stdout.write('Location: otherplace.cgi\n\n')
raise SystemExit
It works -- I get there, the page generated by otherplace.cgi displays properly, functionally speaking, all is good.
HOWEVER. In the browser's location bar, "thisplace.cgi" is still displayed. Not "otherplace.cgi" which is where the user is really at now.
I did some Googling, and from what I gathered (not much... seems curiously mysterious), I ended up sending this...
sys.stdout.write('HTTP/1.1 300 Multiple Choices\n')
sys.stdout.write('Location: otherplace.cgi\n\n')
raise SystemExit
...but the browser just treats it as invalid and gives me an error page. The CGI environment is some shared-host Debian machine out on the Intertubes. I'm seeing the problem in FireFox under OS X.
Am I misunderstanding what the 300 is for, or where to use it, or what? Or am I completely confused? :)
I just want the browser to show the right thing in the URL bar. Doesn't seem like it's a lot to ask for. No?
To redirect a browser to a different URL you will want to use a 301 Moved Permanently. This will cause the browser to fetch the second page and show the new URL.
E.g.
sys.stdout.write('HTTP/1.1 301 Moved Permanently\n')
sys.stdout.write('Location: otherplace.cgi\n\n')
raise SystemExit
I've been trying to get my python code to post. I have tried using the Postman Plugin to test the post method and I would get a 405 method error. I am planning to have the user post the information and have it displayed.
Currently if I press submit I would get a error loading page, changing the form to get results in the submit button working and returning to the previous page. If I change the handler to post the screen would instantly display '405 Method Not Allowed'. I've looked through the Google App Engine logs and there are no errors. Can someone help me with what I done wrong and advise me on how to the post method functioning?
Thanks for the time.
You're getting '405 method not allowed' because the POST is going to the same url that served up the page, but the handler for that path (MainPage) does not have a post method.
That's the same diagnosis that you were given when you asked this question two hours earlier under a different user id.
Stepping back further to address the "what have I done wrong" question: It seems to me that you've gotten quite far along before discovering that what you have doesn't work. So far along that the example is cluttered with irrelevant code that's unrelated to the failure. That makes it harder for you for figure this out for yourself, and it makes it more difficult for people here to help you.
In this situation, you help yourself (and help others help you) by seeing how much code you can eliminate while still demonstrating the failure.
I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.
The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)
I've asked one question about this a month ago, it's here: "post" method to communicate directly with a server.
And I still didn't get the reason why sometimes I get 404 error and sometimes everything works fine, I mean I've tried those codes with several different wordpress blogs. Using firefox or IE, you can post the comment without any problem whatever wordpress blog it is, but using python and "post" method directly communicating with a server I got 404 with several blogs. And I've tried to spoof the headers, adding cookies in the code, but the result remains the same. It's bugging me for quite a while... Anybody knows the reason? Or what code should I add to make the program works just like a browser such as firefox or IE etc ? Hopefully you guys would help me out!
You should use somthing like mechanize.
The blog may have some spam protection against this kind of posting. ( Using programmatic post without accessing/reading the page can be easily detected using javascript protection ).
But if it's the case, I'm surprised you receive a 404...
Anyway, if you wanna simulate a real browser, the best way is to use a real browser remote controlled by python.
Check out WebDriver (http://seleniumhq.org/docs/09_webdriver.html) It has a python implementation and can run HtmlUnit, chrome, IE and Firefox browsers.