How to get Console Output of URL in python? - python

I searched for a solution to my problem but couldnt find any.
Is there a way to open a URL in python and get the console output from that url?
Im trying to check a list of URLs for Javascript errors.
The only solution I found was using Selenium but its incredible slow.

For chrome errors you might find the node module of LightHouse suiting your needs.
For Firefox there is Browser_Console.
From first look the Chrome solution seems easier.
Last but not least I stumbled over the python module selenium which can do liveserver testing.
For reference see this django use case, there is also an other post how to catch js errors

Related

selenium can not Submit an answer in zhihu.com

url:https://www.zhihu.com/question/305744720/answer/557418746
use selenium can not reply answer,only human
options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
browser = webdriver.Chrome(executable_path='.\chromedriver.exe',chrome_options=options)
button_li = self.browser.find_elements_by_class_name('Button--blue')
if len(button_li) > 2:
print(len(button_li))
button_ele = button_li[4]
button_ele.click()
time.sleep(random.uniform(0.5, 3))
browser.find_element_by_css_selector('div.AnswerForm-editor').click()
time.sleep(random.uniform(0.5, 2))
js="""
var div=document.getElementsByClassName('public-DraftStyleDefault-block')
var text =document.createTextNode("君");
div[0].firstChild.appendChild(text)
"""
self.browser.execute_script(js)
browser.find_element_by_css_selector('Button.Button.AnswerForm-submit').click()
Summary of problem:
My problem is that I wrote the content to the answer box successfully, but I was identified as a machine. After that, my actions on the page seemed to stop working. How can I avoid being identified as a machine so that I can still use selenium to select my element?
I'm not sure where you are trying to select the submit button, but the following selector worked for me:
browser.find_element_by_css_selector('Button.Button.AnswerForm-submit').click()
With respect to being detected as a 'machine', it's not easy to avoid that.
There are several different ways they can detect you.
That said, here's one thing I found that can avoid some of the Selenium detection attempts. One thing they look for document variables called $cdc_ and $wdc_ that selenium uses, and for Chrome it would be $cdc_. Here is what I suggest you try:
Download a hex editor if you don't already have one. I used one from here.
open your chromedriver.exe in the hex editor.
Use the Search functionality to find any instance of $cdc_ or $wdc_ and replace basically any other string ending in an underscore. Myself, I found just one instance of $cdc_, and it looked like this:
'$cdc_asdjflasutopfhvcZLmcfl_'
I simply replaced it with 'Random_'
Hopefully this works and now you can traverse the site unimpeded. If not, try some of the following; the only problem with this is that it might break your chromedriver file so that the tests no longer work. But it could be worth a try, and if it does break you can easily download a fresh one.
Search the document for any usage of the words 'selenium', 'webdriver', or 'chromedriver' and delete them. These is another way that a site can tell you are using selenium.
Let me know if any of this helps or you have any questions. It's hard for me to come up with a concrete answer because I don't know how exactly the site is detecting selenium.

Scraping webpage generated by javascript

I have a problem getting javascript content into HTML to use it for scripting. I used multiple methods as phantomjs or python QT library and they all get most of the content in nicely but the problem is that there are javascript buttons inside the page like this:
Pls see screenshot here
Now when I load this page from a script these buttons won't default to any value so I am getting back 0 for all SELL/NEUTRAL/BUY values below. Is there a way to set these values when you load the page from a script?
Example page with all the values is: https://www.tradingview.com/symbols/NEBLBTC/technicals/
Any help would be greatly appreciated.
If you are trying to achieve this with scrapy or with derivation of cURL or urrlib I am afraid that you can't do this. Python has another external packages such selenium that allow you to interact with the javascript of the page, but the problem with selenium is too slow, if you want something similar to scrapy you could check how the site works (as i can see it works through ajax or websockets) and fetch the info that you want through urllib, like you would do with an API.
Please let me know if you understand me or i misunderstood your question
I used seleneum which was perfect for this job, it is indeed slow but fits my purpose. I also used the seleneum firefox plugin to generate the python script as it was very challenging to find where exactly in the code as the button I had to press.

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

Parsing from a website -- source code does not contain the info I need

I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.
The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)

What's the difference between "browser posting" and "program posting"?

I've asked one question about this a month ago, it's here: "post" method to communicate directly with a server.
And I still didn't get the reason why sometimes I get 404 error and sometimes everything works fine, I mean I've tried those codes with several different wordpress blogs. Using firefox or IE, you can post the comment without any problem whatever wordpress blog it is, but using python and "post" method directly communicating with a server I got 404 with several blogs. And I've tried to spoof the headers, adding cookies in the code, but the result remains the same. It's bugging me for quite a while... Anybody knows the reason? Or what code should I add to make the program works just like a browser such as firefox or IE etc ? Hopefully you guys would help me out!
You should use somthing like mechanize.
The blog may have some spam protection against this kind of posting. ( Using programmatic post without accessing/reading the page can be easily detected using javascript protection ).
But if it's the case, I'm surprised you receive a 404...
Anyway, if you wanna simulate a real browser, the best way is to use a real browser remote controlled by python.
Check out WebDriver (http://seleniumhq.org/docs/09_webdriver.html) It has a python implementation and can run HtmlUnit, chrome, IE and Firefox browsers.

Categories