Strange behaviour of XPath inside Selenium with python - python

I'm using Selenium with python to extract comments from a website.
Eventually I end up with a list of WebElement-objects, each corresponding to a single comment. I then use element.get_element_by_xpath(XPATH) to locate different informations inside the comment-Object, such as the name of the commenter, the amount of likes etc.
The comments are all structured exactly the same, I've checked this with element.get_attribute('outerHTML').
But still, the Xpath-expressions only capture the relevant informations only every tenth time or so. The comments that are captured nicely don't seem to differ in any way from the other comments.
Has anyone experienced a similar problem, and maybe found a solution?
Edit: I found the problem wasn't the Xpath-expressions, but the way I tried to get the data from the elements (I used the text-attribute). This post here has the answer to the question I was actually trying to ask: getText() method of selenium chrome driver sometimes returns an empty string

Related

Extract heading and content from an HTML page using a visual approach in Python

I'm looking for a way to extract the heading and content from raw HTML. There are a couple of Python packages out there which does this (Newspaper3k, python-readability, python-goose), but I'm looking to do something more like how the human eye sees. My idea is to use the visual placement of a div on a page to determine if it's part of the main content of a page or not. How can I extract the placement of a div using python? Any other ideas on how to approach this problem?
To the best of my understanding, you want to locate and extract html from certain divs from a website, but on screen, with a cursor and a keyboard (like a human would do), for that purpose, you could go with PyAutoGui.
You can use pyautogui.locateOnScreen(), with a parameter of choice, you can then advance with scrapping tools.
With PyAutoGui, you can automate click events as well.
For further research, you can check the docs.
Hope this answers your question, if doubts, please feel free to ask!
As you mentioned, the worst part of the Python packages you mentioned is the required HTML and DOM structure knowledge. Nevertheless, it is necessary for scraping I can share a hybrid approach.
First step: I use WebScraper.io Chrome extension to visually select items on the page (like on the image) and save them.
Second step: Once I have DOM selectors like p a.cta (on the image). I use them with the Python scraping package.
I use this approach almost for any scraping project. I hope it helps.

Python: Getting all the URLs to a website that has a format

This may not be the right way to phrase this question, but is there a fast way to get the URLs of a website that has a format. What I mean by this is lets say the URL has a format of www.example.com/stuff/number=0123456789 where the numbers at the end are always 10 digits long.
Right now I am using scrapy to go through each URL format from 0000000000 to 9999999999 which is 10 billion different combinations to see if there is a webpage located there. Although I am running multiple instances, and it is going pretty fast, it will still take forever and there has to be a better way to do it. Any suggestions?
Scrapy itself is pretty fast, configurable and scalable. I would stick to it, try to optimize the current approach and scale it. For instance:
use HEAD requests instead of GET (and see this thread also)
distribute the work across multiple scrapyd instances. You can also use libraries like scrapy-redis to keep the queue of urls to check and scraped items (if there are any)
But, be sure you are staying on the legal side and not violating the Terms of Use of the website.
As a side note and to resolve the confusion, BeautifulSoup is an HTML Parser and it is good at what it does. It cannot make HTTP requests itself. It needs an HTML to be passed into.
As an another side note, in general, it doesn't sound quite right to get all of the 10-digit combinations and check if there is a webpage corresponding to a number. If you would elaborate more about the motivation behind the problem, we can come up with more options or an alternative approach.

Documentation/Examples of using Selenium & Python to navigate a website

I've just started using python and Selenium today so in at the deep end a little.
So far I've used the documentation to get a python script to load google, search for something and then take a screenshot of the results.
What I want is to be able to load a website, navigate to certain elements and take screenshots of various pages. I'm struggling to find documentation for navigation however.
Could someone point me towards (or post an answer with) examples/explanation of find_element and what you can actually find, and also how to open elements once found. The documentation for lots of what I wanted is still under development :(
I've been looking through the WebDriver docs on googlecode at the kind of methods I thought I needed but it seems they are all part of the private API so what alternatives are there?
I keep seeing this on everything;
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Found a great example of Action_Chains on here; https://stackoverflow.com/a/8261754/1199464
While the selenium documentation is not in a particularly good order, I feel like everything is there.
You could e.g. start here: http://code.google.com/p/selenium/wiki/FurtherResources
xpath seems a good choice for finding elements.
Also this page seems to contain what you need: http://seleniumhq.org/docs/03_webdriver.html#commands-and-operation
edit: I found this and it should contain what you need: http://selenium.googlecode.com/svn/trunk/docs/api/py/api.html
(sry p0deje, I didnt see that you already posted that last link...)
You can take a look at the basics there:
http://code.google.com/p/selenium/wiki/PythonBindings
http://pypi.python.org/pypi/selenium
The full documentation and up-to-date documentation:
http://selenium.googlecode.com/svn/trunk/docs/api/py/index.html
Good links, but using XPATH for locators is strongly discouraged (too brittle). Use ID or name, or CSS if you cannot.
Few links to best practices:
Selenium Best Practices (pageobject, Prefered selector order : id > name > css > xpath)
Slideshow - more advanced
Compare locators pro/con - XPATH is slow and brittle, esp IE.

What's the best way to get a description of the website, in Python?

Suppose I downloaded the HTML code, and I can parse it.
How do I get the "best" description of that website, if that website does not have meta-description tag?
You could get the first few sentence returned from something like Readability.
Safari 5 uses it, so it must be alright :)
To follow up on the "Readability" suggestion above (which itself is inspired by the website InstaPaper), they have release the JavaScript: http://code.google.com/p/arc90labs-readability/. What's more, some guy took that and ported it to python: http://github.com/gfxmonk/python-readability. Rejoice!
It's very hard to come up with a rule that works 100% of the time, obviously, but my suggestion as a starting point would be to look for the first <h1> tag (or <h2>, <h3>, etc - the highest one you can find) then the bit of text after that can be used as the description. As long as the site is semantically marked-up, that should give you a good description (I guess you could also take the contents of the <h1> itself, but that's more like the "title").
It's interesting to note that Google (for example) uses a keyword-specific extract of the page contents to display as the description, rather than a static description. Not sure if that'll work for your situation, though.

How can I make HTML safe for web browser with python?

How can I make HTML from email safe to display in web browser with python?
Any external references shouldn't be followed when displayed. In other words, all displayed content should come from the email and nothing from internet.
Other than spam emails should be displayed as closely as possible like intended by the writer.
I would like to avoid coding this myself.
Solutions requiring latest browser (firefox) version are also acceptable.
html5lib contains an HTML+CSS sanitizer. It allows too much currently, but it shouldn't be too hard to modify it to match the use case.
Found it from here.
I'm not quite clear with what exactly you mean with "safe". It's a pretty big topic... but, for what it's worth:
In my opinion, the stripping parser from the ActiveState Cookbook is one of the easiest solutions. You can pretty much copy/paste the class and start using it.
Have a look at the comments as well. The last one states that it doesn't work anymore, but I also have this running in an application somewhere and it works fine. From work, I don't have access to that box, so I'll have to look it up over the weekend.
Use the HTMLparser module, or install BeautifulSoup, and use those to parse the HTML and disable or remove the tags. This will leave whatever link text was there, but it will not be highlighted and it will not be clickable, since you are displaying it with a web browser component.
You could make it clearer what was done by replacing the <A></A> with a <SPAN></SPAN> and changing the text decoration to show where the link used to be. Maybe a different shade of blue than normal and a dashed underscore to indicate brokenness. That way you are a little closer to displaying it as intended without actually misleading people into clicking on something that is not clickable. You could even add a hover in Javascript or pure CSS that pops up a tooltip explaining that links have been disabled for security reasons.
Similar things could be done with <IMG></IMG> tags including replacing them with a blank rectangle to ensure that the page layout is close to the original.
I've done stuff like this with Beautiful Soup, but HTMLparser is included with Python. In older Python distribs, there was an htmllib which is now deprecated. Since the HTML in an email message might not be fully correct, use Beautiful Soup 3.0.7a which is better at making sense of broken HTML.

Categories