Automatically logging advertising data from Ghostery plugin with Selenium? - python

I'm interested in keeping an eye on which advertising networks are running on a variety of websites. The Ghostery browser plugin does a great job of showing me which ad networks are used on any website. For example, on StackOverflow, Ghostery says we're being monitored by DoubleClick, Google Analytics, Quantcast, and ScoreCard.
On a weekly basis, I'd like to use Selenium to automatically browse few hundred websites and save the Ghostery data associated with these websites. Using the Python bindings for Selenium, I wrote out some rough pseudocode:
import selenium.webdriver as webdriver
urls = ['www.stackoverflow.com', 'www.amazon.com', ...]
driver = webdriver.Firefox()
for url in urls:
driver.get(url)
# now, how do I access Ghostery's analysis of this URL?
I suppose the broader question is "from Selenium, how do I connect to other browser plugins?"
For fun, I posted an example of what Ghostery's UI looks like (which I'd like to access programmatically):

Selenium is used to access and interact with a browser's DOM. Selenium is not able to access a browser's controls; it is a completely inappropriate tool for what you want to accomplish.

In general, its not possible for Selenium to access extensions directly. If you want to do that, you will have to build a bridge.
For Ghostery specifically, what you are looking for exists as an open source project here: https://github.com/ghostery/areweprivateyet

There appears to be a limited Ghostery API described at https://purplebox.ghostery.com/post/1016023438#more-1016023438

Related

Google returning different layouts for pagination

I am using selenium and chrome to search on google. But it is returning different layouts for pagination. I am using different proxies and different user agents using the fake_useragent library.
I only want the second image layout. Does anybody know how can I get it every time?
First Image
Second Image
The issue was fake_useragent library was returning old user-agents sometimes even if I update the database. I tried this library(https://pypi.org/project/latest-user-agents/) and it returns newer user-agents.
Here is the working code.
from latest_user_agents import get_latest_user_agents
import random
from selenium import webdriver
PATH = 'C:\Program Files (x86)\chromedriver.exe'
proxy = ''
url = ''
user_agent = random.choice(get_latest_user_agents())
options = webdriver.ChromeOptions()
options.add_argument(f'--proxy-server={proxy}')
options.add_argument(f'user-agent={user_agent}')
driver = webdriver.Chrome(PATH, options=options)
driver.get(url)
The difference between the two layouts is when you disable javascript, Google will show the pagination as the first image layout.
To ensure that you get the second layout every time, you would need to make sure javascript is enabled.
If you have a chrome driver from selenium like: options = webdriver.ChromeOptions(), the following would make sure javascript is always enabled:
options.add_argument("--enable-javascript")
Edit based on OP's comment
I got it working by using the latest_user_agents library. The fake_useragent library was returning old user-agents sometimes. That's why it was showing the old layout.
Installing the latest_user_agents library: https://pypi.org/project/latest-user-agents/
Hey Dont try to automate google and google products by automation tools because every day google are changing webelements and view of thier pages.
For multiple reasons, logging into sites like Gmail and Facebook using WebDriver is not recommended. Aside from being against the usage terms for these sites (where you risk having the account shut down), it is slow and unreliable.
The ideal practice is to use the APIs that email providers offer, or in the case of Facebook the developer tools service which exposes an API for creating test accounts, friends, and so forth. Although using an API might seem like a bit of extra hard work, you will be paid back in speed, reliability, and stability. The API is also unlikely to change, whereas webpages and HTML locators change often and require you to update your test framework.
Logging in to third-party sites using WebDriver at any point of your test increases the risk of your test failing because it makes your test longer. A general rule of thumb is that longer tests are more fragile and unreliable.
WebDriver implementations that are W3C conformant also annotate the navigator object with a WebDriver property so that Denial of Service attacks can be mitigated.

How do I hide the fact I'm using a bot?

So for my python selenium script I have to complete a lot of Captcha's. I noticed that when I get the Captcha's on my regular browser they're much easier and quicker. Is there a way for me to hide the fact that I'm using a web automation bot so I get the easier Captcha's?
I already tried randomizing the User Agent but to no success.
You can go to your website and inspect the page. Then go to the network tab and select Network. Reload the page and the select the webpage you are accessing from the list. If you scroll down, you can see the user agent that your browser is using to access the page. Use that user agent in your scraper to exactly mimick your browser.
From a generic perspective there are no proven ways to hide the fact that you are using a Selenium driven web automation bot.
You can find a relevant detailed discussion in Can a website detect when you are using Selenium with chromedriver?
However at certain times modifying the navigator.webdriver flag helps to prevent detection.
References
You can find a couple of relevant detailed discussions in:
Is there a way to use Selenium WebDriver without informing the document that it is controlled by WebDriver?
Selenium Chrome gets detected
How does recaptcha 3 know I'm using selenium/chromedriver?

How to approach web-scraping in python

I am new to python just started on python web-scraping. I have to scrape data from this realtor site
I need to scrape all the details op read-state agents according to their real-state agency;
For this on the web-browser I have to follow the following instructions
Go to this site
click on agency offices button, enter 4000 pin in search box and then submit.
then we get list of the agencies.
go to our team tab and then we get agents their.
then we have to go to each agents page and record their information.
Can anyone tell me how to approach this. Whats the best way to make this type of scrapers.
Do i have to use selenium for the interaction with the pages.
I have worked on request, BeautifulSoup and simple form submit using mechanize
I would recommend on a searching site that you either use Selenium or Requests with sessions, the advantage of Selenium it it will probably work however it will be slow. For Selenium you should just use the Selenium IDE (Firefox add on) to record what you do then get the HTML from the webpage and use beautifulsoup to parse the data.
If you want to scrape the data quickly and without using much resources I usually use Requests with Sessions. To scrape a website like this you should open up a modern web browser (Firefox, Chrome) and use the network tools for that browser (usually located in developer tools or via right click inspect element). Once you are recording the network you can interact with the webpage to see the connections made to the server. In an example search they may use suggestions e.g
https://suggest.example.com.au/smart-suggest?query=4000&n=7&regions=false
The response then will probably be a JSON of the suggested results. Once you select a suggestion you can just submit a request with that search parameters e.g
https://www.example.com.au/find-agent/agents/petrie-terrace-qld-4000
The URLs for the agents will the be in that HTML page, you just then need to separately send a request to each page to get the information using BeautifulSoup.
You might wanna give Node and Jquery a try. I used to use Python all the time, but it gets messy and hard to maintain after a while.
Using node, you can turn the page HTML into a DOM object and then scrape all the data very easily using Jquery. I have done this for imdb here: “Using JQuery & NodeJS to scrape the web” #asimmittal https://medium.com/#asimmittal/using-jquery-nodejs-to-scrape-the-web-9bb5d439413b
You can modify this to scrape yelp

How can I input data into a webpage to scrape the resulting output using Python?

I am familiar with BeautifulSoup and urllib2 to scrape data from a webpage. However, what if a parameter needs to be entered into the page before the result that I want to scrape is returned?
I'm trying to obtain the geographic distance between two addresses using this website: http://www.freemaptools.com/how-far-is-it-between.htm
I want to be able to go to the page, enter two addresses, click "Show", and then extract the "Distance as the Crow Flies" and "Distance by Land Transport" values and save them to a dictionary.
Is there any way to input data into a webpage using Python?
Take a look at tools like mechanize or scrape:
http://pypi.python.org/pypi/mechanize
http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
http://www.ibm.com/developerworks/linux/library/l-python-mechanize-beautiful-soup/
http://zesty.ca/scrape/
Packt Publishing has an article on that matter, too:
http://www.packtpub.com/article/web-scraping-with-python
Yes! Try mechanize for this kind of Web screen-scraping task.
I think you can also use PySide/PyQt, because they have a browser core of qtwebkit, you can control the browser to open pages, simulate human actions(fill, click...), then scrape data from pages. FMiner is work on this way, it's a web scraping software I developed with PySide.
Or you can try phantomjs, it's an easy library to control browser, but not it's javascript not python lanuage.
In addition with the answers already given, you could simply do a request on that page. Using your browser you could always inspect the Network (under Tools/Web Developer tools) behaviors and actions when you interact with the page. E.g. http://www.freemaptools.com/ajax/getaandb.php?a=Florida_Usa&b=New%20York_Usa&c=6052 -> request query for getting the results page you are expecting. Request that page and scrape the field you wanted to. IMHO, page requests are way faster than screen scraping (case-to-case basis).
But of course, you could always do screen scraping/browser simulation also (Mechanize, Splinter) and use headless browsers (PhantomJS, etc.) or the browser driver of the browser you want to use.
The query may have been resolved.
You can use Selenium WebDriver for this purpose. A web page can be interacted using programming language. All the operations can be performed as if a human user is accessing the web page.

Python and webbrowser form fill

Hello how can i make changes in my web browser with python? Like filling forms and pressing Submit?
What lib's should i use? And maybe someone of you have some examples?
Using urllib does not make any changes in opened browser for me
Urllib is not intended to do anyting with your browser, but rather to get contents from urls.
To fill in forms and this kind of things, have a look into mechanize, to scrap the webpages, consider using pyquery.
Selenium is great for this. It's a browser automation tool that you can use to launch a browser (any major browser or a 'headless' one), navigate to a url, and interact with the page.
It's used primarily for testing web code against multiple browsers, but is also very useful for 'scraping' pages and automating mundane tasks.
Here are the python docs: http://selenium-python.readthedocs.org/en/latest/index.html

Categories