Best way to automate creation routine using Python - python

Usually, everyday I have to create emails, save them into a file, then use them to register accounts, etc. It's really boring and takes a good amount of time. I don't want to waste it.
I want to automate this process. As I understand, it's called a "bot". What it should do is go through few websites, click some buttons, scrape needed information, store collected information and fill some forms. Is it possible to do so with Python? If yes, what's the most compact way to do this?

Python's selenium bindings are a great way to automate browser sessions, scrape page data, fill out forms, click buttons, etc. Selenium allows the page JS to run, then parses the DOM for you and makes it available as Python objects through a well-documented API:
http://selenium-python.readthedocs.org/en/latest/
Much much easier than trying to parse it yourself.

The Scrapy python module should meet your needs:
http://scrapy.org/

Related

Scraping PDF's from a password protected website

I work in tech support and currently have to manually keep our manuals for products updated manually by periodically checking to see if it has an update and if it does replacing the current one saved on our network.
I was wondering if it would be possible to build a small program to quickly download all files on a suppliers website and have them automatically download and be sorted into the given folders for those products, replacing the current PDF's in that file. I must also note that the website is password protected and is sorted into folders.
Is this possible with Python? I figured a small program I could perhaps run once a week or something to automatically update our manuals would be super useful (and a learning experience).
Apologies if I haven't explained the requirement well, any questions let me know.
It's certainly possible. As the other answer suggests you will want to use libaries like Requests (Handle HTTP requests) or Selenium (AUtomated browser activity) to navigate through the login.
You'll need to sort through the links on a given page, could be done with beautifulsoup ideally (An HTML parser) but could be done with selenium (Automated Browser activity).You'll need to check out libraries like requests (To handle HTTP requests) for downloading the pdf's, the OS module for sorting the folders out into specific folders and replacing files.
I strongly urge you to think through the steps, But I hope that gives an idea about the libaries that you'll need to learn abit about. The most challenging thing to learn will be using selenium, so if you can use requests to do the login that is much better.
If you've got a basic grasp of python the requests, OS module and beautifulsoup libraries are not difficult things to pick up.
You can use selenium for browser automation. This could insert the password (although the are you a robot stuff might stop you), and then you can download the pdf's simply by setting a default download location and clicking the download button. This will make the browser download the files to the default download location.

Does selenium or other web scraper tools are mandatory for scraping data from chrome to python script

So I wanted to scrape an website data. I have used selenium in my python script to scrape the data. But i have noticed that in Network section of Google Chrome Inspect, the chrome can record the XmlHttpRequest to find out the json/xml file of websites. So i was wondering that can i directly use this data in my python script as selenium is quite heavy weight and needs more bandwidth. Does selenium or other web scraper tools should be used as a medium to communicate with browser? If not, please give out some information about scraping data to be used for my python file only by using chrome itself.
Definitely! Check out the requests module.
From there you can access the page source, and using data from it you can access the different aspects separately. Here are the things to consider though:
Pros:
Faster, less to download. For things like AJAX requests, is extremely more efficient.
Does not require graphic UI like selenium
More precise; Get exactly what you need
The ability to set Headers/Cookies/etc before making requests
Images may be downloaded separately, with no obligation to download any of them.
Allows as many sessions as you want to be opened in parallel, each
can have different options(proxies, no cookies, consistent cookies,
custom headers, block redirects, etc) without affecting the other.
Cons:
Much harder to get into as opposed to Selenium, requires
minimal knowledge of HTML's GET and POST , and a library
like re or BeautifulSoup to extract data.
For pages with javascript-generated data, depending how the
javascript is implemented(or obfuscated), while always possible,
could be extremely difficult to extract wanted data.
Conclusion:
I suggest you definitely learn requests, and use it for most cases; However if the javascript gets too complicated, then switch to selenium for an easier solution. Look for some tutorials online, and then check the official page for an overview of what you've learned.

How to fill textareas and select option (select tag) and hit submit (input tag) via python?

I work with python and data mine some content which I categorize into different categories.
Then I go to a specific webpage and submit manually the results.
Is there a way to automate the process? I guess this is a "form-submit" thread but I haven't seen any relevant module in Python. Can you suggest me something?
Selenium Webdriver is the most popular way to drive web pages, but Python also has beautifulsoup; Either library will work.
If you want make this automatic yo have to see which params are send in the form and make a request with this params to the endpoint but directly from your python app, or search a package that simulate a browser and fill the form, but I think that the correct way is making the request directly from your app

Efficient way to scrape images from website in Django/Python

First I guess I should say I am still a bit of a Django/Python noob. I am in the midst of a project that allows users to enter a URL, the site scrapes the content from that page and returns images over a certain size and the page title tag so the user can then pick which image they want to use on their profile. A pretty standard scenario I assume. I have this working by using Selenium (headless Chrome browser) to grab the destination page content, some python to determine the file size and then my Django view spits it all out into a template. I then have it coded in such a way that the image the user selects will be downloaded and stored locally.
However I seriously doubt the scalability of this, its currently just running locally and I am very concerned about how this would cope if there were lots of users all running at the same time. I am firing up that headless chrome browser every time a request is made which doesn't sound efficient, I am having to download the image to determine it's size so I can decide whether it's large enough. One example took 12 seconds to get from me submitting the URL to displaying the results to the user, whereas the same destination URL put through www.kit.com (they have very similar web scraping functionality) took 3 seconds.
I have not provided any code as the code I have does what it should, I think the approach however is incorrect. To summarise what I want is:
To allow a user to enter a URL and for it to return all images (or just the URLs to those images) from that page over a certain size (width/height), and the page title.
For this to be the most efficient solution, taking into account it would be run concurrently between many users at once.
For it to work in a Django (2.0) / Python (3+) environment.
I am not completely against using the API from a 3rd party service if one exists, but it would be my least preferred option.
Any help/pointers would be much appreciated.
You can use 2 python solutions in your case:
1) BeautifulSoup, and here is a good answer how to download the images using it. You just have to make it a separate function and pass site as the argument into it. But also it is very easy to parse only images links as u said - depending on speed what u need (obviously scraping files, specially when there is a big amount of them, will be much slower, than links). This tool is just for parsing and scrapping the content of the page.
2) Scrapy - this is much more powerful tool, framework, via it you can connect your spider to a Django models, operate with images much more efficiently, using its built-in image-pipelines. It is much more flexible with a lot of features how to operate with scrapped data. I am not sure if u need to use it in your project and if it is not overpowered in your case.
Also my advice is to run the spider in some background task like Queue or Celery, and call the result via AJAX, cuz it may take some time to parse the content - so don't make a user wait for the response.
P.S. You can even combine those 2 tools in some cases :)

Scraping a web page as you manually navigate

Is there a way, using some library or method, to scrape a webpage in real time as a user navigates it manually? Most scrapers I know of such as python mechanize create a browser object that emulates a browser - of course this is not what I am looking for since if I have a browser open, it will be different than the one mechanize creates.
If there is no solution, my problem is I want to scrape elements from a HTML5 game to make an intelligent agent of sorts. I won't go into more detail, but I suspect if others are trying to do the same in the future (or any real time scraping with a real user), a solution to this could be useful for them as well.
Thanks in advance!
Depending on what your use-case is, you could set up a SOCKS proxy or some other form of proxy and configure it to log all traffic, then instruct your browser to use it. You'd then scrape that log somehow.
Similarly, if you have control over your router, you could configure capture and logging there, e.g. using tcpdump. This wouldn't decrypt encrypted traffic, of course.
If you are working with just one browser, there may be a way to instruct it to do something at each action via a custom browser plugin, but I'd have to guess you'd be running into security model issues a lot.
The problem with a HTML5 game is that typically most of its "navigation" is done using a lot of Javascript. The Javascript is typically doing a lot -- manipulating the DOM, triggering requests for new content to fit into the DOM, etc...
Because of this you might be better off looking into OS-level or browser-level scripting services that can "drive" keyboard and mouse events, take screenshots, or possibly even take a snapshot of the current page DOM and query it.
You might investigate browser automation and testing frameworks like Selenium for this.
I am not sure if this would work in your situation but it is possible to create a simple web browser using PyQt which will work with HTML5 and from this it might be possible to capture what is going on when a live user plays the game.
I have used PyQt for a simple browser window (for a completely different application) and it seems to handle simple, sample HTML5 games. How one would delve into the details of what is going on the game is a question for PyQt experts, not me.

Categories