Scraping PDF's from a password protected website

Scraping PDF's from a password protected website - python

I work in tech support and currently have to manually keep our manuals for products updated manually by periodically checking to see if it has an update and if it does replacing the current one saved on our network.
I was wondering if it would be possible to build a small program to quickly download all files on a suppliers website and have them automatically download and be sorted into the given folders for those products, replacing the current PDF's in that file. I must also note that the website is password protected and is sorted into folders.
Is this possible with Python? I figured a small program I could perhaps run once a week or something to automatically update our manuals would be super useful (and a learning experience).
Apologies if I haven't explained the requirement well, any questions let me know.

It's certainly possible. As the other answer suggests you will want to use libaries like Requests (Handle HTTP requests) or Selenium (AUtomated browser activity) to navigate through the login.
You'll need to sort through the links on a given page, could be done with beautifulsoup ideally (An HTML parser) but could be done with selenium (Automated Browser activity).You'll need to check out libraries like requests (To handle HTTP requests) for downloading the pdf's, the OS module for sorting the folders out into specific folders and replacing files.
I strongly urge you to think through the steps, But I hope that gives an idea about the libaries that you'll need to learn abit about. The most challenging thing to learn will be using selenium, so if you can use requests to do the login that is much better.
If you've got a basic grasp of python the requests, OS module and beautifulsoup libraries are not difficult things to pick up.

You can use selenium for browser automation. This could insert the password (although the are you a robot stuff might stop you), and then you can download the pdf's simply by setting a default download location and clicking the download button. This will make the browser download the files to the default download location.

Related

How to scrape infinitely scrolling websites with login using python request (or similar)

I would like to scrape a website that does not have an API and is an "infinite scroller". I have been using selenium for this, but now I need to scrape a lot more pages and do that all at once. The problem is that selenium is very resource-dependant since I am running a full (headless) chrome browser in each instance and also not stable at all (probably because of limited resources but still). I know that there is a way to look for ajax requests that the site uses and access it with requests library, but I have two issues:
I can't seem to find the desired request
The ones that I try to use with requests library require the user to be logged in and I have no idea how to do that (maybe pass cookies and whatnot, I am not a web developer).
Let me take Twitter as an example since it is exactly the as what I am describing (except it has an API). You have to log in and then the feed is loaded infinitely. So the goal is to "scroll" and take the content of each tweet. How can this be done? If you can, please, provide a working example.
Thank you.

Does selenium or other web scraper tools are mandatory for scraping data from chrome to python script

So I wanted to scrape an website data. I have used selenium in my python script to scrape the data. But i have noticed that in Network section of Google Chrome Inspect, the chrome can record the XmlHttpRequest to find out the json/xml file of websites. So i was wondering that can i directly use this data in my python script as selenium is quite heavy weight and needs more bandwidth. Does selenium or other web scraper tools should be used as a medium to communicate with browser? If not, please give out some information about scraping data to be used for my python file only by using chrome itself.

Definitely! Check out the requests module.
From there you can access the page source, and using data from it you can access the different aspects separately. Here are the things to consider though:
Pros:
Faster, less to download. For things like AJAX requests, is extremely more efficient.
Does not require graphic UI like selenium
More precise; Get exactly what you need
The ability to set Headers/Cookies/etc before making requests
Images may be downloaded separately, with no obligation to download any of them.
Allows as many sessions as you want to be opened in parallel, each
can have different options(proxies, no cookies, consistent cookies,
custom headers, block redirects, etc) without affecting the other.
Cons:
Much harder to get into as opposed to Selenium, requires
minimal knowledge of HTML's GET and POST , and a library
like re or BeautifulSoup to extract data.
For pages with javascript-generated data, depending how the
javascript is implemented(or obfuscated), while always possible,
could be extremely difficult to extract wanted data.
Conclusion:
I suggest you definitely learn requests, and use it for most cases; However if the javascript gets too complicated, then switch to selenium for an easier solution. Look for some tutorials online, and then check the official page for an overview of what you've learned.

Scraping a web page as you manually navigate

Is there a way, using some library or method, to scrape a webpage in real time as a user navigates it manually? Most scrapers I know of such as python mechanize create a browser object that emulates a browser - of course this is not what I am looking for since if I have a browser open, it will be different than the one mechanize creates.
If there is no solution, my problem is I want to scrape elements from a HTML5 game to make an intelligent agent of sorts. I won't go into more detail, but I suspect if others are trying to do the same in the future (or any real time scraping with a real user), a solution to this could be useful for them as well.
Thanks in advance!

Depending on what your use-case is, you could set up a SOCKS proxy or some other form of proxy and configure it to log all traffic, then instruct your browser to use it. You'd then scrape that log somehow.
Similarly, if you have control over your router, you could configure capture and logging there, e.g. using tcpdump. This wouldn't decrypt encrypted traffic, of course.
If you are working with just one browser, there may be a way to instruct it to do something at each action via a custom browser plugin, but I'd have to guess you'd be running into security model issues a lot.
The problem with a HTML5 game is that typically most of its "navigation" is done using a lot of Javascript. The Javascript is typically doing a lot -- manipulating the DOM, triggering requests for new content to fit into the DOM, etc...
Because of this you might be better off looking into OS-level or browser-level scripting services that can "drive" keyboard and mouse events, take screenshots, or possibly even take a snapshot of the current page DOM and query it.
You might investigate browser automation and testing frameworks like Selenium for this.

I am not sure if this would work in your situation but it is possible to create a simple web browser using PyQt which will work with HTML5 and from this it might be possible to capture what is going on when a live user plays the game.
I have used PyQt for a simple browser window (for a completely different application) and it seems to handle simple, sample HTML5 games. How one would delve into the details of what is going on the game is a question for PyQt experts, not me.

Parsing from a website -- source code does not contain the info I need

I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.

The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.

Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)

Downloading a web page and all of its resource files in Python

I want to be able to download a page and all of its associated resources (images, style sheets, script files, etc) using Python. I am (somewhat) familiar with urllib2 and know how to download individual urls, but before I go and start hacking at BeautifulSoup + urllib2 I wanted to be sure that there wasn't already a Python equivalent to "wget --page-requisites http://www.google.com".
Specifically I am interested in gathering statistical information about how long it takes to download an entire web page, including all resources.
Thanks
Mark

Websucker? See http://effbot.org/zone/websucker.htm

websucker.py doesn't import css links. HTTrack.com is not python, it's C/C++, but it's a good, maintained, utility for downloading a website for offline browsing.
http://www.mail-archive.com/python-bugs-list#python.org/msg13523.html
[issue1124] Webchecker not parsing css "#import url"
Guido> This is essentially unsupported and unmaintaned example code. Feel free
to submit a patch though!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping PDF's from a password protected website - python

Related

How to scrape infinitely scrolling websites with login using python request (or similar)

Does selenium or other web scraper tools are mandatory for scraping data from chrome to python script

Scraping a web page as you manually navigate

Parsing from a website -- source code does not contain the info I need

Downloading a web page and all of its resource files in Python

Categories

Resources