I want to download all the files that are publicly accessible on this site:
https://www.duo.uio.no/
This is the site for the university of Oslo, and here we can find every paper/thesis that is publicly available from the archives of the university. I tried a crawler, but the website has set some mechanism for stopping crawlers accessing their documents. Are there any other ways of doing this?
Did not mention this in the original question, but what I want is all the pdf files on the server. I tried SiteSucker, but that seems to just download the site itself.
wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=unix,ascii --domains your-site.com --no-parent http://your-site.com
try it
You could try using site sucker (download), that allows you to download the contents of a website, ignoring any rules they may have in place.
Related
I have a dataset contains hundreds of numpy arrays looks like this,
I am trying to save them to an online drive so that I can run the code with this dataset remotely from a sever. I cannot access the drive of the server but can only run code script and access the terminal. So I have tried with google drive and Onedrive, and looked up how to generate a direct download link from those drives but it did not work.
In short, I need to be able to get those files from my python scripts. Could anyone give some hints?
You can get the download URLs very easily from Drive. I assume that you already uploaded the files into a Drive folder. Then you can easily set up a scenario to download the files on Python. First you would need an environment on Python to connect to Drive. If you don't currently have one, you can follow this guide. That guide will install the required libraries, credentials and run a sample script. Once you can run the sample script you can make minor modifications to reach your goal.
To download the files you are going to need their ids. I am assuming that you already know them, but if you don't you could retrieve them by doing a Files.list on the folder where you keep the files. To do so you can use '{ FOLDER ID }' in parents as the q parameter.
To download the files you only have to run a Files.get request by providing the file id. You will find the download URL on the webContentLink property. Feel free to leave a comment if you need further clarifications.
MyBinder and Colaboratory are feasible to allow people to run our examples from the website directly in their browser, without any download required.
When I work on the Binder, our data to be loaded takes a huge time. So, I need to run python code on the website directly.
I'm not sure whether I totally get the question. If you want to avoid having to download the data from another source, you can add the data into you git repo which you use to start Binder. It should look something like this: https://github.com/lschmiddey/book_recommender_voila
However, if your dataset is too big to be uploaded to your git repo, you have to get the data onto the provided Binder server somehow. So you usually have to download the data onto your Binder Server so that other users can work on your notebook.
I can not download the complete HTML code from the Google Drive folder to find the ID code for downloading public files from this Google folder. If I open the site and download it through the Mozilla Firefox browser, then it's all in the HTML code. The link to the google drive folder is in the example code below. Everything as an unregistered Google user. These are public files and public folders.
The file, which I know to crawl through the downloaded Mozilla Firefox html code, but not through WGET or Python, has the name:
piconwhite-220x132-freeSAT..........(insignificant remaining part of file name)
Here is an example of the Python algorithm what I use, but where nothing is obvious (urllib2 module):
import urllib2
u_handle = urllib2.urlopen('https://drive.google.com/drive/folders/0Bwz6mBA7lUOKZi1nbGdlbzFDZ0U')
htmlPage = u_handle.read()
with open('/tmp/test.html','w') as f:
f.write(htmlPage)
If I download a html page using a web browser, the html file size is about 500kB and also contains the above mentioned file to uncover the download code. If I download the webpage through wget or through the Python urllib2 module, the downloaded html code has a size of only 213kB and does not contain the mentioned file.
BTW, I tried several WGET methods (via linux shell - command line) but there is the same situation - that is, always downloading HTML with a certain number of maximum files from the content (unfortunately, not all files there).
Thank you for all the advice.
P.S.
I'm not a good web developer and I'm looking for a solution to the problem. I'm a developer in other languages and on other platforms.
So, I resolved my own problem by downloading a different drive.google webpage as a shortened form of directory / file list. I use this new URL:
'https://drive.google.com/embeddedfolderview?id=0Bwz6mBA7lUOKZi1nbGdlbzFDZ0U#list'
Instead of the previous URL:
'https://drive.google.com/drive/folders/0Bwz6mBA7lUOKZi1nbGdlbzFDZ0U'
The source code of the "list" site is slightly different, but it has a lot of records (lots of directories or files on drive.google page). So I can see all the files or all the directories that are on the required drive.google website.
Thank you all for helping me or for reading my problem.
I am looking for a web server, where I can upload files and download files from Ubuntu to Windows and the vice versa. I've builded a web server with Python and I share my folder in Ubuntu and download the files in this folder at Windows. Now I want to look up every millisecond if there is a new file and download this new files automatically. Is there any script or something helpfully for me?
Is a python web server a good solution?
There are many ways to synchronise folders, even remote.
If you need to stick with the python server approach for some reason, look for file system events libraries to trigger your upload code (for example watchdog).
But if not, it may be simpler to use tools like rsync + inotify, or simply lsync.
Good luck!
Edit: I just realized you want linux->windows sync, not the other way around, so since you don't have ssh server on the target (windows), rsync and lsync will not work for you, you probably need smbclient. In python, consider pysmbc or PySmbClient
I am trying to download PDFs from my school server, but the way it is set up by the stupid IT department is that we have to click each link one by one and there are hundreds of PDFs on the same page with links.
How can I download using python or wget "2015-0001.pdf" "2015-0002.pdf" "2015-0003.pdf"
I have tried wget --accept pdf,zip,7z,doc --recursive but it only grabs the index.html file of the site and no actual files.
Use Scrapy: http://scrapy.org/
An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
Scrapy tutorial how to get started with website scraping