How can I download files from a website using wildacrds in Python? I have a site that I need to download file from periodically. The problem is the filenames change each time. A portion of the file stays the same though. How can I use a wildcard to specify the unknown portion of the file in a URL?
If the filename changes, there must still be a link to the file somewhere (otherwise nobody would ever guess the filename). A typical approach is to get the HTML page that contains a link to the file, search through that looking for the link target, and then send a second request to get the actual file you're after.
Web servers do not generally implement such a "wildcard" facility as you describe, so you must use other techniques.
You could try logging into the ftp server using ftplib.
From the python docs:
from ftplib import FTP
ftp = FTP('ftp.cwi.nl') # connect to host, default port
ftp.login() # user anonymous, passwd anonymous#
The ftp object has a dir method that lists the contents of a directory.
You could use this listing to find the name of the file you want.
Related
I am working on a Python tool to synchronize files between my local machines and a remote server. When I upload a file to the server the modification time property of that file on the server is set to the time of the upload process and not to the mtime of the source file, which I want to preserve. I am using FTP.storbinary() from the Python ftplib to perform the upload. My question: Is there a simple way to preserve the mtime when uploading or to set it after the upload? Thanks.
Short answer: no. The Python ftplib module offers no option to transport the time of the file. Furthermore, the FTP protocol as defined by rfc-959 has no provision to directly get not set the mtime of a file. It may be possible on some servers through SITE commands, but this is server dependant.
If it is possible for you, you should be able to pass a site command with the sendcmd method of a connection object. For example if the server accepts a special SITE SETDATE filename iso-8601-date-string you could use:
resp = ftp.sendcmd(f'SITE SETDATE {file_name} {date-string}')
My application is keeping watch on a set of folders where users can upload files. When a file upload is finished I have to apply a treatment, but I don't know how to detect that a file has not finish to upload.
Any way to detect if a file is not released yet by the FTP server?
There's no generic solution to this problem.
Some FTP servers lock the file being uploaded, preventing you from accessing it, while the file is still being uploaded. For example IIS FTP server does that. Most other FTP servers do not. See my answer at Prevent file from being accessed as it's being uploaded.
There are some common workarounds to the problem (originally posted in SFTP file lock mechanism, but relevant for the FTP too):
You can have the client upload a "done" file once the upload finishes. Make your automated system wait for the "done" file to appear.
You can have a dedicated "upload" folder and have the client (atomically) move the uploaded file to a "done" folder. Make your automated system look to the "done" folder only.
Have a file naming convention for files being uploaded (".filepart") and have the client (atomically) rename the file after upload to its final name. Make your automated system ignore the ".filepart" files.
See (my) article Locking files while uploading / Upload to temporary file name for an example of implementing this approach.
Also, some FTP servers have this functionality built-in. For example ProFTPD with its HiddenStores directive.
A gross hack is to periodically check for file attributes (size and time) and consider the upload finished, if the attributes have not changed for some time interval.
You can also make use of the fact that some file formats have clear end-of-the-file marker (like XML or ZIP). So you know, that the file is incomplete.
Some FTP servers allow you to configure a hook to be called, when an upload is finished. You can make use of that. For example ProFTPD has a mod_exec module (see the ExecOnCommand directive).
I use ftputil to implement this work-around:
connect to ftp server
list all files of the directory
call stat() on each file
wait N seconds
For each file: call stat() again. If result is different, then skip this file, since it was modified during the last seconds.
If stat() result is not different, then download the file.
This whole ftp-fetching is old and obsolete technology. I hope that the customer will use a modern http API the next time :-)
If you are reading files of particular extensions, then use WINSCP for File Transfer. It will create a temporary file with extension .filepart and it will turn to the actual file extension once it fully transfer the file.
I hope, it will help someone.
This is a classic problem with FTP transfers. The only mostly reliable method I've found is to send a file, then send a second short "marker" file just to tell the recipient the transfer of the first is complete. You can use a file naming convention and just check for existence of the second file.
You might get fancy and make the content of the second file a checksum of the first file. Then you could verify the first file. (You don't have the problem with the second file because you just wait until file size = checksum size).
And of course this only works if you can get the sender to send a second file.
Scenario: Website x has a form in which a file can be uploaded from your local machine.
Question: Is it possible to pass in a remote url (e.g. http://blah.whatever/file.ext) instead of a local file, thus causing the remote server to download the file found at http://blah.whatever/file.ext instead of having to download the file to my local machine and then upload it?
I'm most familiar with Python. So, if this is possible through something like the requests module, that would be fantastic.
I want to deploy my scrapy project on a ip that is not listed in the scrapy.cfg file , because the ip can change and i want to automate the process of deploying. i tried giving the ip of the server directly in the deploy command but it did not work. any suggestion to do this?
First, you should consider assigning a domain to the server, so you can always get to it regardless of its dynamic IP. DynDNS comes handy at times.
Second, you probably won't do the first, because you haven't got access to the server, or for whatever other reason. In that case, I suggest mimicking above behavior by using your system's hosts file. As described at wikipedia article:
The hosts file is a computer file used by an operating system to map hostnames to IP addresses.
For example, lets say you set your url to remotemachine in your scrapy.cfg. You can write a script that would edit the hosts file with the latest IP address, and execute it before deploying your spider. This approach has a benefit of having a system-wide effect, so if you are deploying multiple spiders, or using the same server for some other purpose, you don't have to update multiple configuration files.
This script could look something like this:
import fileinput
import sys
def update_hosts(hostname, ip):
if 'linux' in sys.platform:
hosts_path = '/etc/hosts'
else:
hosts_path = 'c:\windows\system32\drivers\etc\hosts'
for line in fileinput.input(hosts_path, inplace=True):
if hostname in line:
print "{0}\t{1}".format(hostname, ip)
else:
print line.strip()
if __name__ == '__main__':
hostname = sys.argv[1]
ip = sys.argv[2]
update_hosts(hostname, ip)
print "Done!"
Ofcourse,you should do additional argument checks, etc., this is just a quick example.
You can then run it prior deploying like this:
python updatehosts.py remotemachine <remote_ip_here>
If you want to take it a step further and add this functionality as a simple argument to scrapyd-deploy, you can go ahead and edit your scrapyd-deploy file (its just a Python script) to add the additional parameter and update the hosts file from within. But I'm not sure this is the best thing to do, since leaving this implementation separate and more explicit would probably be a better choice.
This is not something you can solve on the scrapyd side.
According to the source code of scrapyd-deploy, it requires the url to be defined in the [deploy] section of the scrapy.cfg.
One of the possible workarounds could be having a placeholder in scrapy.cfg which you would replace with a real IP address of the target server, before starting scrapyd-deploy.
Suppose I have one server called server 1. On this server there's a directory called dir1. dir1 has 3 files in them called neh_iu.dat_1, neh_hj.dat_2, jen_ak.dat_1.
I need to get ONLY the 'neh' files from server 1 to another server called server 2. server 2 is where I will be performing certain modifications on these files.
How do I get ONLY the 'neh' files in Python? I'm new to python. I'm aware of a module called paramiko which allows for file transfers but assuming that there are millions of 'neh' files in dir1, and that I don't know the full names of all of them, how can I get an automated process for it in Python?
If you really need to use python instead of bash (assuming you're on unix).
>>> import subprocess
>>> subprocess.call("tar cvzf /path/to/ftp-or-static-http/foo.tgz /path/to/dir/neh*")
This will create a tar file with all the neh*s files. Easy to be transferred between servers (it's only one file instead of millions).
Use FTP, SFTP, HTTP or any transfer protocol supported by your server and perform a request from the other server (curl or ftp).