Python + Paramiko - Checking whether two files are identical without downloading - python

I have a script that downloads a lot of fairly large (20MB+) files. I would like to be able to check if the copy I have locally is identical to the remote version. I realize I can just use a combination of date modified and length, but is there something even more accurate I can use (that is also available via paramiko) that I can use to ensure this? Ideally some sort of checksum?
I should add that the remote system is Windows and I have SFTP access only, no shell access.

I came with a similar scenario. the solution I currently take is to compare the remote file's size by using item.st_size for item in sftp.listdir_attr(remote_dir) with the local file's size by using os.path.getsize(local_file). when the two files are around 1MB or smaller,this solution is fine. However, a weird thing might happen: when the files are around 10MB or larger, the two size might differ slightly,e.g., one is 10000 Byte, another is 10003 Byte.

Related

naming and storing fileinformation for comparison

I am currently working on a script that automatically syncs files from the Documents and Picture directory with an USB stick that I use as sort of an "essentials backup". In practice, this should identify filenames and some information about them (like last time edited etc.) in the directories that I choose to sync.
If a file exists in one directory, but not in the other (i.e. it's on my computer but not on my USB drive), it should automatically copy that file to the USB as well. Likewise, if a file exists in both directories, but has different mod-times, it should replace the older with the newer one.
However, I have some issues with storing that information for the purpose of comparing those files. I initially thought about a file class, that stores all that information and through which I can compare objects with the same name.
Problem 1 with that approach is, that if I create an object, how do I name it? Do I name it like the file? I then would have to remove the file-extension like .txt or .py, because I'd run into trouble with my code. but I might have a notes.odt and a notes.jpg, which would be problem 2.
I am pretty new to Python, so my imagination is probably limited by my lack of knowledge. Any pointers on how I could make that work?

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)
I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

How can I reduce the access time on large Excel files?

I would like to process a large data set of a mechanical testing device with Python. The software of this device only allows to export the data as an Excel file. Therefore, I use the xlrd package which works fine for small *.xlsx files.
The problem I have is, that when I want to open a common data set (3-5 MB) by
xlrd.open_workbook(path_wb)
the access time is about 30s to 60s. Is there any more effecitve and faster way to access Excel files?
You could access the file as a database via PyPyODBC instead, which may (or may not) be faster - you'd have to try it out and compare the results.
This method should work for both .xls and .xlsx files. Unfortunately, it comes with a couple of caveats:
As far as I am aware, this will only work on Windows machines, since you're relying on the Microsoft Jet database driver.
The Microsoft Jet database driver can be rather buggy, especially with dates.
It's not possible to create or modify Excel files (a note in the PyPyODBC exceltests.py file says: I have not been able to successfully create or modify Excel files.). Your question seems to indicate that you're only interested in reading files, though, so hopefully this will not be a problem.
I just figured out that it wasn't actually the problem with the access time but I created an object in the same step. Now, by creating the object separately everything works fast and nice.

Is there a standard way, across operating systems, of adding "tags" to files

I'm writing a script to make backups of various different files. What I'd like to do is store meta information about the backup. Currently I'm using the file name, so for example:
backups/cool_file_bkp_c20120119_104955_d20120102
Where c represents the file creation datetime, and d represents "data time" which represents what the cool_file actually contains. The reason I currently use "data time" is that a later backup may be made of the same file, in which case, I know I can safely replace the previous backup of the same "data time" without loosing any information.
It seems like an awful way to do things, but it does seem to have the benefit of being non-os dependent. Is there a better way?
FYI: I am using Python to script my backup creation, and currently need to have this working on Windows XP, 2003, and Redhat Linux.
EDIT: Solution:
From the answers below, I've inferred that metadata on files is not widely supported in a standard way. Given my goal was to tightly couple the metadata with the file, it seems that archiving the file alongside a metadata textfile is the way to go.
I'd take one of two approaches there:
create a stand alone file, on the backub dir, that would contain the desired metadata - this could be somethnng in human readable form, just to make life easier, such as a json data structure, or "ini" like file.
The other one is to archive the copied files - possibily using "zip", and bundle along with it a textual file with the desired meta-data.
The idea of creating zip archives to group files that you want together is used in several places, like in java .jar files, Open Document Format (offfice files created by several office sutres), Office Open XML (Microsoft specific offic files), and even Python language own eggs.
The ziplib module in Python's standard library has all the toools necessary to acomplish this - you can just use a dictionary's representation in a file bundled with the original one to have as much metadata as you need.
In any of these approaches you will also need a helper script to letyou see and filter the metadata on the files, of course.
Different file systems (not different operating systems) have different capabilities for storing metadata. NTFS has plenty of possibilities, while FAT is very limited, and ext* are somewhere in between. None of widespread (subjective term, yes) filesystems support custom tags which you could use. Consequently there exists no standard way to work with such tags.
On Windows there was an attempt to introduce Extended Attributes, but these were implemented in such a tricky way that were almost unusable.
So putting whatever you can into the filename remains the only working approach. Remember that filesystems have limitations on file name and file path length, and with this approach you can exceed the limit, so be careful.

Limitations of TEMP directory in Windows?

I have an application written in Python that's writing large amounts of data to the %TEMP% folder. Oddly, every once and awhile, it dies, returning IOError: [Errno 28] No space left on device. The drive has plenty of free space, %TEMP% is not its own partition, I'm an administrator, and the system has no quotas.
Does Windows artificially put some types of limits on the data in %TEMP%? If not, any ideas on what could be causing this issue?
EDIT: Following discussions below, I clarified the question to better explain what's going on.
What is the exact error you encounter?
Are you creating too many temp files?
The GetTempFileName method will raise
an IOException if it is used to
create more than 65535 files without
deleting previous temporary files.
The GetTempFileName method will raise
an IOException if no unique temporary
file name is available. To resolve
this error, delete all unneeded
temporary files.
One thing to note is that if you're indirectly using the Win32 API, and you're only using it to get temp file names, note that while (indirectly) calling it:
Creates a uniquely named, zero-byte
temporary file on disk and returns the
full path of that file.
If you're using that path but also changing the value returned, be aware you might actually be creating a 0byte file and an additional file on top of that (e.g. My_App_tmpXXXX.tmp and tmpXXXX.tmp).
As Nestor suggested below, consider deleting your temp files after you're done using them.
Using a FAT32 filesystem I can imagine this happening when:
Writing a lot of data to one file, and you reach the 4GB file size cap.
Or when you are creating a lot of small files and reaching the 2^16-2 files per directory cap.
Apart from this, I don't know of any limitations the system can impose on the temp folder, apart from the phyiscal partition actually being full.
Another limitation is as Mike Atlas has suggested the GetTempFileName() function which creates files of type tmpXXXX.tmp. Although you might not be using it directly, verify that the %TEMP% folder does not contain too many of them (2^16).
And maybe the obvious, have you tried emptying the %TEMP% folder before running the utility?
There shouldn't be such space limitation in Temp. If you wrote the app, I would recommend creating your files in ProgramData...
There should be no trouble whatsoever with regard to your %TEMP% directory.
What is your disk quota set to for %TEMP%'s hosting volume? Depending in part on what the apps themselves are doing, one of them may be throwing an error due to the disk quota being reached, which is a pain if this quota is set unreasonably high. If the quota is very high, try lowering it, which you can do as Administrator.

Categories