Python/Django: how to get files fastest (based on path and name) - python

My website users can upload image files, which then need to be found whenever they are to be displayed on a page (using src = ""). Currently, I put all images into one directory. What if there are many files - is it slow to find the right file? Are they indexed? Should I create subdirectories instead?
I use Python/Django. Everything is on webfaction.

The access time for an individual file are not affected by the quantity of files in the same directory.
running ls -l on a directory with more files in it will take longer of course. Same as viewing that directory in the file browser. Of course it might be easier to work with these images if you store them in a subdirectory defined by the user's name. But that just depends on what you are going to doing with them. There is no technical reason to do so.
Think about it like this. The full path to the image file (/srv/site/images/my_pony.jpg) is the actual address of the file. Your web server process looks there, and returns any data it finds or a 404 if there is nothing. What it doesn't do is list all the files in /srv/site/images and look through that list to see if it contains an item called my_pony.jpg.

If only for organizational purposes, and to help with system maintenance you should create subdirectories. Otherwise, there is very little chance you'll run into the maximum number of files that a directory can hold.
There is negligible performance implication for the web. For other applications though (file listing, ftp, backup, etc.) there may be consequences, but only if you reach a very large number of files.

Related

Performance of walking directory structure vs reading file with contents of ls (or similar command) for performing search

Is it better to walk a directory structure when performing multiple searches or is it a good idea to catalog the directory structure (in a file or memory) and then operate on that catalog? Or are there other methods which are better suited which I haven't hit upon?
I have a 3.5TB external HDD with thousands of files.
I have a set of files which list the contents of a directory. These listing files hold a folder name, filenames and file sizes.
I want to search the external HDD for the files in these listing files. If a file is found I then want to check and see if the file size of the actual file matches that in the listing file.
This process will cover about 1000 listing files and probably 10s of thousands of actual files.
A listing file would have contents like
folder: SummerPhotos
name: IMG0096.jpg, length: 6589
name: IMG0097.jpg, length: 6489
name: IMG0098.jpg, length: 6500
name: IMG0099.jpg, length: 6589
name: BeachPhotos/IMG0100.jpg, length, 34892
name: BeachPhotos/IMG0101.jpg, length, 34896
I like the offline processing of the listing files with a file which lists the contents of the external HDD because then I can perform this operation on a faster computer (as the hard drive is on an old computer acting as a server) or split the listing files over several computers and split up the work. Plus I think that continually walking the directory structure is about as inefficient as you can get and putting unnecessary wear on the hardware.
Walk pseudo code:
for each listing file
get base_foldername,filelist
for root,subfolder,files in os.walk(/path/to/3.5TBdrive)
if base_foldername in subfolder
for file in filelist
if file in files
if file.size == os.path.getsize(file)
dosomething
else
somethingelse
else
not_found
For the catalog file method I'm thinking of dumping a recursive 'ls' to file and then pretty much doing a string search on that file. I'll extract the filesize and perform a match there.
My 'ls -RlQ' dump file is 11MB in size with ~150k lines. If there is a better way to get the required data I'm open to suggestions. I'm thinking of using the os.walk() to compile a list and create my own file in a format I like vs trying to parse my ls command.
I feel like I should be doing somethign to make my college professors proud and making a hashtable or balanced tree, but feel like the effort to implement that will take longer than simply brute forcing the solution w cpu cycles.
OS: Linux
preferred programming language: Python 2/3
Thanks!
Is it better to walk a directory structure when performing multiple
searches or is it a good idea to catalog the directory structure (in a
file or memory) and then operate on that catalog?
If you just want to check if the file exists or the directory structure is not too complex, I suggest you to just use your filesystem. You're basically duplicating the work that it already does anyway and this will lead to problems in the future, as complexity always does.
I don't see any point using hashtables or balanced trees for in-program data structures - this is also what your filesystem already does. What you should instead do to speed up lookups is to design a deep directory structure instead of a few single directories that contain thousands of files. There are filesystems that choke while trying to list directories with dozens of thousands of files and it is a better idea to limit yourself to a few thousands and create a new level of directory depth should you exceed it.
For example, if you want to keep logs of your internet-wide scanning research, if you use a single file for each host you scanned, you don't want to create a directory scanning-logs with files such as 1.1.1.1.xml, 1.1.1.2.xml and so on. Instead, naming such as scanning-logs/1/1/1.1.1.1.xml is a better idea.
Also, watch out for the inode limit! I was once building a large file-based database on EXT4 filesystem. One day I started getting error messages like "no space left on device" even though I clearly had quite a lot of space left. The real reason was that I created too many inodes - the limit can be manually set while creating a volume.

How can I find if the contents in a Google Drive folder have changed

I am currently working on an app that syncs one specific folder in a users Google Drive. I need to find when any of the files/folders in that specific folder have changed. The actual syncing process is easy, but I don't want to do a full sync every few seconds.
I am condisering one of these methods:
1) Moniter the changes feed and look for any file changes
This method is easy but it will cause a sync if ANY file in the drive changes.
2) Frequently request all files in the whole drive eg. service.files().list().execute() and look for changes within the specific tree. This is a brute force approach. It will be too slow if the user has 1000's of files in their drive.
3) Start at the specific folder, and move down the folder tree looking for changes.
This method will be fast if there are only a few directories in the specific tree, but it will still lead to numerous API requests.
Are there any better ways to find whether a specific folder and its contents have changed?
Are there any optimisations I could apply to method 1,2 or 3.
As you have correctly stated, you will need to keep (or work out) the file hierarchy for a changed file to know whether a file has changed within a folder tree.
There is no way of knowing directly from the changes feed whether a deeply nested file within a folder has been changed. Sorry.
There are a couple of tricks that might help.
Firstly, if your app is using drive.file scope, then it will only see its own files. Depending on your specific situation, this may equate to your folder hierarchy.
Secondly, files can have multiple parents. So when creating a file in folder-top/folder-1/folder-1a/folder-1ai. you could declare both folder-1ai and folder-top as parents. Then you simply need to check for folder-top.

How do I upload a file to a subfolder even if it doesn't exist

I have a Django app that needs to create a file in google drive: in FolderB/Sub1/Sub2/file.pdf. I have the id for FolderB but I don't know if Sub1 or Sub2 even exist. If not it should be created and the file.pdf should be put in it.
I figure I can look at children at each level and create the folder at each level if its not there, but this seems like a lot of checks and api calls just to create one file. Its also a harder task trying to accommodate multiple folder structures (ie, one python function that can accept any path of any depth and upload a file there)
The solution you have presented is the correct one. As you have realized, the Drive file system is not exactly like a hierarchical file system, so you will have to perform these checks.
One optimization you could perform is to try to find the grand-child folder (Sub2) first, so you will save a number of calls.

How can I efficiently select 100 random JPG files from a directory (including subdirs) in Python?

I have a very large directory of files and folders. Currently, I scan the entire directory for JPGs and store them in a list. This is really slow due to the size of the directory. Is there a faster, more efficient way to do this? Perhaps without scanning everything?
My directory looks like this:
/library/Modified/2000/[FolderName]/Images.JPG
/library/Modified/2001/[FolderName]/Images.JPG
/library/Modified/2002/[FolderName]/Images.JPG
/library/Modified/2003/[FolderName]/Images.JPG
/library/Modified/2004/[FolderName]/Images.JPG
...
/library/Modified/2012/FolderName/Images.JPG
Thanks
See Generator Tricks for System Programmers for a bunch of neat stuff. But specifically, see the gen-find example. This is as efficient as you are going to get, without making a bunch of assumptions about your file structure layout.
Assuming that you application is the only one changing directory and that you have control over the directory names/structure and that you have to do the operation described in your question more than once:
Rename all the files once so you can access them in predictable order. Say, give all files numeric name from 1 to N (where N is the number of files in directory) and have a special file ".count" which will hold the N for each directory. Then access them directly with their names generated by random generator.
I don't know where the slowness occurs, but to scan directories and files I found it much faster the dump the directories/files into a text file first using a batch file then get python to read the file. This worked well on our server system with 7 servers and many thousands of directories.
Python could, of course, run the batch file.

Get original path from django filefield

My django app accepts two files (in this case a jad and jar combo). Is there a way I can preserve the folders they came from?
I need this so I can check later that they came from the same path.
(And later on accept a whole load of files and be able to work out which came from the same folder).
I think that is not possible, most browsers at least firefox3.0 do not allow fullpath to be seen, so even from JavaScript side you can not get full path
If you could get full path you can send it to server, but I think you will have to be satisfied with file name

Categories