Sensible way to create filenames for files based on URLs?

Sensible way to create filenames for files based on URLs? - python

I am screenshotting a bunch of web pages, using Python with Selenium. I want to save the PNGs locally for reference. The list of URLs looks something like this:
www.mysite.com/dir1/pageA
www.mysite.com/dir1/pageB
My question is about what filenames to give the screenshotted PNGs.
If I call the image files e.g. www.mysite.com/dir1/pageA.png the meaningless slashes will inevitably cause problems at some point.
I could replace all the / characters in the URL with _, but I suspect that might cause problems too, e.g. if there are already _ characters in the URL. (I don't strictly need to be able to work backwards from the filename to the URL, but it wouldn't be a bad thing.)
What's a sensible way to handle the naming?

The easiest way to represent what's almost certainly a directory structure on the server is to do like wget does and replicate that structure on your local machine.
Thus the / characters become directory delimiters, and your www.mysite.com/dir1/pageA.png would become a PNG file called pageA.png in a directory called dir1, and dir1 is located in a directory called www.mysite.com.
It's simple, guaranteed to be reversible, and doesn't risk ambiguous results.

What if you use '%2F'? It's the '/' but html encoded.
source:
http://www.w3schools.com/tags/ref_urlencode.asp

Related

Python ZipFile extractall fails when external_attr is 0x10

I have a peculiar case, where Python's ZipFile.extractall fails. I have a few zip files that seem to have been created in an uncommon way (I did not create them, I need to open them). For example, let's discuss a zip files that contains the following files:
a/f1.txt
a/f2.txt
b/f3.txt
However, the zip file contain the following file entries (as seen by using ZipFile.filelist)
a/f1.txt, a/f2.txt, a, b/f3.txt, b
When trying to use extractall, I get an error, saying that the file a cannot be created, because it is already a directory (makes sense, as a/f1.txt was handled before).
Looking further, the external_attr of the proper files is 0x20. The external_attr of the extra files is 0x10. They are also zero-length.
Window's internal unzip works properly, as does 7zip, but Python's extractall fails.
Is this a bug in Python's ZipFile implementation? Or are these badly encoded zip files that Windows and 7zip just happen to understand? What is going on here?
Note: making the Python code work is simple, just pass all the files without external_attr==0x10 to extractall, but I still want to know what's going on.

glob() to exclude sub-directories

So I'm working on a script which will go through a bunch of log files looking for strings and server names.
In my testing I was using glob() to create a list of files to troll through.
However, to improve my testing I have copied a log directory from a live system (11gb!) - and things aren't as smooth as they were before.. it looks like glob treats the sub-directories as files, and as such the readlines() is struggling to read them.
I don't care about files in the sub-directories, I just want to scan through the files in the native directory.
I think I can use os.walk() to achieve this, with something like:
logs = next(os.walk('var/opt/server/log/current'))[2]
As opposed to:
logs = glob('/var/opt/server/log/current/*')
Because I'm learning python, I want to make sure I learn things the correct way.. so am I correct in what I'm saying above? Or should I use glob() in a slightly different way to achieve this goal?

Use glob and filter out all the dirs:
logs = [log for log in glob('/var/opt/server/log/current/*') if not os.path.isdir(log)]

Python/Django: how to get files fastest (based on path and name)

My website users can upload image files, which then need to be found whenever they are to be displayed on a page (using src = ""). Currently, I put all images into one directory. What if there are many files - is it slow to find the right file? Are they indexed? Should I create subdirectories instead?
I use Python/Django. Everything is on webfaction.

The access time for an individual file are not affected by the quantity of files in the same directory.
running ls -l on a directory with more files in it will take longer of course. Same as viewing that directory in the file browser. Of course it might be easier to work with these images if you store them in a subdirectory defined by the user's name. But that just depends on what you are going to doing with them. There is no technical reason to do so.
Think about it like this. The full path to the image file (/srv/site/images/my_pony.jpg) is the actual address of the file. Your web server process looks there, and returns any data it finds or a 404 if there is nothing. What it doesn't do is list all the files in /srv/site/images and look through that list to see if it contains an item called my_pony.jpg.

If only for organizational purposes, and to help with system maintenance you should create subdirectories. Otherwise, there is very little chance you'll run into the maximum number of files that a directory can hold.
There is negligible performance implication for the web. For other applications though (file listing, ftp, backup, etc.) there may be consequences, but only if you reach a very large number of files.

How can I efficiently select 100 random JPG files from a directory (including subdirs) in Python?

I have a very large directory of files and folders. Currently, I scan the entire directory for JPGs and store them in a list. This is really slow due to the size of the directory. Is there a faster, more efficient way to do this? Perhaps without scanning everything?
My directory looks like this:
/library/Modified/2000/[FolderName]/Images.JPG
/library/Modified/2001/[FolderName]/Images.JPG
/library/Modified/2002/[FolderName]/Images.JPG
/library/Modified/2003/[FolderName]/Images.JPG
/library/Modified/2004/[FolderName]/Images.JPG
...
/library/Modified/2012/FolderName/Images.JPG
Thanks

See Generator Tricks for System Programmers for a bunch of neat stuff. But specifically, see the gen-find example. This is as efficient as you are going to get, without making a bunch of assumptions about your file structure layout.

Assuming that you application is the only one changing directory and that you have control over the directory names/structure and that you have to do the operation described in your question more than once:
Rename all the files once so you can access them in predictable order. Say, give all files numeric name from 1 to N (where N is the number of files in directory) and have a special file ".count" which will hold the N for each directory. Then access them directly with their names generated by random generator.

I don't know where the slowness occurs, but to scan directories and files I found it much faster the dump the directories/files into a text file first using a batch file then get python to read the file. This worked well on our server system with 7 servers and many thousands of directories.
Python could, of course, run the batch file.

Get original path from django filefield

My django app accepts two files (in this case a jad and jar combo). Is there a way I can preserve the folders they came from?
I need this so I can check later that they came from the same path.
(And later on accept a whole load of files and be able to work out which came from the same folder).

I think that is not possible, most browsers at least firefox3.0 do not allow fullpath to be seen, so even from JavaScript side you can not get full path
If you could get full path you can send it to server, but I think you will have to be satisfied with file name

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sensible way to create filenames for files based on URLs? - python

What if you use '%2F'? It's the '/' but html encoded. source: http://www.w3schools.com/tags/ref_urlencode.asp

Related

Python ZipFile extractall fails when external_attr is 0x10

glob() to exclude sub-directories

Python/Django: how to get files fastest (based on path and name)

How can I efficiently select 100 random JPG files from a directory (including subdirs) in Python?

Get original path from django filefield

Categories

Resources