glob() to exclude sub-directories - python

So I'm working on a script which will go through a bunch of log files looking for strings and server names.
In my testing I was using glob() to create a list of files to troll through.
However, to improve my testing I have copied a log directory from a live system (11gb!) - and things aren't as smooth as they were before.. it looks like glob treats the sub-directories as files, and as such the readlines() is struggling to read them.
I don't care about files in the sub-directories, I just want to scan through the files in the native directory.
I think I can use os.walk() to achieve this, with something like:
logs = next(os.walk('var/opt/server/log/current'))[2]
As opposed to:
logs = glob('/var/opt/server/log/current/*')
Because I'm learning python, I want to make sure I learn things the correct way.. so am I correct in what I'm saying above? Or should I use glob() in a slightly different way to achieve this goal?

Use glob and filter out all the dirs:
logs = [log for log in glob('/var/opt/server/log/current/*') if not os.path.isdir(log)]

Related

Python: iNotify_Simple getting files from other directories

I'm using inotify_simple to get notifications from a directory of directories. I'm accessing a directory that has multiple sub-directories, looping through those sub-directories and wanting to use inotify within each directory. My goal for this program is to be notified anytime anything happens (in particular, a creation of a file) in one of the sub-directories.
My file structure:
-mainDir
-subDir1
-file1
-file2
-subDir2
-file3
-file4
...etc.
I'm looping through the directories in mainDir, and setting that path as the path for inotify to search in:
for directory in os.listdir(self.path):
new_path = os.path.join(self.path, directory)
new_curr = self.inotify.new_curr_file_notification(new_path)
New path values are exactly what I expect:
.../mainDir/subDir1
.../mainDir/subDir2
When passing in new_path into my function (which is the path to give inotify), I'm expecting inotify to only look in that directory. However, I'm getting notifications that files in other directories are causing the notification.
path for inotify .../mainDir/subDir1
Event(wd=1, mask=256, cookie=0, name='someFileInSubDir2')
flags.CREATE
Does anyone know why this is happening? And, if anyone has any suggestions to make this process easier/better, I'm all ears! Thanks!
I'm the author of inotify_simple, and since it doesn't have a method called new_curr_file_notification, I'm guessing that's that's something you wrote. Without seeing that method, or some more code that demonstrates how you're using the library exactly, I unfortunately can't give you any advice, as there's not enough information to see how you're using inotify_simple.
If you post a complete example, I will probably be able to tell what's going wrong.
Feel free to post a bug report on the project't github if it looks like there might be a bug.

Sensible way to create filenames for files based on URLs?

I am screenshotting a bunch of web pages, using Python with Selenium. I want to save the PNGs locally for reference. The list of URLs looks something like this:
www.mysite.com/dir1/pageA
www.mysite.com/dir1/pageB
My question is about what filenames to give the screenshotted PNGs.
If I call the image files e.g. www.mysite.com/dir1/pageA.png the meaningless slashes will inevitably cause problems at some point.
I could replace all the / characters in the URL with _, but I suspect that might cause problems too, e.g. if there are already _ characters in the URL. (I don't strictly need to be able to work backwards from the filename to the URL, but it wouldn't be a bad thing.)
What's a sensible way to handle the naming?
The easiest way to represent what's almost certainly a directory structure on the server is to do like wget does and replicate that structure on your local machine.
Thus the / characters become directory delimiters, and your www.mysite.com/dir1/pageA.png would become a PNG file called pageA.png in a directory called dir1, and dir1 is located in a directory called www.mysite.com.
It's simple, guaranteed to be reversible, and doesn't risk ambiguous results.
What if you use '%2F'? It's the '/' but html encoded.
source:
http://www.w3schools.com/tags/ref_urlencode.asp

Listing files in a sub folder using Glob

I saw this answer - How can I search sub-folders using glob.glob module in Python? but I didn't understand what os.walk() was doing (I read the docs but it didn't quite make sense).
I'm really new to pathing and still trying to make sense of it.
I have a script that lives in - /users/name/Desktop/folder/ and I want to access some files in /users/name/Desktop/folder/subfolder/*.html
I tried glob.glob('/users/name/Desktop/folder/subfolder/*.html') but it returned an empty list. I realize this is what the previous person did and it didn't work (I was just hoping that glob had been updated!)
Any thoughts on how to do this?
Without any further information it's hard to say what the issue is. I tested your syntax, it works fine for me. Are you sure the extension is .html not .htm in your /users/name/Desktop/folder/subfolder/ directory?
Also, to further troubleshoot you can check what python can see in you directory by running:
os.listdir('/users/name/Desktop/folder/subfolder/')
This should get you started.

Searching script in Python

im trying to make a script which search some file in whole computer similar to Search in windows. I want to do it without any libraries.
I started with setting dir to begginng of disk and then checking how much dirs i have - i want to make a function that will search for new dirs in current dir so at the end ill have list of all dirs in disk. But when i try to found all files ending with ".txt", i get error message WindowsError5 Acces Denied. What am i doing wrong? Thanks
import os
os.chdir("\\.")
dir = os.listdir()
dirs = []
for x in dir:
if os.path.isdir(x):
dirs.append(x)
for y in dirs:
o = os.listdir(y)
for p in o:
if p.endswith(".txt"):
print(p)
input()
First, if you want to walk a directory tree, use os.walk instead of trying to build it yourself. And if you're trying to learn how something like os.walk works, the source code should be right there in os.py.
Second, you probably don't have access to your entire filesystem unless you run as Administrator, so you're going to get a bunch of Access Denied errors as you try to step through the directories you don't have access to. You have to use try/catch to deal with these errors in whatever way you find appropriate (e.g., print the error and move on to the next directory)
Third, this whole idea is probably misguided. Windows Desktop Search does not actually search your whole tree, it keeps a database and searches that, which is much faster (and also allows users to search into paths they couldn't reach directly—for example, you might have access to /Users/foo, but not to /Users, which means WDS can show you files in /Users/foo but your script cannot).
Fourth, this whole thing is much easier to do with the POSIX 'find' tool or… I forget the name, but there's a DOS-derived tool that also comes with Windows that does the same thing but not as flexibly. Either way, it's a one-liner shell/batch command instead of dozens of lines of Python.
Finally, the way you've written this, it's going to search the current drive, not all drives, which probably isn't what you want, is it?
You've hit a directory that you don't have permission to look in. So what. Catch the exception and continue the search.

How can I efficiently select 100 random JPG files from a directory (including subdirs) in Python?

I have a very large directory of files and folders. Currently, I scan the entire directory for JPGs and store them in a list. This is really slow due to the size of the directory. Is there a faster, more efficient way to do this? Perhaps without scanning everything?
My directory looks like this:
/library/Modified/2000/[FolderName]/Images.JPG
/library/Modified/2001/[FolderName]/Images.JPG
/library/Modified/2002/[FolderName]/Images.JPG
/library/Modified/2003/[FolderName]/Images.JPG
/library/Modified/2004/[FolderName]/Images.JPG
...
/library/Modified/2012/FolderName/Images.JPG
Thanks
See Generator Tricks for System Programmers for a bunch of neat stuff. But specifically, see the gen-find example. This is as efficient as you are going to get, without making a bunch of assumptions about your file structure layout.
Assuming that you application is the only one changing directory and that you have control over the directory names/structure and that you have to do the operation described in your question more than once:
Rename all the files once so you can access them in predictable order. Say, give all files numeric name from 1 to N (where N is the number of files in directory) and have a special file ".count" which will hold the N for each directory. Then access them directly with their names generated by random generator.
I don't know where the slowness occurs, but to scan directories and files I found it much faster the dump the directories/files into a text file first using a batch file then get python to read the file. This worked well on our server system with 7 servers and many thousands of directories.
Python could, of course, run the batch file.

Categories