Python2.7 search zipfiles for .kml containing string without unzipping

Python2.7 search zipfiles for .kml containing string without unzipping - python

I am trying to write my first python script below. I want to search through a read only archive on an HPC to look in zipfiles contained within folders with a variety of other folder/file types. If the zip contains a .kml file I want to print the line in there starting with the string <coordinates>.
import zipfile as z
kfile = file('*.kml') #####breaks here#####
folderpath = '/neodc/sentinel1a/data/IW/L1_GRD/h/IPF_v2/2015/01/21' # folder with multiple folders and .zips
for zipfile in folderpath: # am only interested in the .kml files within the .zips
if kfile in zipfile:
with read(kfile) as k:
for line in k:
if '<coordinates>' in line: # only want the coordinate line
print line # print the coordinates
k.close()
Eventually I want to loop this through multiple folders rather than pointing to the exact folder location ie loop thorough every sub folder in here /neodc/sentinel1a/data/IW/L1_GRD/h/IPF_v2/2015/ but this is a starting point for me to try and understand how python works.
I am sure there are many problems with this script before it will run but the current one I have is
kfile = file('*.kml')
IOError: [Errno 22] invalid mode ('r') or filename: '*.kml'
Process finished with exit code 1
Any help appreciated to get this simple process script working.

When you run:
kfile = file('*.kml')
You are trying to open a single file named exactly *.kml, which is not what you want. If you want to process all *.kml files, you will need to (a) get a list of matching files and then (b) process those files in a list.
There are a number of ways to accomplish the above; the easiest is probably the glob module, which can be used something like this:
import glob
for kfilename in glob.glob('*.kml'):
print kfilename
However, if you are trying to process a directory tree, rather than a single directory, you may instead want to investigate the os.walk function. From the docs:
Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames).
A simple example might look something like this:
import os
for root, dirs, files in os.walk('topdir/'):
kfilenames = [fn for fn in files if fn.endswith('.kml')]
for kfilename in kfilenames:
print kfilename
Additional commentary
Iterating over strings
Your script has:
for zipfile in folderpath:
That will simply iterate over the characters in the string folderpath. E.g., the output of:
folderpath = '/neodc/sentinel1a/data/IW/L1_GRD/h/IPF_v2/2015/01/21'
for zipfile in folderpath:
print zipefile
Would be:
/
n
e
o
d
c
/
s
e
n
t
i
n
e
l
1
a
/
...and so forth.
read is not a context manager
Your code has:
with read(kfile) as k:
There is no read built-in, and the .read method on files cannot be used as a context manager.
KML is XML
You're looking for "lines beginning with <coordinate>", but KML files are not line based. An entire KML could be a single line and it would still be valid.
Your are much better off using an XML parser to parse XML.

Related

Errors with Glob while outputting file names

I am combining two questions here because they are related to each other.
Question 1: I am trying to use glob to open all the files in a folder but it is giving me "Syntax Error". I am using Python 3.xx. Has the syntax changed for Python 3.xx?
Error Message:
File "multiple_files.py", line 29
files = glob.glob(/src/xyz/rte/folder/)
SyntaxError: invalid syntax
Code:
import csv
import os
import glob
from pandas import DataFrame, read_csv
#extracting
files = glob.glob(/src/xyz/rte/folder/)
for fle in files:
with open (fle) as f:
print("output" + fle)
f_read.close()
Question 2: I want to read input files, append "output" to the names and print out the names of the files. How can I do that?
Example: Input file name would be - xyz.csv and the code should print output_xyz.csv .
Your help is appreciated.

Your first problem is that strings, including pathnames, need to be in quotes. This:
files = glob.glob(/src/xyz/rte/folder/)
… is trying to divide a bunch of variables together, but the leftmost and rightmost divisions are missing operands, so you've confused the parser. What you want is this:
files = glob.glob('/src/xyz/rte/folder/')
Your next problem is that this glob pattern doesn't have any globs in it, so the only thing it's going to match is the directory itself.
That's perfectly legal, but kind of useless.
And then you try to open each match as a text file. Which you can't do with a directory, hence the IsADirectoryError.
The answer here is less obvious, because it's not clear what you want.
Maybe you just wanted all of the files in that directory? In that case, you don't want glob.glob, you want listdir (or maybe scandir): os.listdir('/src/xyz/rte/folder/').
Maybe you wanted all of the files in that directory or any of its subdirectories? In that case, you could do it with rglob, but os.walk is probably clearer.
Maybe you did want all the files in that directory that match some pattern, so glob.glob is right—but in that case, you need to specify what that pattern is. For example, if you wanted all .csv files, that would be glob.glob('/src/xyz/rte/folder/*.csv').
Finally, you say "I want to read input files, append "output" to the names and print out the names of the files". Why do you want to read the files if you're not doing anything with the contents? You can do that, of course, but it seems pretty wasteful. If you just want to print out the filenames with output appended, that's easy:
for filename in os.listdir('/src/xyz/rte/folder/'):
print('output'+filename)

This works in http://pyfiddle.io:
Doku: https://docs.python.org/3/library/glob.html
import csv
import os
import glob
# create some files
for n in ["a","b","c","d"]:
with open('{}.txt'.format(n),"w") as f:
f.write(n)
print("\nFiles before")
# get all files
files = glob.glob("./*.*")
for fle in files:
print(fle) # print file
path,fileName = os.path.split(fle) # split name from path
# open file for read and second one for write with modified name
with open (fle) as f,open('{}{}output_{}'.format(path,os.sep, fileName),"w") as w:
content = f.read() # read all
w.write(content.upper()) # write all modified
# check files afterwards
print("\nFiles after")
files = glob.glob("./*.*") # pattern for all files
for fle in files:
print(fle)
Output:
Files before
./d.txt
./main.py
./c.txt
./b.txt
./a.txt
Files after
./d.txt
./output_c.txt
./output_d.txt
./main.py
./output_main.py
./c.txt
./b.txt
./output_b.txt
./a.txt
./output_a.txt
I am on windows and would use os.walk (Doku) instead.
for d,subdirs,files in os.walk("./"): # deconstruct returned aktDir, all subdirs, files
print("AktDir:", d)
print("Subdirs:", subdirs)
print("Files:", files)
Output:
AktDir: ./
Subdirs: []
Files: ['d.txt', 'output_c.txt', 'output_d.txt', 'main.py', 'output_main.py',
'c.txt', 'b.txt', 'output_b.txt', 'a.txt', 'output_a.txt']
It also recurses into subdirs.

WxPython - building a directory tree based on file availability

I do atomistic modelling, and use Python to analyze simulation results. To simplify work with a whole bunch of Python scripts used for different tasks, I decided to write simple GUI to run scripts from it.
I have a (rather complex) directory structure beginning from some root (say ~/calc), and I want to populate wx.TreeCtrl control with directories containing calculation results preserving their structure. The folder contains the results if it contains a file with .EXT extension. What i try to do is walk through dirs from root and in each dir check whether it contains .EXT file. When such dir is reached, add it and its ancestors to the tree:
def buildTree(self, rootdir):
root = rootdir
r = len(rootdir.split('/'))
ids = {root : self.CalcTree.AddRoot(root)}
for (dirpath, dirnames, filenames) in os.walk(root):
for dirname in dirnames:
fullpath = os.path.join(dirpath, dirname)
if sum([s.find('.EXT') for s in filenames]) > -1 * len(filenames):
ancdirs = fullpath.split('/')[r:]
ad = rootdir
for ancdir in ancdirs:
d = os.path.join(ad, ancdir)
ids[d] = self.CalcTree.AppendItem(ids[ad], ancdir)
ad = d
But this code ends up with many second-level nodes with the same name, and that's definitely not what I want. So I somehow need to see if the node is already added to the tree, and in positive case add new node to the existing one, but I do not understand how this could be done. Could you please give me a hint?
Besides, the code contains 2 dirty hacks I'd like to get rid of:
I get the list of ancestor dirs with splitting the full path in \
positions, and this is Linux-specific;
I find if .EXT file is in the directory by trying to find the extension in the strings from filenames list, taking in account that s.find returns -1 if the substring is not found.
Is there a way to make these chunks of code more readable?

First of all the hacks:
To get the path seperator for whatever os your using you can use os.sep.
Use str.endswith() and use the fact that in Python the empty list [] evaluates to False:
if [ file for file in filenames if file.endswith('.EXT') ]:
In terms of getting them all nicely nested you're best off doing it recursively. So the pseudocode would look something like the following. Please note this is just provided to give you an idea of how to do it, don't expect it to work as it is!
def buildTree(self, rootdir):
rootId = self.CalcTree.AddRoot(root)
self.buildTreeRecursion(rootdir, rootId)
def buildTreeRecursion(self, dir, parentId)
# Iterate over the files in dir
for file in dirFiles:
id = self.CalcTree.AppendItem(parentId, file)
if file is a directory:
self.buildTreeRecursion(file, id)
Hope this helps!

python listing dirs in a different order based upon platform

I am writing and testing code on XPsp3 w/ python 2.7. I am running the code on 2003 server w/ python 2.7. My dir structure will look something like this
d:\ssptemp
d:\ssptemp\ssp9-1
d:\ssptemp\ssp9-2
d:\ssptemp\ssp9-3
d:\ssptemp\ssp9-4
d:\ssptemp\ssp10-1
d:\ssptemp\ssp10-2
d:\ssptemp\ssp10-3
d:\ssptemp\ssp10-4
Inside each directory there is one or more files that will have "IWPCPatch" as part of the filename.
Inside one of these files (one in each dir), there will be the line 'IWPCPatchFinal_a.wsf'
What I do is
1) os.walk across all dirs under d:\ssptemp
2) find all files with 'IWPCPatch' in the filename
3) check the contents of the file for 'IWPCPatchFinal_a.wsf'
4) If contents is true I add the path of that file to a list.
My problem is that on my XP machine it works fine. If I print out the results of the list I get several items in the order I listed above.
When I move it to the server 2003 machine I get the same contents in a different order. It comes ssp10-X, then ssp9-X. And this is causing me issues with a different area in the program.
I can see from my output that it begins the os.walk in the wrong order, but I don't know why that is occuring.
import os
import fileinput
print "--createChain--"
listOfFiles = []
for path, dirs, files in os.walk('d:\ssptemp'):
print "parsing dir(s)"
for file in files:
newFile = os.path.join(path,file)
if newFile.find('IWPCPatch') >= 0:
for line in fileinput.FileInput(newFile):
if "IWPCPatchFinal_a.wsf" in line:
listOfFiles.append(newFile)
print "Added", newFile
for item in listOfFiles:
print "list item", item

for path, dirs, files in os.walk('d:\ssptemp'):
# sort dirs and files
dirs.sort()
files.sort()
print "parsing dir(s)"
# ...

The order of directories within os.walk is not necessarily alphabetical (I think it's actually dependent upon how they're stored within the dirent on the filesystem). It will likely be stable on the same exact directory (on the same filesystem) if you don't change the directory contents (ie, repeated calls will return the same order), but the order is not necessarily alphabetical.
If you want to have an ordered list of filenames you will have to build the list and then sort it yourself.

How to write tag deleter script in python

I want to implement a file reader (folders and subfolders) script which detects some tags and delete those tags from the files.
The files are .cpp, .h .txt and .xml And they are hundreds of files under same folder.
I have no idea about python, but people told me that I can do it easily.
EXAMPLE:
My main folder is A: C:\A
Inside A, I have folders (B,C,D) and some files A.cpp A.h A.txt and A.xml. In B i have folders B1, B2,B3 and some of them have more subfolders, and files .cpp, .xml and .h....
xml files, contains some tags like <!-- $Mytag: some text$ -->
.h and .cpp files contains another kind of tags like //$TAG some text$
.txt has different format tags: #$This is my tag$
It always starts and ends with $ symbol but it always have a comment character (//,
The idea is to run one script and delete all tags from all files so the script must:
Read folders and subfolders
Open files and find tags
If they are there, delete and save files with changes
WHAT I HAVE:
import os
for root, dirs, files in os.walk(os.curdir):
if files.endswith('.cpp'):
%Find //$ and delete until next $
if files.endswith('.h'):
%Find //$ and delete until next $
if files.endswith('.txt'):
%Find #$ and delete until next $
if files.endswith('.xml'):
%Find <!-- $ and delete until next $ and -->

The general solution would be to:
use the os.walk() function to traverse the directory tree.
Iterate over the filenames and use fn_name.endswith('.cpp') with if/elseif to determine which file you're working with
Use the re module to create a regular expression you can use to determine if a line contains your tag
Open the target file and a temporary file (use the tempfile module). Iterate over the source file line by line and output the filtered lines to your tempfile.
If any lines were replaced, use os.unlink() plus os.rename() to replace your original file
It's a trivial excercise for a Python adept but for someone new to the language, it'll probably take a few hours to get working. You probably couldn't ask for a better task to get introduced to the language though. Good Luck!
----- Update -----
The files attribute returned by os.walk is a list so you'll need to iterate over it as well. Also, the files attribute will only contain the base name of the file. You'll need to use the root value in conjunction with os.path.join() to convert this to a full path name. Try doing just this:
for root, d, files in os.walk('.'):
for base_filename in files:
full_name = os.path.join(root, base_filename)
if full_name.endswith('.h'):
print full_name, 'is a header!'
elif full_name.endswith('.cpp'):
print full_name, 'is a C++ source file!'
If you're using Python 3, the print statements will need to be function calls but the general idea remains the same.

Try something like this:
import os
import re
CPP_TAG_RE = re.compile(r'(?<=// *)\$[^$]+\$')
tag_REs = {
'.h': CPP_TAG_RE,
'.cpp': CPP_TAG_RE,
'.xml': re.compile(r'(?<=<!-- *)\$[^$]+\$(?= *-->)'),
'.txt': re.compile(r'(?<=# *)\$[^$]+\$'),
}
def process_file(filename, regex):
# Set up.
tempfilename = filename + '.tmp'
infile = open(filename, 'r')
outfile = open(tempfilename, 'w')
# Filter the file.
for line in infile:
outfile.write(regex.sub("", line))
# Clean up.
infile.close()
outfile.close()
# Enable only one of the two following lines.
os.rename(filename, filename + '.orig')
#os.remove(filename)
os.rename(tempfilename, filename)
def process_tree(starting_point=os.curdir):
for root, d, files in os.walk(starting_point):
for filename in files:
# Get rid of `.lower()` in the following if case matters.
ext = os.path.splitext(filename)[1].lower()
if ext in tag_REs:
process_file(os.path.join(root, base_filename), tag_REs[ext])
Nice thing about os.splitext is that it does the right thing for filenames that start with a ..

Using Python, how do I get an array of file info objects, based on a search of a file system?

Currently I have a bash script which runs the find command, like so:
find /storage/disk-1/Media/Video/TV -name *.avi -mtime -7
This gets a list of TV shows that were added to my system in the last 7 days. I then go on to create some symbolic links so I can get to my newest TV shows.
I'm looking to re-code this in Python, but I have a few questions I can seem to find the answers for using Google (maybe I'm not searching for the right thing). I think the best way to sum this up is to ask the question:
How do I perform a search on my file system (should I call find?) which gives me an array of file info objects (containing the modify date, file name, etc) so that I may sort them based on date, and other such things?

import os, time
allfiles = []
now = time.time()
# walk will return triples (current dir, list of subdirs, list of regular files)
# file names are relative to dir at first
for dir, subdirs, files in os.walk("/storage/disk-1/Media/Video/TV"):
for f in files:
if not f.endswith(".avi"):
continue
# compute full path name
f = os.path.join(dir, f)
st = os.stat(f)
if st.st_mtime < now - 3600*24*7:
# too old
continue
allfiles.append((f, st))
This will return all files that find also returned, as a list of pairs (filename, stat result).

look into module os: os.walk is the function which walks the file system, os.path is the module which gives the file mtime and other file informations. also os.path defines a lot of functions for parsing and splitting filenames.
also of interest, module glob defines a functions for "globbing" strings (matching a string using unix wildcards rules)
from this, building a list of file matching some criterion should be easy.

You can use "find" through the "subprocess" module.
Afterwards, use the "split" string function to dissect each line
For each file, use the OS module (e.g. getmtime etc.) to get file information
or
Use the "walk" and "glob" modules to get the file paths in objects

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.