use search to get matching list of files - python

I am using following function of a Class to find out if every .csv has corresponding .csv.meta in the given directory.
I am getting "None " for file which are just .csv and hexadecimal code for .csv.meta.
Result
None
<_sre.SRE_Match object at 0x1bb4300>
None
<_sre.SRE_Match object at 0xbd6378>
This is code
def validate_files(self,filelist):
try:
local_meta_file_list = []
local_csv_file_list = []
# Validate each files and see if they are pairing properly based on the pattern *.csv and *.csv.meta
for tmp_file_str in filelist:
csv_match = re.search(self.vprefix_pattern + '([0-9]+)' + self.vcsv_file_postfix_pattern + '$' , tmp_file_str)
if csv_match:
local_csv_file_list.append(csv_match.group())
meta_file_match_pattern=self.vprefix_pattern + csv_match.group(1) + self.vmeta_file_postfix_pattern
tmp_meta_file = [os.path.basename(s) for s in filelist if meta_file_match_pattern in s]
local_meta_file_list.extend(tmp_meta_file)
except Exception, e:
print e
self.m_logger.error("Error: Validate File Process thrown exception " + str(e))
sys.exit(1)
return local_csv_file_list, local_meta_file_list
These are file names.
File Names
rp_package.1406728501.csv.meta
rp_package.1406728501.csv
rp_package.1402573701.csv.meta
rp_package.1402573701.csv
rp_package.1428870707.csv
rp_package.1428870707.meta
Thanks
Sandy

If all you need is to find .csv files which have corresponding .csv.meta files, then I don’t think you need to use regular expressions for filtering them. We can filter the file list for those with the .csv extension, then filter that list further for files whose name, plus .meta, appears in the file list.
Here’s a simple example:
myList = [
'rp_package.1406728501.csv.meta',
'rp_package.1406728501.csv',
'rp_package.1402573701.csv.meta',
'rp_package.1402573701.csv',
'rp_package.1428870707.csv',
'rp_package.1428870707.meta',
]
def validate_files(file_list):
loc_csv_list = filter(lambda x: x[-3:].lower() == 'csv', file_list)
loc_meta_list = filter(lambda c: '%s.meta' % c in file_list, loc_csv_list)
return loc_csv_list, loc_meta_list
print validate_files(myList)
If there may be CSV files that don’t conform to the rp_package format, and need to be excluded, then we can initially filter the file list using the regex. Here’s an example (swap out the regex parameters as necessary):
import re
vprefix_pattern = 'rp_package.'
vcsv_file_postfix_pattern = '.csv'
regex_str = vprefix_pattern + '[0-9]+' + vcsv_file_postfix_pattern
def validate_files(file_list):
csv_list = filter(lambda x: re.search(regex_str, x), file_list)
loc_csv_list = filter(lambda x: x[-3:].lower() == 'csv', csv_list)
loc_meta_list = filter(lambda c: '%s.meta' % c in file_list, loc_csv_list)
return loc_csv_list, loc_meta_list
print validate_files(myList)

Related

python module ZipFile get base folder using regex

Assume this zip file "acme_example.zip" contains below content of the files/folders :
acme/one.txt
acme/one1.txt
acme/one2.txt
acme/one3.txt
acme/one4.txt
__MACOSX
.DS_Store
And i am using this below script
output_var = []
skip_st = '__MACOSX'
with ZipFile('acme_example.zip','r') as ZipObj:
listfFiles = ZipObj.namelist()
for elm in listfFiles:
p = Path(elm).parts[0]
if p not in output_var:
output_var.append(p)
return re.sub(skip_st, '', ''.join(str(item) for item in output_var))
This above script will exclude "__MAXOSX" but is there a way to also exclude ".DS_Store" so that we will only return "acme" as folder name?
As you iterate over the values, that would be better to exclude them at this moment, also as they are already strings, you can simplify the code in the join part
skip_st = ['__MACOSX', '.DS_Store']
with ZipFile('acme_example.zip','r') as ZipObj:
listfFiles = ZipObj.namelist()
for elm in listfFiles:
p = Path(elm).parts[0]
if p not in output_var and p not in skip_st:
output_var.append(p)
return ''.join(output_var)
So you know, here's how you can filter at the end
with a list
skip_st = ['__MACOSX', '.DS_Store']
# ...
return ''.join(item for item in output_var not in skip_st)
with a pattern
skip_st = '__MACOSX|.DS_Store'
# ...
return re.sub(skip_st, '', ''.join(output_var))

python regular expression to check start and end of a word in a string

I am working on a script to rename files. In this scenario there are three possibilities.
1.file does not exist: Create new file
2.File exists: create new file with filename '(number of occurence of file)'.eg filename(1)
3.Duplicate of file already exists: create new file with filename '(number of occurence of file)'.eg filename(2)
I have the filename in a string. I can check the last character of filename using regex but how to check the last characters from '(' to ')' and get the number inside it?
You just need something like this:
(?<=\()(\d+)(?=\)[^()]*$)
Demo
Explanation:
(?<=\() must be preceded by a literal (
(\d+) match and capture the digits
(?=\)[^()]+$) must be followed by ) and then no more ( or ) until the end of the string.
Example: if the file name is Foo (Bar) Baz (23).jpg, the regex above matches 23
Here is the code and tests to get a filename based on existing filenames:
import re
def get_name(filename, existing_names):
exist = False
index = 0
p = re.compile("^%s(\((?P<idx>\d+)\))?$" % filename)
for name in existing_names:
m = p.match(name)
if m:
exist = True
idx = m.group('idx')
if idx and int(idx) > index:
index = int(idx)
if exist:
return "%s(%d)" % (filename, index + 1)
else:
return filename
# test data
exists = ["abc(1)", "ab", "abc", "abc(2)", "ab(1)", "de", "ab(5)"]
tests = ["abc", "ab", "de", "xyz"]
expects = ["abc(3)", "ab(6)", "de(1)", "xyz"]
print exists
for name, exp in zip(tests, expects):
new_name = get_name(name, exists)
print "%s -> %s" % (name, new_name)
assert new_name == exp
Look at this line for the regex to get the number in (*):
p = re.compile("^%s(\((?P<idx>\d+)\))?$" % filename)
Here it uses a named capture ?P<idx>\d+ for the number \d+, and access the capture later with m.group('idx').

Python Error: String indices must be integers, not str

OK, I have an obvious problems staring me in the face that I can't figure out. I am getting the output/results I need but I get the TypeError: "string indices must be integers, not str". The following is a sample of my code. It is because of the statement "if f not in GetSquishySource(dirIn)" Basicially I am looking to see if a specific file is in another list so that I don't end up adding it to a zip file I am creating. I just don't see the problem here and how to get around it. Any help would be appreciated.
def compressLists(z, dirIn, dirsIn, filesIn, encrypt=None):
try:
with zipfile.ZipFile(z, 'w', compression=zipfile.ZIP_DEFLATED) as zip:
# Add files
compressFileList(z, dirIn, dirIn, filesIn, zip, encrypt)
# Add directories
for dir in dirsIn:
dirPath = os.path.join(dirIn, dir["name"])
for root, dirs, files in os.walk(dirPath):
# Ignore hidden files and directories
files = [f for f in files if not f[0] == '.']
dirs[:] = [d for d in dirs if not d[0] == '.']
# Replace file entries with structure value entries
for i, f in enumerate(files):
del files[i]
**if f not in GetSquishySource(dirIn):**
files.insert(i, {'zDir': dir["zDir"], 'name': f})
compressFileList(z, dirIn, root, files, zip, encryptedLua)
if dir["recurse"] == False:
break;
The following is the GetSquishySource function I created and call.
def GetSquishySource(srcDir):
squishyLines = []
srcToRemove = []
if os.path.isfile(srcDir + os.path.sep + "squishy"):
with open(srcDir + os.path.sep + "squishy") as squishyFile:
squishyContent = squishyFile.readlines()
squishyFile.close()
for line in squishyContent:
if line.startswith("Module") and line is not None:
squishyLines.append(line.split(' '))
for s in squishyLines:
if len(s) == 3 and s is not None:
# If the 3rd column in the squishy file contains data, use that.
path = s[2].replace('Module "', '').replace('"', '').replace("\n", '')
srcToRemove.append(os.path.basename(path))
elif len(s) == 2 and s is not None:
# If the 3rd column in the squishy file contains no data, then use the 2nd column.
path = s[1].replace('Module "', '').replace('"', '').replace("\n", '').replace(".", os.path.sep) + ".lua"
srcToRemove.append(os.path.basename(path))
return srcToRemove

Search through directories for specific Excel files and compare data from these files with inputvalues

The task:
The firm I have gotten a summer-job for has an expanding test-database that consists of an increasing number of subfolders for each project, that includes everything from .jpeg files to the .xlsx's I am interested in. As I am a bit used to Python from earlier, I decided to give it a go at this task. I want to search for exceldocuments that has "test spreadsheet" as a part of its title(for example "test spreadsheet model259"). All the docs I am interested in are built the same way(weight is always "A3" etc), looking somewhat like this:
Model: 259
Lenght: meters 27
Weight: kg 2500
Speed: m/s 25
I want the user of the finished program to be able to compare results from different tests with each other using my script. This means that the script must see if there is an x-value that fits both criteria at once:
inputlength = x*length of model 259
inputweight = x*weight of model 259
The program should loop through all the files in the main folder. If such an X exists for a model, I want the program to return it to a list of fitting models. The x-value will be a variable, different for each model.
As the result I want a list of all files that fits the input, their scale(x-value) and possibly a link to the file.
For example:
Model scale Link
ModelA 21.1 link_to_fileA
ModelB 0.78 link_to_fileB
The script
The script I have tried to get to work so far is below, but if you have other suggestions of how to deal with the task I'll happily accept them. Don't be afraid to ask if I have not explained the task well enough. XLRD is already installed, and I use Eclipse as my IDE. I've been trying to get it to work in many ways now, so most of my script is purely for testing.
Edited:
#-*- coding: utf-8 -*-
#Accepts norwegian letters
import xlrd, os, fnmatch
folder = 'C:\eclipse\TST-folder'
def excelfiles(pattern):
file_list = []
for root, dirs, files in os.walk(start_dir):
for filename in files:
if fnmatch.fnmatch(filename.lower(), pattern):
if filename.endswith(".xls") or filename.endswith(".xlsx") or filename.endswith(".xlsm"):
file_list.append(os.path.join(root, filename))
return file_list
file_list = excelfiles('*tst*') # only accept docs hwom title includes tst
print excelfiles()
How come I only get one result when I am printing excelfiles() after returning the values, but when I exchange "return os.path.join(filename)" with "print os.path.join(filename)" it shows all .xls files? Does this mean that the results from the excelfiles-function is not passed on? Answered in comments
''' Inputvals '''
inputweight = int(raw_input('legg inn vekt')) #inputbox for weight
inputlength = int(raw_input('legg inn lengd')) #inputbox for lenght
inputspeed = int(raw_input('legg inn hastighet')) #inputbox for speed
'''Location of each val from the excel spreadsheet'''
def locate_vals():
val_dict = {}
for filename in file_list:
wb = xlrd.open_workbook(os.path.join(start_dir, filename))
sheet = wb.sheet_by_index(0)
weightvalue = sheet.cell_value(1, 1)
lenghtvalue = sheet.cell_value(1, 1)
speedvalue = sheet.cell_value(1, 1)
val_dict[filename] = [weightvalue, lenghtvalue, speedvalue]
return val_dict
val_dict = locate_vals()
print locate_vals()
count = 0
Any ideas of how I can read from each of the documents found by the excelfiles-function? "funcdox" does not seem to work. When I insert a print-test, for example print weightvalue after the weightvalue = sheet.cell(3,3).value function, I get no feedback at all. Errormessages without the mentioned print-test:Edited to the script above, which creates a list of the different values + minor changes that removed the errormessages
Script works well until this point
Made some minor changes to the next part. It is supposed to scale an value from the spreadsheet by multiplying it with a constant (x1). Then I want the user to be able to define another inputvalue, which in turn defines another constant(x2) to make the spreadsheetvalue fit. Eventually, these constants will be compared to find which models will actually fit for the test.
'''Calculates vals from excel from the given dimensions'''
def dimension(): # Maybe exchange exec-statement with the function itself.
if count == 0:
if inputweight != 0:
exec scale_weight()
elif inputlenght != 0:
exec scale_lenght()
elif inputspeed != 0:
exec scale_speed()
def scale_weight(x1, x2): # Repeat for each value.
for weightvalue in locate_vals():
if count == 0:
x1 * weightvalue == inputweight
count += 1
exec criteria2
return weightvalue, x1
elif count == 2:
inputweight2 = int(raw_input('Insert weight')) #inputbox for weight
x2 * weightvalue == inputweight2
return weightvalue, x2
The x1 and x2 are what I want to find with this function, so I want them to be totally "free". Is there any way I can test this function without having to insert values for x1 and x2 ?
def scale_lenght(): # Almost identical to scale_weight
return
def scale_speed(): # Almost identical to scale_weight
return
def criteria2(weight, lenght, speed):
if count == 1:
k2 = raw_input('Criteria two, write weight, length or speed.')
if k2 == weight:
count += 1
exec scale_weight
elif k2 == lenght:
count += 1
exec scale_lenght
elif k2 == speed:
count += 1
exec scale_speed
else:
return
Do you see any easier way to deal with this problem?(Hope I managed to explain it well enough. The way I have written the code so far is quite messy, but since I'm not that experienced I'll just have to make it work first, and then clean it up if I have the time.
Since probably none of the values will exactly fit for both x-constants, I thought I'd use approx_Equal to deal with it:
def approx_Equal(x1, x2, tolerance=int(raw_input('Insert tolerance for scaling difference')),err_msg='Unacceptable tolerance', verbose = True ): # Gives the approximation for how close the two values of x must be for
if x1 == x2:
x = x1+ (x2-x1)/2
return x
Eventually, I'd like a diagram of all the variables used + a link-to-file and name for each document.
No sure how I will do this, so any tips are greatly appreciated.
Thanks!
In answer to the first question "How come I only get one result when I am printing excelfiles()" this is because your return statement is within the nested loop, so the function will stop on the first iteration. I would try building up a list instead and then return this list, you could also combine this with the issue of checking the name e.g. :
import os, fnmatch
#globals
start_dir = os.getenv('md')
def excelfiles(pattern):
file_list = []
for root, dirs, files in os.walk(start_dir):
for filename in files:
if fnmatch.fnmatch(filename.lower(), pattern):
if filename.endswith(".xls") or filename.endswith(".xlsx") or filename.endswith(".xlsm"):
file_list.append(os.path.join(root, filename))
return file_list
file_list = excelfiles('*cd*')
for i in file_list: print i
Obviously, you'll need to replace the cd with your own search text, but keep the * either side and replace the start_dir with your own. I have done the match on filename.lower() and entered the search text in lower case to make the matching case in-sensitive, just remove the .lower() if you don't want this. I have also allowed for other types of Excel files.
Regarding reading data from Excel files I have done this before to create an automated way of converting basic Excel files into csv format. You are welcome to have a look at the code below and see if there is anything you can use from this. The xl_to_csv function is where the data is read from the Excel file:
import os, csv, sys, Tkinter, tkFileDialog as fd, xlrd
# stop tinker shell from opening as only needed for file dialog
root = Tkinter.Tk()
root.withdraw()
def format_date(dt):
yyyy, mm, dd = str(dt[0]), str(dt[1]), str(dt[2])
hh, mi, ss = str(dt[3]), str(dt[4]), str(dt[5])
if len(mm) == 1:
mm = '0'+mm
if len(dd) == 1:
dd = '0'+dd
if hh == '0' and mi == '0' and ss == '0':
datetime_str = dd+'/'+mm+'/'+yyyy
else:
if len(hh) == 1:
hh = '0'+hh
if len(mi) == 1:
mi = '0'+mi
if len(ss) == 1:
ss = '0'+ss
datetime_str = dd+'/'+mm+'/'+yyyy+' '+hh+':'+mi+':'+ss
return datetime_str
def xl_to_csv(in_path, out_path):
# set up vars to read file
wb = xlrd.open_workbook(in_path)
sh1 = wb.sheet_by_index(0)
row_cnt, col_cnt = sh1.nrows, sh1.ncols
# set up vars to write file
fileout = open(out_path, 'wb')
writer = csv.writer(fileout)
# iterate through rows and cols
for r in range(row_cnt):
# make list from row data
row = []
for c in range(col_cnt):
#print "...debug - sh1.cell(",r,c,").value set to:", sh1.cell(r,c).value
#print "...debug - sh1.cell(",r,c,").ctype set to:", sh1.cell(r,c).ctype
# check data type and make conversions
val = sh1.cell(r,c).value
if sh1.cell(r,c).ctype == 2: # number data type
if val == int(val):
val = int(val) # convert to int if only no decimal other than .0
#print "...debug - res 1 (float to str), val set to:", val
elif sh1.cell(r,c).ctype == 3: # date fields
dt = xlrd.xldate_as_tuple(val, 0) # date no from excel to dat obj
val = format_date(dt)
#print "...debug - res 2 (date to str), val set to:", val
elif sh1.cell(r,c).ctype == 4: # boolean data types
val = str(bool(val)) # convert 1 or 0 to bool true / false, then string
#print "...debug - res 3 (bool to str), val set to:", val
else:
val = str(val)
#print "...debug - else, val set to:", val
row.append(val)
#print ""
# write row to csv file
try:
writer.writerow(row)
except:
print '...row failed in write to file:', row
exc_type, exc_value, exc_traceback = sys.exc_info()
lines = traceback.format_exception(exc_type, exc_value, exc_traceback)
for line in lines:
print '!!', line
print 'Data written to:', out_path, '\n'
def main():
in_path, out_path = None, None
# set current working directory to user's my documents folder
os.chdir(os.path.join(os.getenv('userprofile'),'documents'))
# ask user for path to Excel file...
while not in_path:
print "Please select the excel file to read data from ..."
try:
in_path = fd.askopenfilename()
except:
print 'Error selecting file, please try again.\n'
# get dir for output...
same = raw_input("Do you want to write the output to the same directory? (Y/N): ")
if same.upper() == 'Y':
out_path = os.path.dirname(in_path)
else:
while not out_path:
print "Please select a directory to write the csv file to ..."
try:
out_path = fd.askdirectory()
except:
print 'Error selecting file, please try again.\n'
# get file name and join to dir
f_name = os.path.basename(in_path)
f_name = f_name[:f_name.find('.')]+'.csv'
out_path = os.path.join(out_path,f_name)
# get data from file and write to csv...
print 'Attempting read data from', in_path
print ' and write csv data to', out_path, '...\n'
xl_to_csv(in_path, out_path)
v_open = raw_input("Open file (Y/N):").upper()
if v_open == 'Y':
os.startfile(out_path)
sys.exit()
if __name__ == '__main__':
main()
Let me know if you have any questions on this.
Finally, regarding the output I would consider writing this out to a html file in a table format. Let me know if you want any help with this, I will have some more sample code that you could use part of.
UPDATE
Here is some further advice on writing your output to a html file. Here is a function that I have written and used previously for this purpose. Let me know if you need any guidance on what you would need to change for your implementation (if anything). The function expects a nested object in the data argument e.g. a list of lists or list of tuples etc. but should work for any number of rows / columns:
def write_html_file(path, data, heads):
html = []
tab_attr = ' border="1" cellpadding="3" style="background-color:#FAFCFF; text-align:right"'
head_attr = ' style="background-color:#C0CFE2"'
# opening lines needed for html table
try:
html.append('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" ')
html.append('"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> ')
html.append('<html xmlns="http://www.w3.org/1999/xhtml">')
html.append('<body>')
html.append(' <table'+tab_attr+'>')
except:
print 'Error setting up html heading data'
# html table headings (if required)
if headings_on:
try:
html.append(' <tr'+head_attr+'>')
for item in heads:
html.append(' '*6+'<th>'+str(item)+'</th>')
html.append(' </tr>')
except:
exc_type, exc_value, exc_traceback = sys.exc_info()
lines = traceback.format_exception(exc_type, exc_value, exc_traceback)
print 'Error writing html table headings:'
print ''.join('!! ' + line for line in lines)
# html table content
try:
for row in data:
html.append(' <tr>')
for item in row:
html.append(' '*6+'<td>'+str(item)+'</td>')
html.append(' </tr>')
except:
print 'Error writing body of html data'
# closing lines needed
try:
html.append(' </table>')
html.append('</body>')
html.append('</html>')
except:
print 'Error closing html data'
# write html data to file
fileout = open(path, 'w')
for line in html:
fileout.write(line)
print 'Data written to:', path, '\n'
if sql_path:
os.startfile(path)
else:
v_open = raw_input("Open file (Y/N):").upper()
if v_open == 'Y':
os.startfile(path)
headings_on is a global that I have set to True in my script, you will also need to import traceback for the error handling to work as it is currently specified.

python read files from hard disk and return result depending on the pattern found

what i wanna do is this;
- read files from hard drive and find a pattern like the see if the file contains this string if it does return true or return false. In the function call it should print out nicely saying `the found in file.txt etc.
this is what i have came up with so far
import os
path = '../'
folder = os.listdir(path);
y = {}
n = {}
def bla(pattern):
for book in folder:
if book[-3:] == 'txt':
data = open(path+''+book).read()
if pattern in sanitize(data):
y[pattern] = book + " contains " + pattern
return True
else :
n[pattern] = book + " does not contain " + pattern
return False
if bla('jane'):
print(y['jane'])
print(n['jane'])
desired output is this;
1.txt contains 'the'
2.txt does not contain 'the'
3.txt contains 'the'
4.txt does not contain 'the'
this works but without the return true and return false thingy that i desired to have, ANY BETTER WAY THAN THIS?
import os
path = '../'
folder = os.listdir(path);
def bla(pattern):
for book in folder:
if book[-3:] == 'txt':
data = open(path+''+book).read()
if pattern in sanitize(data):
print(book + " contains " + pattern)
else :
print(book + " does not contain " + pattern)
bla('the')
import re
m1 = re.compile(r'.*?\.txt$')
pattern = 'yourpattern'
m2 = re.compile(r'%s' % (pattern))
for file in filter(m1.search, os.listdir(somedir)):
if m2.search(open(file,'r').read()):
print file, 'contains', pattern
else:
print file, 'does not contain', pattern
modify output per your tastes

Categories