Search through directories for specific Excel files and compare data from these files with inputvalues - python

The task:
The firm I have gotten a summer-job for has an expanding test-database that consists of an increasing number of subfolders for each project, that includes everything from .jpeg files to the .xlsx's I am interested in. As I am a bit used to Python from earlier, I decided to give it a go at this task. I want to search for exceldocuments that has "test spreadsheet" as a part of its title(for example "test spreadsheet model259"). All the docs I am interested in are built the same way(weight is always "A3" etc), looking somewhat like this:
Model: 259
Lenght: meters 27
Weight: kg 2500
Speed: m/s 25
I want the user of the finished program to be able to compare results from different tests with each other using my script. This means that the script must see if there is an x-value that fits both criteria at once:
inputlength = x*length of model 259
inputweight = x*weight of model 259
The program should loop through all the files in the main folder. If such an X exists for a model, I want the program to return it to a list of fitting models. The x-value will be a variable, different for each model.
As the result I want a list of all files that fits the input, their scale(x-value) and possibly a link to the file.
For example:
Model scale Link
ModelA 21.1 link_to_fileA
ModelB 0.78 link_to_fileB
The script
The script I have tried to get to work so far is below, but if you have other suggestions of how to deal with the task I'll happily accept them. Don't be afraid to ask if I have not explained the task well enough. XLRD is already installed, and I use Eclipse as my IDE. I've been trying to get it to work in many ways now, so most of my script is purely for testing.
Edited:
#-*- coding: utf-8 -*-
#Accepts norwegian letters
import xlrd, os, fnmatch
folder = 'C:\eclipse\TST-folder'
def excelfiles(pattern):
file_list = []
for root, dirs, files in os.walk(start_dir):
for filename in files:
if fnmatch.fnmatch(filename.lower(), pattern):
if filename.endswith(".xls") or filename.endswith(".xlsx") or filename.endswith(".xlsm"):
file_list.append(os.path.join(root, filename))
return file_list
file_list = excelfiles('*tst*') # only accept docs hwom title includes tst
print excelfiles()
How come I only get one result when I am printing excelfiles() after returning the values, but when I exchange "return os.path.join(filename)" with "print os.path.join(filename)" it shows all .xls files? Does this mean that the results from the excelfiles-function is not passed on? Answered in comments
''' Inputvals '''
inputweight = int(raw_input('legg inn vekt')) #inputbox for weight
inputlength = int(raw_input('legg inn lengd')) #inputbox for lenght
inputspeed = int(raw_input('legg inn hastighet')) #inputbox for speed
'''Location of each val from the excel spreadsheet'''
def locate_vals():
val_dict = {}
for filename in file_list:
wb = xlrd.open_workbook(os.path.join(start_dir, filename))
sheet = wb.sheet_by_index(0)
weightvalue = sheet.cell_value(1, 1)
lenghtvalue = sheet.cell_value(1, 1)
speedvalue = sheet.cell_value(1, 1)
val_dict[filename] = [weightvalue, lenghtvalue, speedvalue]
return val_dict
val_dict = locate_vals()
print locate_vals()
count = 0
Any ideas of how I can read from each of the documents found by the excelfiles-function? "funcdox" does not seem to work. When I insert a print-test, for example print weightvalue after the weightvalue = sheet.cell(3,3).value function, I get no feedback at all. Errormessages without the mentioned print-test:Edited to the script above, which creates a list of the different values + minor changes that removed the errormessages
Script works well until this point
Made some minor changes to the next part. It is supposed to scale an value from the spreadsheet by multiplying it with a constant (x1). Then I want the user to be able to define another inputvalue, which in turn defines another constant(x2) to make the spreadsheetvalue fit. Eventually, these constants will be compared to find which models will actually fit for the test.
'''Calculates vals from excel from the given dimensions'''
def dimension(): # Maybe exchange exec-statement with the function itself.
if count == 0:
if inputweight != 0:
exec scale_weight()
elif inputlenght != 0:
exec scale_lenght()
elif inputspeed != 0:
exec scale_speed()
def scale_weight(x1, x2): # Repeat for each value.
for weightvalue in locate_vals():
if count == 0:
x1 * weightvalue == inputweight
count += 1
exec criteria2
return weightvalue, x1
elif count == 2:
inputweight2 = int(raw_input('Insert weight')) #inputbox for weight
x2 * weightvalue == inputweight2
return weightvalue, x2
The x1 and x2 are what I want to find with this function, so I want them to be totally "free". Is there any way I can test this function without having to insert values for x1 and x2 ?
def scale_lenght(): # Almost identical to scale_weight
return
def scale_speed(): # Almost identical to scale_weight
return
def criteria2(weight, lenght, speed):
if count == 1:
k2 = raw_input('Criteria two, write weight, length or speed.')
if k2 == weight:
count += 1
exec scale_weight
elif k2 == lenght:
count += 1
exec scale_lenght
elif k2 == speed:
count += 1
exec scale_speed
else:
return
Do you see any easier way to deal with this problem?(Hope I managed to explain it well enough. The way I have written the code so far is quite messy, but since I'm not that experienced I'll just have to make it work first, and then clean it up if I have the time.
Since probably none of the values will exactly fit for both x-constants, I thought I'd use approx_Equal to deal with it:
def approx_Equal(x1, x2, tolerance=int(raw_input('Insert tolerance for scaling difference')),err_msg='Unacceptable tolerance', verbose = True ): # Gives the approximation for how close the two values of x must be for
if x1 == x2:
x = x1+ (x2-x1)/2
return x
Eventually, I'd like a diagram of all the variables used + a link-to-file and name for each document.
No sure how I will do this, so any tips are greatly appreciated.
Thanks!

In answer to the first question "How come I only get one result when I am printing excelfiles()" this is because your return statement is within the nested loop, so the function will stop on the first iteration. I would try building up a list instead and then return this list, you could also combine this with the issue of checking the name e.g. :
import os, fnmatch
#globals
start_dir = os.getenv('md')
def excelfiles(pattern):
file_list = []
for root, dirs, files in os.walk(start_dir):
for filename in files:
if fnmatch.fnmatch(filename.lower(), pattern):
if filename.endswith(".xls") or filename.endswith(".xlsx") or filename.endswith(".xlsm"):
file_list.append(os.path.join(root, filename))
return file_list
file_list = excelfiles('*cd*')
for i in file_list: print i
Obviously, you'll need to replace the cd with your own search text, but keep the * either side and replace the start_dir with your own. I have done the match on filename.lower() and entered the search text in lower case to make the matching case in-sensitive, just remove the .lower() if you don't want this. I have also allowed for other types of Excel files.
Regarding reading data from Excel files I have done this before to create an automated way of converting basic Excel files into csv format. You are welcome to have a look at the code below and see if there is anything you can use from this. The xl_to_csv function is where the data is read from the Excel file:
import os, csv, sys, Tkinter, tkFileDialog as fd, xlrd
# stop tinker shell from opening as only needed for file dialog
root = Tkinter.Tk()
root.withdraw()
def format_date(dt):
yyyy, mm, dd = str(dt[0]), str(dt[1]), str(dt[2])
hh, mi, ss = str(dt[3]), str(dt[4]), str(dt[5])
if len(mm) == 1:
mm = '0'+mm
if len(dd) == 1:
dd = '0'+dd
if hh == '0' and mi == '0' and ss == '0':
datetime_str = dd+'/'+mm+'/'+yyyy
else:
if len(hh) == 1:
hh = '0'+hh
if len(mi) == 1:
mi = '0'+mi
if len(ss) == 1:
ss = '0'+ss
datetime_str = dd+'/'+mm+'/'+yyyy+' '+hh+':'+mi+':'+ss
return datetime_str
def xl_to_csv(in_path, out_path):
# set up vars to read file
wb = xlrd.open_workbook(in_path)
sh1 = wb.sheet_by_index(0)
row_cnt, col_cnt = sh1.nrows, sh1.ncols
# set up vars to write file
fileout = open(out_path, 'wb')
writer = csv.writer(fileout)
# iterate through rows and cols
for r in range(row_cnt):
# make list from row data
row = []
for c in range(col_cnt):
#print "...debug - sh1.cell(",r,c,").value set to:", sh1.cell(r,c).value
#print "...debug - sh1.cell(",r,c,").ctype set to:", sh1.cell(r,c).ctype
# check data type and make conversions
val = sh1.cell(r,c).value
if sh1.cell(r,c).ctype == 2: # number data type
if val == int(val):
val = int(val) # convert to int if only no decimal other than .0
#print "...debug - res 1 (float to str), val set to:", val
elif sh1.cell(r,c).ctype == 3: # date fields
dt = xlrd.xldate_as_tuple(val, 0) # date no from excel to dat obj
val = format_date(dt)
#print "...debug - res 2 (date to str), val set to:", val
elif sh1.cell(r,c).ctype == 4: # boolean data types
val = str(bool(val)) # convert 1 or 0 to bool true / false, then string
#print "...debug - res 3 (bool to str), val set to:", val
else:
val = str(val)
#print "...debug - else, val set to:", val
row.append(val)
#print ""
# write row to csv file
try:
writer.writerow(row)
except:
print '...row failed in write to file:', row
exc_type, exc_value, exc_traceback = sys.exc_info()
lines = traceback.format_exception(exc_type, exc_value, exc_traceback)
for line in lines:
print '!!', line
print 'Data written to:', out_path, '\n'
def main():
in_path, out_path = None, None
# set current working directory to user's my documents folder
os.chdir(os.path.join(os.getenv('userprofile'),'documents'))
# ask user for path to Excel file...
while not in_path:
print "Please select the excel file to read data from ..."
try:
in_path = fd.askopenfilename()
except:
print 'Error selecting file, please try again.\n'
# get dir for output...
same = raw_input("Do you want to write the output to the same directory? (Y/N): ")
if same.upper() == 'Y':
out_path = os.path.dirname(in_path)
else:
while not out_path:
print "Please select a directory to write the csv file to ..."
try:
out_path = fd.askdirectory()
except:
print 'Error selecting file, please try again.\n'
# get file name and join to dir
f_name = os.path.basename(in_path)
f_name = f_name[:f_name.find('.')]+'.csv'
out_path = os.path.join(out_path,f_name)
# get data from file and write to csv...
print 'Attempting read data from', in_path
print ' and write csv data to', out_path, '...\n'
xl_to_csv(in_path, out_path)
v_open = raw_input("Open file (Y/N):").upper()
if v_open == 'Y':
os.startfile(out_path)
sys.exit()
if __name__ == '__main__':
main()
Let me know if you have any questions on this.
Finally, regarding the output I would consider writing this out to a html file in a table format. Let me know if you want any help with this, I will have some more sample code that you could use part of.
UPDATE
Here is some further advice on writing your output to a html file. Here is a function that I have written and used previously for this purpose. Let me know if you need any guidance on what you would need to change for your implementation (if anything). The function expects a nested object in the data argument e.g. a list of lists or list of tuples etc. but should work for any number of rows / columns:
def write_html_file(path, data, heads):
html = []
tab_attr = ' border="1" cellpadding="3" style="background-color:#FAFCFF; text-align:right"'
head_attr = ' style="background-color:#C0CFE2"'
# opening lines needed for html table
try:
html.append('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" ')
html.append('"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> ')
html.append('<html xmlns="http://www.w3.org/1999/xhtml">')
html.append('<body>')
html.append(' <table'+tab_attr+'>')
except:
print 'Error setting up html heading data'
# html table headings (if required)
if headings_on:
try:
html.append(' <tr'+head_attr+'>')
for item in heads:
html.append(' '*6+'<th>'+str(item)+'</th>')
html.append(' </tr>')
except:
exc_type, exc_value, exc_traceback = sys.exc_info()
lines = traceback.format_exception(exc_type, exc_value, exc_traceback)
print 'Error writing html table headings:'
print ''.join('!! ' + line for line in lines)
# html table content
try:
for row in data:
html.append(' <tr>')
for item in row:
html.append(' '*6+'<td>'+str(item)+'</td>')
html.append(' </tr>')
except:
print 'Error writing body of html data'
# closing lines needed
try:
html.append(' </table>')
html.append('</body>')
html.append('</html>')
except:
print 'Error closing html data'
# write html data to file
fileout = open(path, 'w')
for line in html:
fileout.write(line)
print 'Data written to:', path, '\n'
if sql_path:
os.startfile(path)
else:
v_open = raw_input("Open file (Y/N):").upper()
if v_open == 'Y':
os.startfile(path)
headings_on is a global that I have set to True in my script, you will also need to import traceback for the error handling to work as it is currently specified.

Related

Closing Python files while using cv2

Apologies in advance for the probably easy fix, I am a college student learning c++ and am using python for the first time on a personal project.
I am writing a program that extracts the title from a media file inside a directory or subdirectory, then looks to see if there are any strings that match. If there are then it compares their resolution, and deletes the lower resolution file. If they are both the same resolution, it deletes the larger file size. All of it is working, with the exception of deleting files. When I try to, it throws an error saying the files are in use. After doing some research, I learned that it was because I have the files open inside the code, preventing them from being deleted. My problem is that I don't know what variable I need to close, or the appropriate way and location to do so.
import os
import cv2
import PTN
import json
array1 = [os.path.join(r,file) for r,d,f in os.walk("E:\Python Test Environment") for file in f]
for x in range(0, len(array1)):
print(array1[x])
array2 = array1[:] #The colon tells it to directly copy rather than do a link
for x in range(0, len(array2)):
array2[x] = (json.dumps(PTN.parse(array2[x])))
array2[x] = json.loads(array2[x])['title']
head, array2[x] = os.path.split(array2[x])
del head
y = len(array2)
for x in range(0, len(array2)):
if array2[x] == "":
break
for i in range(x, y-1): #Set to x+1 so that it does not compare against the current file
i = x + 1
if array2[i] == "":
break
if array2[x] == array2[i]:
print 'Match found!'
print array1[i]
print 'Matches: '
print array1[x]
with open(array1[x]) as f: #tried to include this to prevent error, doesn't seem to stop it
capture1 = cv2.VideoCapture(array1[x]) #Open the video
ret, frame = capture1.read() #Read the first frame
resolution1 = frame.shape #Get resolution
f.close()
with open(array1[i]) as f: #tried to include this to prevent error, doesn't seem to stop it
capture2 = cv2.VideoCapture(array1[i]) #Open the video
ret, frame = capture2.read() #Read the first frame
resolution2 = frame.shape #Get resolution
f.close()
if resolution1 > resolution2:
print array1[x]
print "Is higher resolution than"
print array1[i]
print "Would delete: "
print array1[i]
os.remove(array1[i])
array1[i] = ""
array2[i] = ""
if resolution2 > resolution1:
print array1[i]
print "Is higher resolution than"
print array1[x]
print "Would delete: "
print array1[i]
os.remove(array1[x])
array1[x] = ""
array2[x] = ""
if resolution1 == resolution2:
print "equal"
if os.path.getsize(array1[x]) <= os.path.getsize(array1[i]):
print "Would delete: "
print array1[i]
os.remove(array1[i])
array1[i] = ""
array2[i] = ""
if os.path.getsize(array1[i]) < os.path.getsize(array1[x]):
print "Would delete: "
print array1[x]
os.remove(array1[x])
array1[x] = ""
array2[x] = ""
Add capture1.release() and capture2.release() to release the resources used by the VideoCapture instances

Quick and dirty duplicate finder based on size and last write time only [duplicate]

This question already has answers here:
Finding duplicate files and removing them
(10 answers)
Closed 5 years ago.
Is there a simple and fast python code to identify duplicate files in a directory tree based on filesize and last write time only? (A couple false positives are OK. Forget hash, too slow and not needed to initial ID of potential real dups.)
S/O abounds with similar questions but they tend to utilize md5 or byte-by-byte comparison.
Any suggestions? Or, I need to run the code below and compare to find dup lines in the first two columns? (And maybe run hash only on the ones with matching LWT and size)?
def get_size(filename):
st = os.stat(filename)
return str(st.st_size)
def get_last_write_time(filename):
st = os.stat(filename)
convert_time_to_human_readable = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(st.st_mtime))
return convert_time_to_human_readable
LOL! Thats my code! :)))))))
Try This ( LAST UPDATE ):
import os, hashlib, time
your_target_folder = "." # change with your target folder.
def size_check(get_path):
try:
st = os.stat(get_path)
except:
return "Error"
else:
return str(st.st_size)
def md5_check(get_path):
try:
hash_md5 = hashlib.md5()
with open(get_path, "rb") as f:
for chunk in iter(lambda: f.read(2 ** 20), b""):
hash_md5.update(chunk)
except:
return "Error"
else:
return hash_md5.hexdigest()
def save_data(get_output):
with open("./data.txt", 'a') as output_data:
output_data.write(get_output)
print("Waking On All Files In Your Target Directory and Grabbing Their Personal Hashes, Plz Wait ... \n")
files_and_sizes = {}
for dirpath, _, filenames in os.walk(your_target_folder):
for items in filenames:
file_full_path = os.path.abspath(os.path.join(dirpath, items))
get_size = size_check(file_full_path)
if get_size in files_and_sizes:
files_and_sizes[get_size].append(file_full_path)
else:
files_and_sizes[get_size] = [file_full_path]
new_dict = {}
error_box = []
for key, box_name in files_and_sizes.items():
if not key == "Error" and len(box_name) > 1:
for files in box_name:
get_file_hash = md5_check(files)
if not get_file_hash == "Error":
if get_file_hash in new_dict:
new_dict[get_file_hash].append(files)
else:
new_dict[get_file_hash] = [files]
else:
error_box.append(files)
elif key == "Error" and len(box_name) > 0:
do = [error_box.append(error_files) for error_files in box_name]
else:
pass
for hashes, names in new_dict.items():
if len(names) > 1:
for each_files in names:
result = each_files + "\n"
print(result)
save_data(result)
else:
pass
if len(error_box) > 0:
print("Something Went Wrong On These Files ( I could not access them ): " + str(error_box) + "\n")
print("Good By.")
Good Luck...

How do I write a python macro in libreoffice calc to cope with merged cells when inserting external data

The premise: I am working in libreoffice calc and need to send an instruction to another program that I know to be listening on a TCP port, via a python macro.
I am expecting a list of invoice line data from the listening program and want to insert the lines into the libreoffice spreadsheet which may or may not have merged cells.
Having been helped many times over by searching stackoverflow, I thought that I would post a solution to a problem which took much effort to resolve.
The code splits the data into lines and each line into data items delimited by the sending program, by tab. The data is inserted, starting from the cell in which the cursor is presently positioned. Each subsequent data item is inserted into the next column and for each line of subsequent data increments the row for the next set of inserts.
Finding the merged cell "range" was a particularly difficult thing to discover how to do and I have not found this documented elsewhere.
Finally each data item is tested to see if it should be inserted as a numeric or text, this is vital if you wish the spreadsheet to perform calculations on the inserted data.
The last line of data is marked with the word "END". This final line of data contains, in this example, an Invoice number ( at position 1) and the specific Cell Name (at position 4) into which it should be put. If there is an error the data is written into the next row down as text so the user can cut and paste the data.
Configobj is a package that reads parameters from a flat file. In this example, I am using that file to store the TCP port to be used. Both the listening program and this code are reading the port number from the same configuration file. It could have been hard coded.
Here is a python macro that works for me, I trust that it will point others in the right direction
def fs2InvoiceLinesCalc(*args):
desktop = XSCRIPTCONTEXT.getDesktop()
model = desktop.getCurrentComponent()
try:
sheets = model.getSheets()
except AttributeError:
raise Exception("This script is for Calc Spreadsheets only")
# sheet = sheets.getByName('Sheet1')
sheet = model.CurrentController.getActiveSheet()
oSelection = model.getCurrentSelection()
oArea = oSelection.getRangeAddress()
first_row = oArea.StartRow
last_row = oArea.EndRow
first_col = oArea.StartColumn
last_col = oArea.EndColumn
#get the string from Footswitch2 via a TCP port
import os, socket, time
from configobj import ConfigObj
configuration_dir = os.environ["HOME"]
config_filename = configuration_dir + "/fs2.cfg"
if os.access(config_filename, os.R_OK):
pass
else:
return None
cfg = ConfigObj(config_filename)
#define values to use from the configuration file
tcp_port = int(cfg["control"]["TCP_PORT"])
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(0.5)
try:
sock.connect(("localhost", tcp_port))
except:
return None
sock.settimeout(10)
try:
sock.send(bytes('invoice\n', 'UTF-8'))
except:
return None
try:
time.sleep(1.0)
s_list = sock.recv(4096).decode('UTF-8')
s_list = s_list.split("\n")
except:
return None
lines_in_response = len(s_list)
if lines_in_response is None:
return None
column =['A','B','C','D','E','F','G','H','I','J','K','L','M',\
'N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
# merged rows are cumulative
master_row_merge_adj = 0
for x in range(0,lines_in_response):
if s_list[x].startswith("END"):
break
row_merge_adj = master_row_merge_adj
insert_table = s_list[x].split("\t")
if s_list[x] == "":
continue
parts = len(insert_table)
# merged columns are a simple adjustment for each item within x
column_merge_adj = 0
row_merge_done = 0
for y in range(0,parts):
it = insert_table[y]
cell_name = column[first_col + y + column_merge_adj]+str(x +1 +first_row + row_merge_adj)
cell = sheet.getCellRangeByName(cell_name)
if cell.getIsMerged():
cellcursor = sheet.createCursorByRange(cell)
cellcursor.collapseToMergedArea()
try:
# format AbsoluteName $Sheet1.$A$1:$D$2 for a merged cell of A1:D2
a,b,cell_range = cellcursor.AbsoluteName.partition(".")
start_cell, end_cell = cell_range.split(":")
a, start_col, start_row = start_cell.split("$")
a, end_col, end_row = end_cell.split("$")
column_merge_adj = column_merge_adj + (int(column.index(end_col)) - int(column.index(start_col)))
# merged rows are cumulative over each x
# merged row increment should only occur once within each x
# or data will not be in the top left of the merged cell
if row_merge_done == 0:
master_row_merge_adj = row_merge_adj + (int(end_row) - int(start_row))
row_merge_done = 1
except:
#unable to compute - insert data off to the right so it's available for cut and paste
column_merge_adj = 10
try:
float(it)
ins_numeric = True
except:
ins_numeric = False
if ins_numeric:
cell.Value = it
else:
cell.String = it
if s_list[x].startswith("END"):
insert_table = s_list[x].split("\t")
try:
invno = int(insert_table[1])
cell_name = insert_table[4]
except:
pass
try:
cell = sheet.getCellRangeByName(cell_name)
cell.Value = invno
except:
#The cell_name passed for Invoice number is incorrect, attempt to insert it in the next row, first selected column
passed_cell_name = cell_name
cell_name = column[first_col]+str(x +2 +first_row + row_merge_adj)
cell = sheet.getCellRangeByName(cell_name)
insert_text = "Invoice Number "+str(invno)+" Pos "+passed_cell_name+" Incorrect"
cell.String = insert_text
sock.close()
return None

Writing to file limitation on line length

I've been trying to write lines to a file based on specific file names from the same directory, a search of the file names in another log file(given as an input), and the modified date of the files.
The output is limiting me to under 80 characters per line.
def getFiles(flag, file):
if (flag == True):
file_version = open(file)
if file_version:
s = mmap.mmap(file_version.fileno(), 0, access=mmap.ACCESS_READ)
file_version.close()
file = open('AllModules.txt', 'wb')
for i, values in dict.items():
# search keys in version file
if (flag == True):
index = s.find(bytes(i))
if index > 0:
s.seek(index + len(i) + 1)
m = s.readline()
line_new = '{:>0} {:>12} {:>12}'.format(i, m, values)
file.write(line_new)
s.seek(0)
else:
file.write(i +'\n')
file.close()
if __name__ == '__main__':
dict = {}
for file in os.listdir(os.getcwd()):
if os.path.splitext(file)[1] == '.psw' or os.path.splitext(file)[1] == '.pkw':
time.ctime(os.path.getmtime(file))
dict.update({str(os.path.splitext(file)[0]).upper():time.strftime('%d/%m/%y')})
if (len(sys.argv) > 1) :
if os.path.exists(sys.argv[1]):
getFiles(True, sys.argv[1])
else:
getFiles(False, None)
The output is always like:
BW_LIB_INCL 13.1 rev. 259 [20140425 16:28]
16/05/14
The interpretation of data is correct, then again the formatting is not correct as the time is put on the next line (not on the same).
This is happening to all the lines of my new file.
Could someone give me a hint?
m = s.readline() has \n at the end of line. Then you're doing .format(i, m, values) which writes m in the middle of the string.
I leave it as exercise to the reader to find out what's happening when you're writing such line to a file. :-)
(hint: m = s.readline().rstrip('\n'))

Searching a file for matches between two values and outputting search hits in Python

I am (attempting) to write a program that searches through a hex file for instances of a hex string between two values, eg. Between D4135B and D414AC, incrementing between the first value until the second is reached- D4135B, D4135C, D4135D etc etc.
I have managed to get it to increment etc, but it’s the search part I am having trouble with.
This is the code I have so far, it's been cobbled together from other places and I need to make it somehow output all search hits into the output file (file_out)
I have exceeded the limit of my Python understanding and I'm sure there's probably a much easier way of doing this. I would be very grateful for any help.
def search_process(hx): # searching for two binary strings
global FLAG
while threeByteHexPlusOne != threeByteHex2: #Keep incrementing until second value reached
If Flag:
if hx.find(threeByteHex2) != -1:
FLAG = False #If threeByteHex = ThreeByteHexPlusOne, end search
Print (“Reached the end of the search”,hx.find(threeByteHexPlusOne))
Else:
If hx.find(threeByteHexPlusOne) != -1:
FLAG = True
Return -1 #If no results found
if __name__ == '__main__':
try:
file_in = open(FILE_IN, "r") #opening input file
file_out = open(FILE_OUT, 'w') #opening output file
hx_read = file_in.read #read from input file
tmp = ''
found = ''
while hx_read: #reading from file till file is empty
hx_read = tmp + hx_read
pos = search_process(hx_read)
while pos != -1:
hex_read = hx_read[pos:]
if FLAG:
found = found + hx_read
pos = search_process(hx_read)
tmp = bytes_read[]
hx_read = file_in.read
file_out.write(found) #writing to output file
except IOError:
print('FILE NOT FOUND!!! Check your filename or directory/PATH')
Here's a program that looks through a hex string from a file 3 bytes at a time and if the 3-byte hex string is between the given hex bounds, it writes it to another file. It makes use of generators to make getting the bytes from the hex string a little cleaner.
import base64
import sys
_usage_string = 'Usage: python {} <input_file> <output_file>'.format(sys.argv[0])
def _to_base_10_int(value):
return int(value, 16)
def get_bytes(hex_str):
# Two characters equals one byte
for i in range(0, len(hex_str), 2):
yield hex_str[i:i+2]
def get_three_byte_hexes(hex_str):
bytes = get_bytes(hex_str)
while True:
try:
three_byte_hex = next(bytes) + next(bytes) + next(bytes)
except StopIteration:
break
yield three_byte_hex
def find_hexes_in_range(hex_str, lower_bound_hex, upper_bound_hex):
lower_bound = _to_base_10_int(lower_bound_hex)
upper_bound = _to_base_10_int(upper_bound_hex)
found = []
for three_byte_hex in get_three_byte_hexes(hex_str):
hex_value = _to_base_10_int(three_byte_hex)
if lower_bound <= hex_value < upper_bound:
found.append(three_byte_hex)
return found
if __name__ == "__main__":
try:
assert(len(sys.argv) == 3)
except AssertionError:
print _usage_string
sys.exit(2)
file_contents = open(sys.argv[1], 'rb').read()
hex_str = base64.decodestring(file_contents).encode('hex')
found = find_hexes_in_range(hex_str, 'D4135B', 'D414AC')
print('Found:')
print(found)
if found:
with open(sys.argv[2], 'wb') as fout:
for _hex in found:
fout.write(_hex)
Check out some more info on generators here

Categories