Version strings in python using patterns - python

I created a code to version names in python. The idea is to add v1, v2... if a name already exists in a list. I tried the following code:
import pandas as pd
list_names = pd.Series(['name_1', 'name_1_v1'])
name = 'name_1'
new_name = name
i = 1
while list_names.str.contains(new_name).any() == True:
new_name = f'{name}_v{i}'
if list_names.str.contains(new_name).any() == False:
break
i = i + 1
It works fine when I input 'name_1' (output: 'name_1_v2'), however, when I enter 'name_1_v1', the output is 'name_1_v1_v1' (correct would be 'name_1_v2'). I thought of using a regex with pattern _v[0-9]$, but I wasnt able to make it work.
<<< edit >>>
Output should be new_name = 'name_1_v2'. The idea is to find an adequate versioned name, not change the ones in the list.

Proposed code :
import pandas as pd
import re
basename = 'name_1'
def new_version(lnam, basename):
i, lat_v = 0, 0
# looks for latest version
while i < len(lnam):
if re.search('v\d*', lnam[i]) is not None:
lat_v = max(int(re.findall('v\d*', lnam[i])[0][1:]), lat_v)
i+=1
if lat_v == 0:
return basename + '_v1'
else:
return basename + '_v%s'%(lat_v+1)
lnam = pd.Series(['name_1'])
new_name = new_version(lnam, basename)
print("new_name : ", new_name)
# new_name : name_1_v1
lnam = pd.Series(['name_1', 'name_1_v1'])
new_name = new_version(lnam, basename)
print("new_name : ", new_name)
# new_name : name_1_v2
Result :
new_name : name_1_v2
Let's try now with an unordered list of names (next version is 101) :
lnam = pd.Series(['name_1', 'name_1_v4', 'name_1_v100', 'name_1_v12', 'name_1_v17'])
new_name = new_version(lnam, basename)
print("new_name : ", new_name)
# new_name : name_1_v101
Basename automatic identification (like #FernandoQuintino suggests)
basename = re.sub('_v\d*', '', basename)
# name_1

Related

Add leading zeros to filename

I have a folder with more than 1.000 files that's updated constantly. I'm using a script to add a random number to it based on the total of the files, like this:
Before
file_a
file_b
After
1_file_a
2_file_b
I would like to add leading zeros so that the files are sorted correctly. Like this:
0001_file_a
0010_file_b
0100_file_c
Here's the random number script:
import os
import random
used_random = []
os.chdir('c:/test')
for filename in os.listdir():
n = random.randint(1, len(os.listdir()))
while n in used_random:
n = random.randint(1, len(os.listdir()))
used_random.append(n)
os.rename(filename, f"{n}_{filename}")
I would suggest using f-strings to accomplish this.
>>> num = 2
>>> f"{num:04}_file"
'0002_file'
>>> num = 123
>>> f"{num:04}_file"
'0123_file'
I would also replace the following with a list comprehension.
cleaned_files = []
for item in folder_files:
if item[0] == '.' or item[0] == '_':
pass
else:
cleaned_files.append(item)
cleaned_files = [item for item in folder_files if not item[0] in ('.', '_')]
You should use the first element of the list obtained after split:
def getFiles(files):
for file in files:
file_number, file_end = file.split('_')
num = file_number.split()[0].zfill(4) # num is 4 characters long with leading 0
new_file = "{}_{}".format(num, file_end)
# rename or store the new file name for later rename
Something like this should work ... I hope this helps ...
import re
import glob
import os
import shutil
os.chdir('/tmp') # I played in the /tmp directory
for filename in glob.glob('[0-9]*_file_*'):
m = re.match(r'(^[0-9]+)(_.*)$', filename)
if m:
num = f"{int(m.group(1)):04}" # e.g. 23 convert to int and than format
name = m.group(2) # the rest of the name e.g. _file_a
new_filename = num + name # 0023_file_a
print(filename + " " + new_filename)
# Not sure if you like to rename the files, if yes:
# shutil.move(filename, new_filename)
Thanks to user https://stackoverflow.com/users/15261315/chris I updated the random number script to add leading zeros:
import os
import random
used_random = []
os.chdir('c:/Test')
for filename in os.listdir():
n = random.randint(1, len(os.listdir()))
while n in used_random:
n = random.randint(1, len(os.listdir()))
used_random.append(n)
os.rename(filename, f"{n:04}_{filename}")

Comparing part of a string within a list

I have a list of strings:
fileList = ['YMML.2019.09.10-Run.1-Final.pdf',
'YMML.2019.09.10-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.1-Final.pdf',
'YMML.2019.09.12-Run.2-Initial.pdf',
'YMML.2019.09.13-Run.2-Initial.pdf',
'YMML.2019.09.12-Run.1-Final.pdf',
'YMML.2019.09.13-Run.1-Final.pdf',
'YMML.2019.09.14-Run.1-Final.pdf',]
and I'd like to confirm that there is both a Run.1-Final and Run.2-Initial for each date.
I've tried something like:
for i in range(len(directoryList)):
if directoryList[i][5:15] != directoryList[i + 1][5:15]:
print(directoryList[i] + ' is missing.')
i += 2
and I'd like the output to be
'YMML.2019.09.14-Run.2-Initial.pdf is missing,
Perhaps something like
dates = [directoryList[i][5:15] for i in range(len(directoryList))]
counter = collections.Counter(dates)
But then having trouble extracting from the dictionary.
To make it more readable, you could create a list of dates first, then loop over those.
file_list = ['YMML.2019.09.10-Run.1-Final.pdf',
'YMML.2019.09.10-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.1-Final.pdf',
'YMML.2019.09.12-Run.2-Initial.pdf',
'YMML.2019.09.13-Run.2-Initial.pdf',
'YMML.2019.09.12-Run.1-Final.pdf',
'YMML.2019.09.13-Run.1-Final.pdf',
'YMML.2019.09.14-Run.1-Final.pdf',]
dates = set([item[5:15] for item in file_list])
for date in dates:
if 'YMML.' + date + '-Run.1-Final.pdf' not in file_list:
print('YMML.' + date + '-Run.1-Final.pdf is missing')
if 'YMML.' + date + '-Run.2-Initial.pdf' not in file_list:
print('YMML.' + date + '-Run.2-Initial.pdf is missing')
set() takes the unique values in the list to avoid looping through them all twice.
I'm kind of late but here's what i found to be the simplest way, maybe not the most efficent :
for file in fileList:
if file[20:27] == "1-Final":
if (file[0:20] + "2-Initial.pdf") not in fileList:
print(file)
elif file[19:29] is "2-Initial.pdf":
if (file[0:20] + "1-Final.pdf") not in fileList:
print(file)
Here's an O(n) solution which collects items into a defaultdict by date, then filters on quantity seen, restoring original names from the remaining value:
from collections import defaultdict
files = [
'YMML.2019.09.10-Run.1-Final.pdf',
'YMML.2019.09.10-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.1-Final.pdf',
'YMML.2019.09.12-Run.2-Initial.pdf',
'YMML.2019.09.13-Run.2-Initial.pdf',
'YMML.2019.09.12-Run.1-Final.pdf',
'YMML.2019.09.13-Run.1-Final.pdf',
'YMML.2019.09.14-Run.1-Final.pdf',
]
seen = defaultdict(list)
for x in files:
seen[x[5:15]].append(x)
missing = [v[0] for k, v in seen.items() if len(v) < 2]
print(missing) # => ['YMML.2019.09.14-Run.1-Final.pdf']
Getting names of partners can be done with a conditional:
names = [
x[:20] + "2-Initial.pdf" if x[20] == "1" else
x[:20] + "1-Final.pdf" for x in missing
]
print(names) # => ['YMML.2019.09.14-Run.2-Initial.pdf']
This works:
fileList = ['YMML.2019.09.10-Run.1-Final.pdf',
'YMML.2019.09.10-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.1-Final.pdf',
'YMML.2019.09.12-Run.2-Initial.pdf',
'YMML.2019.09.13-Run.2-Initial.pdf',
'YMML.2019.09.12-Run.1-Final.pdf',
'YMML.2019.09.13-Run.1-Final.pdf',
'YMML.2019.09.14-Run.1-Final.pdf',]
initial_set = {filename[:15] for filename in fileList if 'Initial' in filename}
final_set = {filename[:15] for filename in fileList if 'Final' in filename}
for filename in final_set - initial_set:
print(filename + '-Run.2-Initial.pdf is missing.')
for filename in initial_set - final_set:
print(filename + '-Run.1-Final.pdf is missing.')

Python - excluding tags from a list of a different tags

I have a python script for Editorial on iOS that I've modified, and I would like help tweaking it further.
I have .taskpaper files in a dropbox folder that Editorial is pointed at. When I run this workflow the script search all the files and return a list of lines that include "#hardware". This is working well but the final list includes items with #hardware that I've finished and appended with #done. How can I exclude #hardware lines with #done?
There are seven files that run. These two seem to be the ones that need to be modified:
Generate the list of hashtags
import editor
import console
import os
import re
import sys
import codecs
import workflow
pattern = re.compile(r'\s#{1}(\w+)', re.I|re.U)
p = editor.get_path()
from urllib import quote
dir = os.path.split(p)[0]
valid_extensions = set(['.taskpaper'])
tags = ['#hardware']
for w in os.walk(dir):
dir_path = w[0]
filenames = w[2]
for name in filenames:
full_path = os.path.join(dir_path, name)
ext = os.path.splitext(full_path)[1]
if ext.lower() in valid_extensions:
try:
with codecs.open(full_path, 'r', 'utf-8') as f:
for line in f:
for match in re.finditer(pattern, line):
tags.append(match.group(1))
except UnicodeDecodeError, e:
pass
workflow.set_output('\n'.join(sorted(set(tags))))
and
Search documents with hashtags
import editor
import console
import os
import re
import sys
import codecs
import workflow
from StringIO import StringIO
theme = editor.get_theme()
workflow.set_variable('CSS', workflow.get_variable('CSS Dark' if theme == 'Dark' else 'CSS Light'))
p = editor.get_path()
searchterm = workflow.get_variable('Search Term')
term = '#' + searchterm
pattern = re.compile(re.escape(term), flags=re.IGNORECASE)
from urllib import quote
dir = os.path.split(p)[0]
valid_extensions = set(['.taskpaper'])
html = StringIO()
match_count = 0
for w in os.walk(dir):
dir_path = w[0]
filenames = w[2]
for name in filenames:
full_path = os.path.join(dir_path, name)
ext = os.path.splitext(full_path)[1]
if ext.lower() not in valid_extensions:
continue
found_snippets = []
i = 0
try:
with codecs.open(full_path, 'r', 'utf-8') as f:
for line in f:
for match in re.finditer(pattern, line):
start = max(0, match.start(0) - 100)
end = min(len(line)-1, match.end(0) + 100)
snippet = (line[start:match.start(0)],
match.group(0),
line[match.end(0):end],
match.start(0) + i,
match.end(0) + i)
found_snippets.append(snippet)
i += len(line)
except UnicodeDecodeError, e:
pass
if len(found_snippets) > 0:
match_count += 1
root, rel_path = editor.to_relative_path(full_path)
ed_url = 'editorial://open/' + quote(rel_path.encode('utf-8')) + '?root=' + root
html.write('<h2>' + name + '</h2>')
for snippet in found_snippets:
start = snippet[3]
end = snippet[4]
select_url = 'editorial://open/' + quote(rel_path.encode('utf-8')) + '?root=' + root
select_url += '&selection=' + str(start) + '-' + str(end)
html.write('<a class="result-box" href="' + select_url + '">' + snippet[0] + '<span class="highlight">' + snippet[1] + '</span>' + snippet[2] + '</a>')
if match_count == 0:
html.write('<p>No matches found.</p>')
workflow.set_output(html.getvalue())
Thank you.
Since the matching lines are stored in a list, you can use a list comprhension to exlcude the ones you don't want. Something like this:
l = ['#hardware ttuff', 'stuff #hardware', 'things #hardware sett #done', '#hardware', '#hardware# #done']
print(l)
['#hardware ttuff', 'stuff #hardware', 'things #hardware sett #done', '#hardware', '#hardware# #done']
m = [ s for s in l if '#done' not in s]
print(m)
['#hardware ttuff', 'stuff #hardware', '#hardware']
A friend solved it for me.
We added:
if not "#done" in line:
in the "Search documents with hashtags" file after
for line in f:
Works great

How to read .evtx file using python?

Guys do anyone know how to read event log file in C:\Windows\System32\winevt\Logs with .evtx extension?
I have already tried to open it using notepad and read using python but notepad says access is denied...
Do anyone know how to do it? Thanks in advance..
This is how you would read the file "Forwarded Events" from the event viewer. You need admin access so I would run it as admin but I it will prompt you for a password if you don't.
import win32evtlog
import xml.etree.ElementTree as ET
import ctypes
import sys
def is_admin():
try:
return ctypes.windll.shell32.IsUserAnAdmin()
except:
return False
if is_admin():
# open event file
query_handle = win32evtlog.EvtQuery(
'C:\Windows\System32\winevt\Logs\ForwardedEvents.evtx',
win32evtlog.EvtQueryFilePath)
read_count = 0
a = 1
while a == 1:
a += 1
# read 1 record(s)
events = win32evtlog.EvtNext(query_handle, 1)
read_count += len(events)
# if there is no record break the loop
if len(events) == 0:
break
for event in events:
xml_content = win32evtlog.EvtRender(event, win32evtlog.EvtRenderEventXml)
# parse xml content
xml = ET.fromstring(xml_content)
# xml namespace, root element has a xmlns definition, so we have to use the namespace
ns = '{http://schemas.microsoft.com/win/2004/08/events/event}'
substatus = xml[1][9].text
event_id = xml.find(f'.//{ns}EventID').text
computer = xml.find(f'.//{ns}Computer').text
channel = xml.find(f'.//{ns}Channel').text
execution = xml.find(f'.//{ns}Execution')
process_id = execution.get('ProcessID')
thread_id = execution.get('ThreadID')
time_created = xml.find(f'.//{ns}TimeCreated').get('SystemTime')
#data_name = xml.findall('.//EventData')
#substatus = data_name.get('Data')
#print(substatus)
event_data = f'Time: {time_created}, Computer: {computer}, Substatus: {substatus}, Event Id: {event_id}, Channel: {channel}, Process Id: {process_id}, Thread Id: {thread_id}'
print(event_data)
user_data = xml.find(f'.//{ns}UserData')
# user_data has possible any data
else:
ctypes.windll.shell32.ShellExecuteW(None, "runas", sys.executable, " ".join(sys.argv), None, 1)
input()
.evtx is the extension for Windows Eventlog files. It contains data in a special binary format designed by Microsoft so you cannot simply open it in a text editor.
The are open source tools to read .evtx and the NXLog EE can also read .evtx files. (Disclaimer: I'm affiliated with the latter).
I modified the accepted answer a bit as following, so it becomes reusable:
import xml.etree.ElementTree as Et
import win32evtlog
from collections import namedtuple
class EventLogParser:
def __init__(self, exported_log_file):
self.exported_log_file = exported_log_file
def get_all_events(self):
windows_events = []
query_handle = win32evtlog.EvtQuery(str(self.exported_log_file),
win32evtlog.EvtQueryFilePath | win32evtlog.EvtQueryReverseDirection)
while True:
raw_event_collection = win32evtlog.EvtNext(query_handle, 1)
if len(raw_event_collection) == 0:
break
for raw_event in raw_event_collection:
windows_events.append(self.parse_raw_event(raw_event))
return windows_events
def parse_raw_event(self, raw_event):
xml_content = win32evtlog.EvtRender(raw_event, win32evtlog.EvtRenderEventXml)
root = Et.fromstring(xml_content)
ns = "{" + root.tag.split('}')[0].strip('{') + "}"
system = root.find(f'{ns}System')
event_id = system.find(f'{ns}EventID').text
level = system.find(f'{ns}Level').text
time_created = system.find(f'{ns}TimeCreated').get('SystemTime')
computer = system.find(f'{ns}Computer').text
WindowsEvent = namedtuple('WindowsEvent',
'event_id, level, time_created, computer')
return WindowsEvent(event_id, level, time_created, computer)
I use the "python-evtx" library, you can install it using this command:
pip install python-evtx
In my case, I'm not interested in reading records with the "Information" level.
import os
import codecs
from lxml import etree
import Evtx.Evtx as evtx
def evtxFile(absolutePath, filenameWithExt, ext, _fromDate, _toDate):
print("Reading: " + filenameWithExt)
outText = ""
channel = ""
#read the windows event viewer log and convert its contents to XML
with codecs.open(tempFilePath, "a+", "utf-8", "ignore") as tempFile:
with evtx.Evtx(absolutePath) as log:
for record in log.records():
xmlLine = record.xml()
xmlLine = xmlLine.replace(" xmlns=\"http://schemas.microsoft.com/win/2004/08/events/event\"", "")
xmlParse = etree.XML(xmlLine)
level = parseXMLtoString(xmlParse, ".//Level/text()")
if not level == "0" and not level == "4":
providerName = parseXMLtoString(xmlParse, ".//Provider/#Name")
qualifiers = parseXMLtoString(xmlParse, ".//EventID/#Qualifiers")
timestamp = parseXMLtoString(xmlParse, ".//TimeCreated/#SystemTime")
eventID = parseXMLtoString(xmlParse, ".//EventID/text()")
task = parseXMLtoString(xmlParse, ".//Task/text()")
keywords = parseXMLtoString(xmlParse, ".//Keywords/text()")
eventRecordID = parseXMLtoString(xmlParse, ".//EventRecordID/text()")
channel = parseXMLtoString(xmlParse, ".//Channel/text()")
computer = parseXMLtoString(xmlParse, ".//Computer/text()")
message = parseXMLtoString(xmlParse, ".//Data/text()")
if level == "1":
level = "Critical"
elif level == "2":
level = "Error"
elif level == "3":
level = "Warning"
date = timestamp[0:10]
time = timestamp[11:19]
time = time.replace(".", "")
_date = datetime.strptime(date, "%Y-%m-%d").date()
if _fromDate <= _date <= _toDate:
message = message.replace("<string>", "")
message = message.replace("</string>", "")
message = message.replace("\r\n", " ")
message = message.replace("\n\r", " ")
message = message.replace("\n", " ")
message = message.replace("\r", " ")
outText = date + " " + time + "|" + level + "|" + message.strip() + "|" + task + "|" + computer + "|" + providerName + "|" + qualifiers + "|" + eventID + "|" + eventRecordID + "|" + keywords + "\n"
tempFile.writelines(outText)
with codecs.open(tempFilePath, "r", "utf-8", "ignore") as tempFile2:
myLinesFromDateRange = tempFile2.readlines()
#delete the temporary file that was created
os.remove(tempFilePath)
if len(myLinesFromDateRange) > 0:
createFolder("\\filtered_data_files\\")
outFilename = "windows_" + channel.lower() + "_event_viewer_logs" + ext
myLinesFromDateRange.sort()
#remove duplicate records from the list
myFinalLinesFromDateRange = list(set(myLinesFromDateRange))
myFinalLinesFromDateRange.sort()
with codecs.open(os.getcwd() + "\\filtered_data_files\\" + outFilename, "a+", "utf-8", "ignore") as linesFromDateRange:
linesFromDateRange.seek(0)
if len(linesFromDateRange.read(100)) > 0:
linesFromDateRange.writelines("\n")
linesFromDateRange.writelines(myFinalLinesFromDateRange)
del myLinesFromDateRange[:]
del myFinalLinesFromDateRange[:]
else:
print("No data was found within the specified date range.")
print("Closing: " + filenameWithExt)
I hope it helps you or someone else in the future.
EDIT:
The "tempFilePath" can be anything you want, for example:
tempFilePath = os.getcwd() + "\\tempFile.txt"
I collected some information first before calling the "evtxFile" function:
The "From" and the "To" dates are in the following format: YYYY-MM-DD
Converted the dates to "date" data type:
_fromDate = datetime.strptime(fromDate, "%Y-%m-%d").date()
_toDate = datetime.strptime(toDate, "%Y-%m-%d").date()
Divided the directory where the .evtx files are located into different parts:
def splitDirectory(root, file):
absolutePathOfFile = os.path.join(root, file)
filePathWithoutFilename = os.path.split(absolutePathOfFile)[0]
filenameWithExt = os.path.split(absolutePathOfFile)[1]
filenameWithoutExt = os.path.splitext(filenameWithExt)[0]
extension = os.path.splitext(filenameWithExt)[1]
return absolutePathOfFile, filePathWithoutFilename, filenameWithExt, filenameWithoutExt, extension
for root, subFolders, files in os.walk(directoryPath):
for f in files:
absolutePathOfFile, filePathWithoutFilename, filenameWithExt,
filenameWithoutExt, extension = splitDirectory(root, f)
if extension == ".evtx":
evtxFile(absolutePathOfFile, filenameWithExt, ".txt", _fromDate, _toDate)

How to take care of duplicates while copying files to a folder in python

I am writing a script in python to consolidate images in different folders to a single folder. There is a possibility of multiple image files with same names. How to handle this in python? I need to rename those with "image_name_0001", "image_name_0002" like this.
You can maintain a dict with count of a names that have been seen so far and then use os.rename() to rename the file to this new name.
for example:
dic = {}
list_of_files = ["a","a","b","c","b","d","a"]
for f in list_of_files:
if f in dic:
dic[f] += 1
new_name = "{0}_{1:03d}".format(f,dic[f])
print new_name
else:
dic[f] = 0
print f
Output:
a
a_001
b
c
b_001
d
a_002
If you have the root filename i.e name = 'image_name', the extension, extension = '.jpg' and the path to the output folder, path, you can do:
*for each file*:
moved = 0
num = 0
if os.path.exists(path + name + ext):
while moved == 0:
num++
modifier = '_00'+str(num)
if not os.path.exists(path + name + modifier + extension):
*MOVE FILE HERE using (path + name + modifier + extension)*
moved = 1
else:
*MOVE FILE HERE using (path + name + ext)*
There are obviously a couple of bits of pseudocode in there but you should get the gist

Categories