Comparing incrementing filenames in Python and check which one is missing - python

I'm iterating through a folder with files and adding each of the file's path to a list. The folder contains files with incrementing file names, such as 00-0.txt, 00-1.txt, 00-2.txt, 01-0.txt, 01-1.txt, 01-2.txt and so on.
The number of files is not fixed and always varies. Also, sometimes a file could be missing. This means that I will sometimes get this list instead:
00-0.txt, 00-1.txt, 01-0.txt, 01-1.txt, 01-2.txt.
However, in my final list, I should always have groups of 9 (so 00-0, 00-1, 00-2 and so on until 00-8 is one group). If a file is missing, then I will append 'is missing' string text in the new list instead.
What I was thinking to do is the following:
Get the last character of the filename (for ex. '3')
Check if it's value is the same as the previous index + 1.
If it's not, then append 'it's missing' string
If it's the same, then append the file name
In pseudo-code (please don't mind the syntax errors, I'm mainly looking for high level advice), it would be something like this:
empty_list = []
list_with_paths = glob.glob("/path/to/dir*.txt")
for index, item in enumerate(list_with_paths):
basename = os.path.basename(item)
filename = os.path.splitext(basename)[0]
if index == 0 and int(filename[-1]) != 0:
empty_list.append('is missing')
elif filename[-1] != empty_list[index - 1] + 1:
empty_list.append('is missing')
else:
empty_list.append(filename)
I'm sure there is a more optimal solution in order to achieve this.

Once you have the set of actual paths, just iterate over the expected paths until you have accounted for all of the actual paths.
from itertools import count
list_with_paths = set(glob.glob("/path/to/dir/*.txt"))
groups = count()
results = []
for g in groups:
if not list_with_paths:
break
for i in range(0,9):
expected = "{:02}-{}.txt".format(g, i)
if "/path/to/dir/" + expected in list_with_paths:
list_with_paths.remove(expected)
else:
expected = "is missing"
results.append(expected)

Related

Save part of a file name in seperate panda df columns

I would like to save different positions of a file name in different panda df columns.
For example my file names look like this:
001015io.png
position 0-2 in column 'y position' in this case '001'
position 3-5 in column 'x position' in this case '015'
position 6-7 in column 'status' in this case 'io'
My folder contains about 400 of these picture files. I'm a beginner in programming, so I don't know how I should start to solve this.
If the parts of the file names that you need are consistent (same position and length in all files), you can use string slicing to create new columns from the pieces of the file name like this:
import pandas as pd
df = pd.DataFrame({'file_name': ['001015io.png']})
df['y position'] = df['file_name'].str[0:3]
df['x position'] = df['file_name'].str[3:6]
df['status'] = df['file_name'].str[6:8]
This results in the dataframe:
file_name y position x position status
0 001015io.png 001 015 io
Note that when you slice a string you give a start position and a stop position like [0:3]. The start position is inclusive, but the stop position is not, so [0:3] gives you the substring from 0-2.
You can do this with slicing. A string is basically a list of character, so you can slice this string into the parts you need. See the example below.
filename = '001015io.png'
x = filename[0:3]
y = filename[3:6]
status = filename[6:8]
print(x, y, status)
output
001 015 io
As for getting the list of files, there's an absurdly complete answer for that here.
I have this function below in my personal library which I reuse whenever I need to generate a list of files.
def get_files_from_path(path: str = ".", ext=None) -> list:
"""Find files in path and return them as a list.
Gets all files in folders and subfolders
See the answer on the link below for a ridiculously
complete answer for this. I tend to use this one.
note that it also goes into subdirs of the path
https://stackoverflow.com/a/41447012/9267296
Args:
path (str, optional): Which path to start on.
Defaults to '.'.
ext (str/list, optional): Optional file extention.
Defaults to None.
Returns:
list: list of full file paths
"""
result = []
for subdir, dirs, files in os.walk(path):
for fname in files:
filepath = f"{subdir}{os.sep}{fname}"
if ext == None:
result.append(filepath)
elif type(ext) == str and fname.lower().endswith(ext.lower()):
result.append(filepath)
elif type(ext) == list:
for item in ext:
if fname.lower().endswith(item.lower()):
result.append(filepath)
return result
There's one thing you need to take into account here, this function will give the full filepath, fe: path/to/file/001015io.png
You can use the code below to get just the filename:
import os
print(os.path.basename('path/to/file/001015io.png'))
ouput
001015io.png
Use what Bill the Lizard said to turn it into a df

How to add same number to same string in python?

I want to give same number to the same string name and save it to text file.
For example if there are multiple strings name "Ball" from filename, then I will give this string number 0. Another example, if I have multiple strings name "Square" from filename, then I will give this string number 1. And so on.
I tried using os.path.walk and splitting the text but still have no idea how to add the number and save it to text file
with open("check.txt", "w") as a:
for path, subdirs, files in os.walk(path):
for i, filename in enumerate(files):
#the filename have underscore to separate the space
#for example Ball_red_move
mylist = filename.split("_")
#I tried to take the first string name only after splitting, here
#for example "Ball"
k = mylist[0]
#After this I don't have idea to add number when the string name
#is same and also save it to txt file with the directory name
This is my expected result:
Check/Ball_red_move_01 0
Check/Ball_red_move_02 0
Check/Ball_red_move_03 0
Check/Square_jump_forward_01 1
Check/Square_jump_forward_02 1
Check/Square_jump_forward_03 1
You might like to do something like this:
Prepare a dictionary to map the string to some labeling numbers and check if the string is present.
object_map = {'Ball': 0, 'Square': 1}
def get_num_from_string(x):
for i in object_map:
if i in x:
return object_map[i]
A = ['Check/Ball_red_move_01', 'Check/Square_jump_forward_01']
for i in A:
print(i + ' '+str(get_num_from_string(i)))
This produces
Check/Ball_red_move_01 0
Check/Square_jump_forward_01 1
A few thing for you to consider, what do you want to do none of the string appears and also what do you want to do if multiple strings appear.

Script for separating series of photos

I need to visually separate photos (JPEGs) in a folder by placing black placeholder pictures between series with identical file names (only last two digits of the file names are different). The folder is typically containing single (stand alone) photos, named something like 03-12345-randomfilename.jpg and series named 03-12345-file01.jpg, 03-12345-file02.jpg, ..03, ..04, etc.
The singles should be left alone, but I need to place a black picture before and after all series.
I have the following Python script (originally written by someone else) that is intermittently failing for no apparent reason. It usually works, but sometimes it will overwrite files in the middle of a series, or more typically, it will fail to place a black picture after the last photo in a series. I've spent hours trying to figure out what's going on, but I'm stuck.
Any suggestions most appreciated.
def blackJPG(directory):
# iterate over every file name in the directory
blackJPG = '/Users/username/black.jpg'
filelist = {}
for file_name in os.listdir(directory):
filename, file_extension = os.path.splitext(file_name)
stringmatch = re.compile(r'(\d{2})(.*?)(\d+)(.*?)(([A-Za-z]+))(.*?)(\d+)')
m = stringmatch.search(file_name)
#Create search table
if m:
sequence = int(m.group(8))
filename_without_sequence = "{0}{1}{2}{3}{4}{5}".format(m.group(1),m.group(2),m.group(3),m.group(4),m.group(6),m.group(7))
filelist.update({filename_without_sequence: (sequence)})
for key, value in filelist.iteritems():
if value > 1:
newJPG = "{0}/{1}00.jpg".format(directory, key)
if value >= 10:
lastJPG = "{0}/{1}{2}.jpg".format(directory, key, value+1)
else:
lastJPG = "{0}/{1}0{2}.jpg".format(directory, key, value+1)
#Create first blackJPG
shutil.copyfile(blackJPG, newJPG)
#Create last blackJPG
shutil.copyfile(blackJPG, lastJPG)
return "Done"
If the variation is always the last 2 characters, then you can grab the part that doesn't change (the prefix) count the number of prefixes and create a file for those with more than one file:
def add_black_jpg(directory):
series_count = {}
for file in os.listdir(directory):
name, ext = os.path.splitext(file)
prefix = name[:-2]
count = series_count.get(prefix, 0)
series_count[prefix] = count + 1
for prefix, count in series_count.items():
if count > 1:
shutil.copyfile(black_jpg_location, f"{prefix}00.jpg")

python newbie - where is my if/else wrong?

Complete beginner so I'm sorry if this is obvious!
I have a file which is name | +/- or IG_name | 0 in a long list like so -
S1 +
IG_1 0
S2 -
IG_S3 0
S3 +
S4 -
dnaA +
IG_dnaA 0
Everything which starts with IG_ has a corresponding name. I want to add the + or - to the IG_name. e.g. IG_S3 is + like S3 is.
The information is gene names and strand information, IG = intergenic region. Basically I want to know which strand the intergenic region is on.
What I think I want:
open file
for every line, if the line starts with IG_*
find the line with *
print("IG_" and the line it found)
else
print line
What I have:
with open(sys.argv[2]) as geneInfo:
with open(sys.argv[1]) as origin:
for line in origin:
if line.startswith("IG_"):
name = line.split("_")[1]
nname = name[:-3]
for newline in geneInfo:
if re.match(nname, newline):
print("IG_"+newline)
else:
print(line)
where origin is the mixed list and geneInfo has only the names not IG_names.
With this code I end up with a list containing only the else statements.
S1 +
S2 -
S3 +
S4 -
dnaA +
My problem is that I don't know what is wrong to search so I can (attempt) to fix it!
Below is some step-by-step annotated code that hopefully does what you want (though instead of using print I have aggregated the results into a list so you can actually make use of it). I'm not quite sure what happened with your existing code (especially how you're processing two files?)
s_dict = {}
ig_list = []
with open('genes.txt', 'r') as infile: # Simulating reading the file you pass in sys.argv
for line in infile:
if line.startswith('IG_'):
ig_list.append(line.split()[0]) # Collect all our IG values for later
else:
s_name, value = line.split() # Separate out the S value and its operator
s_dict[s_name] = value.strip() # Add to dictionary to map S to operator
# Now you can go back through your list of IG values and append the appropriate operator
pulled_together = []
for item in ig_list:
s_value = item.split('_')[1]
# The following will look for the operator mapped to the S value. If it is
# not found, it will instead give you 'not found'
corresponding_operator = s_dict.get(s_value, 'Not found')
pulled_together.append([item, corresponding_operator])
print ('List structure')
print (pulled_together)
print ('\n')
print('Printout of each item in list')
for item in pulled_together:
print(item[0] + '\t' + item[1])
nname = name[:-3]
Python's slicing through list is very powerful, but can be tricky to understand correctly.
When you write [:-3], you take everything except the last three items. The thing is, if you have less than three element in your list, it does not return you an error, but an empty list.
I think this is where things does not work, as there are not much elements per line, it returns you an empty list. If you could tell what do you exactly want it to return there, with an example or something, it would help a lot, as i don't really know what you're trying to get with your slicing.
Does this do what you want?
from __future__ import print_function
import sys
# Read and store all the gene info lines, keyed by name
gene_info = dict()
with open(sys.argv[2]) as gene_info_file:
for line in gene_info_file:
tokens = line.split()
name = tokens[0].strip()
gene_info[name] = line
# Read the other file and lookup the names
with open(sys.argv[1]) as origin_file:
for line in origin_file:
if line.startswith("IG_"):
name = line.split("_")[1]
nname = name[:-3].strip()
if nname in gene_info:
lookup_line = gene_info[nname]
print("IG_" + lookup_line)
else:
pass # what do you want to do in this case?
else:
print(line)

IndexError: list index out of range - Can't get my txt file to save the output in Python

I'm writing a program that is referring to two text files. One stores the list of currencies in a sentence and one stores the current value. I want it to only store the value of the currency in the sentence once, but then stores each order so it can be replaced on screen. E.g. Sterling, Euro (in one file). 1,1,2 (in the order file). I keep getting the error message
IndexError: list index out of range
all_currency = []
value_of_currency = []
my_words_file = open("words.txt","r")
my_words_document = my_words_file.readline()
while my_words_document != "":
all_currency.append(my_words_document.strip())
my_words_document = my_words_file.readline()
print("output of the words in the saved list:",str(all_currency))
my_words_file.close()
positions = ""
my_numbers_file = open("numbers.txt","r")
value_of_currency = my_numbers_file.readline().strip().split()
my_numbers_file.close()
for mynumbers in value_of_currency:
positions += all_currency[int(mynumbers)]+" "
print("output of the positions in the saved list:",positions)
You are getting error because you are mainly indexing the currency first. Since in you case there is only 2 currency that's why trying to get the third index eg. arr[2] giving this error.
def read_words(words_file):
return [word for line in open(words_file, 'r') for word in line.strip().split(',')]
all_currency = read_words("a.txt")
print(all_currency)
value_of_currency = read_words("b.txt")
print(value_of_currency )
positions = ""
for mynumbers in value_of_currency:
positions += all_currency[int(mynumbers)]+" "
print("output of the positions in the saved list:",positions)
#a.txt file contains currency: "Sterling,Euro,Dollar"
#b.txt file contains currency: "1,1,2"
I actually didn't get what was your desired output but i fixed the error problem. As i said earlier if you need to make sure that the values do not cross index number. Hope you understand.

Categories