I have a path like this one :
path = "./corpus_test/corpus_ix_test_FMC.xlsx"
I want to retrieve the name of the file without ".xlsx" and the other parts of the file.
I know I should use index like this but there are some cases the file is different ans the path is not the same , for example :
path2 = "./corpus_ix/corpus_polarity_test_FMC.xlsx"
I am looking for a regular expression or a method which with retrieve only the name in both cases. for example, if I read a full repertory, there with lot of files and using index won't help me.
Is there a way to do it in python and telling it it should start slicing at the last "/" ? so that I only retrieve the index of "/" and add "1" to start from.
what I try but still thinking
path ="./corpus_test/corpus_ix_test_FMC.xlsx"
list_of_index =[]
for f in path:
if f == "/":
ind = path.index(f)
list_of_index.append(ind)
ind_to_start_count = max(list_of_index) + 1
print(" the index of the last "/" is" :
name_of_file = path[ind_to_start_count:-5] #
But the printing give me 1 for each / , is there a way to have the index of the letters part of the string ?
But the index of / in both case is 1 for each r ?
wanted to split in caracter but get error with
path ="./corpus_test/corpus_ix_test_FMC.xlsx"
path_string = path.split("")
print(path_string)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-b8bdc29c19b1> in <module>
1 path ="./corpus_test/corpus_ix_test_FMC.xlsx"
2
----> 3 path_string = path.split("")
4 print(path_string)
ValueError: empty separator
import os
fullpath = r"./corpus_test/corpus_ix_test_FMC.xlsx"
full_filename = os.path.basename(fullpath)
filename, ext = os.path.splitext(full_filename)
This would give you the base filename without the extension
This is what I've been using:
def get_suffix(word, delimeter):
""" Returns the part of word after the last instance of 'delimeter' """
while delimeter in word:
word = word.partition(delimeter)[2]
return word
def get_terminal_path(path):
"""
Returns the last step of a path
Edge cases:
-Delimeters: / or \\ or mixed
-Ends with delimeter or not
"""
# Convert "\\" to "/"
while "\\" in path:
part = path.partition("\\")
path = part[0] + "/" + part[2]
# Check if ends with delimeter
if path[-1] == "/":
path = path[0:-1]
# Get terminal path
out = get_suffix(path, "/")
return out.partition(".")[0]
I am trying to sort dicoms of multiple subjects into their respective folders based on their PatientID. The current directory has all the dicoms for all subjects without sorting. I am able to go through a dicom directory and group subjects by their PatientID and count how many dicoms each subject has. Is it possible to copy or move the dicoms to another directory and sort them in a folder based on their PatientID.
code:
os.listdir('\\dicoms')
device = torch.device("cuda")
print(device)
input_path = '\\dicoms\\'
ds_columns = ['ID', 'PatientID', 'Modality', 'StudyInstance',
'SeriesInstance', 'PhotoInterpretation', 'Position0',
'Position1', 'Position2', 'Orientation0', 'Orientation1',
'Orientation2', 'Orientation3', 'Orientation4', 'Orientation5',
'PixelSpacing0', 'PixelSpacing1']
def extract_dicom_features(ds):
ds_items = [ds.SOPInstanceUID,
ds.PatientID,
ds.Modality,
ds.StudyInstanceUID,
ds.SeriesInstanceUID,
ds.PhotometricInterpretation,
ds.ImagePositionPatient,
ds.ImageOrientationPatient,
ds.PixelSpacing]
line = []
for item in ds_items:
if type(item) is pydicom.multival.MultiValue:
line += [float(x) for x in item]
else:
line.append(item)
return line
list_img = os.listdir(input_path + 'imgs')
print(len(list_img))
df_features = []
for img in tqdm.tqdm(list_img):
img_path = input_path + 'imgs/' + img
ds = pydicom.read_file(img_path)
df_features.append(extract_dicom_features(ds))
df_features = pd.DataFrame(df_features, columns=ds_columns)
df_features.head()
df_features.to_csv('\\meta.csv')
print(Counter(df_features['PatientID']))
example of metadata:
,ID,PatientID,Modality,StudyInstance,SeriesInstance,PhotoInterpretation,Position0,Position1,Position2,Orientation0,Orientation1,Orientation2,Orientation3,Orientation4,Orientation5,PixelSpacing0,PixelSpacing1
0,ID_000012eaf,ID_f15c0eee,CT,ID_30ea2b02d4,ID_0ab5820b2a,MONOCHROME2,-125.0,-115.89798,77.970825,1.0,0.0,0.0,0.0,0.927184,-0.374607,0.488281,0.488281
example of Counter output:
Counter({'ID_19702df6': 28, 'ID_b799ed34': 26, 'ID_e3523464': 26, 'ID_cd9169c2': 26, 'ID_e326a8a4': 24, 'ID_45da90cb': 24, 'ID_99e4f787': 24, 'ID_df751e93': 24, 'ID_929a5b39': 20})
I added the following code to try to sort the images into subdirectories but I run into an error:
dest_path = input_path+'imageProcessDir'
counter = 0
for index, rows in df_features.iterrows():
filename = basename(rows['ID'])
image = cv2.imread(input_path+rows['ID'])
counter=counter+1
fold = rows['PatientID']+"/"
dest_fold = dest_path+fold
cv2.imwrite(dest_fold+"/"+filename+ "_" +str(counter)+".dcm", img)
error:
Traceback (most recent call last):
File "ct_move.py", line 77, in <module>
cv2.imwrite(dest_fold+"/"+filename+ "_" +str(counter)+".dcm", img)
TypeError: Expected cv::UMat for argument 'img'
I'd also second ditching CV for this - it is overkill.
Try pydicom instead.
What I'd do for your problem (move all files with same patient ID into their own folder, and count how many for each) is:
get list of dicom files as a list (use glob.glob to search a directory and/or just pass in the full file list via argv)
load all those files into a list of pydicom dicom file objects (DataSets), so something like:
import pydicom
for fname in glob.glob(sys.argv[1], recursive=False):
print("loading: {}".format(fname))
files.append(pydicom.read_file(fname))
go through that list and move (creating new directory if required) that file. So something like (not working code - I can't off the top of my head remember the os module methods I'm just putting the function in <>'s, just showing how conceptually to do it):
from collections import defaultdict
# dict for counting number of files for each patient ID
patient_id_count = defaultdict(lambda: 0)
for f in files:
id = f.PatientID # this gets the patient ID from the current file
if os.<directory doesnt exist>(id):
os.<create directory>(id)
os.<move>(f.file_name, id)
patient_id_count{id} += 1
To address your issue, it seems like overkill to use opencv here at all. If all you want to do is to move the dicom images from one location into another on the filesystem, you could use os.rename or shutil.move if you are on a UNIX-like system. Unless you are modifying image content, these are cleaner and faster solutions.
I noticed two little things in your last code block:
I think I noticed that you want the fold variable to have the "/" prefixed instead of suffixed for the paths to work.
Also, the counter will continue to increment across all dicoms, where I think you want it to increment on a per-subject basis (I am assuming that df_features will be sorted on PatientID here, if it is not, maybe you could use the Counter class).
dest_path = input_path+'imageProcessDir'
counter = 0
prev_fold = '/' + df_features.loc[0, 'PatientID']
for index, rows in df_features.iterrows():
filename = basename(rows['ID'])
counter=counter + 1
fold = '/' + rows['PatientID']
dest_fold = dest_path + fold
out_file = dest_fold + "/" + filename + "_" + str(counter) + ".dcm"
os.rename(input_path + rows['ID'], out_file)
if fold != prev_fold:
counter = 0 # reset when the PatientID changes
prev_fold = fold
I would also use os.path.join to handle filesystem paths instead of adding "/" to everything:
fold = rows['PatientID']
dest_fold = os.path.join(dest_path, fold)
as I think that there is also an issue with the input file path: input_path + rows['ID']
edit:
This is to get rid of the use of '/' and put in os.path.join
dest_path = os.path.join(input_path, 'imageProcessDir')
counter = 0
prev_fold = df_features.loc[0, 'PatientID']
for index, rows in df_features.iterrows():
filename = basename(rows['ID'])
counter=counter + 1
fold = rows['PatientID']
dest_fold = os.path.join(dest_path, fold)
os.makedirs(dest_fold, exist_ok=True) # make sure target folder exists
out_file = os.path.join(dest_fold, filename + "_" + str(counter) + ".dcm")
os.rename(os.path.join(input_path, rows['ID']), out_file)
if fold != prev_fold:
counter = 0 # reset when the PatientID changes
prev_fold = fold
Also, note that os.rename(os.path.join(input_path, rows['ID']), out_file) may need to be os.rename(os.path.join(input_path, rows['ID'] + '.dcm'), out_file)
If it's not too much, you may want to make a backup of your files before attempting this, to make sure you get what you want out!
Thank you I solved the problem with your help.
Solution:
os.listdir('directory')
device = torch.device("cuda")
print(device)
input_path = 'directory\\'
ds_columns = ['ID', 'PatientID', 'Modality', 'StudyInstance',
'SeriesInstance', 'PhotoInterpretation', 'Position0',
'Position1', 'Position2', 'Orientation0', 'Orientation1',
'Orientation2', 'Orientation3', 'Orientation4', 'Orientation5',
'PixelSpacing0', 'PixelSpacing1']
def extract_dicom_features(ds):
ds_items = [ds.SOPInstanceUID,
ds.PatientID,
ds.Modality,
ds.StudyInstanceUID,
ds.SeriesInstanceUID,
ds.PhotometricInterpretation,
ds.ImagePositionPatient,
ds.ImageOrientationPatient,
ds.PixelSpacing]
line = []
for item in ds_items:
if type(item) is pydicom.multival.MultiValue:
line += [float(x) for x in item]
else:
line.append(item)
return line
list_img = os.listdir(input_path)
print(len(list_img))
print('***********')
print(list_img)
print('***********')
df_features = []
for img in tqdm.tqdm(list_img):
img_path = input_path + img
ds = pydicom.read_file(img_path)
df_features.append(extract_dicom_features(ds))
df_features = pd.DataFrame(df_features, columns=ds_columns)
print(df_features)
print('***********')
df_features.head()
df_features.to_csv('\\test_meta.csv')
print(Counter(df_features['PatientID']))
print('***********')
df_features['ID'] = df_features['ID'].astype(str) + ".dcm"
print(df_features)
print('***********')
dest_path = '\\sorted'
counter = 0
prev_fold = '\\' + df_features.loc[0, 'PatientID']
for index, rows in df_features.iterrows():
filename = basename(rows['ID'])
counter=counter + 1
fold = '\\' + rows['PatientID']
dest_fold = dest_path + fold
out_file = os.path.join(dest_fold, filename)
print(out_file)
print('-------------')
if not os.path.exists(dest_fold):
os.mkdir(dest_fold)
os.rename(os.path.join(input_path, rows['ID']), out_file)
if fold != prev_fold:
counter = 0
prev_fold = fold
I have a list of strings:
fileList = ['YMML.2019.09.10-Run.1-Final.pdf',
'YMML.2019.09.10-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.1-Final.pdf',
'YMML.2019.09.12-Run.2-Initial.pdf',
'YMML.2019.09.13-Run.2-Initial.pdf',
'YMML.2019.09.12-Run.1-Final.pdf',
'YMML.2019.09.13-Run.1-Final.pdf',
'YMML.2019.09.14-Run.1-Final.pdf',]
and I'd like to confirm that there is both a Run.1-Final and Run.2-Initial for each date.
I've tried something like:
for i in range(len(directoryList)):
if directoryList[i][5:15] != directoryList[i + 1][5:15]:
print(directoryList[i] + ' is missing.')
i += 2
and I'd like the output to be
'YMML.2019.09.14-Run.2-Initial.pdf is missing,
Perhaps something like
dates = [directoryList[i][5:15] for i in range(len(directoryList))]
counter = collections.Counter(dates)
But then having trouble extracting from the dictionary.
To make it more readable, you could create a list of dates first, then loop over those.
file_list = ['YMML.2019.09.10-Run.1-Final.pdf',
'YMML.2019.09.10-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.1-Final.pdf',
'YMML.2019.09.12-Run.2-Initial.pdf',
'YMML.2019.09.13-Run.2-Initial.pdf',
'YMML.2019.09.12-Run.1-Final.pdf',
'YMML.2019.09.13-Run.1-Final.pdf',
'YMML.2019.09.14-Run.1-Final.pdf',]
dates = set([item[5:15] for item in file_list])
for date in dates:
if 'YMML.' + date + '-Run.1-Final.pdf' not in file_list:
print('YMML.' + date + '-Run.1-Final.pdf is missing')
if 'YMML.' + date + '-Run.2-Initial.pdf' not in file_list:
print('YMML.' + date + '-Run.2-Initial.pdf is missing')
set() takes the unique values in the list to avoid looping through them all twice.
I'm kind of late but here's what i found to be the simplest way, maybe not the most efficent :
for file in fileList:
if file[20:27] == "1-Final":
if (file[0:20] + "2-Initial.pdf") not in fileList:
print(file)
elif file[19:29] is "2-Initial.pdf":
if (file[0:20] + "1-Final.pdf") not in fileList:
print(file)
Here's an O(n) solution which collects items into a defaultdict by date, then filters on quantity seen, restoring original names from the remaining value:
from collections import defaultdict
files = [
'YMML.2019.09.10-Run.1-Final.pdf',
'YMML.2019.09.10-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.1-Final.pdf',
'YMML.2019.09.12-Run.2-Initial.pdf',
'YMML.2019.09.13-Run.2-Initial.pdf',
'YMML.2019.09.12-Run.1-Final.pdf',
'YMML.2019.09.13-Run.1-Final.pdf',
'YMML.2019.09.14-Run.1-Final.pdf',
]
seen = defaultdict(list)
for x in files:
seen[x[5:15]].append(x)
missing = [v[0] for k, v in seen.items() if len(v) < 2]
print(missing) # => ['YMML.2019.09.14-Run.1-Final.pdf']
Getting names of partners can be done with a conditional:
names = [
x[:20] + "2-Initial.pdf" if x[20] == "1" else
x[:20] + "1-Final.pdf" for x in missing
]
print(names) # => ['YMML.2019.09.14-Run.2-Initial.pdf']
This works:
fileList = ['YMML.2019.09.10-Run.1-Final.pdf',
'YMML.2019.09.10-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.1-Final.pdf',
'YMML.2019.09.12-Run.2-Initial.pdf',
'YMML.2019.09.13-Run.2-Initial.pdf',
'YMML.2019.09.12-Run.1-Final.pdf',
'YMML.2019.09.13-Run.1-Final.pdf',
'YMML.2019.09.14-Run.1-Final.pdf',]
initial_set = {filename[:15] for filename in fileList if 'Initial' in filename}
final_set = {filename[:15] for filename in fileList if 'Final' in filename}
for filename in final_set - initial_set:
print(filename + '-Run.2-Initial.pdf is missing.')
for filename in initial_set - final_set:
print(filename + '-Run.1-Final.pdf is missing.')
I am using following function of a Class to find out if every .csv has corresponding .csv.meta in the given directory.
I am getting "None " for file which are just .csv and hexadecimal code for .csv.meta.
Result
None
<_sre.SRE_Match object at 0x1bb4300>
None
<_sre.SRE_Match object at 0xbd6378>
This is code
def validate_files(self,filelist):
try:
local_meta_file_list = []
local_csv_file_list = []
# Validate each files and see if they are pairing properly based on the pattern *.csv and *.csv.meta
for tmp_file_str in filelist:
csv_match = re.search(self.vprefix_pattern + '([0-9]+)' + self.vcsv_file_postfix_pattern + '$' , tmp_file_str)
if csv_match:
local_csv_file_list.append(csv_match.group())
meta_file_match_pattern=self.vprefix_pattern + csv_match.group(1) + self.vmeta_file_postfix_pattern
tmp_meta_file = [os.path.basename(s) for s in filelist if meta_file_match_pattern in s]
local_meta_file_list.extend(tmp_meta_file)
except Exception, e:
print e
self.m_logger.error("Error: Validate File Process thrown exception " + str(e))
sys.exit(1)
return local_csv_file_list, local_meta_file_list
These are file names.
File Names
rp_package.1406728501.csv.meta
rp_package.1406728501.csv
rp_package.1402573701.csv.meta
rp_package.1402573701.csv
rp_package.1428870707.csv
rp_package.1428870707.meta
Thanks
Sandy
If all you need is to find .csv files which have corresponding .csv.meta files, then I don’t think you need to use regular expressions for filtering them. We can filter the file list for those with the .csv extension, then filter that list further for files whose name, plus .meta, appears in the file list.
Here’s a simple example:
myList = [
'rp_package.1406728501.csv.meta',
'rp_package.1406728501.csv',
'rp_package.1402573701.csv.meta',
'rp_package.1402573701.csv',
'rp_package.1428870707.csv',
'rp_package.1428870707.meta',
]
def validate_files(file_list):
loc_csv_list = filter(lambda x: x[-3:].lower() == 'csv', file_list)
loc_meta_list = filter(lambda c: '%s.meta' % c in file_list, loc_csv_list)
return loc_csv_list, loc_meta_list
print validate_files(myList)
If there may be CSV files that don’t conform to the rp_package format, and need to be excluded, then we can initially filter the file list using the regex. Here’s an example (swap out the regex parameters as necessary):
import re
vprefix_pattern = 'rp_package.'
vcsv_file_postfix_pattern = '.csv'
regex_str = vprefix_pattern + '[0-9]+' + vcsv_file_postfix_pattern
def validate_files(file_list):
csv_list = filter(lambda x: re.search(regex_str, x), file_list)
loc_csv_list = filter(lambda x: x[-3:].lower() == 'csv', csv_list)
loc_meta_list = filter(lambda c: '%s.meta' % c in file_list, loc_csv_list)
return loc_csv_list, loc_meta_list
print validate_files(myList)
I am writing a script in python to consolidate images in different folders to a single folder. There is a possibility of multiple image files with same names. How to handle this in python? I need to rename those with "image_name_0001", "image_name_0002" like this.
You can maintain a dict with count of a names that have been seen so far and then use os.rename() to rename the file to this new name.
for example:
dic = {}
list_of_files = ["a","a","b","c","b","d","a"]
for f in list_of_files:
if f in dic:
dic[f] += 1
new_name = "{0}_{1:03d}".format(f,dic[f])
print new_name
else:
dic[f] = 0
print f
Output:
a
a_001
b
c
b_001
d
a_002
If you have the root filename i.e name = 'image_name', the extension, extension = '.jpg' and the path to the output folder, path, you can do:
*for each file*:
moved = 0
num = 0
if os.path.exists(path + name + ext):
while moved == 0:
num++
modifier = '_00'+str(num)
if not os.path.exists(path + name + modifier + extension):
*MOVE FILE HERE using (path + name + modifier + extension)*
moved = 1
else:
*MOVE FILE HERE using (path + name + ext)*
There are obviously a couple of bits of pseudocode in there but you should get the gist