Sort dicoms images using the metadata?

Sort dicoms images using the metadata? - python

I am trying to sort dicoms of multiple subjects into their respective folders based on their PatientID. The current directory has all the dicoms for all subjects without sorting. I am able to go through a dicom directory and group subjects by their PatientID and count how many dicoms each subject has. Is it possible to copy or move the dicoms to another directory and sort them in a folder based on their PatientID.
code:
os.listdir('\\dicoms')
device = torch.device("cuda")
print(device)
input_path = '\\dicoms\\'
ds_columns = ['ID', 'PatientID', 'Modality', 'StudyInstance',
'SeriesInstance', 'PhotoInterpretation', 'Position0',
'Position1', 'Position2', 'Orientation0', 'Orientation1',
'Orientation2', 'Orientation3', 'Orientation4', 'Orientation5',
'PixelSpacing0', 'PixelSpacing1']
def extract_dicom_features(ds):
ds_items = [ds.SOPInstanceUID,
ds.PatientID,
ds.Modality,
ds.StudyInstanceUID,
ds.SeriesInstanceUID,
ds.PhotometricInterpretation,
ds.ImagePositionPatient,
ds.ImageOrientationPatient,
ds.PixelSpacing]
line = []
for item in ds_items:
if type(item) is pydicom.multival.MultiValue:
line += [float(x) for x in item]
else:
line.append(item)
return line
list_img = os.listdir(input_path + 'imgs')
print(len(list_img))
df_features = []
for img in tqdm.tqdm(list_img):
img_path = input_path + 'imgs/' + img
ds = pydicom.read_file(img_path)
df_features.append(extract_dicom_features(ds))
df_features = pd.DataFrame(df_features, columns=ds_columns)
df_features.head()
df_features.to_csv('\\meta.csv')
print(Counter(df_features['PatientID']))
example of metadata:
,ID,PatientID,Modality,StudyInstance,SeriesInstance,PhotoInterpretation,Position0,Position1,Position2,Orientation0,Orientation1,Orientation2,Orientation3,Orientation4,Orientation5,PixelSpacing0,PixelSpacing1
0,ID_000012eaf,ID_f15c0eee,CT,ID_30ea2b02d4,ID_0ab5820b2a,MONOCHROME2,-125.0,-115.89798,77.970825,1.0,0.0,0.0,0.0,0.927184,-0.374607,0.488281,0.488281
example of Counter output:
Counter({'ID_19702df6': 28, 'ID_b799ed34': 26, 'ID_e3523464': 26, 'ID_cd9169c2': 26, 'ID_e326a8a4': 24, 'ID_45da90cb': 24, 'ID_99e4f787': 24, 'ID_df751e93': 24, 'ID_929a5b39': 20})
I added the following code to try to sort the images into subdirectories but I run into an error:
dest_path = input_path+'imageProcessDir'
counter = 0
for index, rows in df_features.iterrows():
filename = basename(rows['ID'])
image = cv2.imread(input_path+rows['ID'])
counter=counter+1
fold = rows['PatientID']+"/"
dest_fold = dest_path+fold
cv2.imwrite(dest_fold+"/"+filename+ "_" +str(counter)+".dcm", img)
error:
Traceback (most recent call last):
File "ct_move.py", line 77, in <module>
cv2.imwrite(dest_fold+"/"+filename+ "_" +str(counter)+".dcm", img)
TypeError: Expected cv::UMat for argument 'img'

I'd also second ditching CV for this - it is overkill.
Try pydicom instead.
What I'd do for your problem (move all files with same patient ID into their own folder, and count how many for each) is:
get list of dicom files as a list (use glob.glob to search a directory and/or just pass in the full file list via argv)
load all those files into a list of pydicom dicom file objects (DataSets), so something like:
import pydicom
for fname in glob.glob(sys.argv[1], recursive=False):
print("loading: {}".format(fname))
files.append(pydicom.read_file(fname))
go through that list and move (creating new directory if required) that file. So something like (not working code - I can't off the top of my head remember the os module methods I'm just putting the function in <>'s, just showing how conceptually to do it):
from collections import defaultdict
# dict for counting number of files for each patient ID
patient_id_count = defaultdict(lambda: 0)
for f in files:
id = f.PatientID # this gets the patient ID from the current file
if os.<directory doesnt exist>(id):
os.<create directory>(id)
os.<move>(f.file_name, id)
patient_id_count{id} += 1

To address your issue, it seems like overkill to use opencv here at all. If all you want to do is to move the dicom images from one location into another on the filesystem, you could use os.rename or shutil.move if you are on a UNIX-like system. Unless you are modifying image content, these are cleaner and faster solutions.
I noticed two little things in your last code block:
I think I noticed that you want the fold variable to have the "/" prefixed instead of suffixed for the paths to work.
Also, the counter will continue to increment across all dicoms, where I think you want it to increment on a per-subject basis (I am assuming that df_features will be sorted on PatientID here, if it is not, maybe you could use the Counter class).
dest_path = input_path+'imageProcessDir'
counter = 0
prev_fold = '/' + df_features.loc[0, 'PatientID']
for index, rows in df_features.iterrows():
filename = basename(rows['ID'])
counter=counter + 1
fold = '/' + rows['PatientID']
dest_fold = dest_path + fold
out_file = dest_fold + "/" + filename + "_" + str(counter) + ".dcm"
os.rename(input_path + rows['ID'], out_file)
if fold != prev_fold:
counter = 0 # reset when the PatientID changes
prev_fold = fold
I would also use os.path.join to handle filesystem paths instead of adding "/" to everything:
fold = rows['PatientID']
dest_fold = os.path.join(dest_path, fold)
as I think that there is also an issue with the input file path: input_path + rows['ID']
edit:
This is to get rid of the use of '/' and put in os.path.join
dest_path = os.path.join(input_path, 'imageProcessDir')
counter = 0
prev_fold = df_features.loc[0, 'PatientID']
for index, rows in df_features.iterrows():
filename = basename(rows['ID'])
counter=counter + 1
fold = rows['PatientID']
dest_fold = os.path.join(dest_path, fold)
os.makedirs(dest_fold, exist_ok=True) # make sure target folder exists
out_file = os.path.join(dest_fold, filename + "_" + str(counter) + ".dcm")
os.rename(os.path.join(input_path, rows['ID']), out_file)
if fold != prev_fold:
counter = 0 # reset when the PatientID changes
prev_fold = fold
Also, note that os.rename(os.path.join(input_path, rows['ID']), out_file) may need to be os.rename(os.path.join(input_path, rows['ID'] + '.dcm'), out_file)
If it's not too much, you may want to make a backup of your files before attempting this, to make sure you get what you want out!

Thank you I solved the problem with your help.
Solution:
os.listdir('directory')
device = torch.device("cuda")
print(device)
input_path = 'directory\\'
ds_columns = ['ID', 'PatientID', 'Modality', 'StudyInstance',
'SeriesInstance', 'PhotoInterpretation', 'Position0',
'Position1', 'Position2', 'Orientation0', 'Orientation1',
'Orientation2', 'Orientation3', 'Orientation4', 'Orientation5',
'PixelSpacing0', 'PixelSpacing1']
def extract_dicom_features(ds):
ds_items = [ds.SOPInstanceUID,
ds.PatientID,
ds.Modality,
ds.StudyInstanceUID,
ds.SeriesInstanceUID,
ds.PhotometricInterpretation,
ds.ImagePositionPatient,
ds.ImageOrientationPatient,
ds.PixelSpacing]
line = []
for item in ds_items:
if type(item) is pydicom.multival.MultiValue:
line += [float(x) for x in item]
else:
line.append(item)
return line
list_img = os.listdir(input_path)
print(len(list_img))
print('***********')
print(list_img)
print('***********')
df_features = []
for img in tqdm.tqdm(list_img):
img_path = input_path + img
ds = pydicom.read_file(img_path)
df_features.append(extract_dicom_features(ds))
df_features = pd.DataFrame(df_features, columns=ds_columns)
print(df_features)
print('***********')
df_features.head()
df_features.to_csv('\\test_meta.csv')
print(Counter(df_features['PatientID']))
print('***********')
df_features['ID'] = df_features['ID'].astype(str) + ".dcm"
print(df_features)
print('***********')
dest_path = '\\sorted'
counter = 0
prev_fold = '\\' + df_features.loc[0, 'PatientID']
for index, rows in df_features.iterrows():
filename = basename(rows['ID'])
counter=counter + 1
fold = '\\' + rows['PatientID']
dest_fold = dest_path + fold
out_file = os.path.join(dest_fold, filename)
print(out_file)
print('-------------')
if not os.path.exists(dest_fold):
os.mkdir(dest_fold)
os.rename(os.path.join(input_path, rows['ID']), out_file)
if fold != prev_fold:
counter = 0
prev_fold = fold

Related

Get the total file size from a directory for each date using map and pairs

I'm working on a project where the file name consists of actual dates but the data for the dates are split into multiple files.
Developed the following program to count the number of files for each date (part of the filename) and also the total size.
Is there a better way to achieve the same?
import os
import glob
import os
import collections
directory_name = "\\SpeicifDir\\"
# Get a list of files (file paths) in the given directory
list_of_files = filter( os.path.isfile,
glob.glob(directory_name + '*.txt') )
mapOfDateFileSize = collections.defaultdict(list)
# For all the files
for file_path in list_of_files:
file_size = os.stat(file_path).st_size
filename = os.path.basename(file_path)
splitFilename = filename.split('-')
# Extract the file and split the file using - as a separator
dtForFile = splitFilename[1] + "-" + splitFilename[2] + "-" + splitFilename[3]
# Get the file name and size
if dtForFile in mapOfDateFileSize:
dataFromDictionary = mapOfDateFileSize[dtForFile]
dataFromDictionary = dataFromDictionary[0]
totalCount = dataFromDictionary[0]
totalSize = dataFromDictionary[1]
totalCount = totalCount + 1
totalSize = totalSize + file_size
# Update the file size and count
mapOfDateFileSize[dtForFile] = [ (totalCount, totalSize) ]
else:
mapOfDateFileSize[dtForFile].append((1,file_size))
# For each date get the total size, total file count
for dt,elements in mapOfDateFileSize.items():
dataFromDictionary = elements[0]
totalCount = dataFromDictionary[0]
totalSize = dataFromDictionary[1]
print (dt, ",", totalCount , ",", totalSize)

Optimize the performance of retreiving file sizes with pysftp

I have a requirement to get the file details for certain locations (within the system and SFTP) and get the file size for some locations on SFTP which can be achieved using the shared code.
def getFileDetails(location: str):
filenames: list = []
if location.find(":") != -1:
for file in glob.glob(location):
filenames.append(getFileNameFromFilePath(file))
else:
with pysftp.Connection(host=myHostname, username=myUsername, password=myPassword) as sftp:
remote_files = [x.filename for x in sorted(sftp.listdir_attr(location), key=lambda f: f.st_mtime)]
if location == LOCATION_SFTP_A:
for filename in remote_files:
filenames.append(filename)
sftp_archive_d_size_mapping[filename] = sftp.stat(location + "/" + filename).st_size
elif location == LOCATION_SFTP_B:
for filename in remote_files:
filenames.append(filename)
sftp_archive_e_size_mapping[filename] = sftp.stat(location + "/" + filename).st_size
else:
for filename in remote_files:
filenames.append(filename)
sftp.close()
return filenames
There are more than 10000+ files in LOCATION_SFTP_A and LOCATION_SFTP_B. For each file, I need to get the file size. To get the size I am using
sftp_archive_d_size_mapping[filename] = sftp.stat(location + "/" + filename).st_size
sftp_archive_e_size_mapping[filename] = sftp.stat(location + "/" + filename).st_size
# Time Taken : 5 min+
sftp_archive_d_size_mapping[filename] = 1 #sftp.stat(location + "/" + filename).st_size
sftp_archive_e_size_mapping[filename] = 1 #sftp.stat(location + "/" + filename).st_size
# Time Taken : 20-30 s
If I comment sftp.stat(location + "/" + filename).st_size and assign static value It takes only 20-30 seconds to run the entire code. I am looking for a way How can optimize the time and get the file size details.

The Connection.listdir_attr already gives you the file size in SFTPAttributes.st_size.
There's no need to call Connection.stat for each file to get the size (again).
See also:
With pysftp or Paramiko, how can I get a directory listing complete with attributes?
How to fetch sizes of all SFTP files in a directory through Paramiko

Comparing part of a string within a list

I have a list of strings:
fileList = ['YMML.2019.09.10-Run.1-Final.pdf',
'YMML.2019.09.10-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.1-Final.pdf',
'YMML.2019.09.12-Run.2-Initial.pdf',
'YMML.2019.09.13-Run.2-Initial.pdf',
'YMML.2019.09.12-Run.1-Final.pdf',
'YMML.2019.09.13-Run.1-Final.pdf',
'YMML.2019.09.14-Run.1-Final.pdf',]
and I'd like to confirm that there is both a Run.1-Final and Run.2-Initial for each date.
I've tried something like:
for i in range(len(directoryList)):
if directoryList[i][5:15] != directoryList[i + 1][5:15]:
print(directoryList[i] + ' is missing.')
i += 2
and I'd like the output to be
'YMML.2019.09.14-Run.2-Initial.pdf is missing,
Perhaps something like
dates = [directoryList[i][5:15] for i in range(len(directoryList))]
counter = collections.Counter(dates)
But then having trouble extracting from the dictionary.

To make it more readable, you could create a list of dates first, then loop over those.
file_list = ['YMML.2019.09.10-Run.1-Final.pdf',
'YMML.2019.09.10-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.1-Final.pdf',
'YMML.2019.09.12-Run.2-Initial.pdf',
'YMML.2019.09.13-Run.2-Initial.pdf',
'YMML.2019.09.12-Run.1-Final.pdf',
'YMML.2019.09.13-Run.1-Final.pdf',
'YMML.2019.09.14-Run.1-Final.pdf',]
dates = set([item[5:15] for item in file_list])
for date in dates:
if 'YMML.' + date + '-Run.1-Final.pdf' not in file_list:
print('YMML.' + date + '-Run.1-Final.pdf is missing')
if 'YMML.' + date + '-Run.2-Initial.pdf' not in file_list:
print('YMML.' + date + '-Run.2-Initial.pdf is missing')
set() takes the unique values in the list to avoid looping through them all twice.

I'm kind of late but here's what i found to be the simplest way, maybe not the most efficent :
for file in fileList:
if file[20:27] == "1-Final":
if (file[0:20] + "2-Initial.pdf") not in fileList:
print(file)
elif file[19:29] is "2-Initial.pdf":
if (file[0:20] + "1-Final.pdf") not in fileList:
print(file)

Here's an O(n) solution which collects items into a defaultdict by date, then filters on quantity seen, restoring original names from the remaining value:
from collections import defaultdict
files = [
'YMML.2019.09.10-Run.1-Final.pdf',
'YMML.2019.09.10-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.1-Final.pdf',
'YMML.2019.09.12-Run.2-Initial.pdf',
'YMML.2019.09.13-Run.2-Initial.pdf',
'YMML.2019.09.12-Run.1-Final.pdf',
'YMML.2019.09.13-Run.1-Final.pdf',
'YMML.2019.09.14-Run.1-Final.pdf',
]
seen = defaultdict(list)
for x in files:
seen[x[5:15]].append(x)
missing = [v[0] for k, v in seen.items() if len(v) < 2]
print(missing) # => ['YMML.2019.09.14-Run.1-Final.pdf']
Getting names of partners can be done with a conditional:
names = [
x[:20] + "2-Initial.pdf" if x[20] == "1" else
x[:20] + "1-Final.pdf" for x in missing
]
print(names) # => ['YMML.2019.09.14-Run.2-Initial.pdf']

This works:
fileList = ['YMML.2019.09.10-Run.1-Final.pdf',
'YMML.2019.09.10-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.2-Initial.pdf',
'YMML.2019.09.11-Run.1-Final.pdf',
'YMML.2019.09.12-Run.2-Initial.pdf',
'YMML.2019.09.13-Run.2-Initial.pdf',
'YMML.2019.09.12-Run.1-Final.pdf',
'YMML.2019.09.13-Run.1-Final.pdf',
'YMML.2019.09.14-Run.1-Final.pdf',]
initial_set = {filename[:15] for filename in fileList if 'Initial' in filename}
final_set = {filename[:15] for filename in fileList if 'Final' in filename}
for filename in final_set - initial_set:
print(filename + '-Run.2-Initial.pdf is missing.')
for filename in initial_set - final_set:
print(filename + '-Run.1-Final.pdf is missing.')

Python Error: String indices must be integers, not str

OK, I have an obvious problems staring me in the face that I can't figure out. I am getting the output/results I need but I get the TypeError: "string indices must be integers, not str". The following is a sample of my code. It is because of the statement "if f not in GetSquishySource(dirIn)" Basicially I am looking to see if a specific file is in another list so that I don't end up adding it to a zip file I am creating. I just don't see the problem here and how to get around it. Any help would be appreciated.
def compressLists(z, dirIn, dirsIn, filesIn, encrypt=None):
try:
with zipfile.ZipFile(z, 'w', compression=zipfile.ZIP_DEFLATED) as zip:
# Add files
compressFileList(z, dirIn, dirIn, filesIn, zip, encrypt)
# Add directories
for dir in dirsIn:
dirPath = os.path.join(dirIn, dir["name"])
for root, dirs, files in os.walk(dirPath):
# Ignore hidden files and directories
files = [f for f in files if not f[0] == '.']
dirs[:] = [d for d in dirs if not d[0] == '.']
# Replace file entries with structure value entries
for i, f in enumerate(files):
del files[i]
**if f not in GetSquishySource(dirIn):**
files.insert(i, {'zDir': dir["zDir"], 'name': f})
compressFileList(z, dirIn, root, files, zip, encryptedLua)
if dir["recurse"] == False:
break;
The following is the GetSquishySource function I created and call.
def GetSquishySource(srcDir):
squishyLines = []
srcToRemove = []
if os.path.isfile(srcDir + os.path.sep + "squishy"):
with open(srcDir + os.path.sep + "squishy") as squishyFile:
squishyContent = squishyFile.readlines()
squishyFile.close()
for line in squishyContent:
if line.startswith("Module") and line is not None:
squishyLines.append(line.split(' '))
for s in squishyLines:
if len(s) == 3 and s is not None:
# If the 3rd column in the squishy file contains data, use that.
path = s[2].replace('Module "', '').replace('"', '').replace("\n", '')
srcToRemove.append(os.path.basename(path))
elif len(s) == 2 and s is not None:
# If the 3rd column in the squishy file contains no data, then use the 2nd column.
path = s[1].replace('Module "', '').replace('"', '').replace("\n", '').replace(".", os.path.sep) + ".lua"
srcToRemove.append(os.path.basename(path))
return srcToRemove

How to take care of duplicates while copying files to a folder in python

I am writing a script in python to consolidate images in different folders to a single folder. There is a possibility of multiple image files with same names. How to handle this in python? I need to rename those with "image_name_0001", "image_name_0002" like this.

You can maintain a dict with count of a names that have been seen so far and then use os.rename() to rename the file to this new name.
for example:
dic = {}
list_of_files = ["a","a","b","c","b","d","a"]
for f in list_of_files:
if f in dic:
dic[f] += 1
new_name = "{0}_{1:03d}".format(f,dic[f])
print new_name
else:
dic[f] = 0
print f
Output:
a
a_001
b
c
b_001
d
a_002

If you have the root filename i.e name = 'image_name', the extension, extension = '.jpg' and the path to the output folder, path, you can do:
*for each file*:
moved = 0
num = 0
if os.path.exists(path + name + ext):
while moved == 0:
num++
modifier = '_00'+str(num)
if not os.path.exists(path + name + modifier + extension):
*MOVE FILE HERE using (path + name + modifier + extension)*
moved = 1
else:
*MOVE FILE HERE using (path + name + ext)*
There are obviously a couple of bits of pseudocode in there but you should get the gist

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sort dicoms images using the metadata? - python

Related

Get the total file size from a directory for each date using map and pairs

Optimize the performance of retreiving file sizes with pysftp

Comparing part of a string within a list

Python Error: String indices must be integers, not str

How to take care of duplicates while copying files to a folder in python

Categories

Resources