Searching for data in dataframe - python

Firstly, my apologies if this question is too simple / obvious.
My question is:
I am using nested loops to check whether certain images are listed in a dataframe ('old_df'). If they are present, I add them to an empty list ('new_list').
Is there a faster or more performant way to do this?
images = []
for root, dirs, files in os.walk('/gdrive/MyDrive/CNN_Tute/data/images/'):
for file in files:
images.append(file)
new_list = []
for i in range(len(images)):
for j in range(len(old_df)):
if images[i] == old_df.iloc[j, 0]:
new_list.append(old_df.iloc[j, :])

If want test first column by position:
images = [file for root, dirs, files in os.walk('/gdrive/MyDrive/CNN_Tute/data/images/'
for file in files]
new_list = old_df.iloc[old_df.iloc[:, 0].isin(images).to_numpy(), 0].tolist()

You can achieve this in two lines:
images = [file for _, _, files in os.walk('/gdrive/MyDrive/CNN_Tute/data/images/' for file in files]
new_labels_df = xr_df[xr_df[[0]].isin(images)]

Related

Unable to append in order

import os
train_dir = "/Images/train/"
data = []
for i in os.listdir(train_dir):
path = os.path.join(train_dir, i)
img = cv2.imread(path)
print(i)
data.append(img)
My train directory has 49000 images in order img(1), img(2), ..., img(49000)
I want to append these images in this order only but they are getting appended in a different order (as shown in the image).
Any help?
I want to append them as img(1).png, img(2).png, img(3).png, and so on.
Using the sorted method helped me.
data = []
train_dir = "/Images/train/"
files = os.listdir(train_dir)
files = sorted(files ,key=lambda x: int(os.path.splitext(x)[0]))
for i in (files):
path = os.path.join(train_dir, i)
img = cv2.imread(path)
data.append(img)
So all you want to do is to list the images name in a python list. Which is filenames in my solution. Python list has function called sort() which will sort all the images name. Your new filenames list will be in sorted order relative to your images name that are there inside your directory. So, iterating through the list, you will be getting the sorted images name.
train_dir = "/Images/train/"
filenames = [img for img in os.listdir(train_dir)]
filenames.sort()
data = []
for i in filenames:
path = os.path.join(train_dir, i)
img = cv2.imread(path)
print(i)
data.append(img)

Can I loop through directories and subdirectories and store certain files in an array?

I have one folder that contains many subfolders, and images within those subfolders. I have code that loops through the folders and subfolders and prints out the name of each image one at a time. I want all of these image names to be stored in a single array. How do I get my loop to append each image name to the same array?
I have only seen similar solutions on Linux or Matlab so far, but not on python.
files = []
#r=root, d=directories, f = files
for r, d, f in os.walk(path):
for face_image in f :
if face_image.endswith("g"): #to get all '.jpg' and all '.png' files
print(face_image)
When I run the loop above, I get all ~1000 image names printed. But when I then try and print(face_image) outside of the loop, only the name of the final image in the loop is printed. I now now this is because I have not appended each name to an array, but am not sure how to go about this? Any help would be massively appreciated!
Using pathlib and a recursive glob pattern:
from pathlib import Path
file_types = ("jpg", "png")
file_paths = []
for file_type in file_types:
file_paths.extend(Path(".").glob(f"**/*.{file_type}"))
file_names = [file_path.name for file_path in file_paths]
After your print statement, you can use files.append(face_image) to add the face image to your list. When the loops are done, all valid image names will be in the list for you to use.
I wasn't sure if this was a legit question or not. You need to append the files to the list.
files = []
#r=root, d=directories, f = files
for r, d, f in os.walk(path):
for face_image in f :
if face_image.endswith("g"): #to get all '.jpg' and all '.png' files
print(face_image)
files.append(face_image)
You could try something like this:
files = []
for r, d, f in os.walk(path):
# collect all images
files += [os.path.join(r, file) for file in f]
# filter images
files = [ff for ff in files if ff.endswith('g')]
or a little more compact:
files = []
for r, d, f in os.walk(path):
# collect all images that end with 'g'
files += [os.path.join(r, file) for file in f if file.endswith('g')]

Python: os.walk to specific directory and file

I have a file structure that looks like:
|num_1
|----|dir1
|--------|dir2
|------------|dcm
|----------------\file_1
|----------------\file_2
|----------------\file_n
|num_2
|----|dir1
|--------|dcm
|------------\file_1
|------------\file_n
|num_n
I want to us os.walk (or something more appropriate?) to traverse the tree until it finds the directory "dcm". dcm can be at varying levels of the tree
This is what I have. Thanks!
import dicom
import re
import os
dcm = []
PATH = "C:\foo"
#find the directory we want to get to, save path
for path, dirs in os.walk(PATH):
for dirname in dirs:
fullpath = os.path.join(path,dirname)
if "dcm" in dirname:
#copied this first_file line - just want a fast and easy way to grab ONE file in the dcm directory
#without reading any of the others (for time reasons)
first_file = next((join(path, f) for f in os.listdir(path) if isfile(join(path, f))),"none")
fullpath = os.path.join(fullpath,first_file)
dcm.append(fullpath)
I went ahead with the "lazy" way and used listdir to read out all of the files under the dcm directory - decided that the resource cost wasn't too high.
That being said, I think that pulling out a single random file from a directory without reading all of those files is an interesting query that someone more Python oriented than I should answer!
For reference, my final solution was... do excuse the inefficiencies in iterator usage! I am new and needed a quick solution
for path, dirs, filename in os.walk(rootDir): #omit files, loop through later
for dirname in dirs:
fullpath = os.path.join(path,dirname)
if "dcm" in dirname:
dcm.append(fullpath)
final = []
uni = 0
final.append(dcm[0])
for i in range(len(dcm)):
if len(os.listdir(dcm[i])) < 10:
pass
elif dcm[i][16:19] != final[uni][16:19]:
final.append(dcm[i])
uni += 1
tags = ((0x8, 0x70)),((0x8, 0x1090)), ((0x18, 0x1020))
values = []
printout = []
for p in range(len(final)):
file = os.path.join(final[p],os.listdir(final[p])[0])
ds = dicom.read_file(file)
printout.append([final[p]])
for k in range(len(tags)):
printout.append([ds[tags[k]]])

Python, appending to different lists while looping

Is it possible to append to different lists while looping through multiple directories simultaneously? my code:
def trav(dir_1, dir_2):
data_0= []
data_1 = []
for dir in [dir_1, dir_2]:
for path, dirs, files in os.walk(dir):
for file in files:
for line in file:
data_0.append(line)
How do I append line from dir_1 -> data_0 and appand dir_2 -> data_1 using one loop, I know i can write two separate methods but would like to know if there is a more efficient, simpler way of doing it. I tried using chain from itertools, but no luck with that, any suggestiosn?
If you do not want two loops its okay you can simply perform an if
def trav(dir_1, dir_2):
data_0 = []
data_1 = []
for dir in [dir_1, dir_2]:
current_dir = dir
for path, dirs, files in os.walk(dir):
for file in files:
for line in file:
if current_dir == dir_1:
data_0.append(line)
else:
data_1.append(line)
another way could be:
def trav(dir_1, dir_2):
data_0 = []
data_1 = []
for dir in [dir_1, dir_2]:
if dir == dir_1:
data = data_0
else:
data = data_1
for path, dirs, files in os.walk(dir):
for file in files:
for line in file:
data.append(line)
Second one will run faster than the first one, since number of comparison needed will be lesser.
Well, you could make data a dict:
def trav(dir_1, dir_2):
data = {}
data[dir_1] = []
data[dir_2] = []
for dir in [dir_1, dir_2]:
for path, dirs, files in os.walk(dir):
for file in files:
for line in file:
data[dir].append(line)
Or you could make data a collections.defaultdict(list). Then you wouldn't have to initialize the entries to empty lists. Also, I would suggest you not use the name dir because of confusion with the built-in name. There's no harm done here though, because it's a local variable.

How to compare all of the files in directory with each other two by two in Python?

I have a directory and I want to compare all of the files in it and get the percentage to the match between them. As the starting point, I decide to open one file and compare other files with that one:
filelist=[]
diff_list=[]
f= open("D:/Desktop/sample/ff69.txt")
flines= f.readlines()
path="D:/Desktop/sample"
for root, dirnames, filenames in os.walk(path):
for filename in fnmatch.filter(filenames, '*.txt'):
filelist.append(os.path.join(root, filename))
for m in filelist:
g = open(m,'r')
glines= g.readlines()
d = difflib.Differ()
#print d
diffl= diff_list.append(d.compare(flines, glines))
print("".join(diff))#n_adds, n_subs, n_eqs, n_wiered = 0, 0, 0, 0
#
But my code those not work, which means that when I am printing it I get "None". any has any idea why? Or any better idea about comparing all of the files in a directory two by two?
If you're attempting to compare files pairwise you probably want something closer to this:
files = os.listdir('root')
for idx, filename in enumerate(files):
try:
fcompare = files[idx + 1]
except IndexError:
# We've reached the last file.
break
# Actual diffing code.
d = difflib.Differ()
lines1 = open(filename).readlines()
lines2 = open(fcompare).readlines()
d.compare(lines1, lines2)
That will compare files 1-2, 2-3, 3-4, etc. It may be worth optimizing when you read the files in - file 2 is in use for loop iterations 1 and 2 - so shouldn't have its contents read twice if possible, but that may be premature optimization depending on the volume of files.

Categories