My directory has hundreds of images and text files(.png and .txt). what's special about them is that each image has its own matching txt file, for example im1.png has img1.txt, news_im2.png has news_im2.png etc.. What i want is some way to give it a parameter or percentage, let's say 40 where it randomly copy 40% of the images along with their correspondent texts to a new file, and the most important word here is randomely as if i do the test again i shouldn't get the same results. Ideally i should be able to take 2 kind of parameters(reminder that the first would be the % of each sample) the second being number of samples for example maybe i want my data in 3 different samples randomly not only 2, in this case it should be able to take destination directories path equal to the number of samples i want and spread them accordingly, for example i shouldn't find img_1 in 2 different samples.
What i have done so far is simply set up my method to copy them :
import glob, os, shutil
source_dir ='all_the_content/'
dest_dir = 'percentage_only/'
files = glob.iglob(os.path.join(source_dir, "*.png"))
for file in files:
if os.path.isfile(file):
shutil.copy2(file, dest_dir)
and the start of my code to set the random switching:
import os, shutil,random
my_pic_dict = {}
source_dir ='/home/michel/ubuntu/EAST/data_0.8/'
for element in os.listdir(source_dir):
if element.endswith('.png'):
my_pic_dict[element] = element.replace('.png', '.txt')
print(my_pic_dict)
print (len(my_pic_dict))
imgs_list = my_pic_dict.keys()
print(imgs_list)
hwo can i finalize it as i couldn't make random.sample work.
try this:
import random
import numpy as np
n_elem = 1000
n_samples = 4
percentages = [50,10,10,30]
indices = list(range(n_elem))
random.shuffle(indices)
elem_per_samples = [int(p*n_elem/100) for p in percentages]
limits = np.cumsum([0]+elem_per_samples)
samples = [indices[limits[i]:limits[i+1]] for i in range(n_samples)]
Related
I have an image dataset that looks like this: Dataset
The timestep of each image is 15 minutes (as you can see, the timestamp is in the filename).
Now I would like to group those images in 3hrs long sequences and save those sequences inside subfolders that would contain respectively 12 images(=3hrs).
The result would ideally look like this:
Sequences
I have tried using os.walk and loop inside the folder where the image dataset is saved, then I created a dataframe using pandas because I thought I could handle the files more easily but I think I am totally off target here.
Since you said you need only 12 files (considering that the timestamp is the same for all of them and 12 is the exact number you need, the following code can help you
import os
import shutil
output_location = "location where you want to save them" # better not to be in the same location with the dataset
dataset_path = "your data set"
files = [os.path.join(path, file) for path, subdirs, files in os.walk(dataset_path) for file in files]
nr_of_files = 0
folder_name = ""
for index in range(len(files)):
if nr_of_files == 0:
folder_name = os.path.join(output_location, files[index].split("\\")[-1].split(".")[0])
os.mkdir(folder_name)
shutil.copy(files[index], files[index].replace(dataset_path, folder_name))
nr_of_files += 1
elif nr_of_files == 11:
shutil.copy(files[index], files[index].replace(dataset_path, folder_name))
nr_of_files = 0
else:
shutil.copy(files[index], files[index].replace(dataset_path, folder_name))
nr_of_files += 1
Explaining the code:
files takes value of all files in the dataset_path. You set this variable and files will contain the entire path to all files.
for loop interating for the entire length of files.
Used nr_of_files to count each 12 files. If it's 0, it will create a folder with the name of files[index] to the location you set as output, will copy the file (replacing the input path with the output path)
If it's 11 (starting from 0, index == 11 means 12th file) will copy the file and set nr_of_files back to 0 to create another folder
Last else will simply copy the file and increment nr_of_files
The timestep of each image is 15 minutes (as you can see, the
timestamp is in the filename).
Now I would like to group those images in 3hrs long sequences and save
those sequences inside subfolders that would contain respectively 12
images(=3hrs)
I suggest exploiting datetime built-in libary to get desired result, for each file you have
get substring which is holding timestamp
parse it into datetime.datetime instance using datetime.datetime.strptime
convert said instance into seconds since epoch using .timestamp method
compute number of seconds integer division (//) 10800 (number of seconds inside 3hr)
convert value you got into str and use it as target subfolder name
I have multiple files in a folder and I want to get first four files perform some operation and get next four files perform some operation and so on. But I am unable to iterate through each file and that too in sorted manner. I tried using glob.glob but I don't know how to iterate through each file using index in glob.
My files are 0.jpg 1.jpg 2.jpg 3.jpg 4.jpg......
for image in sorted(glob.glob(directory + '*.jpg'),key=os.path.getmtime):
name = image.split('/')[-1]
imgname = name.split('.')[0]
Here's a way of doing it. I see you have another question nearly the same, and you are likely going to need a "fill" value in case your directory has a number of images that is not an exact multiple of 4. I suggest you create a "fill" image (as a PNG so that it doesn't appear in your sorted list of JPEGs). Make the "fill" image the same colour as the background onto which you paste the other 4 images so that it doesn't even show up.
#!/usr/bin/env python3
import os, glob
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
"""
Group items of list in groups of "n" padding with "fillvalue"
"""
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
# Go to where the images are instead of managing a load of paths
os.chdir('images')
# Get list of filenames sorted by mtime
filenames = sorted(glob.glob('*.jpg'),key=os.path.getmtime)
# Iterate over the files in groups of 4
for f1, f2, f3, f4 in grouper(filenames, 4, 'fill.png'):
print(f1,f2,f3,f4)
Sample Output
iphone.jpg door.jpg hands.jpg solar.jpg
test.jpg circuit_board.jpg r1.jpg r2.jpg
thing.jpg roadhog.jpg colorwheel.jpg hogtemplate.jpg
tiger.jpg bean.jpg image.jpg bottle.jpg
result.jpg fill.png fill.png fill.png
I am working on sign language gesture classifier using pytorch, I have pictures resembling each letter residing in a folder titled with that specific letter. E.g. folder "A" has "1_A_1.jpg", "1_A_2.jpg", "21_A_3.jpg".. etc.
I am trying to build a function that:
Iterates through the different folders
Splits the data into training, validation, and test sets
Labels those pictures with their respective folder name (i.e. letter label)
Returns 3 created folders that are train, test and validation
All the online code shows examples of splitting data coming from torchvision data sets (built in data sets), nothing from scratch.
I found the following on stackoverflow:
import os
import numpy as np
import argparse
def get_files_from_folder(path):
files = os.listdir(path)
return np.asarray(files)
def main(path_to_data, path_to_test_data, train_ratio):
# get dirs
_, dirs, _ = next(os.walk(path_to_data))
# calculates how many train data per class
data_counter_per_class = np.zeros((len(dirs)))
for i in range(len(dirs)):
path = os.path.join(path_to_data, dirs[i])
files = get_files_from_folder(path)
data_counter_per_class[i] = len(files)
test_counter = np.round(data_counter_per_class * (1 - train_ratio))
# transfers files
for i in range(len(dirs)):
path_to_original = os.path.join(path_to_data, dirs[i])
path_to_save = os.path.join(path_to_test_data, dirs[i])
#creates dir
if not os.path.exists(path_to_save):
os.makedirs(path_to_save)
files = get_files_from_folder(path_to_original)
# moves data
for j in range(int(test_counter[i])):
dst = os.path.join(path_to_save, files[j])
src = os.path.join(path_to_original, files[j])
shutil.move(src, dst)
and when I tried doing the following:
path_to_data= r'path\A'
path_to_test_data=r"path\test"
train_ratio=0.8
main(path_to_data,path_to_test_data,train_ratio)
Nothing really happened..
If I can get this working for train and test, I can easily extend it for validation.
Give this a go:
from pathlib import Path
def main(data_path, out_path, train_ratio):
#1
dir_paths = [child for child in Path(data_path).iterdir() if child.is_dir()]
for i, dir_path in enumerate(dir_paths):
#2
files = list(dir_path.iterdir())
train_len = int(len(files) * (1 - train_ratio))
#3
out_dir = Path(out_path).joinpath(dir_path.name)
if not out_dir.exists():
out_dir.mkdir(parents=True)
#4
for file_ in files[:train_len]:
file_.replace(out_dir.joinpath(file_.name))
if __name__ == '__main__':
main('data', 'test', 0.8)
I'm extremely new to Python (and software programming/development in general). I decided to use the scenario below as my first project. The project includes 5 main personal challenges. Some of the challenges I have been able to complete (although probably not the most effecient way), and others I'm struggling with. Any feedback you have on my approach and recommendations for improvement is GREATLY appreciated.
Project Scenario = "If I doubled my money each day for 100 days, how much would I end up with at day #100? My starting amount on Day #1 is $1.00"
1.) Challenge 1 - What is the net TOTAL after day 100 - (COMPLETED, I think, please correct me if I'm wrong)
days = 100
compound_rate = 2
print('compound_rate ** days) # 2 raised to the 100th
#==Result===
1267650600228229401496703205376
2.) Challenge 2 - Print to screen the DAYS in the first column, and corresponding Daily Total in the second column. - (COMPLETED, I think, please correct me if I'm wrong)
compound_rate = 2
days_range = list(range(101))
for x in days_range:
print (str(x),(compound_rate ** int(x)))
# ===EXAMPLE Results
# 0 1
# 1 2
# 2 4
# 3 8
# 4 16
# 5 32
# 6 64
# 100 1267650600228229401496703205376
3.) Challenge 3 - Write TOTAL result (after the 100 days) to an external txt file - (COMPLETED, I think, please correct me if I'm wrong)
compound_rate = 2
days_range = list(range(101))
hundred_days = (compound_rate ** 100)
textFile = open("calctest.txt", "w")
textFile.write(str(hundred_days))
textFile.close()
#===Result====
string of 1267650600228229401496703205376 --> written to my file 'calctest.txt'
4.) Challenge 4 - Write the Calculated running DAILY Totals to an external txt file. Column 1 will be the Day, and Column 2 will be the Amount. So just like Challenge #2 but to an external file instead of screen
NEED HELP, I can't seem to figure this one out.
5.) Challenge 5 - Somehow plot or chart the Daily Results (based on #4) - NEED GUIDANCE.
I appreciate everyone's feedback as I start on my personal Python journey!
challenge 2
This will work fine, but there's no need to write list(range(101)), you can just write range(101). In fact, there's no need even to create a variable to store that, you can just do this:
for x in range(101):
print("whatever you want to go here")
challenge 3
Again, this will work fine, but when writing to a file, it is normally best to use a with statement, this means that you don't need to close the file at the end, as python will take care of that. For example:
with open("calctest.txt", "w") as f:
write(str(hundred_days))
challenge 4
Use a for loop as you did with challenge 2. Use "\n" to write a new line. Again do everything inside a with statement. e.g.
with open("calctest.txt", "w") as f:
for x in range(101):
f.write("something here \n").
(would write a file with 'something here ' written 101 times)
challenge 5
There is a python library called matplotlib, which I have never used, but I would suggest that would be where to go to in order to solve this task.
I hope this is of some help :)
You can use what you did in challenge 3 to open and close the ouput file.
In between, you have to do what you did in challenge 2 to compute the data for each day.
In stead of writing the daily result to the stream, you will have to combine it into a string. After that, you can write that string to the file, exactly like you did in challenge 3.
Challenge One:
This is the correct way.
days = 100
compound_rate = 2
print("Result after 100 days" + (compound_rate ** days))
Challenge Two
This is corrected.
compound_rate = 2
days_range = list(range(101))
for x in days_range:
print(x + (compound_rate ** x))
Challenge Three
This one is close but you didn't need to cast the result of hundred_days to a string as you can write the integer to a file and python doesn't care most of the time. Explicit casts need only to be worried about when using the data in some way other than simply printing it.
compound_rate = 2
days_range = list(range(101))
hundred_days = (compound_rate ** 100)
textFile = open("calctest.txt", "w")
textFile.write(hundred_days)
textFile.close()
Challenge Four
For this challenge, you will want to look into the python CSV module. You can write the data in two rows separated by commas very simply with this module.
Challenge Five
For this challenge, you will want to look into the python library matplotlib. This library will give you tools to work with the data in a graphical way.
Answer for challenge 1 is as follows:
l = []
for a in range(0,100):
b = 2 ** a
l.append(b)
print("Total after 100 days", sum(l))
import os, sys
import datetime
import time
#to get the current work directory, we use below os.getcwd()
print(os.getcwd())
#to get the list of files and folders in a path, we use os.listdir
print(os.listdir())
#to know the files inside a folder using path
spath = (r'C:\Users\7char')
l = spath
print(os.listdir(l))
#converting a file format to other, ex: txt to py
path = r'C:\Users\7char'
print(os.listdir(path))
# after looking at the list of files, we choose to change 'rough.py' 'rough.txt'
os.chdir(path)
os.rename('rough.py','rough.txt')
#check whether the file has changed to new format
print(os.listdir(path))
#yes now the file is changed to new format
print(os.stat('rough.txt').st_size)
# by using os.stat function we can see the size of file (os.stat(file).sst_size)
path = r"C:\Users\7char\rough.txt"
datetime = os.path.getmtime(path)
moddatetime = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(datetime))
print("Last Modified Time : ", moddatetime)
#differentiating b/w files and folders using - os.path.splitext
import os
path = r"C:\Users\7char\rough.txt"
dir(os.path)
files = os.listdir()
for file in files:
print(os.path.splitext(file))
#moving a file from one folder to other (including moving with folders of a path or moving into subforlders)
import os
char_7 = r"C:\Users\7char"
cleardata = r"C:\Users\clearadata"
operating = os.listdir(r"C:\Users\7char")
print(operating)
for i in operating:
movefrom = os.path.join(char_7,i)
moveto = os.path.join(cleardata,i)
print(movefrom,moveto)
os.rename(movefrom,moveto)
#now moving files based on length of individual charecter (even / odd) to a specified path (even or odd).
import os
origin_path = r"C:\Users\movefilehere"
fivechar_path= r"C:\Users\5char"
sevenchar_path = r"C:\Users\7char"
origin_path = os.listdir(origin_path)
for file_name in origin_pathlist:
l = len(file_name)
if l % 2 == 0:
evenfilepath = os.path.join(origin_path,file_name)
newevenfilepath = os.path.join(fivechar_path,file_name)
print(evenfilepath,newevenfilepath)
os.rename(evenfilepath,newevenfilepath)
else:
oddfilepath = os.path.join(origin_path,file_name)
newoddfilepath = os.path.join(sevenchar_path,file_name)
print(oddfilepath,newoddfilepath)
os.rename(oddfilepath,newoddfilepath)
#finding the extension in a folder using isdir
import os
path = r"C:\Users\7char"
print(os.path.isdir(path))
#how a many files .py and .txt (any files) in a folder
import os
from os.path import join, splitext
from glob import glob
from collections import Counter
path = r"C:\Users\7char"
c = Counter([splitext(i)[1][1:] for i in glob(join(path, '*'))])
for ext, count in c.most_common():
print(ext, count)
#looking at the files and extensions, including the total of extensions.
import os
from os.path import join, splitext
from collections import defaultdict
path = r"C:\Users\7char"
c = defaultdict(int)
files = os.listdir(path)
for filenames in files:
extension = os.path.splitext(filenames)[-1]
c[extension]+=1
print(os.path.splitext(filenames))
print(c,extension)
#getting list from range
list(range(4))
#break and continue statements and else clauses on loops
for n in range(2,10):
for x in range(2,n):
if n%x == 0:
print(n,'equals',x, '*', n//x)
break
else:
print(n, 'is a prime number')
#Dictionaries
#the dict() constructer builds dictionaries directly from sequences of key-value pairs
dict([('ad', 1212),('dasd', 2323),('grsfd',43324)])
#loop over two or more sequences at the same time, the entries can be paired with the zip() function.
questions = ['name', 'quest', 'favorite color']
answers = ['lancelot', 'the holy grail', 'blue']
for q, a in zip(questions, answers):
print('What is your {0}? It is {1}.'.format(q, a))
#Using set()
basket = ['apple', 'orange', 'apple', 'pear', 'orange', 'banana']
for f in sorted(set(basket)):
print(f)
I have the following working code to sort images according to a cluster list which is a list of tuples: (image_id, cluster_id).
One image can only be in one and only one cluster (there is never the same image in two clusters for example).
I wonder if there is a way to shorten the "for+for+if+if" loops at the end of the code as yet, for each file name, I must check in every pairs in the cluster list, which makes it a little redundant.
import os
import re
import shutil
srcdir = '/home/username/pictures/' #
if not os.path.isdir(srcdir):
print("Error, %s is not a valid directory!" % srcdir)
return None
pts_cls # is the list of pairs (image_id, cluster_id)
filelist = [(srcdir+fn) for fn in os.listdir(srcdir) if
re.search(r'\.jpg$', fn, re.IGNORECASE)]
filelist.sort(key=lambda var:[int(x) if x.isdigit() else
x for x in re.findall(r'[^0-9]|[0-9]+', var)])
for f in filelist:
fbname = os.path.splitext(os.path.basename(f))[0]
for e,cls in enumerate(pts_cls): # for each (img_id, clst_id) pair
if str(cls[0])==fbname: # check if image_id corresponds to file basename on disk)
if cls[1]==-1: # if cluster_id is -1 (->noise)
outdir = srcdir+'cluster_'+'Noise'+'/'
else:
outdir = srcdir+'cluster_'+str(cls[1])+'/'
if not os.path.isdir(outdir):
os.makedirs(outdir)
dstf = outdir+os.path.basename(f)
if os.path.isfile(dstf)==False:
shutil.copy2(f,dstf)
Of course, as I am pretty new to Python, any other well explained improvements are welcome!
I think you're complicating this far more than needed. Since your image names are unique (there can only be one image_id) you can safely convert pts_cls into a dict and have fast lookups on the spot instead of looping through the list of pairs each and every time. You are also utilizing regex where its not needed and you're packing your paths only to unpack them later.
Also, your code would break if it happens that an image from your source directory is not in the pts_cls as its outdir would never be set (or worse, its outdir would be the one from the previous loop).
I'd streamline it like:
import os
import shutil
src_dir = "/home/username/pictures/"
if not os.path.isdir(src_dir):
print("Error, %s is not a valid directory!" % src_dir)
exit(1) # return is expected only from functions
pts_cls = [] # is the list of pairs (image_id, cluster_id), load from whereever...
# convert your pts_cls into a dict - since there cannot be any images in multiple clusters
# base image name is perfectly ok to use as a key for blazingly fast lookups later
cluster_map = dict(pts_cls)
# get only `.jpg` files; store base name and file name, no need for a full path at this time
files = [(fn[:-4], fn) for fn in os.listdir(src_dir) if fn.lower()[-4:] == ".jpg"]
# no need for sorting based on your code
for name, file_name in files: # loop through all files
if name in cluster_map: # proceed with the file only if in pts_cls
cls = cluster_map[name] # get our cluster value
# get our `cluster_<cluster_id>` or `cluster_Noise` (if cluster == -1) target path
target_dir = os.path.join(src_dir, "cluster_" + str(cls if cls != -1 else "Noise"))
target_file = os.path.join(target_dir, file_name) # get the final target path
if not os.path.exists(target_file): # if the target file doesn't exists
if not os.path.isdir(target_dir): # make sure our target path exists
os.makedirs(target_dir, exist_ok=True) # create a full path if it doesn't
shutil.copy(os.path.join(src_dir, file_name), target_file) # copy
UPDATE - If you have multiple 'special' folders for certain cluster IDs (like Noise is for -1) you can create a map like cluster_targets = {-1: "Noise"} where the keys are your cluster IDs and their values are, obviously, the special names. Then you can replace the target_dir generation with: target_dir = os.path.join(src_dir, "cluster_" + str(cluster_targets.get(cls,cls)))
UPDATE #2 - Since your image_id values appear to be integers while filenames are strings, I'd suggest you to just build your cluster_map dict by converting your image_id parts to strings. That way you'd be comparing likes to likes without the danger of type mismatch:
cluster_map = {str(k): v for k, v in pts_cls}
If you're sure that none of the *.jpg files in your src_dir will have a non-integer in their name you can instead convert the filename into an integer to begin with in the files list generation - just replace fn[:-4] with int(fn[:-4]). But I wouldn't advise that as, again, you never know how your files might be named.