Randomly splitting 1 file from many files based on ID - python

In my dataset, I have a large number of images in jpg format and they are named [ID]_[Cam]_[Frame].jpg. The dataset contains many IDs, and every ID has a different number of image. I want to randomly take 1 image from each ID into a different set of images. The problem is that the IDs in the dataset aren't always in order (Sometimes jump and skipped some numbers). As for the example below, the set of files doesn't have ID number 2 and 3.
Is there any python code to do this?
Before
TrainSet
00000000_0001_00000000.jpg
00000000_0001_00000001.jpg
00000000_0002_00000001.jpg
00000001_0001_00000001.jpg
00000001_0002_00000001.jpg
00000001_0002_00000002.jpg
00000004_0001_00000001.jpg
00000004_0002_00000001.jpg
After
TrainSet
00000000_0001_00000000.jpg
00000000_0001_00000002.jpg
00000001_0002_00000001.jpg
00000001_0001_00000001.jpg
00000004_0001_00000001.jpg
ValidationSet
00000000_0001_00000001.jpg
00000001_0001_00000002.jpg
00000004_0001_00000002.jpg

In this case, I would use a dictionary with id as the key and list of the name of files with matching id as the value. Then randomly picks the array from the dict.
import os
from random import choice
from pathlib import Path
import shutil
source_folder = "SOURCE_FOLDER"
dest_folder = "DEST_FOLDER"
dir_list = os.listdir(source_folder)
ids = {}
for f in dir_list:
f_id = f.split("_")[0]
ids[f_id] = [f, *ids.get(f_id, [])]
Path(dest_folder).mkdir(parents=True, exist_ok=True)
for files in ids.values():
random_file = choice(files)
shutil.move(
os.path.join(source_folder, random_file), os.path.join(dest_folder, random_file)
)
In your case, replace SOURCE_FOLDER with TrainSet and DEST_FOLDER with ValidationSet.

You need to use a sort alongwith Datastructure - Dictionary
FOr eg :
myDict = {'a': 00000000_0001_00000000.jpg, 'b': 00000000_0001_00000001.jpg}
myKeys = list(myDict.keys())
myKeys.sort()
sorted_dict = {i: myDict[i] for i in myKeys}
print(sorted_dict)

Here's a Pandas DataFrame solution that negates the need to move the files between folders. The str.extract method can extract the text matching a regex pattern as new columns in a DataFrame. The file names are grouped by the values in the newly created f_id column. The groupby.sample method returns a random sample from each group and the random_state parameter allows reproducibility.
import numpy as np
import pandas as pd
# Load file names into a data frame
data = [
{"fname": "00000000_0001_00000000.jpg"},
{"fname": "00000000_0001_00000001.jpg"},
{"fname": "00000000_0002_00000001.jpg"},
{"fname": "00000001_0001_00000001.jpg"},
{"fname": "00000001_0002_00000001.jpg"},
{"fname": "00000001_0002_00000002.jpg"},
{"fname": "00000004_0001_00000001.jpg"},
{"fname": "00000004_0002_00000001.jpg"},
]
df = pd.DataFrame(data)
# Extract 'f_id' from 'fname' string
df = df.join(df["fname"].str.extract(r'^(?P<f_id>\d+)_'))
sample_size = 1 # sample size
state_seed = 43 # reproducible
group_list = ["f_id"]
# Add 'validation' column
df["validation"] = 0
# Increment 'validation' by 1 for selected samples
df["validation"] = df.groupby(group_list).sample(n=sample_size, random_state=state_seed)["validation"].add(1)
# Reset 'NaN' values to 0
df["validation"] = df["validation"].fillna(0).astype(np.int8)
The result is a DataFrame with a value of 1 in the validation column for the selected file names.
fname
f_id
validation
0
00000000_0001_00000000.jpg
00000000
0
1
00000000_0001_00000001.jpg
00000000
1
2
00000000_0002_00000001.jpg
00000000
0
3
00000001_0001_00000001.jpg
00000001
1
4
00000001_0002_00000001.jpg
00000001
0
5
00000001_0002_00000002.jpg
00000001
0
6
00000004_0001_00000001.jpg
00000004
0
7
00000004_0002_00000001.jpg
00000004
1

Related

Create a column for each first directory of a path and fill the column with each last directory of the same path

This dataset represents a collection of image information. Each image has some tags that are stored very badly.
In particular I have a dataframe with a column ('tags_path') which is a string representing a list of several paths for each observation of my dataset like this:
df['tags_path'][0]
',/SITUATION/Group Photo,/CONTENT YEAR/Years 2020/2022,/FRAMEWORK/Otherframeworks/Tracks,/PERSON/Editor/Mark,/PERSON/Co-Editor/Paul,PERSON/Protagonist/Cherles,/SITUATION/Victory,/SITUATION/Portrait,'
as you can see there are several paths in this string, each first directory of each path represents the category of the tag while each last directory represents the tag name. For example in the above observation we have:
SITUATION->['Group Photo', 'Victory', 'Potrait']
CONTENT YEAR->['2022']
FRAMEWORK->['Tracks']
PERSON->['Mark', 'Paul', 'Charles']
I would like to create a column in the dataframe for each "tag category" (SITUATION, CONTENT-YEAR, FRAMEWORK, ecc...) which contains their own list of tags.
Since now i managed to create an empty column for all unique tag cotegories of my dataset like this:
df['tags_path'] = ','+df['tags_path']+','
tags = [re.findall(r',/[a-zA-Z .]+', str(df.loc[i, 'tags_path'])) for i in range(len(df))]
flat_tags_columns = [x[2:] for x in list(set([item for sublist in tags for item in sublist]))]
for i in flat_tags_columns:
df[i] = 0
Now i need to fill the columns with the respective tags. Thanks.
With the following toy dataframe:
from pathlib import Path
import pandas as pd
df = pd.DataFrame(
{
"tags_path": [
",/SITUATION/Group Photo,/CONTENT YEAR/Years 2020/2022,/FRAMEWORK/Otherframeworks/Tracks,/PERSON/Editor/Mark,/PERSON/Co-Editor/Paul,/PERSON/Protagonist/Charles,/SITUATION/Victory,/SITUATION/Portrait,",
",/SITUATION/Group Photo,/CONTENT YEAR/Years 2020/2021,/FRAMEWORK/Otherframeworks/Tracks,/PERSON/Editor/Peter,/PERSON/Co-Editor/John,/PERSON/Protagonist/Charly,/SITUATION/Victory,/SITUATION/Portrait,",
]
}
)
I suggest a different approach, taking advantage of Python standard library Pathlib module for dealing with pathlike objects:
def process(tag):
"""Helper function which extracts columns names and values as lists.
"""
paths = [Path(item) for item in tag.split(",")]
data = {str(path.parts[1]): [] for path in paths if path.parts}
for path in paths:
try:
data[str(path.parts[1])].append(path.name)
except IndexError:
pass
return [[col for col in data.keys()], [value for value in data.values()]]
# Temporary columns
df["temp"] = df["tags_path"].apply(process)
df[["columns", "values"]] = pd.DataFrame(df["temp"].tolist(), index=df.index)
# Add final columns
df[df["columns"][0]] = pd.DataFrame(df["values"].tolist(), index=df.index)
# Cleanup
df = df.drop(columns=["temp", "columns", "values"])
print(df)
# Output
tags_path \
0 ,/SITUATION/Group Photo,/CONTENT YEAR/Years 20...
1 ,/SITUATION/Group Photo,/CONTENT YEAR/Years 20...
SITUATION CONTENT YEAR FRAMEWORK \
0 [Group Photo, Victory, Portrait] [2022] [Tracks]
1 [Group Photo, Victory, Portrait] [2021] [Tracks]
PERSON
0 [Mark, Paul, Charles]
1 [Peter, John, Charly]

python: multi-column pandas data-file obtained in FOR loop

I am working on a Python script which loops over N .SDF filles, creates their list using glob, performs some calculations for each of the file and then store this information in pandas data file format. Assuming that I calculate 4 different properties of each file, for 1000 filles the expected output should be summarized in data-file format with 5 columns and 1000 lines. Here is the example of the code:
# make a list of all .sdf filles present in data folder:
dirlist = [os.path.basename(p) for p in glob.glob('data' + '/*.sdf')]
# create empty data file with 5 columns:
# name of the file, value of variable p, value of ac, value of don, value of wt
df = pd.DataFrame(columns=["key", "p", "ac", "don", "wt"])
# for each sdf file get its name and calculate 4 different properties: p, ac, don, wt
for sdf in dirlist:
sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
# set a name of the file
key = f'{sdf_name}'
mol = open(sdf,'rb')
# --- do some specific calculations --
p = MolLogP(mol) # coeff conc-perm
ac = CalcNumLipinskiHBA(mol)#
don = CalcNumLipinskiHBD(mol)
wt = MolWt(mol)
# add one line to DF in the following order : ["key", "p", "ac", "don", "wt"]
df[key] = [p, ac, don, wt]
The problem is in the last line of the script, required to summarize all of the calculations in one line and append it into the DF together with the processed file. Eventually, for 1000 processed SDF filles, my DF should contain 5 columns and 1000 lines.
You should replace the troublesome line with something like
df.loc[len(df)] = [key, p, ac, don, wt]
this will append a new row at the end of the df
Alternatively you can do
df = df.append(adict,ignore_index = True)
where adict is a dictionary of your values associated to the column names as keys:
adict = {'key':key, 'p':p, ...}

Replace Matching Strings with New Text in Pandas Dataframe

I am trying to replace names in the "Name" column with a generic ID and make a new column "research_code", the "Name" column will then be removed.
I do not want to remove duplicates, but I do want all instances of "Buzz Lightyear" to be replaced by the same integer (i.e 1). So all "Buzz Lightyears" are "1" all "Twighlight Sparkle's" are "2". etc
When I run this, I get no errors, but the "research_code" does not persist for some reason.
full_set = pd.read_csv(filename, index_col=None, header=0)
grouped_set = full_set.groupby('Name')
names = grouped_set.groups.keys()
idx = 1
for c in names:
set_index = str(idx + 1)
idx = int(set_index) + 1
replaceables = full_set[(full_set.Name == str(c))]
for index, row in replaceables.iterrows():
print(row['Name'])
print(row['research_code'])
row['research_code'] = set_index
print(row['research_code'])
print(full_set.head)
Categories can be used.
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
filename = StringIO("""Name
Rahul
Doug
Joe
Buzzlightyear
Twighlight Sparkle
Twighlight Sparkle
Liu
""")
full_set = pd.read_csv(filename, index_col=None, header=0)
full_set['research_code'] = full_set['Name'].astype('category')
full_set['research_code'] = full_set['research_code'].cat.rename_categories([i for i in range(full_set['research_code'].nunique())])
print(full_set.drop(['Name'], axis=1))
That last bit on the list comprehension is a bit gratuitous. Just rename the categories by providing rename_categories() a list of new names (integers in the above Question) that is as long as the number of unique values in the Names column.
research_code
0 4
1 1
2 2
3 0
4 5
5 5
6 3

Check if a list contain element of a csv file

I have a csv file with items like these: some,foo,bar and i have a different list in python with different item like att1,some,bar,try,other . Is possible for every list, create a row in the same csv file and set 1 in correspondence of the 'key' is present and 0 otherwise? So in this case the csv file result would be:
some,foo,bar
1,0,1
Here's one approach, using Pandas.
Let's say the contexts of example.csv are:
some,foo,bar
Then we can represent sample data with:
import pandas as pd
keys = ["att1","some","bar","try","other"]
data = pd.read_csv('~/Desktop/example.csv', header=None)
print(data)
0 1 2
0 some foo bar
matches = data.apply(lambda x: x.isin(keys).astype(int))
print(matches)
0 1 2
0 1 0 1
newdata = pd.concat([data, matches])
print(newdata)
0 1 2
0 some foo bar
0 1 0 1
Now write back to CSV:
newdata.to_csv('example.csv', index=False, header=False)
# example.csv contents
some,foo,bar
1,0,1
Given data and keys, we can condense it all into one chained command:
(pd.concat([data,
data.apply(lambda x: x.isin(keys).astype(int))
])
.to_csv('example1.csv', index=False, header=False))
You could create a dictionary with the keys as the column names.
csv = {
"some" : [],
"foo" : [],
"bar" : []
}
Now, for each list check if the values are in the dict's keys, and append as necessary.
for list in lists:
for key in csv.keys():
if key in list:
csv[key].append(1)
else:
csv[key].append(0)

How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

I'm importing large amounts of http logs (80GB+) into a Pandas HDFStore for statistical processing. Even within a single import file I need to batch the content as I load it. My tactic thus far has been to read the parsed lines into a DataFrame then store the DataFrame into the HDFStore. My goal is to have the index key unique for a single key in the DataStore but each DataFrame restarts it's own index value again. I was anticipating HDFStore.append() would have some mechanism to tell it to ignore the DataFrame index values and just keep adding to my HDFStore key's existing index values but cannot seem to find it. How do I import DataFrames and ignore the index values contained therein while having the HDFStore increment its existing index values? Sample code below batches every 10 lines. Naturally the real thing would be larger.
if hd_file_name:
"""
HDF5 output file specified.
"""
hdf_output = pd.HDFStore(hd_file_name, complib='blosc')
print hdf_output
columns = ['source', 'ip', 'unknown', 'user', 'timestamp', 'http_verb', 'path', 'protocol', 'http_result',
'response_size', 'referrer', 'user_agent', 'response_time']
source_name = str(log_file.name.rsplit('/')[-1]) # HDF5 Tables don't play nice with unicode so explicit str(). :(
batch = []
for count, line in enumerate(log_file,1):
data = parse_line(line, rejected_output = reject_output)
# Add our source file name to the beginning.
data.insert(0, source_name )
batch.append(data)
if not (count % 10):
df = pd.DataFrame( batch, columns = columns )
hdf_output.append(KEY_NAME, df)
batch = []
if (count % 10):
df = pd.DataFrame( batch, columns = columns )
hdf_output.append(KEY_NAME, df)
You can do it like this. Only trick is that the first time the store table doesn't exist, so get_storer will raise.
import pandas as pd
import numpy as np
import os
files = ['test1.csv','test2.csv']
for f in files:
pd.DataFrame(np.random.randn(10,2),columns=list('AB')).to_csv(f)
path = 'test.h5'
if os.path.exists(path):
os.remove(path)
with pd.get_store(path) as store:
for f in files:
df = pd.read_csv(f,index_col=0)
try:
nrows = store.get_storer('foo').nrows
except:
nrows = 0
df.index = pd.Series(df.index) + nrows
store.append('foo',df)
In [10]: pd.read_hdf('test.h5','foo')
Out[10]:
A B
0 0.772017 0.153381
1 0.304131 0.368573
2 0.995465 0.799655
3 -0.326959 0.923280
4 -0.808376 0.449645
5 -1.336166 0.236968
6 -0.593523 -0.359080
7 -0.098482 0.037183
8 0.315627 -1.027162
9 -1.084545 -1.922288
10 0.412407 -0.270916
11 1.835381 -0.737411
12 -0.607571 0.507790
13 0.043509 -0.294086
14 -0.465210 0.880798
15 1.181344 0.354411
16 0.501892 -0.358361
17 0.633256 0.419397
18 0.932354 -0.603932
19 -0.341135 2.453220
You actually don't necessarily need a global unique index, (unless you want one) as HDFStore (through PyTables) provides one by uniquely numbering rows. You can always add these selection parameters.
In [11]: pd.read_hdf('test.h5','foo',start=12,stop=15)
Out[11]:
A B
12 -0.607571 0.507790
13 0.043509 -0.294086
14 -0.465210 0.880798

Categories