In my dataset, I have a large number of images in jpg format and they are named [ID]_[Cam]_[Frame].jpg. The dataset contains many IDs, and every ID has a different number of image. I want to randomly take 1 image from each ID into a different set of images. The problem is that the IDs in the dataset aren't always in order (Sometimes jump and skipped some numbers). As for the example below, the set of files doesn't have ID number 2 and 3.
Is there any python code to do this?
Before
TrainSet
00000000_0001_00000000.jpg
00000000_0001_00000001.jpg
00000000_0002_00000001.jpg
00000001_0001_00000001.jpg
00000001_0002_00000001.jpg
00000001_0002_00000002.jpg
00000004_0001_00000001.jpg
00000004_0002_00000001.jpg
After
TrainSet
00000000_0001_00000000.jpg
00000000_0001_00000002.jpg
00000001_0002_00000001.jpg
00000001_0001_00000001.jpg
00000004_0001_00000001.jpg
ValidationSet
00000000_0001_00000001.jpg
00000001_0001_00000002.jpg
00000004_0001_00000002.jpg
In this case, I would use a dictionary with id as the key and list of the name of files with matching id as the value. Then randomly picks the array from the dict.
import os
from random import choice
from pathlib import Path
import shutil
source_folder = "SOURCE_FOLDER"
dest_folder = "DEST_FOLDER"
dir_list = os.listdir(source_folder)
ids = {}
for f in dir_list:
f_id = f.split("_")[0]
ids[f_id] = [f, *ids.get(f_id, [])]
Path(dest_folder).mkdir(parents=True, exist_ok=True)
for files in ids.values():
random_file = choice(files)
shutil.move(
os.path.join(source_folder, random_file), os.path.join(dest_folder, random_file)
)
In your case, replace SOURCE_FOLDER with TrainSet and DEST_FOLDER with ValidationSet.
You need to use a sort alongwith Datastructure - Dictionary
FOr eg :
myDict = {'a': 00000000_0001_00000000.jpg, 'b': 00000000_0001_00000001.jpg}
myKeys = list(myDict.keys())
myKeys.sort()
sorted_dict = {i: myDict[i] for i in myKeys}
print(sorted_dict)
Here's a Pandas DataFrame solution that negates the need to move the files between folders. The str.extract method can extract the text matching a regex pattern as new columns in a DataFrame. The file names are grouped by the values in the newly created f_id column. The groupby.sample method returns a random sample from each group and the random_state parameter allows reproducibility.
import numpy as np
import pandas as pd
# Load file names into a data frame
data = [
{"fname": "00000000_0001_00000000.jpg"},
{"fname": "00000000_0001_00000001.jpg"},
{"fname": "00000000_0002_00000001.jpg"},
{"fname": "00000001_0001_00000001.jpg"},
{"fname": "00000001_0002_00000001.jpg"},
{"fname": "00000001_0002_00000002.jpg"},
{"fname": "00000004_0001_00000001.jpg"},
{"fname": "00000004_0002_00000001.jpg"},
]
df = pd.DataFrame(data)
# Extract 'f_id' from 'fname' string
df = df.join(df["fname"].str.extract(r'^(?P<f_id>\d+)_'))
sample_size = 1 # sample size
state_seed = 43 # reproducible
group_list = ["f_id"]
# Add 'validation' column
df["validation"] = 0
# Increment 'validation' by 1 for selected samples
df["validation"] = df.groupby(group_list).sample(n=sample_size, random_state=state_seed)["validation"].add(1)
# Reset 'NaN' values to 0
df["validation"] = df["validation"].fillna(0).astype(np.int8)
The result is a DataFrame with a value of 1 in the validation column for the selected file names.
fname
f_id
validation
0
00000000_0001_00000000.jpg
00000000
0
1
00000000_0001_00000001.jpg
00000000
1
2
00000000_0002_00000001.jpg
00000000
0
3
00000001_0001_00000001.jpg
00000001
1
4
00000001_0002_00000001.jpg
00000001
0
5
00000001_0002_00000002.jpg
00000001
0
6
00000004_0001_00000001.jpg
00000004
0
7
00000004_0002_00000001.jpg
00000004
1
I want to parse an xlsx file. Some of the cells in the file are merged and working as a header for the underneath values.
But do not know what approach I should select to parse the file.
Shall I parse the file from xlsx to json format and then I should perform the pivoting or transformation of dataset.
OR
Shall proceed just by xlsx format and try to read the specific cell values- but I believe this approach will not make the code scalable and dynamic.
I tried to parse the file and tried to convert to json but it did not load the all the records. unfortunately, it is not throwing any exception.
from json import dumps
from xlrd import open_workbook
# load excel file
wb = open_workbook('/dbfs/FileStore/tables/filename.xlsx')
# get sheet by using sheet name
sheet = wb.sheet_by_name('Input Format')
# get total rows
total_rows = sheet.nrows
# get total columns
total_columns = sheet.ncols
# convert each row of sheet name in Dictionary and append to list
lst = []
for i in range(0, total_rows):
row = {}
for j in range(0, total_columns):
if i + 1 < total_rows:
column_name = sheet.cell(rowx=0, colx=j)
row_data = sheet.cell_value(rowx=i+1, colx=j)
row.update(
{
column_name.value: row_data
}
)
if len(row):
lst.append(row)
# convert into json
json_data = dumps(lst)
print(json_data)
After executing the above code I received following type of output:
{
"Analysis": "M000000000000002001900000000000001562761",
"KPI": "FELIX PARTY.MIX",
"": 2.9969042460942
},
{
"Analysis": "M000000000000002001900000000000001562761",
"KPI": "FRISKIES ESTERILIZADOS",
"": 2.0046260994622
},
Once the data will be in good shape then spark-databricks should be used for the transformation.
I tried multiple approaches but failed :(
Hence seeking help from the community.
For more clarity on the question I have added sample input/output screenshot as following.
Input dataset:
Expected Output1:
You can download the actual dataset and expected output from the following link
Dataset
To convert get the month column as per requirement, you can use the following code:
import pandas as pd
for_cols = pd.read_excel('/dbfs/FileStore/HHP.xlsx', engine='openpyxl', skiprows=2,nrows=1)
main_cols = [for_cols[req][0] for req in for_cols if type(for_cols[req][0])==type('x')] #getting main header column names
#print(main_cols)
for_dates = pd.read_excel('/dbfs/FileStore/HHP.xlsx', engine='openpyxl',skiprows=4,usecols="C:R")
dates = for_dates.columns.to_list() #getting list of month names to be used
#print(dates)
pdf = pd.read_excel('/dbfs/FileStore/HHP.xlsx', engine='openpyxl',skiprows=4) #reading the file without main headers
#pdf
#all the columns i.e., 2021 Jan will be labeled differently like 2021 Jan, 2021 Jan.1, 2021 Jan.2 and so on. So the following code will create an array of arrays where each of the child array will be used to create a new small dataframe. All these new dataframes will be combined to a single dataframe (union).
req_cols=[]
for i in range(len(main_cols)):
current_dates = ['Market','Product']
if(i!=0):
for d in dates:
current_dates.append(d+f'.{i}')
else:
current_dates.extend(dates)
req_cols.append(current_dates)
print(req_cols)
#the following code will combine the dataframe to remove multiple yyyy MMM columns. Also added a column `stype` whose name would help identify to which main header column does the month belongs to for each product.
mydf = pdf[req_cols[0]]
mydf['stype']= main_cols[0]
#display(mydf)
for i in range(1,len(req_cols)):
temp = pdf[req_cols[i]]
#print(temp.columns)
temp['stype'] = main_cols[i]
rename_cols={'Market': 'Market', 'Product': 'Product','stype':'stype'} #renaming columns i.e., changing 2021 Jan.1 and such to just 2021 Jan.
for j in req_cols[i][2:]:
rename_cols[j]= j[:8] #if j is 2021 Jan.3 then we only take until j[:8] to get the actual name (2021 Jan)
#print(rename_cols)
temp.rename(columns = rename_cols, inplace = True)
mydf = pd.concat([mydf,temp]) #combining the child dataframes to main dataframe.
mydf
tp = mydf[['Market','Product','2021 Jan','stype']]
req_df = tp.pivot(index=['Product','Market'],columns='stype', values='2021 Jan') #now pivoting the `stype` column
req_df['month'] = ['2021 Jan']*len(req_df) #initialising the month column
req_df.reset_index(inplace=True) #converting index columns to actual columns.
req_df #required data format for 2021 Jan.
#using the following code to get required result. Do it separately for each of the dates and then combine it to `req_df`
for dt in dates[1:]:
tp = mydf[['Market','Product',dt,'stype']]
tp1 = tp.pivot(index=['Product','Market'],columns='stype', values=dt)
tp1['month'] = [dt]*len(tp1)
tp1.reset_index(inplace=True)
req_df = pd.concat([req_df,tp1])
display(req_df[(req_df['Product'] != 'Nestle Purina')]) #selecting only data where product name is not Nestle Purina
To create a new column called Nestle Purina for one of the main columns (Penetration) you can use the following code:
nestle_purina = req_df[(req_df['Product'] == 'Nestle Purina')] #where product name is Nestle Purina
b = req_df[(req_df['Product'] != 'Nestle Purina')] #where product name is not nestle purina
a = b[['Product','Market','month','Penetration % (% of Households who bought a product atleast once in the given time period)']] #selecting required columns along with main column Penetration
n = nestle_purina[['month','Penetration % (% of Households who bought a product atleast once in the given time period)']] #selecting only required columns from nestle_purina df.
import numpy as np
a['Nestle Purina'] = np.nan #creating empty column to populate using code below
for dt in dates:
val = [i for i in n[(n['month'] == dt)]['Penetration % (% of Households who bought a product atleast once in the given time period)']] #getting the corresponding Nestle Purina value for Penetration column
a.loc[a['month'] == dt, 'Nestle Purina'] = val[0] #updating the `Nestle Purina` column value from nan to value extracted above.
a
I have GBs of data in this text format:
1,'Acct01','Freds Autoshop'
2,'3-way-Cntrl','Y'
1000,576,686,837
1001,683,170,775
1,'Acct02','Daves Tacos'
2,'centrifugal','N'
1000,334,787,143
1001,749,132,987
The first column indicates the row content and is an index series that repeats for each Account (Acct01, Acct02...). Rows with index values (1,2) are one-to-one associated with each account (Parent). I would like to flatten this data into a dataframe that associates the Account level data (index = 1,2) with it's associated series data (1000, 10001, 1002, 1003...) the child data in a flat df.
Desired df:
'Acct01','Freds Autoshop','3-way-Cntrl','Y',1000,576,686,837
'Acct01','Freds Autoshop','3-way-Cntrl','Y',1001,683,170,775
'Acct02','Daves Tacos',2,'centrifugal','N',1000,334,787,143
'Acct02','Daves Tacos',2,'centrifugal','N',1001,749,132,987
I've been able to do this in a very mechanical, very slow row-by-row process:
import pandas as pd
import numpy as np
import time
file = 'C:\\PythonData\\AcctData.txt'
t0 = time.time()
pdata = [] # Parse data
acct = [] # Account Data
row = {} #Assembly Container
#Set dataframe columns
df = pd.DataFrame(columns=['Account','Name','Type','Flag','Counter','CNT01','CNT02','CNT03'])
# open the file and read through it line by line
with open(file, 'r') as f:
for line in f:
#Strip each line
pdata = [x.strip() for x in line.split(',')]
#Use the index to parse data into either acct[] for use on the rows with counter > 2
indx = int(pdata[0])
if indx == 1:
acct.clear()
acct.append(pdata[1])
acct.append(pdata[2])
elif indx == 2:
acct.append(pdata[1])
acct.append(pdata[2])
else:
row.clear()
row['Account'] = acct[0]
row['Name'] = acct[1]
row['Type'] = acct[2]
row['Flag'] = acct[3]
row['Counter'] = pdata[0]
row['CNT01'] = pdata[1]
row['CNT02'] = pdata[2]
row['CNT03'] = pdata[3]
if indx > 2:
#data.append(row)
df = df.append(row, ignore_index=True)
t1 = time.time()
totalTimeDf = t1-t0
TTDf = '%.3f'%(totalTimeDf)
print(TTDf + " Seconds to Complete df: " + i_filepath)
print(df)
Result:
0.018 Seconds to Complete df: C:\PythonData\AcctData.txt
Account Name Type Flag Counter CNT01 CNT02 CNT03
0 'Acct01' 'Freds Autoshop' '3-way-Cntrl' 'Y' 1000 576 686 837
1 'Acct01' 'Freds Autoshop' '3-way-Cntrl' 'Y' 1001 683 170 775
2 'Acct02' 'Daves Tacos' 'centrifugal' 'N' 1000 334 787 143
3 'Acct02' 'Daves Tacos' 'centrifugal' 'N' 1001 749 132 987
This works but is tragically slow. I suspect there is a very easy pythonic way to import and organize to a df. It appears an OrderDict will properly organize the data as follows:
import csv
from collections import OrderedDict
od = OrderedDict()
file_name = 'C:\\PythonData\\AcctData.txt'
try:
csvfile = open(file_name, 'rt')
except:
print("File not found")
csvReader = csv.reader(csvfile, delimiter=",")
for row in csvReader:
key = row[0]
od.setdefault(key,[]).append(row)
od
Result:
OrderedDict([('1',
[['1', "'Acct01'", "'Freds Autoshop'"],
['1', "'Acct02'", "'Daves Tacos'"]]),
('2',
[['2', "'3-way-Cntrl'", "'Y'"],
['2', "'centrifugal'", "'N'"]]),
('1000',
[['1000', '576', '686', '837'], ['1000', '334', '787', '143']]),
('1001',
[['1001', '683', '170', '775'], ['1001', '749', '132', '987']])])
From the OrderDict I haven't been able to figure out how to combine keys 1,2 and associate with acct specific series of keys (1000, 1001) then append into a df. How do I go from OrderedDict to df while flattening the Parent/Child data? Or, is there a better way to process this data?
I'm not sure if it's the fastes or the pythonic way, but I believe a pandas aproach might do, since you need to iterate for every 4 rows in a weird real specific way:
first importing libraries to work with:
import pandas as pd
import numpy as np
since I didn't have a file to load, I just recreated it as an array (this part you'll have to do some work, or simply load it to a pandas' DataFrame with 4 columns will be fine [like next step]):
data = [[1,'Acct01','Freds Autoshop'],
[2,'3-way-Cntrl','Y' ],
[1000,576,686,837 ],
[1001,683,170,775 ],
[1002,333,44,885 ],
[1003,611183,12,1 ],
[1,'Acct02','Daves Tacos' ],
[2,'centrifugal','N' ],
[1000,334,787,143 ] ,
[1001,749,132,987],
[1,'Acct03','Norah Jones' ],
[2,'undertaker','N' ],
[1000,323,1,3 ] ,
[1001,311,2,111 ] ,
[1002,95,112,4]]
Created a dataframe with the above data + created new columns with numpy's nans (faster than panda's) as placeholders.
df = pd.DataFrame(data)
df['4']= np.nan
df['5']= np.nan
df['6']= np.nan
df['7']= np.nan
df['8']= np.nan
df.columns = ['idx','Account','Name','Type','Flag','Counter','CNT01','CNT02','CNT3']
Making a new df that will get everytime "AcctXXXX" apears and how many rows bellow until the next parent.
# Getting the unique "Acct" and their index position into an array
acct_idx_pos = np.array([df[df['Account'].str.contains('Acct').fillna(False)]['Account'].values, df[df['Account'].str.contains('Acct').fillna(False)].index.values])
# Making a df with the transposed array
df_pos = pd.DataFrame(acct_idx_pos.T, columns=['Acct', 'Position'])
# Shifting the values into a new column and filling the last value (nan) with the df length
df_pos['End_position'] = df_pos['Position'].shift(-1)
df_pos['End_position'][-1:] = len(df)
# Making the column we want, that is the number of loops we'll go
df_pos['Position_length'] = df_pos['End_position'] - df_pos['Position']
A custom function that uses a dummy Dataframe and concatenates temporary ones (will be used later)
def concatenate_loop_dfs(df_temp, df_full, axis=0):
"""
to avoid retyping the same line of code for every df.
the parameters should be the temporary df created at each loop and the concatenated DF that will contain all
values which must first be initialized (outside the loop) as df_name = pd.DataFrame(). """
if df_full.empty:
df_full = df_temp
else:
df_full = pd.concat([df_full, df_temp], axis=axis)
return df_full
Created a function that will loop to fill each row and drop duplicated rows:
# a complicated loop function
def shorthen_df(df, num_iterations):
# to not delete original df
dataframe = df.copy()
# for the slicing, we need to start at the first row.
curr_row = 1
# fill current row's nan values with values from next row
dataframe.iloc[curr_row-1:curr_row:,3] = dataframe.iloc[curr_row:curr_row+1:,1].values
dataframe.iloc[curr_row-1:curr_row:,4] = dataframe.iloc[curr_row:curr_row+1:,2].values
dataframe.iloc[curr_row-1:curr_row:,5] = dataframe.iloc[curr_row+1:curr_row+2:,0].values
dataframe.iloc[curr_row-1:curr_row:,6] = dataframe.iloc[curr_row+1:curr_row+2:,1].values
dataframe.iloc[curr_row-1:curr_row:,7] = dataframe.iloc[curr_row+1:curr_row+2:,2].values
dataframe.iloc[curr_row-1:curr_row:,8] = dataframe.iloc[curr_row+1:curr_row+2:,3].values
# the "num_iterations-2" is because the first two lines are filled and not replaced
# as the next ones will be. So this will vary correctly to each "account"
for i in range(1, num_iterations-2):
# Replaces next row with values from previous row
dataframe.iloc[curr_row+(i-1):curr_row+i:] = dataframe.iloc[curr_row+(i-2):curr_row+(i-1):].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,5] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,0].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,6] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,1].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,7] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,2].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,8] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,3].values
# last 2 rows of df
dataframe = dataframe[0:len(dataframe)-2]
return dataframe
Finally, creating the dummy DF that will concat all "Acct" and loop for each one with it's position, using both functions above.
df_final= pd.DataFrame()
for start, end, iterations in zip(df_pos.Position.values, df_pos.End_position.values, df_pos.Position_length.values):
df2 = df[start:end]
df_temp = shorthen_df(df2, iterations)
df_final = concatenate_loop_dfs(df_temp, df_final)
# Dropping first/unnecessary columns
df_final.drop('idx', axis=1, inplace=True)
# resetting index
df_final.reset_index(inplace=True, drop=True)
df_final
returns
Account Name Type Flag Counter CNT01 CNT02 CNT3
0 Acct01 Freds Autoshop 3-way-Cntrl Y 1000.0 576 686 837
1 Acct01 Freds Autoshop 3-way-Cntrl Y 1001.0 683 170 775
2 Acct01 Freds Autoshop 3-way-Cntrl Y 1002.0 333 44 885
3 Acct01 Freds Autoshop 3-way-Cntrl Y 1003.0 611183 12 1
4 Acct02 Daves Tacos centrifugal N 1000.0 334 787 143
5 Acct02 Daves Tacos centrifugal N 1001.0 749 132 987
6 Acct03 Norah Jones undertaker N 1000.0 323 1 3
7 Acct03 Norah Jones undertaker N 1001.0 311 2 111
8 Acct03 Norah Jones undertaker N 1002.0 95 112 4
I have a tsv file containing a network. Here's a snippet. Column 0 contains unique IDs, column 1 contains an alternative ID (not necessarily unique). Each pair of columns after that contains an 'interactor' and a score of interaction.
11746909_a_at A1CF SHPRH 0.11081568 TRIM10 0.11914056
11736238_a_at ABCA5 ANKS1A 0.1333185 CCDC90B 0.14495682
11724734_at ABCB8 HYKK 0.09577321 LDB3 0.09845833
11723976_at ABCC8 FAM161B 0.15087105 ID1 0.14801268
11718612_a_at ABCD4 HOXC6 0.23559235 LCMT2 0.12867001
11758217_s_at ABHD17C FZD7 0.46334574 HIVEP3 0.24272481
So for example, A1CF connects to SHPRH and TRIM10 with scores of 0.11081568 and 0.11914056 respectively. I'm trying to convert this data into a 'flat' format using pandas which would look like this:
11746909_a_at A1CF SHPRH 0.11081568
TRIM10 0.11914056
11736238_a_at ABCA5 ANKS1A 0.1333185
CCDC90B 0.14495682
...... and so on........ ........ ....
Note that each row can have an arbitrary number of (interactor, score) pairs.
I've tried setting columns 0 and 1 to indexes then giving the columns names df.colnames = ['Interactor', Weight']*int(df.shape[1]/2) then using pandas.groupby but so far my attempts have not been successful. Can anybody suggest a way to do this?
Producing an output dataframe like you specified above shouldn't be too hard
from collections import OrderedDict
import pandas as pd
def open_network_tsv(filepath):
"""
Read the tsv file, returning every line split by tabs
"""
with open(filepath) as network_file:
for line in network_file.readlines():
line_columns = line.strip().split('\t')
yield line_columns
def get_connections(potential_conns):
"""
Get the connections of a particular line, grouped
in interactor:score pairs
"""
for idx, val in enumerate(potential_conns):
if not idx % 2:
if len(potential_conns) >= idx + 2:
yield val, potential_conns[idx+1]
def create_connections_df(filepath):
"""
Build the desired dataframe
"""
connections = OrderedDict({
'uniq_id': [],
'alias': [],
'interactor': [],
'score': []
})
for line in open_network_tsv(filepath):
uniq_id, alias, *potential_conns = line
for connection in get_connections(potential_conns):
connections['uniq_id'].append(uniq_id)
connections['alias'].append(alias)
connections['interactor'].append(connection[0])
connections['score'].append(connection[1])
return pd.DataFrame(connections)
Maybe you can do a dataframe.set_index(['uniq_id', 'alias']) or dataframe.groupby(['uniq_id', 'alias']) on the output afterward