Finding all files associated with an id within a folder of images?

Finding all files associated with an id within a folder of images? - python

I'm trying to populate a dataframe based on a class label and images in a folder.
I have a folder have over 10,000 images with the following name structure: ['leaflet_10000_1.jpg', 'leaflet_10000_2.jpg', 'leaflet_10001_1.jpg', 'leaflet_10001_2.jpg', 'leaflet_10002_1.jpg', 'leaflet_10002_2.jpg', 'leaflet_10003_1.jpg', 'leaflet_10003_2.jpg'
And an accompanying csv file of the structure:
ID,Location,Party,Representative/Candidate,Date
1000,Glasgow North,Liberal Democrats,,02-Apr-10
1001,Erith and Thamesmead,Labour Party,,02-Apr-10
I want to create a new csv file which has the paths of all the images for a said Party. I can separate a certain party from the full csv file using the commands:
df_ = df.loc[df["Party"] == "Labour Party"]
This will give me the party I am interested in, but how do I create a FULL list of all images associated with it.. from the image list shared above, it can be seen that ID 1001 has 2 images associated with it.. this is not a fixed number, some ID's have 3 to 5 images associated with them.
How do I get this new dataframe populated with all the required paths?
My thought process is to apply str.split(name, '_') on each file name and then search every ID against all the results but where to go from there?

You're on the right track!
If all IDs are unique and you want an output dataframe with just the party and image number, you can do something like:
from pathlib import Path
import numpy as np
import pandas as pd
partySer = df.loc[:, ['ID', 'Party']].set_index('ID')
# Get image names
imgFiles = list(Path('./<your-image-path>/').glob('*.jpg'))
imgFiles_str = np.array([str(f) for f in imgFiles])
# Grab just the integer ID from each image name
imgIds = np.array([int(f.stem.split('_')[1]) for f in imgFiles])
# Build dataframe with matching ids
outLst = []
for curId, party in partySer.iterrows():
matchingImgIdxs = imgIds == curId
matchingImgs = imgFiles_str[matchingImgIdxs]
outLst.append({'Party': party, 'images': matchingImgs})
outDf = pd.DataFrame(outLst)
I haven't tested this code, but it should lead you on the right track.

Lets create a dataframe of your images and extract the id.
from pathlib import Path
img_df = pd.DataFrame({'img' : [i.stem for i Path(your_images).glob('*.jpg')]})
img_df['ID'] = img_df['imgs'].astype(str).str.split('_',expand=True)[1].astype(int)
img_dfg = img_df.groupby('ID',as_index=False).agg(list)
ID imgs
0 10000 [leaflet_10000_1.jpg, leaflet_10000_2.jpg]
1 10001 [leaflet_10001_1.jpg, leaflet_10001_2.jpg]
2 10002 [leaflet_10002_1.jpg, leaflet_10002_2.jpg]
3 10003 [leaflet_10003_1.jpg, leaflet_10003_2.jpg]
then we just need to merge the ID columns.
df_merged = pd.merge(df,img_dfg,on='ID',how='left')
you can then do any further operations to group or list your images.

What do you want in your DataFrame ? You said here that you wanted to populate your df with the required paths ? If so, then using the str.split(name, '_') would allow you to get the following information for every file : its ID, and its number.
You can now insert elements in your dataframe using both of these characteristics, adding any other characteristic obtained from the relative .csv file that you described. In the end, filtering the dataframe to get all elements that correspond to a given criteria should give you what you are looking for.
You seem to think that one ID will mean one line inside the dataframe, but its incorrect as each line is described by a (ID, number) in your case, and thus, your function would already give you the full list of all images associated with the party/ID/other characteristic.
If you want to reduce the size of your dataframe, since all images related to the same ID only have one characteristic that differ, you could also have a "Files" column, which contain a list of all images related to this ID (and thus, drop the "number" column), or just the number associated with them as their path is composed of the main path, followed by "_number.jpg". This solution would be a lot more efficient

Related

How to place pandas data frame header names in a specific order?

So I try to import a number of excels and create a list of all the data and here is my code for it:
import os
import pandas as pd
cwd = os.path.abspath('')
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.XLSX'):
df = df.append(pd.read_excel(file), ignore_index=True)
df = df.where(df.notnull(), None)
array = df.values.tolist()
print(array)
the excels, on the other hand, look something like so:
product cost used_by prime
name price gender yes or no
name price gender yes or no
... and so on
However, not all of them have the odder of product cost used_by prime(case one order). Some of them, for example, are in the format of cost product prime used_by(case two order). Of course, pandas would be able to auto-sort them and make sure the data find the right header, but I run into an issue.
So basically, I run this code on two different devices using the same data and code but the results are different. One of them is in case one order while the other one is in the case two order. I want to have a line of code that makes sure the data frame is always in the order of product cost used_by prime but I am not sure how.
Can you show me the python code for it? Thank you in advance.

you can try reordering right after loading the csv file
df = df[['product', 'used_by', 'prime']]

How can I import certain files from a folder by creating a mask?

I'm incredibly new to coding so please bear with me
so basically I have a folder of 4,229 .fits files (legac_spec) and a dataframe (legac_cat) with 1989 rows, two of the columns being id number and mask value. Each .fits file has a file name of something along the lines of legac_M[mask value]_v3.6_spec1d _[id number].fits but I'm not sure how to get the specific files I need where each mask value and id number corresponds to a specific file.
I know I need to use a for loop but I'm not sure how I get it to do what i need it to do since I have to mask the mask value and id numbers separately
import numpy as np
import pandas as pd
from astropy.io import fits
legac_cat = pd.read_csv('legac_file')
M = legac_cat['mask']
ID = legac_cat.id
directory_name = 'C:/Users/kfdhfs/Downloads/legac_spec'
for mask_val in legac_cat['mask']:
for files in directory_name:
hdu = fits.open(files)

I've barely ever used Pandas. I haven't tested this solution, so sorry if it doesn't work "straight out of the box". It seems to me that all you have to do is iterate over your dataframe's rows (where each row has a "mask" and "id" that correspond to one file), and then construct a filename from each row's "mask" and "id" - then open the file:
legac_cat = pd.read_csv("legac_file")
for index, row in legac_cat.iterrows():
file_name = "legac_M{}_v3.6_spec1d_{}.fits".format(row["mask"], row["id"])
# open the file using file_name

How Can I implement functions like mean.median and variance if I have a dictionary with 2 keys in Python?

I have many files in a folder that like this one:
enter image description here
and I'm trying to implement a dictionary for data. I'm interested in create it with 2 keys (the first one is the http address and the second is the third field (plugin used), like adblock). The values are referred to different metrics so my intention is to compute the for each site and plugin the mean,median and variance of each metric, once the dictionary has been implemented. For example for the mean, my intention is to consider all the 4-th field values in the file, etc. I tried to write this code but, first of all, I'm not sure that it is correct.
enter image description here
I read others posts but no-one solved my problem, since they threats or only one key or they don't show how to access the different values inside the dictionary to compute the mean,median and variance.
The problem is simple, admitting that the dictionary implementation is ok, in which way must I access the different values for the key1:www.google.it -> key2:adblock ?
Any kind oh help is accepted and I'm available for any other answer.

You can do what you want using a dictionary, but you should really consider using the Pandas library. This library is centered around tabular data structure called "DataFrame" that excels in column-wise and row-wise calculations such as the one that you seem to need.
To get you started, here is the Pandas code that reads one text file using the read_fwf() method. It also displays the mean and variance for the fourth column:
# import the Pandas library:
import pandas as pd
# Read the file 'table.txt' into a DataFrame object. Assume
# a header-less, fixed-width file like in your example:
df = pd.read_fwf("table.txt", header=None)
# Show the content of the DataFrame object:
print(df)
# Print the fourth column (zero-indexed):
print(df[3])
# Print the mean for the fourth column:
print(df[3].mean())
# Print the variance for the fourth column:
print(df[3].var())
There are different ways of selecting columns and rows from a DataFrame object. The square brackets [ ] in the previous examples selected a column in the data frame by column number. If you want to calculate the mean of the fourth column only from those rows that contain adblock in the third column, you can do it like so:
# Print those rows from the data frame that have the value 'adblock'
# in the third column (zero-indexed):
print(df[df[2] == "adblock"])
# Print only the fourth column (zero-indexed) from that data frame:
print(df[df[2] == "adblock"][3])
# Print the mean of the fourth column from that data frame:
print(df[df[2] == "adblock"][3].mean())
EDIT:
You can also calculate the mean or variance for more than one column at the same time:
# Use a list of column numbers to calculate the mean for all of them
# at the same time:
l = [3, 4, 5]
print(df[l].mean())
END EDIT
If you want to read the data from several files and do the calculations for the concatenated data, you can use the concat() method. This method takes a list of DataFrame objects and concatenates them (by default, row-wise). Use the following line to create a DataFrame from all *.txt files in your directory:
df = pd.concat([pd.read_fwf(file, header=None) for file in glob.glob("*.txt")],
ignore_index=True)

parsing CSV to pandas dataframes (one-to-many unmunge)

I have a csv file imported to a pandas dataframe. It probably came from a database export that combined a one-to-many parent and detail table. The format of the csv file is as follows:
header1, header2, header3, header4, header5, header6
sample1, property1,,,average1,average2
,,detail1,detail2,,
,,detail1,detail2,,
,,detail1,detail2,,
sample2, ...
,,detail1,detail2,,
,,detail1,detail2,,
...
(i.e. line 0 is the header, line 1 is record 1, lines 2 through n are details, line n+1 is record 2 and so on...)
What is the best way to extricate (renormalize?) the details into separate DataFrames that can be referenced using values in the sample# records? The number of each subset of details are different for each sample.
I can use:
samplelist = df.header2[pd.notnull(df.header2)]
to get the starting index of each sample so that I can grab samplelist.index[0] to samplelist.index[1] and put it in a smaller dataframe. Detail records by themselves have no reference to which sample they came from, so that has to be inferred from the order of the csv file (notice that there is no intersection of filled/empty fields in my example).
Should I make a list of dataframes, a dict of dataframes, or a panel of dataframes?
Can I somehow create variables from the sample1 record fields and somehow attach them to each dataframe that has only detail records (like a collection of objects that have several scalar members and one dataframe each)?
Eventually I will create statistics on data from each detail record grouping and plot them against values in the sample records (e.g. sampletype, day or date, etc. vs. mystatistic). I will create intermediate Series to also be attached to the sample grouping like a kernel density estimation PDF or histogram.
Thanks.

You can use the fact that the first column seems to be empty unless it's a new sample record to .fillna(method='ffill') and then .groupby('header1') to get all the separate groups. On these, you can calculate statistics right away or store as separate DataFrame. High level sketch as follows:
df.header1 = df.header1.fillna(method='ffill')
for sample, data in df.groupby('header1'):
print(sample) # access to sample name
data = ... # process sample records

The answer above got me going in the right direction. With further work, the following was used. It turns out I needed to use two columns as a compound key to uniquely identify samples.
df.header1 = df.header1.fillna(method='ffill')
df.header2 = df.header2.fillna(method='ffill')
grouped = df.groupby(['header1','header2'])
samplelist = []
dfParent = pd.DataFrame()
dfDetail = pd.DataFrame()
for sample, data in grouped:
samplelist.append(sample)
dfParent = dfParent.append(grouped.get_group(sample).head(n=1), ignore_index=True)
dfDetail = dfDetail.append(data[1:], ignore_index=True)
dfParent = dfParent.drop(['header3','header4',etc...]) # remove columns only used in
# detail records
dfDetail = dfDetail.drop(['header5','header6',etc...]) # remove columns only used once
# per sample
# Now details can be extracted by sample number in the sample list
# (e.g. the first 10 for sample 0)
samplenumber = 0
dfDetail[
(dfDetail['header1'] == samplelist[samplenumber][0]) &
(dfDetail['header2'] == samplelist[samplenumber][1])
].header3[:10]
Useful links were:
Pandas groupby and get_group
Pandas append to DataFrame

Getting information for multiple queries across multiple .csv files

I am currently trying to figure out a way to get information stored across multiple datasets as .csv files.
Context
For the purposes of this question, suppose I have 4 datasets: experiment_1.csv, experiment_2.csv, experiment_3.csv, and experiment_4.csv. In each dataset, there are 20,000+ rows with 80+ columns in each row. Each row represents an Animal, identified by a id number, and each column represents various experimental data about that Animal. Assume each row's Animal ID number is unique for each dataset, but not across all datasets. For instance, ID#ABC123 can be found in experiment_1.csv, experiment_2.csv, but not experiment_3.csv and experiment_4.csv
Problem
Say a user wants to get info for ~100 Animals by looking up each Animal's ID # across all datasets. How would I go about doing this? I'm relatively new to programming, and I would like to improve. Here's what I have so far.
class Animal:
def __init__(self, id_number, *other_parameters):
self.animal_id = id_number
self.animal_data = {}
def store_info(self, csv_row, dataset):
self.animal_data[dataset] = csv_row
# Main function
# ...
# Assume animal_queries = list of Animal Objects
# Iterate through each dataset csv file
for dataset in all_datasets:
# Make a copy of the list of queries
copy_animal_queries = animal_queries[:]
with open(dataset, 'r', newline='') as dataset_file:
reader = csv.DictReader(dataset_file, delimiter=',')
# Iterate through each row in the csv file
for row in reader:
# Check if the list is not empty
if animal_queries_copy:
# Get the current row's animal id number
row_animal_id = row['ANIMAL ID']
# Check if the animal id number matches with a query for
# every animal in the list
for animal in animal_queries_copy[:]:
if animal.animal_id == row_animal_id:
# If a match is found, store the info, remove the
# query from the list, and exit iterating through
# each query
animal.store_info(row, dataset)
animal_list_copy.remove(animal)
break
# If the list is empty, all queries were found for the current
# dataset, so exit iterating through rows in reader
else:
break
Discussion
Is there a more obvious approach for this? Assume that I want to use .csv files for now, and I will consider converting these .csv files to an easier-to-use format like SQL Tables later down the line (I am an absolute beginner at databases and SQL, so I need to spend time learning this).
The one thing that sticks out to me is that I have to create multiple copies of animal_queries: 1 for each dataset, and 1 for each row in a dataset (in the for loop). Since 1 row only contains 1 ID, I can exit the loop early once I find a match to an ID from animal_queries. In addition, since that ID was already found, I no longer need to search for that ID for the rest of the current dataset, so I remove it from the list, but I need to keep the original copy of the queries since I also need it to search the remaining datasets. However, I can't remove an element from a list while inside a for loop, so I need to create another copy as well. This doesn't seem optimal to me and I'm wondering if I'm approaching this in the wrong direction. Any help would be appreciated, thanks!

Well, you could greatly speed this up by using the pandas library for one thing. Ignoring the class definition for now, you could do the following:
import pandas as pd
file_names = ['ex_1.csv', 'ex_2.csv']
animal_queries = ['foo', 'bar'] #input by user
#create list of data sets
data_sets = [pd.read_csv(_file) for _file in file_names]
#create store of retrieved data
retrieved_data = [d_s[d_s['ANIMAL ID'].isin(animal_queries)] for d_s in data_sets]
#concatenate the data
final_data = pd.concat(retrieved_data)
#export to csv
final_data.to_csv('your_data')
This simplifies things a lot. The isin method slices each data frame where ANIMAL ID is found in the list animal_queires. Incidentally pandas will also help you to cope with sql tables also so is probably a good route for you to go down.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.