How to select unique values from named column in multiple .csv files?

How to select unique values from named column in multiple .csv files? - python

I am trying to create a list of unique ID's from multiple csvs.
I have around 80 csvs containing data, all in the same format and in the same directory. The files contain time series data from around 1500 sites, but not all sites are in all files. The column with the data I need is called 'Site Id'.
I can get unique values from the first csv by creating a dataframe, but I can't see how to loop through all the remaining files.
If it's not obvious by now I am a complete beginner and my tutors are on vacation!
I've tried creating a df for a single file, but I can't figure out the next step.
df = pd.read_csv(r'C:filepathhere.csv')
ids = df['Site Id'].unique().tolist()

You can do something like this. I used the os.listdir function to get all of the files, and then the list.extend to merge the site IDs I was coming across into my siteIDs list. Finally, turning a list into a set, and then back into a list will remove any duplicate entries.
siteIDs = []
directoryToCSVs = r'c:\...'
for filename in os.listdir(directoryToCSVs):
if filename.lower().endswith('.csv'):
df = pd.read_csv(r'C:filepathhere.csv')
siteIDs.extend( df['Site Id'].tolist() )
#remove duplicate site IDs
siteIDs = list(set(siteIds))
#siteIDs will now contain a list of the unique site IDs across all of your CSV files.

You could do something like this to iterate over all your CSVs and load them into dataframes:
from os import walk, path
import pandas as pd
path = 'Path to CSV dir'
csv_paths = []
for root, dirs, files in walk(path):
for c in glob(path.join(root, '*.csv')):
csv_paths.append(c)
for file_path in csv_paths:
df = pd.read_csv(filepath_or_buffer=file_path)
# do something with df (append, export, etc.)

First you need to gather the files into a list that you will be getting data out of. There are many ways to do this, assuming you know the directory they are all in, see this answer for many options.
from os import walk
f = []
for (dirpath, dirnames, filenames) in walk(mypath):
f.extend(filenames)
break
Then within that list you'll need to gather those unique values that you need. Without using Pandas, since it doesn't seem like you actually need your information in a dataframe:
import csv
unique_data = {}
for file in f:
with open(file, 'rU') as infile:
reader = csv.DictReader(infile)
for row in reader:
# go through each, add value to dictionary
for header, value in row.items():
unique_data[value] = 0
# unqiue_data.keys() is now your list of unique values, if you want a true list
unique_data_list = list(unqiue_data.keys())

Related

Merge csv files based on file names and suffix in Python

First time poster and fairly new to Python here. I have a collection of +1,7000 csv files with 2 columns each. The number and labels of the rows are the same in every file. The files are named with a specific format. For example:
Species_1_OrderA_1.csv
Species_1_OrderA_2.csv
Species_1_OrderA_3.csv
Species_10_OrderB_1.csv
Species_10_OrderB_2.csv
Each imported dataframe is formatted like so:
TreeID Species_1_OrderA_2
0 Bu2_1201_1992 0
1 Bu3_1201_1998 0
2 Bu4_1201_2000 0
3 Bu5_1201_2002 0
4 Bu6_1201_2004 0
.. ... ...
307 Fi141_16101_2004 0
308 Fi142_16101_2006 0
309 Fi143_16101_2008 0
310 Fi144_16101_2010 0
311 Fi147_16101_2015 0
I would like to join the files that correspond to the same species, based on the first column. So, in the end, I would get the files Species_1_OrderA.csv and Species_10_OrderB.csv. Please note that all the species do not necessarily have the same number of files.
This is what I have tried so far.
import os
import glob
import pandas as pd
# Importing csv files from directory
path = '.'
extension = 'csv'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))
# Create a dictionary to loop through each file to read its contents and create a dataframe
file_dict = {}
for file in files:
key = file
df = pd.read_csv(file)
file_dict[key] = df
# Extract the name of each dataframe, convert to a list and extract the relevant
# information (before the 3rd underscore). Compare each of these values to the next and
# if they are the same, append them to a list. This list (in my head, at least) will help
# me merge them using pandas.concat
keys_list = list(file_dict.keys())
group = ''
for line in keys_list:
type = "_".join(line.split("_")[:3])
for i in range(len(type) - 1):
if type[i] == type[i+1]:
group.append(line[keys_list])
print(group)
However, the last bit is not even working, and at this point, I am not sure this is the best way to deal with my problem. Any pointers on how to solve this will be really appreciated.
--- EDIT:
This is the expected output for the files per species. Ideally, I would remove the rows that have zeros in them, but that can easily be done with awk.
TreeID,Species_1_OrderA_0,Species_1_OrderA_1,Species_1_OrderA_2
Bu2_1201_1992,0,0,0
Bu3_1201_1998,0,0,0
Bu4_1201_2000,0,0,0
Bu5_1201_2002,0,0,0
Bu6_1201_2004,0,0,0
Bu7_1201_2006,0,0,0
Bu8_1201_2008,0,0,0
Bu9_1201_2010,0,0,0
Bu10_1201_2012,0,0,0
Bu11_1201_2014,0,0,0
Bu14_1201_2016,0,0,0
Bu16_1201_2018,0,0,0
Bu18_3103_1989,0,0,0
Bu22_3103_1999,0,0,0
Bu23_3103_2001,0,0,0
Bu24_3103_2003,0,0,0
...
Fi141_16101_2004,0,0,10
Fi142_16101_2006,0,4,0
Fi143_16101_2008,0,0,0
Fi144_16101_2010,2,0,0
Fi147_16101_2015,0,7,0
``

Try it like this:
import os
import pandas as pd
path = "C:/Users/username"
files = [file for file in os.listdir(path) if file.endswith(".csv")]
dfs = dict()
for file in files:
#everything before the final _ is the species name
species = file.rsplit("_", maxsplit=1)[0]
#read the csv to a dataframe
df = pd.read_csv(os.path.join(path, file))
#if you don't have a df for a species, create a new key
if species not in dfs:
dfs[species] = df
#else, merge current df to existing df on the TreeID
else:
dfs[species] = pd.merge(dfs[species], df, on="TreeID", how="outer")
#write all dfs to their own csv files
for key in dfs:
dfs[key].to_csv(f"{key}.csv")

If your goal is to concatenate all the csv's for each species-order into a consolidated csv, this is one approach. I haven't tested it so there might be a few errors. The idea is to first use glob, as you're doing, to make a dict of file_paths so that all the file_paths of the same species-order are grouped together. Then for each species-order read in all the data into a single table in memory and then write out to a consolidated file.
import pandas as pd
import glob
#Create a dictionary keyed by species_order, valued by a list of files
#i.e. file_paths_by_species_order['Species_10_OrderB'] = ['Species_10_OrderB_1.csv', 'Species_10_OrderB_2.csv']
file_paths_by_species_order = {}
for file_path in glob.glob('*.csv'):
species_order = file_path.split("_")[:3]
if species_order not in file_paths_by_species_order:
file_paths_by_species_order[species_order] = [file_path]
else:
file_paths_by_species_order[species_order].append(file_path)
#For each species_order, concat all files and save the info into a new csv
for species_order,file_paths in file_paths_by_species_order.items():
df = pd.concat(pd.read_csv(file_path) for file_path in file_paths)
df.to_csv('consolidated_{}.csv'.format(species_order))
There are definitely improvements that can be made such as using collections.defaultdict and writing one file at a time out to the consolidated file, instead of reading them all into memory

How to create a dataframe from multiple csv files?

I am loading a csv file in pandas as
premier10 = pd.read_csv('./premier_league/pl_09_10.csv')
However, I have 20+ csv files, which I was hoping to load as separate dfs (one df per csv) using a loop and predefined names, something similar to:
import pandas as pd
file_names = ['pl_09_10.csv','pl_10_11.csv']
names = ['premier10','premier11']
for i in range (0,len(file_names)):
names[i] = pd.read_csv('./premier_league/{}'.format(file_names[i]))
(Note, here I provide only two csv files as example) Unfortunately, this doesn't work (no error messages, but the the pd dfs don't exist).
Any tips/links to previous questions would be greatly appreciated as I haven't found anything similar on Stackoverflow.

Use pathlib to set a Path, p, to the files
Use the .glob method to find the files matching the pattern
Create a dataframe with pandas.read_csv
Use a dict comprehension to create a dict of dataframes, where each file will have its own key-value pair.
Use the dict like any other dict; the keys are the file names and the values are the dataframes.
Alternatively, use a list comprehension with pandas.concat to create a single dataframe from all the files.
In the for-loop in the OP, objects (variables) may not be created in that way (e.g. names[i]).
This is equivalent to 'premier10' = pd.read_csv(...), where 'premier10' is a str type.
from pathlib import Path
import pandas as pd
# set the path to the files
p = Path('some_path/premier_league')
# create a list of the files matching the pattern
files = list(p.glob(f'pl_*.csv'))
# creates a dict of dataframes, where each file has a separate dataframe
df_dict = {f.stem: pd.read_csv(f) for f in files}
# alternative, creates 1 dataframe from all files
df = pd.concat([pd.read_csv(f) for f in files])

names = ['premier10','premier11'] does not create a dictionary but a list. Simply replace it with names = dict() or replace names = ['premier10','premier11'] by names.append(['premier10','premier11'])

This is what you want:
#create a variable and look through contents of the directory
files=[f for f in os.listdir("./your_directory") if f.endswith('.csv')]
#Initalize an empty data frame
all_data = pd.DataFrame()
#iterate through files and their contents, then concatenate their data into the data frame initialized above
for file in files:
df = pd.read_csv('./your_directory' + file)
all_data = pd.concat([all_data, df])
#Call the new data frame and verify that contents were transferred
all_data.head()

Read multiple CSV files in pairs of two them compare each two pair

I have a folder with multiple CSV files named like this
CINinfo_2019-08-08_rev1,CINinfo_2019-08-08_rev2,CINinfo_2019-08-08_rev3, CINinfo_2019-08-08_rev4, I have about 70 files in one folder my intention is to automate this process so that I can read them automatically in pairs of two then compare for differences in each pair and have the result as one combined table. Currently, I am reading them manually and comparing differences,this is the code:
import pandas as pd
df1 = pd.read_csv("CINinfo_2019-08-08_rev1.csv")
df2 = pd.read_csv("CINinfo_2019-08-08_rev2.csv")
import numpy as np
rows,cols=np.where(comparison_values==False)
for item in zip(rows,cols):
df1.iloc[item[0], item[1]] = '{} --> {}'.format(df1.iloc[item[0], item[1]],df2.iloc[item[0], item[1]])
This process is so tedious being that I have other folders with CSV files that I need to read. Note how the CSV files are named, all CSV files have the same prefixes (CINinfo_2019-08-08_) but suffix in this case name (rev) has an incremental number from 1 to 70. The way I need this to read files in pairs is in format 1 and 2, 2 and 3, 3 and 4 going on. In this case I compare pairs like this, CINinfo_2019-08-08_rev1 and CINinfo_2019-08-08_rev2 then CINinfo_2019-08-08_rev2 nd CINinfo_2019-08-08_rev3 going like that, How can I automate the reading of this files in pairs then compare for differences in each pair of them and have one joined table?

You could try something like this:
import os, re
import pandas as pd
import numpy as np
# your directory path here
path = r'path'
# get all files
file_, pat = [], re.compile('.csv')
for root, dirs, files in os.walk(path):
file_ = [os.path.join(root, f) for f in files if pat.search(f)]
# you may want to filter here, this line is just an example
# filter for all csv files containing 'rev'
file_ = [f for f in file_ if 'rev' in f]
# loop through the files of interest
for (idx, ff) in enumerate(file_[1:]):
df1 = pd.read_csv(ff)
df2 = pd.read_csv(file_[idx])
rows, cols = np.where(comparison_values==False)
for item in zip(rows,cols):
# do calculation
This answer is not all inclusive, but hopefully will give you a possible approach. You may need to adjust filtering, or possibly sort. I have not shown how to add the results to a final table, but the best thing to do is create a DataFrame temp and assign the values from the file pairs to it and then use pd.concat to add to a final DataFrame that wil contain all results.

Finding a specific value in csv files Python

I have a column of values, which are part of a dataframe df.
Value
6.868061881
6.5903628020000005
6.472865833999999
6.427754219
6.40081742
6.336348032
6.277545389
6.250755132
These values have been put together from several CSV files. Now I'm trying to backtrack and find the original CSV file which contains the values. This is my code. The problem is each row of the CSV file contains alphanumeric entries and I'm comparing only for numeric ones (as Values above). So the code isn't working.
for item in df['Value']:
for file in dirs:
csv_file = csv.reader(open(file))
for row in csv_file:
for column in row:
if str(column) == str(item):
print (file)
Plus, I'm trying to optimize the # loops. How do I approach this?

Assuming dirs is a list of file paths to CSV files:
csv_dfs = {file: pd.read_csv(file) for file in dirs}
csv_df = pd.concat(csv_dfs)
If you're just looking in the 'Values' column, this is pretty straightforward:
print csv_df[csv_df['Values'].isin(df['Values'])]
Because we made the dataframe from a dictionary of the files, where the keys are filenames, the printed values will have the original filename in the index.
In a comment, you asked how to just get the filenames. Because of the way we constructed the dataframe's index, the following should work to get a series of the filenames:
csv_df[csv_df['Values'].isin(df['Values'])].reset_index()['level_0']
Note, if you're not sure what column in the CSVs you're matching, then you can loop it:
for col in df.columns:
print csv_df[csv_df[col].isin(df['Values'])]

A few suggestions:
Make sure you're comparing like types, e.g.:
if str(column) == str(item):
Or, you could check types before doing the comparison:
if all(map(type,[column,item])) and column == item:
Or, dump your CSV into a DataFrame. This approach reduces the number of loops since you don't need to iterate the rows/lines in the file, just the columns:
from pandas import read_csv
for item in df['Value']:
for file in dirs:
csv_frame = read_csv(file)
for column in csv_frame.columns:
if item in csv_frame[column]:
print(file)

File I/O will generally take more time than processing data in memory. So, if you want to optimize your code , it will be better to loop through the csv files once, instead of for every item in your dataframe. I suggest the following -
val_list = df['Values'].values
for file in dirs:
csv_df = pd.read_csv(file)
df_contains = csv_df.isin(val_list)
if np.any(df_contains.values):
print(file)

Read in multiple csv into separate dataframes in Pandas

I have a long list of csv files that I want to read as dataframes and name them by their file name. For example, I want to read in the file status.csv and assign its dataframe the name status. Is there a way I can efficiently do this using Pandas?
Looking at this, I still have to write the name of each csv in my loop. I want to avoid that.
Looking at this, that allows me to read multiple csv into one dataframe instead of many.

You can list all csv under a directory using os.listdir(dirname) and combine it with os.path.basename to parse the file name.
import os
# current directory csv files
csvs = [x for x in os.listdir('.') if x.endswith('.csv')]
# stats.csv -> stats
fns = [os.path.splitext(os.path.basename(x))[0] for x in csvs]
d = {}
for i in range(len(fns)):
d[fns[i]] = pd.read_csv(csvs[i])

you could create a dictionary of DataFrames:
d = {} # dictionary that will hold them
for file_name in list_of_csvs: # loop over files
# read csv into a dataframe and add it to dict with file_name as it key
d[file_name] = pd.read_csv(file_name)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to select unique values from named column in multiple .csv files? - python

Related

Merge csv files based on file names and suffix in Python

How to create a dataframe from multiple csv files?

Read multiple CSV files in pairs of two them compare each two pair

Finding a specific value in csv files Python

Read in multiple csv into separate dataframes in Pandas

Categories

Resources