Iterating over .csv files and naming dataframes respectively - python

how can I iterate over .csv files in a folder, create dataframe from each .csv and name those dateframes after respective .csv files. Or it could be actually any other name.
My approach doesnt event create a single dataframe.
path = "/user/Home/Data/"
files = os.listdir(path)
os.chdir(path)
for file, j in zip(files, range(len(files))):
if file.endswith('.csv'):
files[j] = pd.read_csv(file)
Thanks!

You can use pathlib and a dictionary to do that (as already pointed out by jitusj in the comment).
from pathlib import Path
path = Path(".../user/Home/Data/") # complete path needed! Replace "..." with full path
dict_of_df = {}
for file in path.glob('*.csv'):
dict_of_df[file.stem] = pd.read_csv(file)
Now you have a dictionary of dataframes, with the filenames as keys (without .csv extension).

Related

Parsing through each folder to pull in information in python

I have a directory with a folder for each customer. In each customer folder there is a csv file named surveys.csv. I want to open each customer folder and then pull the data from the csv and concatenate. I also want to create a column with that customer id which is the name of the folder.
import os
rootdir = '../data/customer_data/'
for subdir, dirs, files in os.walk(rootdir):
for file in files:
csvfiles = glob.glob(os.path.join(mycsvdir, 'surveys.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
df = pd.read_csv(csvfile)
df['patient_id'] = os.path.dirname
dataframes.append(df)
# concatenate them all together
result = pd.concat(dataframes, ignore_index=True)
result.head()
This code is only giving me a dataframe with one customer's data. In the directory : '../data/customer_data/' there should be about 25 folders with customer data. I want to concatenate all the 25 of the surveys.csv files into a dataframe. Please help
Put this line:
dataframes = []
Outside the outer for loop.
It erases the list every loop.
Another issues:
In this line csvfiles = glob.glob(os.path.join(mycsvdir, 'surveys.csv')) - use subdir to get full path of the files.
csvfiles is only one file - why do you use loop to read it?

Read multiple csv files starting with a string into separate data frames in python

I have about 500 '.csv' files starting with letter 'T' e.g. 'T50, T51, T52 ..... T550' and there are some other ',csv' files with other random names in the folder. I want to read all csv files starting with "T" and store them in separate dataframes: 't50, t51, t52... etc.'
The code I have written just reads these files into a dataframe
import glob
import pandas as pd
for file in glob.glob("T*.csv"):
print (file)
I want to have a different name for each dataframe - preferably, their own file names. How can I achieve this within its 'for loop'?
Totally agree with #Comos
But if you still need individual variable names, I adapted the solution from here!
import pandas as pd
import os
folder = '/path/to/my/inputfolder'
filelist = [file for file in os.listdir(folder) if file.startswith('T')]
for file in filelist:
exec("%s = pd.read_csv('%s')" % (file.split('.')[0], os.path.join(folder,file)))
In additions to ABotros's answer, to read all files in different dataframes, I would recommend adding the files to a dictionary, which will allow you to save dataframes with different names in a loop:
filelist = [file for file in os.listdir(folder) if file.startswith('T')]
database = {}
for file in filelist:
database[file] = pd.read_csv(file)

Consolidate excel files from folders that are found in list of foldernames

I have a list of foldernames and I want to go through the folders according to this list and consolidate the excel files found in those folders.
Example:
Say I have the following directory:"C:/Users/XXX/Documents/File Tracking"
This includes the folders A, B, C, D, E, F
Now I have a list of folder names: lst=[A,B,D]
Now I want to go through the folders A, B, D and consolidate the excel files found in these folders into one, ignoring the folders not mentioned in this list.
This is some code that works if I want to consolidate the files from all subfolders
all_data = pd.DataFrame()
for f in glob.glob("C:/Users/XXX/Documents/File Tracking/*"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
If I understand correctly, this should work just fine. Check comments in code for more explanation.
import pandas as pd
import os
# assumes you have a list of the file paths
def consolidate_excel_files(folder_paths: list) -> pd.DataFrame:
# used to collect all dfs from folders
df_collection = []
for folder in folder_paths:
# makes sure your path is right
if os.path.isdir(folder):
# list comprehension that gets all excel files into a data frame
# will ignore any stray file that is not .xlsx or .xls
all_files_as_df = [pd.read_excel(os.path.absnpath(file))
for file in os.listdir(folder)
if os.splitext(file)[1] in ['.xlsx' or '.xls']]
# we only want a 1d list when we use pd.append, so we extend instead
df_collection.extend(all_files_as_df)
# assuming the index is not important
return pd.append(df_collection, ignore_index=True)
There is probably a less verbose way of doing this if you assume a few things, but this will work.
You may do it in the most straightforward way - simply get a list of directories in chosen base dir, filter it and look for the spreadsheets inside each of them. See the boilerplate below:
import glob
import os
path = "C:/Users/XXX/Documents/File Tracking/"
allowed = ["A", "B", "D"]
# list of first-level directories from allowed list
dirs = [name for name in os.listdir(path) if os.path.isdir(os.path.join(path, name)) and name in allowed]
for dirname in dirs:
# iterate over all files that match pattern, for example, xlsx
for file_name in glob.glob(os.path.join(path, dirname, "*.xlsx")):
# process each file

Extract file name from read_csv - Python

I have a script that current reads raw data from a .csv file and performs some pandas data analysis against the data. Currently the .csv file is hardcoded and is read in like this:
data = pd.read_csv('test.csv',sep="|", names=col)
I want to change 2 things:
I want to turn this into a loop so it loops through a directory of .csv files and executes the pandas analysis below each one in the script.
I want to take each .csv file and strip the '.csv' and store that in a another list variable, let's call it 'new_table_list'.
I think I need something like below, at least for the 1st point(though I know this isn't completely correct). I am not sure how to address the 2nd point
Any help is appreciated
import os
path = '\test\test\csvfiles'
table_list = []
for filename in os.listdir(path):
if filename.endswith('.csv'):
table_list.append(file)
data = pd.read_csv(table_list,sep="|", names=col)
Many ways to do it
for filename in os.listdir(path):
if filename.endswith('.csv'):
table_list.append(pd.read_csv(filename,sep="|"))
new_table_list.append(filename.split(".")[0])
One more
for filename in os.listdir(path):
if filename.endswith('.csv'):
table_list.append(pd.read_csv(filename,sep="|"))
new_table_list.append(filename[:-4])
and many more
As #barmar pointed out, better to append path as well to the table_list to avoid any issues related to path and location of files and script.
You can try something like this:
import glob
data = {}
for filename in glob.glob('/path/to/csvfiles/*.csv'):
data[filename[:-4]] = pd.read_csv(filename, sep="|", names=col)
Then data.keys() is the list of filenames without the ".csv" part and data.values() is a list with one pandas dataframe for each file.
I'd start with using pathlib.
from pathlib import Path
And then leverage the stem attribute and glob method.
Let's make an import function.
def read_csv(f):
return pd.read_csv(table_list, sep="|")
The most generic approach would be to store in a dictionary.
p = Path('\test\test\csvfiles')
dod = {f.stem: read_csv(f) for f in p.glob('*.csv')}
And you can also use pd.concat to turn that into a dataframe.
df = pd.concat(dod)
to get the list CSV files in the directory use glob it is easier than os
from glob import glob
# csvs will contain all CSV files names ends with .csv in a list
csvs = glob('you\\dir\\to\\csvs_folder\\*.csv')
# remove the trailing .csv from CSV files names
new_table_list = [csv[:-3] for csv in csvs]
# read csvs as dataframes
dfs = [pd.read_csv(csv, sep="|", names=col) for csv in csvs]
#concatenate all dataframes into a single dataframe
df = pd.concat(dfs, ignore_index=True)
you can try so:
import os
path = 'your path'
all_csv_files = [f for f in os.listdir(path) if f.endswith('.csv')]
for f in all_csv_files:
data = pd.read_csv(os.path.join(path, f), sep="|", names=col)
# list without .csv
files = [f[:-4] for f all_csv_files]
You can (at the moment of opening) add the filename to a Dataframe attribute as follow:
ds.attrs['filename']='filename.csv'
You can subsequently query the dataframe for the name
ds.attrs['filename']
'filename.csv'

How to modify python code to move converted files to a separate folder?

I have python code that converts multiple .xlsx files to .csv. But it puts them into the same folder.
How to modify this code to make sure it puts .csv files to a separate folder?
import pandas as pd
import glob
excel_files = glob.glob('C:/Users/username/Documents/TestFolder/JanuaryDataSentToResourcePro/*.xlsx') # assume the path
for excel in excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel, 'ResourceProDailyDataset') # if the sheet name is always the same.
df.to_csv(out)
Split the directory from your file name and then give your out a new directory :
for excel in excel_files:
folder = r'C:\ASeparateFolder\'
out = folder + excel.split('\\')[-1].split('.')[0]+'.csv'
df = pd.read_excel(excel, 'ResourceProDailyDataset') # if the sheet name is always the same.
df.to_csv(out)
Note I used '\\' to split, it seems to be the common splitting point for glob whether you use '\' or '/' for your initial excel_file directory.
First split the path, remove the folder that you don't want and replace it with some other folder.
import pandas as pd
import os
import glob
excel_files = glob.glob('C:/Users/username/Documents/TestFolder/JanuaryDataSentToResourcePro/*.xlsx') # assume the path
# folder name for the converted csv files
different_folder = "csv_folder"
for excel in excel_files:
# make a csv path
csv_folder_path = "\\".join(excel.split('\\')[:-1])+"\\"+different_folder+"\\"
if not os.path.exists(csv_folder_path):
# create it if it doesn't exist
os.makedirs(csv_folder_path)
# full path of the csv file
out = csv_folder_path+excel.split('\\')[-1].split('.')[0]+'.csv'
df = pd.read_excel(excel, 'ResourceProDailyDataset') # if the sheet name is always the same.
df.to_csv(out)
This will create a new folder csv_folder in all the folders that have Excel file in them and the converted csv files would be placed in this folder.
You can change the variable that controls your output file. Try something like this
out = 'some_directory/' + excel.split('.')[0]+ '.csv'
The directory will have to exist for this to work.

Categories