I have a script that current reads raw data from a .csv file and performs some pandas data analysis against the data. Currently the .csv file is hardcoded and is read in like this:
data = pd.read_csv('test.csv',sep="|", names=col)
I want to change 2 things:
I want to turn this into a loop so it loops through a directory of .csv files and executes the pandas analysis below each one in the script.
I want to take each .csv file and strip the '.csv' and store that in a another list variable, let's call it 'new_table_list'.
I think I need something like below, at least for the 1st point(though I know this isn't completely correct). I am not sure how to address the 2nd point
Any help is appreciated
import os
path = '\test\test\csvfiles'
table_list = []
for filename in os.listdir(path):
if filename.endswith('.csv'):
table_list.append(file)
data = pd.read_csv(table_list,sep="|", names=col)
Many ways to do it
for filename in os.listdir(path):
if filename.endswith('.csv'):
table_list.append(pd.read_csv(filename,sep="|"))
new_table_list.append(filename.split(".")[0])
One more
for filename in os.listdir(path):
if filename.endswith('.csv'):
table_list.append(pd.read_csv(filename,sep="|"))
new_table_list.append(filename[:-4])
and many more
As #barmar pointed out, better to append path as well to the table_list to avoid any issues related to path and location of files and script.
You can try something like this:
import glob
data = {}
for filename in glob.glob('/path/to/csvfiles/*.csv'):
data[filename[:-4]] = pd.read_csv(filename, sep="|", names=col)
Then data.keys() is the list of filenames without the ".csv" part and data.values() is a list with one pandas dataframe for each file.
I'd start with using pathlib.
from pathlib import Path
And then leverage the stem attribute and glob method.
Let's make an import function.
def read_csv(f):
return pd.read_csv(table_list, sep="|")
The most generic approach would be to store in a dictionary.
p = Path('\test\test\csvfiles')
dod = {f.stem: read_csv(f) for f in p.glob('*.csv')}
And you can also use pd.concat to turn that into a dataframe.
df = pd.concat(dod)
to get the list CSV files in the directory use glob it is easier than os
from glob import glob
# csvs will contain all CSV files names ends with .csv in a list
csvs = glob('you\\dir\\to\\csvs_folder\\*.csv')
# remove the trailing .csv from CSV files names
new_table_list = [csv[:-3] for csv in csvs]
# read csvs as dataframes
dfs = [pd.read_csv(csv, sep="|", names=col) for csv in csvs]
#concatenate all dataframes into a single dataframe
df = pd.concat(dfs, ignore_index=True)
you can try so:
import os
path = 'your path'
all_csv_files = [f for f in os.listdir(path) if f.endswith('.csv')]
for f in all_csv_files:
data = pd.read_csv(os.path.join(path, f), sep="|", names=col)
# list without .csv
files = [f[:-4] for f all_csv_files]
You can (at the moment of opening) add the filename to a Dataframe attribute as follow:
ds.attrs['filename']='filename.csv'
You can subsequently query the dataframe for the name
ds.attrs['filename']
'filename.csv'
Related
how can I iterate over .csv files in a folder, create dataframe from each .csv and name those dateframes after respective .csv files. Or it could be actually any other name.
My approach doesnt event create a single dataframe.
path = "/user/Home/Data/"
files = os.listdir(path)
os.chdir(path)
for file, j in zip(files, range(len(files))):
if file.endswith('.csv'):
files[j] = pd.read_csv(file)
Thanks!
You can use pathlib and a dictionary to do that (as already pointed out by jitusj in the comment).
from pathlib import Path
path = Path(".../user/Home/Data/") # complete path needed! Replace "..." with full path
dict_of_df = {}
for file in path.glob('*.csv'):
dict_of_df[file.stem] = pd.read_csv(file)
Now you have a dictionary of dataframes, with the filenames as keys (without .csv extension).
I've been trying to create a code, which runs through all the csv files inside the directory and applies the same operation to all of them. Afterwards it should save the new csv files in another directory.
I've got two problems: First the code only saves the last iteration and second how do I save the files with different names?
Here's my code so far:
from pathlib import Path
import pandas as pd
dir = r'C:\my\path\to\file'
csv_files = [f for f in Path(dir).glob('*.csv')] #list all csv
for csv in csv_files: #iterate list
df = pd.read_csv(csv, encoding = 'ISO-8859-1', engine='python', delimiter = ';') #read csv
df.drop(df.index[:-1], inplace = True) #drop all but the last row
df.to_csv("C:\new\path\to\file\variable name") #save the file in a new dir
Rakesh answer works perfectly for me. Thank you guys for your input! :)
In this case maybe best thing is to save new file with same name/with a common suffix or in new directory.
I've got two problems:
First the code only saves the last iteration - It is because you are saving files with same name so each iteration overrides this file & only last file is available.
and second how do I save the files with different names? - may be use same name for new files to & save in new directory or use some suffix like mycsv_modified.csv
Below i created an example to save in new directory (I tested this code on non-window environment & using jupyter notebook)-
from pathlib import Path
import pandas as pd
dir_b = r'/Users/rakeshkumar/bigquery'
csv_files = [f for f in Path(dir_b).glob('*.csv')] #list all csv
#!mkdir -p processed #I created new directory to save modified file in notebook itself, you can decide yourself about new directory
for csv in csv_files: #iterate list
df = pd.read_csv(csv, encoding = 'ISO-8859-1', engine='python', delimiter = ';') #read csv
df.drop(df.index[:-1], inplace = True) #drop all but the last row
print (df)
df.to_csv(dir_b + "/processed/" + csv.name) #save the file in a new dir
I have about 500 '.csv' files starting with letter 'T' e.g. 'T50, T51, T52 ..... T550' and there are some other ',csv' files with other random names in the folder. I want to read all csv files starting with "T" and store them in separate dataframes: 't50, t51, t52... etc.'
The code I have written just reads these files into a dataframe
import glob
import pandas as pd
for file in glob.glob("T*.csv"):
print (file)
I want to have a different name for each dataframe - preferably, their own file names. How can I achieve this within its 'for loop'?
Totally agree with #Comos
But if you still need individual variable names, I adapted the solution from here!
import pandas as pd
import os
folder = '/path/to/my/inputfolder'
filelist = [file for file in os.listdir(folder) if file.startswith('T')]
for file in filelist:
exec("%s = pd.read_csv('%s')" % (file.split('.')[0], os.path.join(folder,file)))
In additions to ABotros's answer, to read all files in different dataframes, I would recommend adding the files to a dictionary, which will allow you to save dataframes with different names in a loop:
filelist = [file for file in os.listdir(folder) if file.startswith('T')]
database = {}
for file in filelist:
database[file] = pd.read_csv(file)
I have a folder that contains a variable number of files, and each file has a variable string in the name. For example:
my_file V1.csv
my_file V2.csv
my_file something_else.csv
I would need to:
Load all the files which name start with "my_file"
Concatenate all of them in a single dataframe
Right now I am doing it with individual pd.read_csv functions for each file, and then merging them with a concatenate.
This is not optimal as every time the files in the source folder change, I need to modify the script.
Is it possible to automate this process, so that it works even if the source files change?
You can combine glob, pandas.concat and pandas.read_csv fairly easily. Assuming the CSV files are in the same folder as your script:
import glob
import pandas as pd
df = pd.concat([pd.read_csv(f) for f in glob.glob('my_file*.csv')])
for filename in os.listdir(directory):
if filename.startswith("my_file") and filename.endswith(".csv"):
# do some stuff here
continue
else:
continue
I have a folder with .exp files. They're basically .csv files but with a .exp extension (just the format of files exported from the instrument). I know because changing .exp to .csv still allows to open them in Excel as csv files. Example here: https://uowmailedu-my.sharepoint.com/personal/tonyd_uow_edu_au/Documents/LAB/MC-ICPMS%20solution/Dump%20data%20here?csf=1
In Python, I want to read the data from each file into data frames (one for each file). I've tried the following code, but it makes the list dfs with all the files and:
(i) I don't know how to access the content of list dfs and turn it into several data frames
(ii) it looks like the columns in the original .exp files were lost.
import os
# change directory
os.chdir('..\LAB\MC-ICPMS solution\Dump data here')
path = os.getcwd()
import glob
import pandas as pd
# get data file names
filenames = glob.glob(path + "/*.csv")
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
do you guys have any ideas how I could read these files into data frames, so I can easily access the content?
I found this post: Storing csv file's contents into data Frames [Python Pandas] but not too helpful in my case.
thanks
I would recommend you switch to using an absolute path to your folder. Also it is safer to use os.path.join() when combining file parts (better than string concatenation).
To make things easier to understand, I suggest rather than just creating a list of dataframes, that you create a list of tuples containing the filename and the dataframe, that way you will know which is which.
In your code, you are currently searching for csv files not exp files.
The following creates the list of dataframes, each entry also stores the corresponding filename. At end end it cycles through all of the entries and displays the data.
Lastly, it shows you how you would for example display just the first entry.
import pandas as pd
import glob
import os
# change directory
os.chdir('..\LAB\MC-ICPMS solution\Dump data here')
path = os.getcwd()
# get data file names
dfs = []
for filename in glob.glob(os.path.join(path, "*.exp")):
dfs.append((filename, pd.read_csv(filename)))
print "Found {} exp files".format(len(dfs))
# display each of your dataframes
for filename, df in dfs:
print filename
print df
# To display just the first entry:
print "Filename:", df[0][0]
print df[0][1]