Read several csv files into data frames in python - python

I have a folder with .exp files. They're basically .csv files but with a .exp extension (just the format of files exported from the instrument). I know because changing .exp to .csv still allows to open them in Excel as csv files. Example here: https://uowmailedu-my.sharepoint.com/personal/tonyd_uow_edu_au/Documents/LAB/MC-ICPMS%20solution/Dump%20data%20here?csf=1
In Python, I want to read the data from each file into data frames (one for each file). I've tried the following code, but it makes the list dfs with all the files and:
(i) I don't know how to access the content of list dfs and turn it into several data frames
(ii) it looks like the columns in the original .exp files were lost.
import os
# change directory
os.chdir('..\LAB\MC-ICPMS solution\Dump data here')
path = os.getcwd()
import glob
import pandas as pd
# get data file names
filenames = glob.glob(path + "/*.csv")
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
do you guys have any ideas how I could read these files into data frames, so I can easily access the content?
I found this post: Storing csv file's contents into data Frames [Python Pandas] but not too helpful in my case.
thanks

I would recommend you switch to using an absolute path to your folder. Also it is safer to use os.path.join() when combining file parts (better than string concatenation).
To make things easier to understand, I suggest rather than just creating a list of dataframes, that you create a list of tuples containing the filename and the dataframe, that way you will know which is which.
In your code, you are currently searching for csv files not exp files.
The following creates the list of dataframes, each entry also stores the corresponding filename. At end end it cycles through all of the entries and displays the data.
Lastly, it shows you how you would for example display just the first entry.
import pandas as pd
import glob
import os
# change directory
os.chdir('..\LAB\MC-ICPMS solution\Dump data here')
path = os.getcwd()
# get data file names
dfs = []
for filename in glob.glob(os.path.join(path, "*.exp")):
dfs.append((filename, pd.read_csv(filename)))
print "Found {} exp files".format(len(dfs))
# display each of your dataframes
for filename, df in dfs:
print filename
print df
# To display just the first entry:
print "Filename:", df[0][0]
print df[0][1]

Related

How to read csv files in Python with incremental names and create different objects?

I have lots of files name like "XXXXX_1.csv", "XXXX_2.csv","XXXX_3.csv"...."XXXX_n.csv"
I would like to read them and create df1, df2, df3...... How should I do this? In R, I could write like
fname <- filename[i]
assign(paste0("dry_shell",i),fread(paste0("/mnt/Wendy/Data/",fname)))
}
But how about Python? I would like to have different dataframes like df1,df2,df3 that assign to dataframe1, dataframe2, etc.
Assuming they are all named nicely you can iterate over a range of numbers to get all the files:
# this will open and do something to all files named csv_0, csv_1, and csv_2 in the
# directory /path/to/files
for i in range(3):
with open(f"/path/to/files/csv_{i}") as file:
# do something with the csv file ...
You could also provide a path to a directory and achieve the same goal by opening and processing all the files in the dictionary:
import os
for path in os.listdir():
with open(path) as file:
# do something with the csv file ...
See here for the documentation for the python (v3.8.5) standard csv package.
From your question, I assume you are using datatable (I noticed the fread function). The equivalent, dataframe wise, in python is Pandas. You can combine pathlib with Pandas to create a dictionary of dataframes :
from pathlib import Path
directory = Path(directory that contains "XXXXX_1.csv", "XXXX_2.csv","XXXX_3.csv"...."XXXX_n.csv")
{f"df{n}": pd.read_csv(file) for n, file in enumerate(directory.iterdir(),1)}

Read multiple csv files starting with a string into separate data frames in python

I have about 500 '.csv' files starting with letter 'T' e.g. 'T50, T51, T52 ..... T550' and there are some other ',csv' files with other random names in the folder. I want to read all csv files starting with "T" and store them in separate dataframes: 't50, t51, t52... etc.'
The code I have written just reads these files into a dataframe
import glob
import pandas as pd
for file in glob.glob("T*.csv"):
print (file)
I want to have a different name for each dataframe - preferably, their own file names. How can I achieve this within its 'for loop'?
Totally agree with #Comos
But if you still need individual variable names, I adapted the solution from here!
import pandas as pd
import os
folder = '/path/to/my/inputfolder'
filelist = [file for file in os.listdir(folder) if file.startswith('T')]
for file in filelist:
exec("%s = pd.read_csv('%s')" % (file.split('.')[0], os.path.join(folder,file)))
In additions to ABotros's answer, to read all files in different dataframes, I would recommend adding the files to a dictionary, which will allow you to save dataframes with different names in a loop:
filelist = [file for file in os.listdir(folder) if file.startswith('T')]
database = {}
for file in filelist:
database[file] = pd.read_csv(file)

Read multiple .csv files from folder with variable part fo file name

I have a folder that contains a variable number of files, and each file has a variable string in the name. For example:
my_file V1.csv
my_file V2.csv
my_file something_else.csv
I would need to:
Load all the files which name start with "my_file"
Concatenate all of them in a single dataframe
Right now I am doing it with individual pd.read_csv functions for each file, and then merging them with a concatenate.
This is not optimal as every time the files in the source folder change, I need to modify the script.
Is it possible to automate this process, so that it works even if the source files change?
You can combine glob, pandas.concat and pandas.read_csv fairly easily. Assuming the CSV files are in the same folder as your script:
import glob
import pandas as pd
df = pd.concat([pd.read_csv(f) for f in glob.glob('my_file*.csv')])
for filename in os.listdir(directory):
if filename.startswith("my_file") and filename.endswith(".csv"):
# do some stuff here
continue
else:
continue

Extract file name from read_csv - Python

I have a script that current reads raw data from a .csv file and performs some pandas data analysis against the data. Currently the .csv file is hardcoded and is read in like this:
data = pd.read_csv('test.csv',sep="|", names=col)
I want to change 2 things:
I want to turn this into a loop so it loops through a directory of .csv files and executes the pandas analysis below each one in the script.
I want to take each .csv file and strip the '.csv' and store that in a another list variable, let's call it 'new_table_list'.
I think I need something like below, at least for the 1st point(though I know this isn't completely correct). I am not sure how to address the 2nd point
Any help is appreciated
import os
path = '\test\test\csvfiles'
table_list = []
for filename in os.listdir(path):
if filename.endswith('.csv'):
table_list.append(file)
data = pd.read_csv(table_list,sep="|", names=col)
Many ways to do it
for filename in os.listdir(path):
if filename.endswith('.csv'):
table_list.append(pd.read_csv(filename,sep="|"))
new_table_list.append(filename.split(".")[0])
One more
for filename in os.listdir(path):
if filename.endswith('.csv'):
table_list.append(pd.read_csv(filename,sep="|"))
new_table_list.append(filename[:-4])
and many more
As #barmar pointed out, better to append path as well to the table_list to avoid any issues related to path and location of files and script.
You can try something like this:
import glob
data = {}
for filename in glob.glob('/path/to/csvfiles/*.csv'):
data[filename[:-4]] = pd.read_csv(filename, sep="|", names=col)
Then data.keys() is the list of filenames without the ".csv" part and data.values() is a list with one pandas dataframe for each file.
I'd start with using pathlib.
from pathlib import Path
And then leverage the stem attribute and glob method.
Let's make an import function.
def read_csv(f):
return pd.read_csv(table_list, sep="|")
The most generic approach would be to store in a dictionary.
p = Path('\test\test\csvfiles')
dod = {f.stem: read_csv(f) for f in p.glob('*.csv')}
And you can also use pd.concat to turn that into a dataframe.
df = pd.concat(dod)
to get the list CSV files in the directory use glob it is easier than os
from glob import glob
# csvs will contain all CSV files names ends with .csv in a list
csvs = glob('you\\dir\\to\\csvs_folder\\*.csv')
# remove the trailing .csv from CSV files names
new_table_list = [csv[:-3] for csv in csvs]
# read csvs as dataframes
dfs = [pd.read_csv(csv, sep="|", names=col) for csv in csvs]
#concatenate all dataframes into a single dataframe
df = pd.concat(dfs, ignore_index=True)
you can try so:
import os
path = 'your path'
all_csv_files = [f for f in os.listdir(path) if f.endswith('.csv')]
for f in all_csv_files:
data = pd.read_csv(os.path.join(path, f), sep="|", names=col)
# list without .csv
files = [f[:-4] for f all_csv_files]
You can (at the moment of opening) add the filename to a Dataframe attribute as follow:
ds.attrs['filename']='filename.csv'
You can subsequently query the dataframe for the name
ds.attrs['filename']
'filename.csv'

Write csv to excel with pandas

I have 13 csv files in a folder called data, and I want to export that csv's files in the numerical order (1.csv,2.csv ...13.csv) to a excel file having each sheet named (1,2,3,4,5...13). I tryed something like this:
from pandas.io.excel import ExcelWriter
import pandas
ordered_files = ['1.csv', '2.csv','3.csv','4.csv', '5.csv','6.csv','7.csv', '8.csv','9.csv','10.csv', '11.csv','12.csv','13.csv']
with ExcelWriter('my_excel.xlsx') as ew:
for csv_file in ordered_files:
pandas.read_csv(csv_file).to_excel(
ew, index = False, sheet_name=csv_file, encoding='utf-8')
And I have two problems with this:
As you see in my list, I can't import the files directly from my folder data, if I will try:
ordered_files = ['data/1.csv']
Wont found a valid csv.
If I use that list method, my sheet will be named 3.csv for example instead of just 3.
A side question, coming from csv I saw some columns that should be int number with the format as strings with a ' in front.
Thank you so much for your time! I use python 3!
If all that concerns you is removing the last four characters from the sheet names, just use sheet_name=csv_file[:-4] in your call to to_excel. The comment from #pazqo shows you how to generate the correct path to find the CSV files in your data directory.
More generally, suppose you wanted to process all the CSV files on a given path, there are several ways to do this. Here's one straightforward way.
import os
from glob import glob
def process(path, ew):
os.chdir(path) # note this is a process-wide change
for csv_file in glob('*.csv'):
pandas.read_csv(csv_file).to_excel(ew,
index = False,
sheet_name=csv_file[:-4],
encoding='utf-8')
with ExcelWriter('my_excel.xlsx') as ew:
process("data", ew)
You might also consider generating the filenames using glob(os.path.join(path, "*.csv")) but that would also require you to remove the leading path from the sheet names - possibly worthwhile to avoid the os.chdir call, which is a bit ugly.
Concerning your first question you could write relative path as follows:
"data/1.csv" or 'data//1.csv'.
About your second point, your sheet is named like this because you are looping over your csv names, the values you are using for your sheet names are then: '1.csv',...
In my opinion you should have write this instead:
from pandas.io.excel import ExcelWriter
import pandas
ext = '.csv'
n_files = 13
with ExcelWriter('data//my_excel.xlsx') as ew:
for i in range(1,n_files+1):
pandas.read_csv('data//'+str(i)+ext)
.to_excel(ew, index = False, sheet_name=str(i), encoding='utf-8')
Because you have 13 files, named from 1 to 13, you should make a loop over this thanks to range(1,n_files+1), range function generate a list of n_files integers starting from 1.
Hope it helps.
For importing files, the path is relative to the current working directory, if you use the absolute path it should work (such as "C:\data\1.csv" in windows, "/home/user/data/1.csv" in a Linux or Unix environment).
To remove the extension from the sheet name, list the file names without the .csv ( such as orderedlist = range(1:13) ), then:
pandas.read_csv(<directory> + str(csv_file) + '.csv').to_excel(
which might be:
pandas.read_csv(/home/user/data/ + str(csv_file) + '.csv').to_excel(
Alternatively, keep the list as is and change the sheet name to
sheet_name=csv_file.split('.')[0]
to only return the portion of csv_file prior to the '.' .

Categories