I have a folder that contains a variable number of files, and each file has a variable string in the name. For example:
my_file V1.csv
my_file V2.csv
my_file something_else.csv
I would need to:
Load all the files which name start with "my_file"
Concatenate all of them in a single dataframe
Right now I am doing it with individual pd.read_csv functions for each file, and then merging them with a concatenate.
This is not optimal as every time the files in the source folder change, I need to modify the script.
Is it possible to automate this process, so that it works even if the source files change?
You can combine glob, pandas.concat and pandas.read_csv fairly easily. Assuming the CSV files are in the same folder as your script:
import glob
import pandas as pd
df = pd.concat([pd.read_csv(f) for f in glob.glob('my_file*.csv')])
for filename in os.listdir(directory):
if filename.startswith("my_file") and filename.endswith(".csv"):
# do some stuff here
continue
else:
continue
Related
I want to read a csv file into a data frame from a certain folder with pandas. This folder contains several csv files. They contain different information.
df = pd.read_csv(r'C:\User\Username\Desktop\Statistic\12345678_Reference.csv')
The first part in the filename (1 - 8 is variable). I want to read it in the file which ends with '_Reference.csv', but I have no clue how to manage it. I googled, but could not find a solution if there are more than one csv file in the same folder.
If you import os, then you can use functions for for navigating the file system.
os.listdir(path) will return a list of all of the file names in a directory.
[f for f in os.listdir(path) if f.endswith("Reference.csv")]
Will return a list of all files names ending with "Reference.csv". In your scenario, it sounds like there will be only one item in the list.
So, [f for f in os.listdir(path) if f.endswith("Reference.csv")][0] would return the filename that you're looking for.
Then you can construct a path using the filename, and feed it to pd.read_csv().
I have about 500 '.csv' files starting with letter 'T' e.g. 'T50, T51, T52 ..... T550' and there are some other ',csv' files with other random names in the folder. I want to read all csv files starting with "T" and store them in separate dataframes: 't50, t51, t52... etc.'
The code I have written just reads these files into a dataframe
import glob
import pandas as pd
for file in glob.glob("T*.csv"):
print (file)
I want to have a different name for each dataframe - preferably, their own file names. How can I achieve this within its 'for loop'?
Totally agree with #Comos
But if you still need individual variable names, I adapted the solution from here!
import pandas as pd
import os
folder = '/path/to/my/inputfolder'
filelist = [file for file in os.listdir(folder) if file.startswith('T')]
for file in filelist:
exec("%s = pd.read_csv('%s')" % (file.split('.')[0], os.path.join(folder,file)))
In additions to ABotros's answer, to read all files in different dataframes, I would recommend adding the files to a dictionary, which will allow you to save dataframes with different names in a loop:
filelist = [file for file in os.listdir(folder) if file.startswith('T')]
database = {}
for file in filelist:
database[file] = pd.read_csv(file)
I'm using Python 3.7.
I have to download an excel file (.xls) that has a unique filename every time I download it into a specific downloads folder location.
Then with Python and Pandas, I then have to open the excel file and read/convert it to a dataframe.
I want to automate the process, but I'm having trouble telling Python to get the full name of the XLS file as a variable, which will then be used by pandas:
# add dependencies and set location for downloads folder
import os
import glob
import pandas as pd
download_dir = '/Users/Aaron/Downloads/'
# change working directory to download directory
os.chdir(download_dir)
# get filename of excel file to read into pandas
excel_files = glob.glob('*.xls')
blah = str(excel_files)
blah
So then for example, the output for "blah" is:
"['63676532355861.xls']"
I have also tried just using "blah = print(excel_files)" for the above block, instead of the "str" method, and assigning that to a variable, which still doesn't work.
And then the rest of the process would do the following:
# open excel (XLS) file with unknown filename in pandas as a dataframe
data_df = pd.read_excel('WHATEVER.xls', sheet_name=None)
And then after I convert it to a data frame, I want to DELETE the excel file.
So far, I have spent a lot of time reading about fnames, io, open, os.path, and other libraries.
I still don't know how to get the name of the unknown .XLS file into a variable, and then later deleting that file.
Any suggestions would be greatly appreciated.
This code finds an xls file in your specified path reads the xls file and deletes the file.If your directory contains more than 1 xls file,It reads the last one.You can perform whatever operation you want if you find more than one xls files.
import os
for filename in os.listdir(os.getcwd()):
if filename.endswith(".xls"):
print(filename)
#do your operation
data_df = pd.read_excel(filename, sheet_name=None)
os.remove(filename)
Check this,
lst = os.listdir()
matching = [s for s in lst if '.xls' in s]
matching will have all list of excel files.
As you are having only one excel file, you can save in variable like file_name = matching[0]
I have a script that current reads raw data from a .csv file and performs some pandas data analysis against the data. Currently the .csv file is hardcoded and is read in like this:
data = pd.read_csv('test.csv',sep="|", names=col)
I want to change 2 things:
I want to turn this into a loop so it loops through a directory of .csv files and executes the pandas analysis below each one in the script.
I want to take each .csv file and strip the '.csv' and store that in a another list variable, let's call it 'new_table_list'.
I think I need something like below, at least for the 1st point(though I know this isn't completely correct). I am not sure how to address the 2nd point
Any help is appreciated
import os
path = '\test\test\csvfiles'
table_list = []
for filename in os.listdir(path):
if filename.endswith('.csv'):
table_list.append(file)
data = pd.read_csv(table_list,sep="|", names=col)
Many ways to do it
for filename in os.listdir(path):
if filename.endswith('.csv'):
table_list.append(pd.read_csv(filename,sep="|"))
new_table_list.append(filename.split(".")[0])
One more
for filename in os.listdir(path):
if filename.endswith('.csv'):
table_list.append(pd.read_csv(filename,sep="|"))
new_table_list.append(filename[:-4])
and many more
As #barmar pointed out, better to append path as well to the table_list to avoid any issues related to path and location of files and script.
You can try something like this:
import glob
data = {}
for filename in glob.glob('/path/to/csvfiles/*.csv'):
data[filename[:-4]] = pd.read_csv(filename, sep="|", names=col)
Then data.keys() is the list of filenames without the ".csv" part and data.values() is a list with one pandas dataframe for each file.
I'd start with using pathlib.
from pathlib import Path
And then leverage the stem attribute and glob method.
Let's make an import function.
def read_csv(f):
return pd.read_csv(table_list, sep="|")
The most generic approach would be to store in a dictionary.
p = Path('\test\test\csvfiles')
dod = {f.stem: read_csv(f) for f in p.glob('*.csv')}
And you can also use pd.concat to turn that into a dataframe.
df = pd.concat(dod)
to get the list CSV files in the directory use glob it is easier than os
from glob import glob
# csvs will contain all CSV files names ends with .csv in a list
csvs = glob('you\\dir\\to\\csvs_folder\\*.csv')
# remove the trailing .csv from CSV files names
new_table_list = [csv[:-3] for csv in csvs]
# read csvs as dataframes
dfs = [pd.read_csv(csv, sep="|", names=col) for csv in csvs]
#concatenate all dataframes into a single dataframe
df = pd.concat(dfs, ignore_index=True)
you can try so:
import os
path = 'your path'
all_csv_files = [f for f in os.listdir(path) if f.endswith('.csv')]
for f in all_csv_files:
data = pd.read_csv(os.path.join(path, f), sep="|", names=col)
# list without .csv
files = [f[:-4] for f all_csv_files]
You can (at the moment of opening) add the filename to a Dataframe attribute as follow:
ds.attrs['filename']='filename.csv'
You can subsequently query the dataframe for the name
ds.attrs['filename']
'filename.csv'
I have a folder with .exp files. They're basically .csv files but with a .exp extension (just the format of files exported from the instrument). I know because changing .exp to .csv still allows to open them in Excel as csv files. Example here: https://uowmailedu-my.sharepoint.com/personal/tonyd_uow_edu_au/Documents/LAB/MC-ICPMS%20solution/Dump%20data%20here?csf=1
In Python, I want to read the data from each file into data frames (one for each file). I've tried the following code, but it makes the list dfs with all the files and:
(i) I don't know how to access the content of list dfs and turn it into several data frames
(ii) it looks like the columns in the original .exp files were lost.
import os
# change directory
os.chdir('..\LAB\MC-ICPMS solution\Dump data here')
path = os.getcwd()
import glob
import pandas as pd
# get data file names
filenames = glob.glob(path + "/*.csv")
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
do you guys have any ideas how I could read these files into data frames, so I can easily access the content?
I found this post: Storing csv file's contents into data Frames [Python Pandas] but not too helpful in my case.
thanks
I would recommend you switch to using an absolute path to your folder. Also it is safer to use os.path.join() when combining file parts (better than string concatenation).
To make things easier to understand, I suggest rather than just creating a list of dataframes, that you create a list of tuples containing the filename and the dataframe, that way you will know which is which.
In your code, you are currently searching for csv files not exp files.
The following creates the list of dataframes, each entry also stores the corresponding filename. At end end it cycles through all of the entries and displays the data.
Lastly, it shows you how you would for example display just the first entry.
import pandas as pd
import glob
import os
# change directory
os.chdir('..\LAB\MC-ICPMS solution\Dump data here')
path = os.getcwd()
# get data file names
dfs = []
for filename in glob.glob(os.path.join(path, "*.exp")):
dfs.append((filename, pd.read_csv(filename)))
print "Found {} exp files".format(len(dfs))
# display each of your dataframes
for filename, df in dfs:
print filename
print df
# To display just the first entry:
print "Filename:", df[0][0]
print df[0][1]