Open - Edit - Save - Loop csv files in a folder with python - python

I will receive a folder with 100+ .csv files and I will need to edit them in the same way. Files have the same structure.
Folder looks like this:
df1.csv
df2.csv
df3.csv
...
df100.csv. I need to open all files - edit them - and then save them as "df1-edited", "df2-edited" and so on.
As per each df the code runns perfectly. I am not sure how to automatically run it through every file and save them accordingly.
Here is my code:
import pandas as pd
df = pd.read_csv('df1.csv')
[Edit steps here]
df.to_csv("df1-edited.csv", index=None, encoding='utf-8-sig', decimal=',')
Thanks!

For this you can use a module from the standard-library that works with your operating system.
Essentially, you need to find all the .csv files in your folder and iterate over them.
Let's use pathlib. This is not tested but something like this should work:
from pathlib import Path
import pandas as pd
csv_folder = Path('path/to/csvs')
for file in csv_folder.glob('*.csv'): #create iteratable & iterate on it.
df = pd.read_csv(file)
# do stuff
new_file_name = file.parent.joinpath(f"{file.stem}-edited.csv")
df.to_csv(new_file_name, index=None, encoding='utf-8-sig', decimal=',')

Related

How to add filename as column to every file in a directory python

Hi there stack overflow community,
I have several csv-files in a folder and I need to append a column containing the first 8 chars of each filename in a aditional column of the csv. After this step i want to save the datafram including the new colum to the same file.
I get the right output, but it doesn't save the changes in the csv file :/
Maybe someone has some inspiration for me. Thanks a lot!
from tkinter.messagebox import YES
import pandas as pd
import glob, os
import fnmatch
import os
files = glob.glob(r'path\*.csv')
for fp in files:
df = pd.concat([pd.read_csv(fp).assign(date=os.path.basename(fp).split('.')[0][:8])])
#for i in df('date'):
#Decoder problem
print(df)
use:
df.to_csv
like this:
for fp in files:
df = pd.concat([pd.read_csv(fp).assign(date=os.path.basename(fp).split('.')[0][:8])])
df.to_csv(fp, index=False) # index=False if you don't want to save the index as a new column in the csv
btw, I think this may also work and is more readable:
for fp in files:
df = pd.read(fp)
df[date] = os.path.basename(fp).split('.')[0][:8]
df.to_csv(fp, index=False)

Loop/iterate through a directory of excel files & add to the bottom of the dataframe

I am currently working on importing and formatting a large number of excel files (all the same format/scheme, but different values) with Python.
I have already read in and formatted one file and everything worked fine so far.
I would now do the same for all the other files and combine everything in one dataframe, i.e. read in the first excel in one dataframe, add the second at the bottom of the dataframe, add the third at the bottom the dataframe, and so on until I have all the excel files in one dataframe.
So far my script looks something like this:
import pandas as pd
import numpy as np
import xlrd
import os
path = os.getcwd()
path = "path of the directory"
wbname = "name of the excel file"
files = os.listdir(path)
files
wb = xlrd.open_workbook(path + wbname)
# I only need the second sheet
df = pd.read_excel(path + wbname, sheet_name="sheet2", skiprows = 2, header = None,
skipfooter=132)
# here is where all the formatting is happening ...
df
So, "files" is a list with all file relevant names. Now I have to try to put one file after the other into a loop (?) so that they all eventually end up in df.
Has anyone ever done something like this or can help me here.
Something like this might work:
import os
import pandas as pd
list_dfs=[]
for file in os.listdir('path_to_all_xlsx'):
df = pd.read_excel(file, <the rest of your config to parse>)
list_dfs.append(df)
all_dfs = pd.concat(list_dfs)
You read all the dataframes and add them to a list, and then the concat method adds them all together int one big dataframe.

programtically ingesting xl files to pandas data frame by reading filename

I have a folder with 6 files, 4 are excel files that I would like to bring into pandas and 2 are just other files. I want to be able to use pathlib to work with the folder to automatically ingest the excel files I want into individual pandas dataframes. I would also like to be able to name each new dataframe with the name of the excel file (without the file extension)
for example.
import pandas as pd
import pathlib as pl
folder = pl.WindowsPath(r'C:\Users\username\project\output')
files = [e for e in folder.iterdir()]
for i in files:
print(i)
['C:\Users\username\project\output\john.xlsx',
'C:\Users\username\project\output\paul.xlsx',
'C:\Users\username\project\output\random other file not for df.xlsx',
'C:\Users\username\project\output\george.xlsx',
'C:\Users\username\project\output\requirements for project.txt',
'C:\Users\username\project\output\ringo.xlsx' ]
From here, i'd like to be able to do something like
for i in files:
if ' ' not in str(i.name):
str(i.name.strip('.xlsx'))) = pd.read_excel(i)
read the file name, if it doesn't contain any spaces, take the name, remove the file extension and use that as the variable name for a pandas dataframe built from the excel file.
If what I'm doing isn't possible then I have other ways to do it, but they repeat a lot of code.
Any help is appreciated.
using pathlib and re
we can exclude any files that match a certain pattern in our dictionary comprehension, that is any files with a space.
from pathlib import Path
import re
import pandas as pd
pth = (r'C:\Users\username\project\output')
files = Path(pth).glob('*.xlsx') # use `rglob` if you want to to trawl a directory.
dfs = {file.stem : pd.read_excel(file) for file in
files if not re.search('\s', file.stem)}
based on the above you'll get :
{'john': pandas.core.frame.DataFrame,
'paul': pandas.core.frame.DataFrame,
'george': pandas.core.frame.DataFrame,
'ringo': pandas.core.frame.DataFrame}
where pandas.core.frame.DataFrame is your target dataframe.
you can then call them by doing dfs['john']

Opening Multiple`.xls` files from a folder in a different directory and creating one dataframe using Pandas

I am trying to open multiple xls files in a folder from a particular directory. I wish to read into these files and open all of them in one data frame. So far I am able to access the directory and put all the xls files into a list like this
import os
import pandas as pd
path = ('D:\Anaconda Hub\ARK analysis\data\year2021\\february')
files = os.listdir(path)
files
# outputting the variable files which appears to be a list.
Output:
['ARK_Trade_02012021_0619PM_EST_601875e069e08.xls',
'ARK_Trade_02022021_0645PM_EST_6019df308ae5e.xls',
'ARK_Trade_02032021_0829PM_EST_601b2da2185c6.xls',
'ARK_Trade_02042021_0637PM_EST_601c72b88257f.xls',
'ARK_Trade_02052021_0646PM_EST_601dd4dc308c5.xls',
'ARK_Trade_02082021_0629PM_EST_6021c739595b0.xls',
'ARK_Trade_02092021_0642PM_EST_602304eebdd43.xls',
'ARK_Trade_02102021_0809PM_EST_6024834cc5c8d.xls',
'ARK_Trade_02112021_0639PM_EST_6025bf548f5e7.xls',
'ARK_Trade_02122021_0705PM_EST_60270e4792d9e.xls',
'ARK_Trade_02162021_0748PM_EST_602c58957b6a8.xls']
I am now trying to get it into one dataframe like this:
frame = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1')
frame.append(data)
df = pd.concat(frame, axis=0, ignore_index=True)
However, when doing this I sometimes obtain a blank data frame or it throws an error like this:
FileNotFoundError: [Errno 2] No such file or directory: 'ARK_Trade_02012021_0619PM_EST_601875e069e08.xls'
Help would truly be appreciated with this task.
Thanks in advance.
The issue happens because if you simply put the file name the interpreter assumes that it is in the current working directory, therefore you need to use os module to get proper location:
import os
import pandas as pd
path = ('D:\Anaconda Hub\ARK analysis\data\year2021\\february')
files = os.listdir(path)
#frame = pd.DataFrame() ...This will not work!
frame = [] # Do this instead
for f in files:
data = pd.read_excel(os.path.join(path, f), 'Sheet1') # Here join filename with folder location
frame.append(data)
df = pd.concat(frame, axis=0, ignore_index=True)
The other issue is that frame should be a list or some other iterable. Pandas has append method for dataframes but if you want to use concat then it will need to be a list.
you can add a middle step to check if the path exists or not, I suspect this is an isolated issue with your server, from memory when working on older windows servers (namely 2012), I would sometimes have issues where the Path couldn't be found even though it 100% existed.
import pandas as pd
from pathlib import Path
# assuming you want xls and xlsx.
files = Path('folder_location').glob('*.xls*')
dfs = []
for file in files:
if file.is_file():
df = pd.read_excel(file, sheet_name='sheet')
dfs.append(df)
final_df = pd.concat(dfs)

pandas.read_csv() in a for loop. Why is my code not working for more then 2 cycles?

It works fine for simple files but not with more complex ones.
My files are not corrupted and they are in the right directory.
I tried it with easy generate files (1,2,3,4... a,b,c,d...).
I put it at Github tonight so you can run the code and see the files.
import os
import glob
import pandas as pd
def concatenate(indir='./files/', outfile='./all.csv'):
os.chdir(indir)
fileList = glob.glob('*.CSV')
dfList = []
'''colnames = ['Time', 'Number', 'Reaction', 'Code', 'Message', 'date']'''
print(len(fileList))
for filename in fileList:
print(filename)
df = pd.read_csv(filename, header=0)
dfList.append(df)
'''print(dfList)'''
concatDf = pd.concat(dfList, axis=0)
'''concatDf.columns = colnames'''
concatDf.to_csv(outfile, index=None)
concatenate()
Error
Unable to open parsers.pyx: Unable to read file (Error: File not found
(/Users/alf4/Documents/vs_code/files/pandas/_libs/parsers.pyx)).
But just after more than two files.
complex ones? do you mean bigger csv files ?
instead of appendding data to an empty list and then concatenating back to the dataframe, we can do it in a single step, take an empty dataframe(df1), keep appending df to df1 in the loop.
df1=df1.append(df)
and then write it out in the end
df1.to_csv(outfile, index=None)
I am sorry for this question/the wrong topic because it seems not to be a code problem.
It seems that the installation of pandas is bugged. It put it to repl.it to share it here and there it works. At the moment I try to repair the python and pandas installation.
So many thanks to these guys in the comments for the helping.

Categories