Reading bulk Excel files from a file with Python (Pandas) - python

I have 40 .xls files in a folder I would like to import into a df in Pandas.
Is there a function similar to read_csv() that will allow me to direct Python to the folder and open each of these files into the dataframe? All headers are the same in each file

Try pandas.read_excel to open each file. You can loop over the files using the glob module.
import glob
import pandas as pd
dfs = {}
for f in glob.glob('*.xlsx'):
dfs[f] = pd.read_excel(f)
df = pd.concat(dfs) # change concatenation axis if needed

you can load excel files and concat each other.
import os
import pandas as pd
files = os.listdir(<path to folder>)
df_all = pd.DataFrame()
for file in files:
df = pd.read_excel(f"<path to folder>/{file}")
df_all = pd.concat([df_all,df])

import os import pandas as pd
folder = r'C:\Users\AA\Desktop\Excel_file' files = os.listdir(folder)
for file in files: if file.endswith('.xlsx'): df = pd.read_excel(os.path.join(folder,file))
Does this help?

Related

how to print csv with the same pathname but with an extension?

In the code that I present, it reads csv files that are in one folder and prints them in another, I want this print to be with the same name that it has in the path but with an extension. For example, if the file is called: aaa.csv, the print would be aaa_ext.csv
the print i get are file_list0.csv, file_list1.csv, file_list2.csv
This is my code:
import pandas as pd
import numpy as np
import glob
import os
all_files = glob.glob("C:/Users/Gamer/Documents/Colbun/Saturn/*.csv")
file_list = []
for i,f in enumerate(all_files):
df = pd.read_csv(f,header=0,usecols=["t","f"])
df.to_csv(f'C:/Users/Gamer/Documents/Colbun/Saturn2/file_list{i}.csv')
you can modify the line that writes the csv file as follows:
df.to_csv(f'C:/Users/Gamer/Documents/Colbun/Saturn2/{os.path.basename(f).split(".")[0]}_ext.csv')

How to find a required file and read it in a zip file?

I have zip files and each zip file contains three subfolders (i.e. ini, log, and output). I want to read a file from output folder and it contains three csv files with different names. Suppose three files name are: initial.csv, intermediate.csv, and final.csv. and just want to read final.csv file.
The code that I tried to read file is:
import zipfile
import numpy
import pandas as pd
zipfiles = glob.glob('/home/data/*.zip')
for i in np.arange(len(zipfiles)):
zip = zipfile.ZipFile(zpfiles[i])
f = zip.open(zip.namelist().startswith('final'))
data = pd.read_csv(f, usecols=[3,7])
and the error I got is 'list' object has no attribute 'startswith'
How can I find the correct file and read it?
Replase
f = zip.open(zip.namelist().startswith('final'))
With
f = zip.open('output/final.csv')
If you can "find" it:
filename = ([name for name in zip.namelist() if name.startswith('output/final')][0])
f = zip.open(filename)
To find sub dirs, let's switch to pathlib which uses glob:
from pathlib import Path
import zipfile
import pandas as pd
dfs = []
files = Path('/home/data/').rglob('*final*.zip') #rglob recursively trawls all child dirs.
for file in files:
zip = zipfile.ZipFile(zpfiles[file])
....
# your stuff
df = pd.read_csv(f, usecols=[3,7])
dfs.append(df)

Copy data from CSV and PDF into HDF5 using Python

How to transfer files from specific folders to hdf5 file type using python? files type is PDF and CSV.
For example i have this path /root/Desktop/mal/ex1/ that contain many CSV files and PDF files
all of them i wont to make 1 single hdf5 file contain all this CSV and PDF files.
You could modify the below code based on your requirement details:
import numpy as np
import h5py
import pandas as pd
import glob
yourpath = '/root/Desktop/mal/ex1'
all_files = glob.glob(yourpath + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename,index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
hf = h5py.File('data.h5', 'w')
hf.create_dataset('dataset_1', data=frame)
hf.close()

Read multiple excel files from a folder into pandas

I would like to read several excel files contained into a folder in the Desktop of my MacBook into pandas.
The folder in the desktop is contains a folder (project dataset) with all the excel files and the Jupiter notebook page where I am writing the code (draft progetto)
I wrote the following code:
path = os.getcwd()
files = os.listdir(path)
files
Output:
['.DS_Store', 'draft progetto.ipynb', '.ipynb_checkpoints', 'project_dataset']
Then when I run:
files_xls = [f for f in files if f[3:] == 'xlsx']
files_xls
I GET AN EMPTY LIST AS OUTPUT!!
WHY IS THIS?
IIUC,
this is something that can be done much easier with pathlib and unix matching using the glob module.
from pathlib import Path
import pandas as pd
#one liner
your_path = 'path_to_excel_files'
df = pd.concat([pd.read_excel(f) for f in Path(your_path).rglob('*.xlsx')])
Breaking it down.
# find the excel files
# if you want to change the path do Path('your_path')...
files = [file for file in Path.cwd.rglob('*.xlsx')]
#create a list of dataframes.
dfs_list = [pd.read_excel(file) for file in files])
#concat
df = pd.concat(dfs_list)

Open csv file with Pandas and delete if has only 1 row

I have a task to create a script to ssh to list of 10 cisco routers weekly and check for config changes and send notification. So i have in place the script that logs and run the command and send it to csv. I have modified so if there is not changes all I have in the csv will be for example:
rtr0003# -which is the router name only. If there will be conf change the excel will have inside for example:
My question is how to run pandas to open each file and if it sees only one line/row to delete the excel file and if more lines to skip it.
This is how i write the files:
files = glob.glob('*.csv')
for file in files:
df=pd.read_csv(file)
df=df.dropna()
df.to_csv(file,index=False)
df1=pd.read_csv(file,skiprows = 2)
#df1=df1.drop(df1.tail(1))
df1.to_csv(file,index=False)
import os
import glob
import csv
files = glob.glob('*.csv')
for file in files:
with open(file,"r") as f:
reader = csv.reader(f,delimiter = ",")
data = list(reader)
row_count = len(data)
if row_count == 1:
os.remove(file)
Here is a solution using pandas:
import pandas as pd
import glob
import os
csv_files = glob.glob('*.csv')
for file in csv_files:
df_file = pd.read_csv(file, low_memory = False)
if len(df_file) == 1:
os.remove(file)
If you are using excel files, change
glob.glob('*.csv')
to
glob.glob('*.xlsx')
and
pd.read_csv(file, low_memory = False)
to
pd.read_excel(file)

Categories