How to use Python to extract Excel Sheet data

How to use Python to extract Excel Sheet data - python

I have many folders, each folders contains 1 excel file like 1Aug2022, 2Aug2022...
I want python to Read thru all Folders, and only open the excel file name like 19AUG2022, the excel file have many sheets inside like IP-1*****, IP-2*****, IP-3*****. Then go to sheets with (IP-2*****) to extract 2columns of data.
How can I do it in python?

You can use pandas package: https://pandas.pydata.org/
an example is
import pandas as pd
your_excel_path = "your/path/to/the/excel/file"
data = pd.read_excel(your_excel_path, sheet_name = "19AUG2022") # If you want to read specific sheet's data
data = pd.read_excel(your_excel_path, sheet_name = None) # If you want to read all sheets' data, it will return a list of dataframes

As Fergus said use pandas.
The code to search all directorys may look like that:
import os
import pandas as pd
directory_to_search = "./"
sheet_name = "IP-2*****"
for root, dirs, files in os.walk(directory_to_search):
for file in files:
if file == "19AUG2022":
df = pd.read_excel(io=os.path.join(root, file), sheet_name=sheet_name)

Related

Merge all excel files into one file with multiple sheets

i would like some help.
I have multiple excel files, each file only has one sheet.
I would like to combine all excel files into just one file but with multiple sheets one sheet per excel file keeping the same sheet names.
this is what i have so far:
import pandas as pd
from glob import glob
import os
excelWriter = pd.ExcelWriter("multiple_sheets.xlsx",engine='xlsxwriter')
for file in glob('*.xlsx'):
df = pd.read_excel(file)
df.to_excel(excelWriter,sheet_name=file,index=False)
excelWriter.save()
All the excel files looks like this:
https://iili.io/HfiJRHl.png
sorry i cannot upload images here, dont know why but i pasted the link
But all the excel files have the exact same columns and rows and just one sheet, the only difference is the sheet name
Thanks in advance

import pandas as pd
import os
output_excel = r'/home/bera/Desktop/all_excels.xlsx'
#List all excel files in folder
excel_folder= r'/home/bera/Desktop/GIStest/excelfiles/'
excel_files = [os.path.join(root, file) for root, folder, files in os.walk(excel_folder) for file in files if file.endswith(".xlsx")]
with pd.ExcelWriter(output_excel) as writer:
for excel in excel_files: #For each excel
sheet_name = pd.ExcelFile(excel).sheet_names[0] #Find the sheet name
df = pd.read_excel(excel) #Create a dataframe
df.to_excel(writer, sheet_name=sheet_name, index=False) #Write it to a sheet in the output excel

programtically ingesting xl files to pandas data frame by reading filename

I have a folder with 6 files, 4 are excel files that I would like to bring into pandas and 2 are just other files. I want to be able to use pathlib to work with the folder to automatically ingest the excel files I want into individual pandas dataframes. I would also like to be able to name each new dataframe with the name of the excel file (without the file extension)
for example.
import pandas as pd
import pathlib as pl
folder = pl.WindowsPath(r'C:\Users\username\project\output')
files = [e for e in folder.iterdir()]
for i in files:
print(i)
['C:\Users\username\project\output\john.xlsx',
'C:\Users\username\project\output\paul.xlsx',
'C:\Users\username\project\output\random other file not for df.xlsx',
'C:\Users\username\project\output\george.xlsx',
'C:\Users\username\project\output\requirements for project.txt',
'C:\Users\username\project\output\ringo.xlsx' ]
From here, i'd like to be able to do something like
for i in files:
if ' ' not in str(i.name):
str(i.name.strip('.xlsx'))) = pd.read_excel(i)
read the file name, if it doesn't contain any spaces, take the name, remove the file extension and use that as the variable name for a pandas dataframe built from the excel file.
If what I'm doing isn't possible then I have other ways to do it, but they repeat a lot of code.
Any help is appreciated.

using pathlib and re
we can exclude any files that match a certain pattern in our dictionary comprehension, that is any files with a space.
from pathlib import Path
import re
import pandas as pd
pth = (r'C:\Users\username\project\output')
files = Path(pth).glob('*.xlsx') # use `rglob` if you want to to trawl a directory.
dfs = {file.stem : pd.read_excel(file) for file in
files if not re.search('\s', file.stem)}
based on the above you'll get :
{'john': pandas.core.frame.DataFrame,
'paul': pandas.core.frame.DataFrame,
'george': pandas.core.frame.DataFrame,
'ringo': pandas.core.frame.DataFrame}
where pandas.core.frame.DataFrame is your target dataframe.
you can then call them by doing dfs['john']

How can I write a python scripts using pandas to iterate over Excel .xlsx files with multiple sheets?

I have some Excel .Xlsx files. Each file contains multiple sheets. I have used the following code to read and extract data from the files:
import pandas as pd
file = pd.ExcelFile('my_file.xlsx')
file.sheet_names #Displays the sheet names
df = file.parse('Sheet1') #To parse Sheet1
df.columns #To list columns
My interest is the email columns in each sheet. I have been doing this almost manually with the code above. I need a code to automatically iterate over the sheets and extract all emails. Help!

You can pass over all files and all sheets with a for loop:
import pandas as pd
import os
emails = []
files_dir = "/your_path_to_the_xlsx_files"
for file in os.listdir(files_dir):
excel = pd.ExcelFile(os.path.join(files_dir,file))
for sheet in excel.sheet_names:
df = excel.parse(sheet)
if 'email' not in df.columns:
continue
emails.extend(df['email'].tolist())
Now you have all the emails in the emails list.

How can I extract the same excel sheet from multiple files?

I have a directory of similar excel files and want to extract the first sheet from each file and save it as a .csv file. Currently have code which works to extract and save sheet from individual file:
import glob
import pandas as pd
f = glob.glob('filename.xlsx') # assume the path
for excel in f:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel) # if only the first sheet is needed.
df.to_csv(out)

You can get all your files into a list using glob with a list comprehension:
files_to_be_read = glob.glob("*.xlsx") #Assuming you also have the path to the folder where the excel files are saved
for i in files_to_be_read:
df_in = pd.read_excel(i) #You pass the path, pd.read_excel always uses the first sheet by default
df_out = pd.to_csv(i+'.csv') #You will save the file with the same name, but in csv format

Extract data from multiple excel files in multiple directories in python pandas

I am new to Python and I am posting the question in stack overflow for the first time. Please help in solving the problem.
My main directory is 'E:\Data Science\Macros\ZBILL_Dump', containing month-wise folders and each folder contains date-wise excel data.
I was able to extract data from a single folder:
import os
import pandas as pd
import numpy as np
# Find file names in the specified directory
loc = 'E:\Data Science\Macros\ZBILL_Dump\Apr17\\'
files = os.listdir(loc)
# Find the ONLY Excel files
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
# Create empty dataframe and read in new data
zbill = pd.DataFrame()
for f in files_xlsx:
New_data = pd.read_excel(os.path.normpath(loc + f), 'Sheet1')
zbill = zbill.append(New_data)
zbill.head()
I am trying to extract data from my main directory i.e "ZBILL_Dump" which contains many sub folders, but I could not do it. Please somebody help me.
Thanks a lot.

You can use glob.
import glob
import pandas as pd
# grab excel files only
pattern = 'E:\Data Science\Macros\ZBILL_Dump\Apr17\\*.xlsx'
# Save all file matches: xlsx_files
xlsx_files = glob.glob(pattern)
# Create an empty list: frames
frames = []
# Iterate over csv_files
for file in xlsx_files:
# Read xlsx into a DataFrame
df = pd.read_xlsx(file)
# Append df to frames
frames.append(df)
# Concatenate frames into dataframe
zbill = pd.concat(frames)
You can use regex if you want to look in different sub-directories. Use 'filepath/*/*.xlsx' to search the next level. More info here https://docs.python.org/3/library/glob.html

Use glob with its recursive feature for searching sub-directories:
import glob
files = glob.glob('E:\Data Science\Macros\ZBILL_Dump\**\*.xlsx', recursive=True)
Docs: https://docs.python.org/3/library/glob.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use Python to extract Excel Sheet data - python

Related

Merge all excel files into one file with multiple sheets

programtically ingesting xl files to pandas data frame by reading filename

How can I write a python scripts using pandas to iterate over Excel .xlsx files with multiple sheets?

How can I extract the same excel sheet from multiple files?

Extract data from multiple excel files in multiple directories in python pandas

Categories

Resources