I'm working on a code to extract automatically my data from a .csv, and load it into a .xlsx in the format I need. I want to create a new sheet for every file i use and rename it. My code works with one file, but when i use it on my folder, it doesn't create any new sheet.
I already tried this solution but I didn't succeed to use it.
Here is my code :
import pandas as pd
import numpy as np
import datetime as dt
from pathlib import Path
pathList=sorted(Path('.').glob('*.csv'))
name=''
output="output/writer.xlsx"
def reader(file) :
return pd.read_csv(file,sep=";")
def extract(file,name):
file['dte']=pd.to_datetime(file['dte'],errors='coerce')
file['year']=pd.DatetimeIndex(file['dte']).year
df=file.set_index([file.groupby('year').cumcount(),'year']).unstack(1)
df=df.sort_index(1,level=1)
df.columns=[f"{x}_{y}" for x,y in df.columns]
writer = pd.ExcelWriter(output, engine='xlsxwriter')
df.to_excel(writer,sheet_name=f"{name}")
writer.save()
return
for path in pathList :
name=''
readFile=path
readName=str(readFile)
print(readFile)
for i in readName :
if i=='.':
break
name=name+i
print(name)
pathFile=reader(readFile)
extract(pathFile,f"{name}")
I already tried to store the names of my files in a list and create my new sheets in a loop but it doesn't allow list to name it.
If you want to reproduce it, here is two files from my folder :
https://www.mediafire.com/file/mj9u4awael87bhc/ardentes.csv/file
https://www.mediafire.com/file/xfliok17s35lm6e/bas_en_basset.csv/file
Just use global writer and write.save() outside the loop
def extract(file,name):
global df
file['dte']=pd.to_datetime(file['dte'],errors='coerce')
file['year']=pd.DatetimeIndex(file['dte']).year
df=file.set_index([file.groupby('year').cumcount(),'year']).unstack(1)
df=df.sort_index(1,level=1)
df.columns=[f"{x}_{y}" for x,y in df.columns]
writer = pd.ExcelWriter(output, engine='xlsxwriter')
for name in names:
df.to_excel(writer,sheet_name=name)
for path in pathList :
name=''
readFile=path
readName=str(readFile)
print(readFile)
for i in readName :
if i=='.':
break
name=name+i
print(name)
pathFile=reader(readFile)
extract(pathFile,name)
writer = pd.ExcelWriter(output, engine='xlsxwriter')
for name in names:
df.to_excel(writer,sheet_name=name)
writer.save()```
Related
I am trying to get the file name, sheet name, max rows, and max columns of each sheet in each Excel file. I did some research today on how to use Python to take an inventory of Excel files in a folder. I put together the code below and it seems to get me the file name and sheet name, but it gets stuck on the rows and columns. As I know, the rows and columns are strings, right. I'm trying to accommodate that requirement, but something seems to be off here. Can someone tell me what's wrong here?
import openpyxl
import glob
import pandas as pd
inventory = []
all_data = pd.DataFrame()
path = '\\Users\\ryans\\OneDrive\\Desktop\\sample\\*.xlsx'
for f in glob.glob(path):
print(f)
inventory.append(f)
theFile = openpyxl.load_workbook(f)
sheetnames = theFile.active
for sheet in sheetnames:
print(sheet)
inventory.append(sheet)
row_count = str(sheet.max_row)
col_count = str(sheet.max_col)
inventory.append(row_count)
inventory.append(col_count)
print(inventory)
To iterate over the worksheets in a workbook, you should use for sheet in theFile.worksheets. Your current attempt is actually iterating over all of the rows in your workbook, starting at the active sheet.
sheet.max_col is also the incorrect function, use sheet.max_column
So your working code is now:
import openpyxl
import glob
inventory = []
path = '\\Users\\ryans\\OneDrive\\Desktop\\sample\\*.xlsx'
for f in glob.glob(path):
# print(f)
inventory.append(f)
theFile = openpyxl.load_workbook(f)
sheetnames = theFile.active
for sheet in theFile.worksheets:
# print(sheet)
inventory.append(sheet)
row_count = str(sheet.max_row)
col_count = str(sheet.max_column)
inventory.append(row_count)
inventory.append(col_count)
print(inventory)
Most of the articles I'm seeing either:
a) Combine multiple excel single-sheet workbooks into one master workbook with just a single sheet or;
b) Split a multiple-sheet excel workbook into individual workbooks.
However, my goal is to grab all the excel files in a specific folder and save them as individual sheets within one new master excel workbook. I'm trying to rename each sheet name as the name of the original file.
import pandas as pd
import glob
import os
file = "C:\\File\\Path\\"
filename = 'Consolidated Files.xlsx'
pth = os.path.dirname(file)
extension = os.path.splitext(file)[1]
files = glob.glob(os.path.join(pth, '*xlsx'))
w = pd.ExcelWriter(file + filename)
for f in files:
print(f)
df = pd.read_excel(f, header = None)
print(df)
df.to_excel(w, sheet_name = f, index = False)
w.save()
How do I adjust the names for each sheet? Also, if you see any opportunities to clean this up please let me know
You cannot rename sheet with special characters because f is full path and file name. You should use only filename to names sheetname, Use os.path.basename to get file name and use split to seperate file name and extension.
for f in files:
print(f)
df = pd.read_excel(f, header = None)
print(df)
# Use basename to get filename with extension
# Use split to seperate filename and extension
new_sheet_name = os.path.basename(f).split('.')[0]
#
df.to_excel(w, sheet_name = new_sheet_name , index = False)
I decided to put my solution here as well, just in case it would be useful to anyone.
Thing is, I wanted to be able to recall where the end sheet came from. However, source workbooks can (and likely will) often have same sheet names like "Sheet 1", so I couldn't just use sheet names from original workbooks. I also could not use source filenames as sheet names since they might be longer than 31 character, which is maximum sheet name length allowed by Excel.
Therefore, I ended up assigning incremental numbers to resulting sheet names, while simultaneously inserting a new column named "source" at the start of each sheet and populating it with file name concatenated with sheet name. Hope it might help someone :)
from glob import glob
import pandas as pd
import os
files_input = glob(r'C:\Path\to\folder\*.xlsx')
result_DFs = []
for xlsx_file in files_input:
file_DFs = pd.read_excel(xlsx_file, sheet_name=None)
# save every sheet from every file as dataframe to an array
for sheet_DF in file_DFs:
source_name = os.path.basename(xlsx_file) + ":" + sheet_DF
file_DFs[sheet_DF].insert(0, 'source', source_name)
result_DFs.append(file_DFs[sheet_DF])
with pd.ExcelWriter(r'C:\Path\to\resulting\file.xlsx') as writer:
for df_index in range(len(result_DFs)):
# write dataframe to file using simple incremental number as a new sheet name
result_DFs[df_index].to_excel(writer, sheet_name=str(df_index), index=False)
# auto-adjust column width (can be omitted if not needed)
for i, col in enumerate(result_DFs[df_index].columns):
column_len = max(result_DFs[df_index][col].astype(str).str.len().max(), len(col) + 3)
_ = writer.sheets[str(df_index)].set_column(i, i, column_len)
I am trying to read data from multiple xls files and write it to one single file.
My code below is writing only the first file. Not sure what I am missing.
import glob import os import pandas as pd
def list_files(dir):
r = []
for root, dirs, files in os.walk(dir):
for name in files:
r.append(os.path.join(root, name))
return r
files = list_files("C:\\Users\\12345\\BOFS")
for file in files:
df = pd.read_excel(file)
new_header = df.iloc[1]
df = df[2:]
df.columns = new_header
with pd.ExcelWriter("C:\\Users\\12345\\Test\\Test.xls", mode='a') as writer:
df.to_excel(writer,index=False, header=True,)
Documentation says:
ExcelWriter can also be used to append to an existing Excel file:
with pd.ExcelWriter('output.xlsx',
mode='a') as writer:
df.to_excel(writer, sheet_name='Sheet_name_3')
And that probably replaces given sheet
But you could use pd.concat(<dataframes>) to concatenate dataframes and write all data at once in a single sheet.
I tested this piece of code, hopefully its work in your case.
import glob, os
os.chdir("D:/Data Science/stackoverflow")
for file in glob.glob("*.xlsx"):
df = pd.read_excel(file)
all_data = all_data.append(df,ignore_index=True)
# now save the data frame
writer = pd.ExcelWriter('output.xlsx')
all_data.to_excel(writer,'sheet1')
writer.save()
I want to loop through a directory and find specific xlsx files and then put them each into separate pandas dataframe. The thing here is that I also want all sheets in those excel files to be in the dataframe.
Below is a sample of code that I implemented, I just need to add the logic to pick all sheets:
import pandas as pd
from glob import glob
path = 'path_to_file'
files = glob(path + '/*file*.xlsx')
get_df = lambda f: pd.read_excel(f)
dodf = {f: get_df(f) for f in files}
dodf[files[2]] --- dictionary of dataframes
As described in this answer in Pandas you still have access to the ExcelFile class, which loads the file creating an object.
This object has a .sheet_names property which gives you a list of sheet names in the current file.
xl = pd.ExcelFile('foo.xls')
xl.sheet_names # list of all sheet names
To actually handle the import of the specific sheet, use .parse(sheet_name) on the object of the imported Excel file:
xl.parse(sheet_name) # read a specific sheet to DataFrame
For your code something like:
get_df = lambda f: pd.ExcelFile(f)
dodf = {f: get_df(f) for f in files}
...gives you dodf a dictionary of ExcelFile objects.
filename = 'yourfilehere.xlsx'
a_valid_sheet = dodf[filename].sheet_names[0] # First sheet
df = dodf[filename].parse(sheet_name)
Hello I would like to concatenate three excels files xlsx using python.
I have tried using openpyxl, but I don't know which function could help me to append three worksheet into one.
Do you have any ideas how to do that ?
Thanks a lot
Here's a pandas-based approach. (It's using openpyxl behind the scenes.)
import pandas as pd
# filenames
excel_names = ["xlsx1.xlsx", "xlsx2.xlsx", "xlsx3.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in excel_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]
# concatenate them..
combined = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
I'd use xlrd and xlwt. Assuming you literally just need to append these files (rather than doing any real work on them), I'd do something like: Open up a file to write to with xlwt, and then for each of your other three files, loop over the data and add each row to the output file. To get you started:
import xlwt
import xlrd
wkbk = xlwt.Workbook()
outsheet = wkbk.add_sheet('Sheet1')
xlsfiles = [r'C:\foo.xlsx', r'C:\bar.xlsx', r'C:\baz.xlsx']
outrow_idx = 0
for f in xlsfiles:
# This is all untested; essentially just pseudocode for concept!
insheet = xlrd.open_workbook(f).sheets()[0]
for row_idx in xrange(insheet.nrows):
for col_idx in xrange(insheet.ncols):
outsheet.write(outrow_idx, col_idx,
insheet.cell_value(row_idx, col_idx))
outrow_idx += 1
wkbk.save(r'C:\combined.xls')
If your files all have a header line, you probably don't want to repeat that, so you could modify the code above to look more like this:
firstfile = True # Is this the first sheet?
for f in xlsfiles:
insheet = xlrd.open_workbook(f).sheets()[0]
for row_idx in xrange(0 if firstfile else 1, insheet.nrows):
pass # processing; etc
firstfile = False # We're done with the first sheet.
When I combine excel files (mydata1.xlsx, mydata2.xlsx, mydata3.xlsx) for data analysis, here is what I do:
import pandas as pd
import numpy as np
import glob
all_data = pd.DataFrame()
for f in glob.glob('myfolder/mydata*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df, ignore_index=True)
Then, when I want to save it as one file:
writer = pd.ExcelWriter('mycollected_data.xlsx', engine='xlsxwriter')
all_data.to_excel(writer, sheet_name='Sheet1')
writer.save()
Solution with openpyxl only (without a bunch of other dependencies).
This script should take care of merging together an arbitrary number of xlsx documents, whether they have one or multiple sheets. It will preserve the formatting.
There's a function to copy sheets in openpyxl, but it is only from/to the same file. There's also a function insert_rows somewhere, but by itself it won't insert any rows. So I'm afraid we are left to deal (tediously) with one cell at a time.
As much as I dislike using for loops and would rather use something compact and elegant like list comprehension, I don't see how to do that here as this is a side-effect show.
Credit to this answer on copying between workbooks.
#!/usr/bin/env python3
#USAGE
#mergeXLSX.py <a bunch of .xlsx files> ... output.xlsx
#
#where output.xlsx is the unified file
#This works FROM/TO the xlsx format. Libreoffice might help to convert from xls.
#localc --headless --convert-to xlsx somefile.xls
import sys
from copy import copy
from openpyxl import load_workbook,Workbook
def createNewWorkbook(manyWb):
for wb in manyWb:
for sheetName in wb.sheetnames:
o = theOne.create_sheet(sheetName)
safeTitle = o.title
copySheet(wb[sheetName],theOne[safeTitle])
def copySheet(sourceSheet,newSheet):
for row in sourceSheet.rows:
for cell in row:
newCell = newSheet.cell(row=cell.row, column=cell.col_idx,
value= cell.value)
if cell.has_style:
newCell.font = copy(cell.font)
newCell.border = copy(cell.border)
newCell.fill = copy(cell.fill)
newCell.number_format = copy(cell.number_format)
newCell.protection = copy(cell.protection)
newCell.alignment = copy(cell.alignment)
filesInput = sys.argv[1:]
theOneFile = filesInput.pop(-1)
myfriends = [ load_workbook(f) for f in filesInput ]
#try this if you are bored
#myfriends = [ openpyxl.load_workbook(f) for k in range(200) for f in filesInput ]
theOne = Workbook()
del theOne['Sheet'] #We want our new book to be empty. Thanks.
createNewWorkbook(myfriends)
theOne.save(theOneFile)
Tested with openpyxl 2.5.4, python 3.4.
You can simply use pandas and os library to do this.
import pandas as pd
import os
#create an empty dataframe which will have all the combined data
mergedData = pd.DataFrame()
for files in os.listdir():
#make sure you are only reading excel files
if files.endswith('.xlsx'):
data = pd.read_excel(files, index_col=None)
mergedData = mergedData.append(data)
#move the files to other folder so that it does not process multiple times
os.rename(files, 'path to some other folder')
mergedData DF will have all the combined data which you can export in a separate excel or csv file. Same code will work with csv files as well. just replace it in the IF condition
Just to add to p_barill's answer, if you have custom column widths that you need to copy, you can add the following to the bottom of copySheet:
for col in sourceSheet.column_dimensions:
newSheet.column_dimensions[col] = sourceSheet.column_dimensions[col]
I would just post this in a comment on his or her answer but my reputation isn't high enough.