xlsx to txt column data formatting - python

I am trying to turn a series of Excel Sheets into .txt files. The data I'm working with has some specific formatting I want to keep (decimal places and scientific notation specifically), but I can't seem to get it to work. Am I missing something with .format? The code below works for the most part (except for the final 3 lines, the ones I'm working on).
import pandas as pd
file_names = ["Example.xlsx"]
for xl_file in file_names:
xl = pd.ExcelFile("Example.xlsx")
sheet_names = xl.sheet_names
for k in range(len(sheet_names)):
txt_name = xl_file.split(".")[0] + str(sheet_names[k])+".txt"
df = pd.read_excel("Example.xlsx", sheet_name = sheet_names[k])
with open(txt_name, 'w', encoding="utf-8") as outfile:
df.to_string(outfile, index=False)
col0 = [0]
df0 = pd.read_excel("Example.xlsx", usecols=col0)
"El": "{:<}".format(df0)'''

Related

Save to Excel strings with '='

I'm trying to save my output to an Excel file, but some of the values have '=' at the beginning of the string.
So while exporting, Excel converts them to formulas, and instead of strings, I have #NAME error in Excel.
I need to save only some columns as text, as I have dates and numerics in other columns, and they should be saved as is.
I've already tried to convert them with the .astype() function, but with no result.
def create_excel(datadir, filename, data):
df = col_type_converter(filename, pd.DataFrame(data))
filepath = os.path.join(datadir, filename + '.xlsx')
writer = pd.ExcelWriter(filepath, engine='xlsxwriter')
df.to_excel(writer, index=False)
writer.save()
return filepath
def col_type_converter(name, dataframe):
df = dataframe
if name == 'flights':
df['departure_station'] = df['departure_station'].astype(str)
df['arrival_station'] = df['arrival_station'].astype(str)
return df
return df
When I'm importing from CSV using the built-in Excel importer, I can make it import values as text.
Is there any way to say to Pandas how I want to import columns?
nvm, you can just pass xlsxwriter options through pandas:
writer = pd.ExcelWriter(filepath, engine='xlsxwriter', options={'strings_to_formulas': False})
https://xlsxwriter.readthedocs.io/working_with_pandas.html#passing-xlsxwriter-constructor-options-to-pandas
https://xlsxwriter.readthedocs.io/worksheet.html#worksheetwrite

When compiling multiple excel files into csv files, datetime turns into integer dtype

I'm using python to merge some excel files into a single csv file, but when doing so, the datetimes get turned into integers. So, when I read it back with pandas to treat my unified database, I would need to convert it back to datetime, which is possible but seems unnecessary. The code for reading and compiling the files:
folder = Path('myPath')
os.chdir(folder)
files = sorted(os.listdir(os.getcwd()), key = os.path.getctime)
for file in files:
with xlrd.open_workbook(folder/file) as wb:
sh = wb.sheet_by_index(0)
with open('Unified database.csv', 'wb') as f:
c = csv.writer(f, encoding = 'utf-8')
for r in range(sh.nrows):
c.writerow(sh.row_values(r))
Is there a way to take less steps into solving this problem, and just write the datetime columns as strings, which pandas has a much easier time automatically identifying as dates? Even if I have to pass the datetime columns manually.
Have you tried to read all of the excel files directly into a pandas dataframe? The code below is from this answer on how to Import multiple csv files into pandas and concatenate into one DataFrame. I have added the dtype so you can specify which columns should be datetime.
import pandas as pd
import glob
path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(path + "/*.xlsx")
li = []
for filename in all_files:
df = pd.read_xlsx(filename, index_col=None, header=0, dtype={‘a’: np.datetime})
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)

Pandas ExcelFile read columns as string

I know that you can specify data types when reading excels using pd.read_excel (as outlined here). Can you do the same using pd.ExcelFile?
I have the following code:
if ".xls" in
xl = pd.ExcelFile(path + "\\" + name, )
for sheet in xl.sheet_names:
xl_parsed = xl.parse(sheet)
When parsing the sheet, some of the values in the columns are displayed in scientific notation. I don't know the column names before loading so I need to import everything as string. Ideally I would like to be able to do something like xl_parsed = xl.parse(sheet, dtype = str). Any suggestions?
If you would prefer a cleaner solution, I used the following:
excel = pd.ExcelFile(path)
for sheet in excel.sheet_names:
columns = excel.parse(sheet).columns
converters = {column: str for column in columns}
data = excel.parse(sheet, converters=converters)
I went with roganjosh's suggestion - open the excel first, get column names and then pass as converter.
if ".xls" in name:
xl = pd.ExcelFile(path)
sheetCounter = 1
for sheet in xl.sheet_names:
### Force to read as string ###
column_list = []
df_column = pd.read_excel(path, sheetCounter - 1).columns
for i in df_column:
column_list.append(i)
converter = {col: str for col in column_list}
##################
xl_parsed = xl.parse(sheet, converters=converter)
sheetCounter = sheetCounter + 1

Looping through a folder to merge several excel sheets into one column

I have several workbooks, each with three sheets. I want to loop through each workbook and merge all the data from sheet_1 into a new workbook_1 file, sheet_2 into workbook_2 file & sheet_3 into workbook_3.
As far as I can tell the script below does everything I need, except rather than appending the data, it overwrites the data from the previous iteration.
For the sake of parsimony I've shortened, cleaned & simplified my script, but I'm happy to share the full script if needed.
import pandas as pd
import glob
search_dir= ('/Users/PATH/*.xlsx')
sheet_names = ['sheet_1','sheet_2','sheet_2']
def a_joiner(sheet):
for loop_x in glob.glob(search_dir):
try:
if sheet == 'sheet_1':
id_file= pd.ExcelFile(loop_x)
df_1 = id_file.parse(sheet, header= None)
writer= pd.ExcelWriter('/Users/PATH/%s.xlsx' %(sheet), engine= 'xlsxwriter')
df_1.to_excel(writer)
writer.save()
elif sheet == 'sheet_2':
#do same as above
else:
#and do same as above again
except Exception as e:
print('Error:',e)
for sheet in sheet_names:
a_joiner(sheet)
You can also easilly append data like:
df = []
for f in ['c:\\file1.xls', 'c:\\ file2.xls']:
data = pd.read_excel(f, 'Sheet1').iloc[:-2]
data.index = [os.path.basename(f)] * len(data)
df.append(data)
df = pd.concat(df)
From:
Using pandas Combining/merging 2 different Excel files/sheets

How to concatenate three excels files xlsx using python?

Hello I would like to concatenate three excels files xlsx using python.
I have tried using openpyxl, but I don't know which function could help me to append three worksheet into one.
Do you have any ideas how to do that ?
Thanks a lot
Here's a pandas-based approach. (It's using openpyxl behind the scenes.)
import pandas as pd
# filenames
excel_names = ["xlsx1.xlsx", "xlsx2.xlsx", "xlsx3.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in excel_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]
# concatenate them..
combined = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
I'd use xlrd and xlwt. Assuming you literally just need to append these files (rather than doing any real work on them), I'd do something like: Open up a file to write to with xlwt, and then for each of your other three files, loop over the data and add each row to the output file. To get you started:
import xlwt
import xlrd
wkbk = xlwt.Workbook()
outsheet = wkbk.add_sheet('Sheet1')
xlsfiles = [r'C:\foo.xlsx', r'C:\bar.xlsx', r'C:\baz.xlsx']
outrow_idx = 0
for f in xlsfiles:
# This is all untested; essentially just pseudocode for concept!
insheet = xlrd.open_workbook(f).sheets()[0]
for row_idx in xrange(insheet.nrows):
for col_idx in xrange(insheet.ncols):
outsheet.write(outrow_idx, col_idx,
insheet.cell_value(row_idx, col_idx))
outrow_idx += 1
wkbk.save(r'C:\combined.xls')
If your files all have a header line, you probably don't want to repeat that, so you could modify the code above to look more like this:
firstfile = True # Is this the first sheet?
for f in xlsfiles:
insheet = xlrd.open_workbook(f).sheets()[0]
for row_idx in xrange(0 if firstfile else 1, insheet.nrows):
pass # processing; etc
firstfile = False # We're done with the first sheet.
When I combine excel files (mydata1.xlsx, mydata2.xlsx, mydata3.xlsx) for data analysis, here is what I do:
import pandas as pd
import numpy as np
import glob
all_data = pd.DataFrame()
for f in glob.glob('myfolder/mydata*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df, ignore_index=True)
Then, when I want to save it as one file:
writer = pd.ExcelWriter('mycollected_data.xlsx', engine='xlsxwriter')
all_data.to_excel(writer, sheet_name='Sheet1')
writer.save()
Solution with openpyxl only (without a bunch of other dependencies).
This script should take care of merging together an arbitrary number of xlsx documents, whether they have one or multiple sheets. It will preserve the formatting.
There's a function to copy sheets in openpyxl, but it is only from/to the same file. There's also a function insert_rows somewhere, but by itself it won't insert any rows. So I'm afraid we are left to deal (tediously) with one cell at a time.
As much as I dislike using for loops and would rather use something compact and elegant like list comprehension, I don't see how to do that here as this is a side-effect show.
Credit to this answer on copying between workbooks.
#!/usr/bin/env python3
#USAGE
#mergeXLSX.py <a bunch of .xlsx files> ... output.xlsx
#
#where output.xlsx is the unified file
#This works FROM/TO the xlsx format. Libreoffice might help to convert from xls.
#localc --headless --convert-to xlsx somefile.xls
import sys
from copy import copy
from openpyxl import load_workbook,Workbook
def createNewWorkbook(manyWb):
for wb in manyWb:
for sheetName in wb.sheetnames:
o = theOne.create_sheet(sheetName)
safeTitle = o.title
copySheet(wb[sheetName],theOne[safeTitle])
def copySheet(sourceSheet,newSheet):
for row in sourceSheet.rows:
for cell in row:
newCell = newSheet.cell(row=cell.row, column=cell.col_idx,
value= cell.value)
if cell.has_style:
newCell.font = copy(cell.font)
newCell.border = copy(cell.border)
newCell.fill = copy(cell.fill)
newCell.number_format = copy(cell.number_format)
newCell.protection = copy(cell.protection)
newCell.alignment = copy(cell.alignment)
filesInput = sys.argv[1:]
theOneFile = filesInput.pop(-1)
myfriends = [ load_workbook(f) for f in filesInput ]
#try this if you are bored
#myfriends = [ openpyxl.load_workbook(f) for k in range(200) for f in filesInput ]
theOne = Workbook()
del theOne['Sheet'] #We want our new book to be empty. Thanks.
createNewWorkbook(myfriends)
theOne.save(theOneFile)
Tested with openpyxl 2.5.4, python 3.4.
You can simply use pandas and os library to do this.
import pandas as pd
import os
#create an empty dataframe which will have all the combined data
mergedData = pd.DataFrame()
for files in os.listdir():
#make sure you are only reading excel files
if files.endswith('.xlsx'):
data = pd.read_excel(files, index_col=None)
mergedData = mergedData.append(data)
#move the files to other folder so that it does not process multiple times
os.rename(files, 'path to some other folder')
mergedData DF will have all the combined data which you can export in a separate excel or csv file. Same code will work with csv files as well. just replace it in the IF condition
Just to add to p_barill's answer, if you have custom column widths that you need to copy, you can add the following to the bottom of copySheet:
for col in sourceSheet.column_dimensions:
newSheet.column_dimensions[col] = sourceSheet.column_dimensions[col]
I would just post this in a comment on his or her answer but my reputation isn't high enough.

Categories