I have written a script that reads from excel workbooks and writes new workbooks.
Each row is a separate object, and one of the columns is a date.
I have written the date as a NamedStyle using datetime to get what I think is the correct format:
date_style = NamedStyle(name='datetime', number_format='YYYY-MM-DD')
for row in range(2,ws_kont.max_row+1):
ws_kont.cell(row = row, column = 4).style = date_style
The problem is that i need to import this excel workbook to an ancient database who for some reason dont accept a date-formating, only text like this "yyyy-dd-mm".
I'm having trouble rewriting these cells as text.
I have tried using the =TEXT formula, but that wont work since you cant use the cell itself to calculate the result unless i duplicate the column for referencing in the formula:
name = str(ws_teg.cell(row = row, column = 4).coordinate)
date_f = "yyyy-mm-dd"
ws_kont[name] = "=TEXT(%s,%s)" % (name, date_f)
I need to do this a bunch of places in a couple of scripts, so I'm wondering if there is a simpler way to do this?
PS. I'm just a archaeologist trying to automate some tasks in my workday by dabbling in some simple code, please go easy on me if I seem a bit slow.
Found another article that worked out well with minmal code:
writer = pd.ExcelWriter('Sample_Master_Data_edited.xlsx', engine='xlsxwriter',
date_format='mm/dd/yyyy', datetime_format='mm/dd/yyyy')
Reference
Most likely, it won't be enough to change the format of your date - you'll have to store the date as a string instead of a datetime object.
Loop over the column and format the dates with datetime.strftime:
for row in range(1, ws_kont.max_row+1):
cell = ws_kont.cell(row = row, column = 4)
cell.value = cell.value.strftime('%Y-%m-%d')
Related
[ 10-07-2022 - For anyone stopping by with the same issue. After much searching, I have yet to find a way, that isn't convoluted and complicated, to accurately pull mixed type data from excel using Pandas/Python. My solution is to convert the files using unoconv on the command line, which preserves the formatting, then read into pandas from there. ]
I have to concatenate 1000s of individual excel workbooks with a single sheet, into one master sheet. I use a for loop to read them into a data frame, then concatenate the data frame to a master data frame. There is one column in each that could represent currency, percentages, or just contain notes. Sometimes it has been filled out with explicit indicators in the cell, Eg., '$' - other times, someone has used cell formatting to indicate currency while leaving just a decimal in the cell. I've been using a formatting routine to catch some of this but have run into some edge cases.
Consider a case like the following:
In the actual spreadsheet, you see: $0.96
When read_excel siphons this in, it will be represented as 0.96. Because of the mixed-type nature of the column, there is no sure way to know whether this is 96% or $0.96
Is there a way to read excel files into a data frame for analysis and record what is visually represented in the cell, regardless of whether cell formatting was used or not?
I've tried using dtype="str", dtype="object" and have tried using both the default and openpyxl engines.
UPDATE
Taking the comments below into consideration, I'm rewriting with openpyxl.
import openpyxl
from openpyxl import load_workbook
def excel_concat(df_source):
df_master = pd.DataFrame()
for index, row in df_source.iterrows():
excel_file = Path(row['Test Path']) / Path(row['Original Filename'])
wb = openpyxl.load_workbook(filename = excel_file)
ws = wb.active
df_data = pd.DataFrame(ws.values)
df_master = pd.concat([df_master, df_data], ignore_index=True)
return df_master
df_master1 = excel_concat(df_excel_files)
This appears to be nothing more than a "longcut" to just calling the openpyxl engine with pandas. What am I missing in order to capture the visible values in the excel files?
looking here,https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html , noticed the following
dtypeType name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. **If converters are specified, they will be applied INSTEAD of dtype conversion.**
converters dict, default None
Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.
Do you think that might work for you?
I am struggling to find a way to solve my problem. I have an excel file which has data.
I need to check the type of data in columns (with every cell).
For example, in this column, I need to check that every cells are strings. But as you can see, there is a cell that is an int.
In this situation, I need to write this line in a new text file.
This is the code I have so far :
from openpyxl import load_workbook
book = load_workbook('export.xlsx')
sheet = book['Data']
for row in sheet.rows:
print (str(row[6].value))
Thanks for any help !
For this special case you may try this:
intList = list()
for row in sheet.rows:
try:
intList.append(int(row[6].value))
except:
pass
It will try to get the int value of the cell, and if it succeeded, it will push it into a list
I have a column in a dataframe that has values in the format XX/XX (Ex: 05/23, 4/22, etc.) When I convert it to a csv, it converts to a date. How do I prevent this from happening?
I tried putting an equals sign in front but then it executes like division (Ex: =4/20 comes out to 0.5).
df['unique_id'] = '=' + df['unique_id']
I want the output to be in the original format XX/XX (Ex: 5/23 stays 5/23 in the csv file in Excel).
Check the datatypes of your dataframe with df.dtypes. I assume your column is interpreted as date. Then you can do df[col] = df[col].astype(np_type_you_want)
If that doenst bring the wished result, check why the column is interpreted as date when creating the df. Solution depends on where you get the data from.
The issue is not an issue with python or pandas. The issue is that excel thinks its clever and assumes it knows your data type. you were close with trying to put an = before your data but your data needs to be wrapped in qoutes and prefixed with an =. I cant claim to have come up with this answer myself. I obtained it from this answer
The following code will allow you to write a CSV file that will then open in excel without any formating trying to convert to date or executing division. However it shoudl be noted that this is only really a strategy if you will only be opening the CSV in excel. as you are wrapping formating info around your data which will then be stripped out by excel. If you are using this csv in any other software you might need to rethink about it.
import pandas as pd
import csv
data = {'key1': [r'4/5']}
df = pd.DataFrame.from_dict(data)
df['key1'] = '="' + df['key1'] + '"'
print(df)
print(df.dtypes)
with open(r'C:\Users\cd00119621\myfile.csv', 'w') as output:
df.to_csv(output)
RAW OUTPUT in file
,key1
0,"=""4/5"""
EXCEL OUTPUT
I have an excel table that I would like to sort with XlWings. The table has a header row. I tried sorting like this:
wb = xw.Book(file)
ws = wb.sheets[sheet]
ws.range(table).api.Sort(ws.range(table).api,SortOrder.xlAscending,)
But that sorts the table such that data replaces the headers, and the header row ends up at the bottom of the table.
The following produce the same results:
#Setting the range to include only the table data
ws.range("Table1[#Data]").api.Sort(ws.range("Table1[#Data]").api,SortOrder.xlAscending)
#Specifying the range has a header
ws.range(table).api.Sort(Key1=ws.range(table).api,Order1=1,Header="xlYes")
#manually excluding the header row from the range
ws.range('c4:n380').api.Sort(ws.range('c4:n380').api,SortOrder.xlAscending)
I'm at my wits end. The final table will be very large, so I'd rather not write the whole thing into a dataframe, sorting it there and re-writing it to excel.
Documentation on this topic is sketchy.
After 2 days of trying and searching, this seemed to work:
last_row = ws.range(1,1).end('down').row
first_col_range = ws.range("A2:A{row}".format(row=last_row))
data_range = ws.range("A2:N{row}".format(row=last_row))
ws.range(data_range).api.Sort(Key1=first_col_range.api, Order1=1, Header=2, Orientation=1)
I found https://learn.microsoft.com/en-us/office/vba/api/excel.range.sort to be of some help.
The solution I'm posting here refers to the example at:
https://www.dataquest.io/blog/python-excel-xlwings-tutorial/
(xlwings Tutorial: Make Excel Faster Using Python)
Good luck!
I got the same issue using these one-liners. The header row kept being sorted and moved to the last row. Here is how I solved it in Python (using VBA ListObjects).
table1 = ws.api.ListObjects("Table1")
sort_range= ws.range("Table1[#Data]").api
table1.Sort.SortFields.Add(Key=sort_range)
table1.Sort.Apply()
I am working with a big excel file.
I am using
wb = load_workbook(filename='my_file.xlsx')
ws = wb['Sheet1']
I don't want to alter the worksheet in any way. I just want to take data from several columns and work with them.
My understanding is that I can't just call a column and use .tolist() because all the values are stored in excel.
Bernie's answer, I think, was for a slightly older version of OpenPyxl. Worksheet.columns no longer returns tuples but rather, is a generator. The new way to access a column is Worksheet['AlphabetLetter'].
So the rewritten code is:
mylist = []
for col in ws['A']:
mylist.append(col.value)
Based on your comment here's one thing you can do:
mylist = []
for col in ws.columns[0]:
mylist.append(col.value)