python - How to handle "old" dates when transfering data to excel - python

I have the dataframe where one of the columns contains date strings. I first convert it to datetime with:
mydf['Desk Date'] = pd.to_datetime(mydf['Desk Date'])`
and then drop the dataframe to excel with
Range('A1').value = mydf`
I get the following error:
Traceback (most recent call last):
File "C:\Program Files (x86)\Python271\lib\site-packages\IPython\core\interactiveshell.py", line 3035, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-111-6c6f5ea1ff17>", line 1, in <module>
Import.ImportFWD(test_path)
File "C:\Users\jastrzem\Downloads\pyWFP\Import.py", line 42, in ImportFWD
Range('A1').value = mydf
File "C:\Program Files (x86)\Python271\lib\site-packages\xlwings\main.py", line 818, in value
self.row1, self.col1, row2, col2), data)
File "C:\Program Files (x86)\Python271\lib\site-packages\xlwings\_xlwindows.py", line 151, in set_value
xl_range.Value = data
File "C:\Program Files (x86)\Python271\lib\site-packages\win32com\client\dynamic.py", line 560, in __setattr__
self._oleobj_.Invoke(entry.dispid, 0, invoke_type, 0, value)
com_error: (-2147352567, 'Exception occurred.', (0, None, None, None, 0, -2146827284), None)
One of the dates is Timestamp('1899-01-31 00:00:00')
which I think is the reason for the error.
I tried to use np.where to substitute all values before year 2000 to NaN, but with no luck.
f = lambda x: x.year
mydf['Desk Date'] = np.where(pd.DataFrame(mydf['Desk Date']).applymap(f) > 2000, pd.to_datetime(mydf['Desk Date'], format='%D/%M/%Y'),np.nan)
How can I fix the above command or alternatively how should I handle dates that are "not transferable" to excel?
Thanks!
[EDIT]:
I tried to use to_excel method but with no luck either. The code I put at the end of my function:
writer = pd.ExcelWriter('test7.xlsx', engine='xlsxwriter')
mydf.to_excel(writer, sheet_name = 'Sheet1')
writer.close()
it creates the file but it's empty. I get the following error:
Traceback (most recent call last):
File "C:\Program Files (x86)\Python271\lib\site-packages\IPython\core\interactiveshell.py", line 3035, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-26-6c6f5ea1ff17>", line 1, in <module>
Import.ImportFWD(test_path)
File "C:\Users\jastrzem\Downloads\pyWFP\Import.py", line 44, in ImportFWD
writer.close()
File "C:\Program Files (x86)\Python271\lib\site-packages\pandas\io\excel.py", line 623, in close
return self.save()
File "C:\Program Files (x86)\Python271\lib\site-packages\pandas\io\excel.py", line 1298, in save
return self.book.close()
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\workbook.py", line 295, in close
self._store_workbook()
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\workbook.py", line 518, in _store_workbook
xml_files = packager._create_package()
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\packager.py", line 140, in _create_package
self._write_shared_strings_file()
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\packager.py", line 280, in _write_shared_strings_file
sst._assemble_xml_file()
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\sharedstrings.py", line 53, in _assemble_xml_file
self._write_sst_strings()
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\sharedstrings.py", line 83, in _write_sst_strings
self._write_si(string)
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\sharedstrings.py", line 110, in _write_si
self._xml_si_element(string, attributes)
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\xmlwriter.py", line 122, in _xml_si_element
self.fh.write("""<si><t%s>%s</t></si>""" % (attr, string))
File "C:\Program Files (x86)\Python271\lib\codecs.py", line 694, in write
return self.writer.write(data)
File "C:\Program Files (x86)\Python271\lib\codecs.py", line 357, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 26: ordinal not in range(128)

The error is not because of the old date, but because you are trying to throw a whole dataframe at a single cell.
Instead, use the to_excel method.

Excel will not accept dates before 1900. My workaround is to replace "old" dates with np.nan since I know they are data errors anyway.
mydf['Desk Date'] = pd.to_datetime(mydf['Desk Date'])
dates_list = list(mydf['Desk Date'])
dates_list = [x if x.year > 1900 else np.nan for x in dates_list ]
mydf['Desk Date'] = dates_list

Related

Error merging multiple CSV files - Python

I'm trying to merge several CSV files into one.
Searching several methods, I found this one:
files = glob.glob("D:\\green_lake\\Projects\\covid_19\\tabelas_relacao\\acre\\*.csv")
files_merged = pd.concat([pd.read_csv(df) for df in files], ignore_index=True)
When running this error is returned:
>>> files_merged = pd.concat([pd.read_csv(df) for df in files], ignore_index=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 678, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1253, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 225, in
read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 243, saw 4
I'm starting to study python and if it's a stupid mistake, I apologize ;)

Python, Pandas write to dataframe, lxml.etree.SerialisationError: IO_WRITE

Code to pick the wanted lines from a dataframe. The original data is in Excel format and I put it in dataframe here.
I want to pick all the rows of “Test Date” fall in “201506” and “201508”, and write them to an Excel file. The lines are working fine.
import pandas as pd
data_short = {'Contract_type' : ["Other", "Other", "Type-I", "Type-I", "Type-I", "Type-II", "Type-II", "Type-III", "Type-III", "Part-time"],
'Test Date': ["20150816", "20150601", "20150204", "20150609", "20150204", "20150806", "20150201", "20150615", "20150822", "20150236" ],
'Test_time' : ["16:26", "07:39", "18:48", "22:32", "03:54", "03:30", "04:00", "22:02", "13:43", "10:29"],
}
df = pd.DataFrame(data_short)
data_201508 = df[df['Test Date'].astype(str).str.startswith('201508')]
data_201506 = df[df['Test Date'].astype(str).str.startswith('201506')]
data_68 = data_201506.append(data_201508)
writer = pd.ExcelWriter("C:\\test-output.xlsx", engine = 'openpyxl')
data_68.to_excel(writer, "Sheet1", index = False)
writer.save()
But when I applied them to a larger file, ~600,000 rows with 25 columns (65 MB in file size), it returns error message like below:
Traceback (most recent call last):
File "C:\Python27\Working Scripts\LL move pick wanted ATA in months.py", line 15, in <module>
writer.save()
File "C:\Python27\lib\site-packages\pandas\io\excel.py", line 732, in save
return self.book.save(self.path)
File "C:\Python27\lib\site-packages\openpyxl\workbook\workbook.py", line 263, in save
save_workbook(self, filename)
File "C:\Python27\lib\site-packages\openpyxl\writer\excel.py", line 239, in save_workbook
writer.save(filename, as_template=as_template)
File "C:\Python27\lib\site-packages\openpyxl\writer\excel.py", line 222, in save
self.write_data(archive, as_template=as_template)
File "C:\Python27\lib\site-packages\openpyxl\writer\excel.py", line 80, in write_data
self._write_worksheets(archive)
File "C:\Python27\lib\site-packages\openpyxl\writer\excel.py", line 163, in _write_worksheets
xml = sheet._write(self.workbook.shared_strings)
File "C:\Python27\lib\site-packages\openpyxl\worksheet\worksheet.py", line 776, in _write
return write_worksheet(self, shared_strings)
File "C:\Python27\lib\site-packages\openpyxl\writer\worksheet.py", line 263, in write_worksheet
xf.write(worksheet.page_breaks.to_tree())
File "src/lxml/serializer.pxi", line 1016, in lxml.etree._FileWriterElement.__exit__ (src\lxml\lxml.etree.c:142025)
File "src/lxml/serializer.pxi", line 904, in lxml.etree._IncrementalFileWriter._write_end_element (src\lxml\lxml.etree.c:140218)
File "src/lxml/serializer.pxi", line 999, in lxml.etree._IncrementalFileWriter._handle_error (src\lxml\lxml.etree.c:141711)
File "src/lxml/serializer.pxi", line 195, in lxml.etree._raiseSerialisationError (src\lxml\lxml.etree.c:131087)
lxml.etree.SerialisationError: IO_WRITE
Does it mean the computer is not good enough (8GB, Win10)? Is there a way to optimize the code (for example, consume less memory)? Thank you.
btw: Question similiar to I/O Error while saving Excel file - Python but no solution...
found a solution: write the output to csv instead (anyway it can be opened in Excel as well)
data_wanted_all.to_csv("C:\\test-output.csv", index=False)
post here for in case some one encounters the same problem. let me know if this question shall be removed. :)

error opening xlsx files in python

I am trying to open an xlsx file that is created by another system (and this is the format in which the data always comes, and is not in my control). I tried both openpyxl (v2.3.2) and xlrd (v1.0.0) (as well as pandas (v0.20.1) read_excel and pd.ExcelFile(), both of which are using xlrd, and so may be moot), and I am running into errors; plus not finding answers from my searches. Any help is appreciated.
xlrd code:
import xlrd
workbook = xlrd.open_workbook(r'C:/Temp/Data.xlsx')
Error:
Traceback (most recent call last):
File "<ipython-input-3-9e5d87f720d0>", line 2, in <module>
workbook = xlrd.open_workbook(r'C:/Temp/Data.xlsx')
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\__init__.py", line 422, in open_workbook
ragged_rows=ragged_rows,
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 833, in open_workbook_2007_xml
x12sheet.process_stream(zflo, heading)
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 548, in own_process_stream
self_do_row(elem)
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 685, in do_row
self.sheet.put_cell(rowx, colx, None, float(tvalue), xf_index)
ValueError: could not convert string to float:
openpyxl code:
import openpyxl
wb = openpyxl.load_workbook(r'C:/Temp/Data.xlsx')
Error:
Traceback (most recent call last):
File "<ipython-input-2-6083ad2bc875>", line 1, in <module>
wb = openpyxl.load_workbook(r'C:/Temp/Data.xlsx')
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\reader\excel.py", line 234, in load_workbook
parser.parse()
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\reader\worksheet.py", line 106, in parse
dispatcher[tag_name](element)
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\reader\worksheet.py", line 243, in parse_row_dimensions
self.parse_cell(cell)
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\reader\worksheet.py", line 188, in parse_cell
value = _cast_number(value)
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\cell\read_only.py", line 23, in _cast_number
return long(value)
ValueError: invalid literal for int() with base 10: ' '
pandas code:
import pandas as pd
df = pd.read_excel(r'C:/Temp/Data.xlsx', sheetname='Sheet1')
Error:
Traceback (most recent call last):
File "<ipython-input-5-b86ec98a4e9e>", line 2, in <module>
df = pd.read_excel(r'C:/Temp/Data.xlsx', sheetname='Sheet1')
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\io\excel.py", line 200, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\io\excel.py", line 257, in __init__
self.book = xlrd.open_workbook(io)
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\__init__.py", line 422, in open_workbook
ragged_rows=ragged_rows,
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 833, in open_workbook_2007_xml
x12sheet.process_stream(zflo, heading)
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 548, in own_process_stream
self_do_row(elem)
File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 685, in do_row
self.sheet.put_cell(rowx, colx, None, float(tvalue), xf_index)
ValueError: could not convert string to float:
For what its worth, here is an example snippet of the input file:
I am guessing that the errors are coming from the first row having blanks beyond the first column - because the errors vanish when I delete the first two rows and . I cannot skip the first two rows, because I want to extract the value in cell A1. I would also like to force the values read to be string type, and will later convert to float with error checking. thanks!
===========
Update(Aug 9 10AM EDT): Using Charlie's suggestion, was able to open excel file in read only mode; and was able to read most of the contents - but still running into an error somewhere.
new code (sorry it is not very pythonic - still a newbie):
wb = openpyxl.load_workbook(r'C:/Temp/Data.xlsx', read_only=True)
ws = wb['Sheet1']
ws.max_row = ws.max_column = None
i=1
for row in ws.rows:
for cell in row:
if i<2000:
i += 1
try:
print(i, cell.value)
except:
print("error")
Error:
Traceback (most recent call last):
File "<ipython-input-65-2e8f3cf2294a>", line 2, in <module>
for row in ws.rows:
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\worksheet\read_only.py", line 125, in get_squared_range
yield tuple(self._get_row(element, min_col, max_col))
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\worksheet\read_only.py", line 165, in _get_row
value, data_type, style_id)
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\cell\read_only.py", line 36, in __init__
self.value = value
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\cell\read_only.py", line 132, in value
value = _cast_number(value)
File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\cell\read_only.py", line 23, in _cast_number
return long(value)
ValueError: invalid literal for int() with base 10: ' '
=========
Update2 (10:35AM): when i read the file without ws.max_row and ws.max_column set as None, the code was reading just one column, without errors. The value in cell A66 is "Generated from:". But when i read the file with ws.max_row and ws.max_column set as None, this particular cell is causing trouble. But I can read all other cells before that, and that will work fine for me, right now. thanks, #Charlie.
Sounds like the source file is probably corrupt and contains cells that with empty strings that are typed as numbers. You might be able to use openpyxl's read-only mode to skip the first tow rows.
If your program works after you delete the first two rows then lets skip them. try use skiprows to ignore the first 2 rows that are blanks or are headers. you can use the parse method from panda.
xls = pd.read_excel('C:/Temp/Data.xlsx')
df = xls.parse('Sheet1', skiprows=2) #assuming your data is on sheet1.

to_csv error with pandas dataframe with timezone

The following code gives an error under the pandas 0.17 but work very well with the 0.16.2.
No problem with the to_pickle function but get an error with the to_csv.
Has someone a tip to deal with it ?
In[23]: new_index = pd.date_range('2015-01-01', '2015-12-31', freq = 'H', tz='Europe/Paris')
In[24]: df = pd.DataFrame({}, index = new_index)
In[25]: df['test'] = 1.
In[26]: df.to_pickle(r'test.h5')
In[27]: df.to_csv(r'test.csv')
Traceback (most recent call last):
File "C:\Anaconda\lib\site-packages\IPython\core\interactiveshell.py", line 3035, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-27-2ced74ae66e1>", line 1, in <module>
df.to_csv(r'test.csv')
File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 1289, in to_csv
formatter.save()
File "C:\Anaconda\lib\site-packages\pandas\core\format.py", line 1494, in save
self._save()
File "C:\Anaconda\lib\site-packages\pandas\core\format.py", line 1594, in _save
self._save_chunk(start_i, end_i)
File "C:\Anaconda\lib\site-packages\pandas\core\format.py", line 1619, in _save_chunk
quoting=self.quoting)
File "C:\Anaconda\lib\site-packages\pandas\core\index.py", line 1292, in to_native_types
return values._format_native_types(**kwargs)
File "C:\Anaconda\lib\site-packages\pandas\tseries\index.py", line 746, in _format_native_types
format = _get_format_datetime64_from_values(self, date_format)
File "C:\Anaconda\lib\site-packages\pandas\core\format.py", line 2191, in _get_format_datetime64_from_values
is_dates_only = _is_dates_only(values)
File "C:\Anaconda\lib\site-packages\pandas\core\format.py", line 2145, in _is_dates_only
values = DatetimeIndex(values)
File "C:\Anaconda\lib\site-packages\pandas\util\decorators.py", line 89, in wrapper
return func(*args, **kwargs)
File "C:\Anaconda\lib\site-packages\pandas\tseries\index.py", line 344, in __new__
ambiguous=ambiguous)
File "pandas\tslib.pyx", line 3753, in pandas.tslib.tz_localize_to_utc (pandas\tslib.c:64516)
AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-10-25 02:00:00'), try using the 'ambiguous' argument
This seems to be a known bug #11619 and should be fixed in 0.17.1
The underlying issue is that your timeframe crosses from standard time to daylight saving time, which is the exact time showing in the error AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-10-25 02:00:00')

Can't export pandas dataframe to excel / encoding

I'm unable to export one of my dataframes due to some encoding difficulty.
sjM.dtypes
Customer Name object
Total Sales float64
Sales Rank float64
Visit_Frequency float64
Last_Sale datetime64[ns]
dtype: object
csv export works fine
path = 'c:\\test'
sjM.to_csv(path + '.csv') # Works
but the excel export fails
sjM.to_excel(path + '.xls')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "testing.py", line 338, in <module>
sjM.to_excel(path + '.xls')
File "c:\Anaconda\Lib\site-packages\pandas\core\frame.py", line 1197, in to_excel
excel_writer.save()
File "c:\Anaconda\Lib\site-packages\pandas\io\excel.py", line 595, in save
return self.book.save(self.path)
File "c:\Anaconda\Lib\site-packages\xlwt\Workbook.py", line 662, in save
doc.save(filename, self.get_biff_data())
File "c:\Anaconda\Lib\site-packages\xlwt\Workbook.py", line 637, in get_biff_data
shared_str_table = self.__sst_rec()
File "c:\Anaconda\Lib\site-packages\xlwt\Workbook.py", line 599, in __sst_rec
return self.__sst.get_biff_record()
File "c:\Anaconda\Lib\site-packages\xlwt\BIFFRecords.py", line 76, in get_biff_record
self._add_to_sst(s)
File "c:\Anaconda\Lib\site-packages\xlwt\BIFFRecords.py", line 91, in _add_to_sst
u_str = upack2(s, self.encoding)
File "c:\Anaconda\Lib\site-packages\xlwt\UnicodeUtils.py", line 50, in upack2
us = unicode(s, encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x81 in position 22: ordinal not in range(128)
I know that the problem is coming from the 'Customer Name' column, as after deletion the export to excel works fine.
I've tried following advice from that question (Python pandas to_excel 'utf8' codec can't decode byte), using a function to decode and re-encode the offending column
def changeencode(data):
cols = data.columns
for col in cols:
if data[col].dtype == 'O':
data[col] = data[col].str.decode('latin-1').str.encode('utf-8')
return data
sJM = changeencode(sjM)
sjM['Customer Name'].str.decode('utf-8')
L2-00864 SETIA 2
K1-00279 BERKAT JAYA
L2-00664 TK. ANTO
BR00035 BRASIL JAYA,TK
RA00011 CV. RAHAYU SENTOSA
so the conversion to unicode appears to be successful
sjM.to_excel(path + '.xls')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\Anaconda\Lib\site-packages\pandas\core\frame.py", line 1197, in to_excel
excel_writer.save()
File "c:\Anaconda\Lib\site-packages\pandas\io\excel.py", line 595, in save
return self.book.save(self.path)
File "c:\Anaconda\Lib\site-packages\xlwt\Workbook.py", line 662, in save
doc.save(filename, self.get_biff_data())
File "c:\Anaconda\Lib\site-packages\xlwt\Workbook.py", line 637, in get_biff_data
shared_str_table = self.__sst_rec()
File "c:\Anaconda\Lib\site-packages\xlwt\Workbook.py", line 599, in __sst_rec
return self.__sst.get_biff_record()
File "c:\Anaconda\Lib\site-packages\xlwt\BIFFRecords.py", line 76, in get_biff_record
self._add_to_sst(s)
File "c:\Anaconda\Lib\site-packages\xlwt\BIFFRecords.py", line 91, in _add_to_sst
u_str = upack2(s, self.encoding)
File "c:\Anaconda\Lib\site-packages\xlwt\UnicodeUtils.py", line 50, in upack2
us = unicode(s, encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 22: ordinal not in range(128)
Why does it fails, even though the conversion to unicode appears to be successful ?
How can i work around this issue to export that dataframe to excel ?
#Jeff
Thanks for showing me the right direction
steps used :
install xlsxwriter (not bundled with pandas)
sjM.to_excel(path + '.xlsx', sheet_name='Sheet1', engine='xlsxwriter')
You need to use pandas >= 0.13, and the xlsxwriter engine for excel, which supports native unicode writing. xlwt, the default engine will support passing an encoding option will be available in 0.14.
see here for the engine docs.

Categories