to_csv error with pandas dataframe with timezone - python

The following code gives an error under the pandas 0.17 but work very well with the 0.16.2.
No problem with the to_pickle function but get an error with the to_csv.
Has someone a tip to deal with it ?
In[23]: new_index = pd.date_range('2015-01-01', '2015-12-31', freq = 'H', tz='Europe/Paris')
In[24]: df = pd.DataFrame({}, index = new_index)
In[25]: df['test'] = 1.
In[26]: df.to_pickle(r'test.h5')
In[27]: df.to_csv(r'test.csv')
Traceback (most recent call last):
File "C:\Anaconda\lib\site-packages\IPython\core\interactiveshell.py", line 3035, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-27-2ced74ae66e1>", line 1, in <module>
df.to_csv(r'test.csv')
File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 1289, in to_csv
formatter.save()
File "C:\Anaconda\lib\site-packages\pandas\core\format.py", line 1494, in save
self._save()
File "C:\Anaconda\lib\site-packages\pandas\core\format.py", line 1594, in _save
self._save_chunk(start_i, end_i)
File "C:\Anaconda\lib\site-packages\pandas\core\format.py", line 1619, in _save_chunk
quoting=self.quoting)
File "C:\Anaconda\lib\site-packages\pandas\core\index.py", line 1292, in to_native_types
return values._format_native_types(**kwargs)
File "C:\Anaconda\lib\site-packages\pandas\tseries\index.py", line 746, in _format_native_types
format = _get_format_datetime64_from_values(self, date_format)
File "C:\Anaconda\lib\site-packages\pandas\core\format.py", line 2191, in _get_format_datetime64_from_values
is_dates_only = _is_dates_only(values)
File "C:\Anaconda\lib\site-packages\pandas\core\format.py", line 2145, in _is_dates_only
values = DatetimeIndex(values)
File "C:\Anaconda\lib\site-packages\pandas\util\decorators.py", line 89, in wrapper
return func(*args, **kwargs)
File "C:\Anaconda\lib\site-packages\pandas\tseries\index.py", line 344, in __new__
ambiguous=ambiguous)
File "pandas\tslib.pyx", line 3753, in pandas.tslib.tz_localize_to_utc (pandas\tslib.c:64516)
AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-10-25 02:00:00'), try using the 'ambiguous' argument

This seems to be a known bug #11619 and should be fixed in 0.17.1
The underlying issue is that your timeframe crosses from standard time to daylight saving time, which is the exact time showing in the error AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-10-25 02:00:00')

Related

Why am i getting this invalid syntax

My code:
import statsmodels.formula.api as smf
model = smf.ols("Delivery Time ~ Sorting Time" , data = dataset).fit()
My error:
File "/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py", line 3326, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-41-5de8902dbe7c>", line 2, in <module>
model = smf.ols("Delivery Time ~ Sorting Time" , data = dataset).fit()
File "/usr/local/lib/python3.8/dist-packages/statsmodels/base/model.py", line 169, in from_formula
tmp = handle_formula_data(data, None, formula, depth=eval_env,
File "/usr/local/lib/python3.8/dist-packages/statsmodels/formula/formulatools.py", line 63, in handle_formula_data
result = dmatrices(formula, Y, depth, return_type='dataframe',
File "/usr/local/lib/python3.8/dist-packages/patsy/highlevel.py", line 309, in dmatrices
(lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
File "/usr/local/lib/python3.8/dist-packages/patsy/highlevel.py", line 164, in _do_highlevel_design
design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
File "/usr/local/lib/python3.8/dist-packages/patsy/highlevel.py", line 66, in _try_incr_builders
return design_matrix_builders([formula_like.lhs_termlist,
File "/usr/local/lib/python3.8/dist-packages/patsy/build.py", line 689, in design_matrix_builders
factor_states = _factors_memorize(all_factors, data_iter_maker, eval_env)
File "/usr/local/lib/python3.8/dist-packages/patsy/build.py", line 354, in _factors_memorize
which_pass = factor.memorize_passes_needed(state, eval_env)
File "/usr/local/lib/python3.8/dist-packages/patsy/eval.py", line 478, in memorize_passes_needed
subset_names = [name for name in ast_names(self.code)
File "/usr/local/lib/python3.8/dist-packages/patsy/eval.py", line 478, in <listcomp>
subset_names = [name for name in ast_names(self.code)
File "/usr/local/lib/python3.8/dist-packages/patsy/eval.py", line 109, in ast_names
for node in ast.walk(ast.parse(code)):
File "/usr/lib/python3.8/ast.py", line 47, in parse
return compile(source, filename, mode, flags,
File "<unknown>", line 1
Delivery Time
^
SyntaxError: invalid syntax
I tried to add parenthesis. I tried to add underscore to the strings. It still did not work.

Error merging multiple CSV files - Python

I'm trying to merge several CSV files into one.
Searching several methods, I found this one:
files = glob.glob("D:\\green_lake\\Projects\\covid_19\\tabelas_relacao\\acre\\*.csv")
files_merged = pd.concat([pd.read_csv(df) for df in files], ignore_index=True)
When running this error is returned:
>>> files_merged = pd.concat([pd.read_csv(df) for df in files], ignore_index=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 678, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1253, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 225, in
read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 243, saw 4
I'm starting to study python and if it's a stupid mistake, I apologize ;)

TypeError in Dask dataframe while converting to pandas using compute()

I can't figure out what is the problem in the given code:
I am using dask to merge several dataframes. After merging I want to find the unique values from one of the column. I am getting type error while converting from dask to pandas using unique().compute(). But, I cannot seem to find what actually is the problem. It says that str cannot be assigned as int but, in some of the files the code passses through and in some it doesn't. I also cannot find the problem with data structure.
Any suggestions??
import pandas as pd
import dask.dataframe as dd
# Everything is fine until merging
# I have put several print(markers) to find the problem code
print('dask cols')
print(df_by_dask_merged.columns)
print()
print(dask_cols)
print()
print('find unique contigs values in dask dataframe')
pd_df = df_by_dask_merged['contig']
print(pd_df)
print()
print('mark 02')
# this is the problem code ??
pd_df_contig = pd_df.unique().compute()
print(pd_df_contig)
print('mark 03')
Output on Terminal:
dask cols
Index(['contig', 'pos', 'ref', 'all-alleles', 'ms01e_PI', 'ms01e_PG_al',
'ms02g_PI', 'ms02g_PG_al', 'all-freq'],
dtype='object')
['contig', 'pos', 'ref', 'all-alleles', 'ms01e_PI', 'ms01e_PG_al', 'ms02g_PI', 'ms02g_PG_al', 'all-freq']
find unique contigs values in dask dataframe
Dask Series Structure:
npartitions=1
int64
...
Name: contig, dtype: int64
Dask Name: getitem, 52 tasks
mark 02
Traceback (most recent call last):
File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/indexes/base.py", line 2145, in get_value
return tslib.get_value_box(s, key)
File "pandas/tslib.pyx", line 880, in pandas.tslib.get_value_box (pandas/tslib.c:17368)
File "pandas/tslib.pyx", line 889, in pandas.tslib.get_value_box (pandas/tslib.c:17042)
TypeError: 'str' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "merge_haplotype.py", line 305, in <module>
main()
File "merge_haplotype.py", line 152, in main
pd_df_contig = pd_df.unique().compute()
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/base.py", line 155, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/base.py", line 404, in compute
results = get(dsk, keys, **kwargs)
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/threaded.py", line 75, in get
pack_exception=pack_exception, **kwargs)
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/local.py", line 521, in get_async
raise_exception(exc, tb)
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/compatibility.py", line 67, in reraise
raise exc
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/local.py", line 290, in execute_task
result = _execute_task(task, data)
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/local.py", line 271, in _execute_task
return func(*args2)
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/dataframe/core.py", line 3404, in apply_and_enforce
df = func(*args, **kwargs)
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/utils.py", line 687, in __call__
return getattr(obj, self.method)(*args, **kwargs)
File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 4133, in apply
return self._apply_standard(f, axis, reduce=reduce)
File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 4229, in _apply_standard
results[i] = func(v)
File "merge_haplotype.py", line 249, in <lambda>
apply(lambda row : update_cols(row, sample_name), axis=1, meta=(int))
File "merge_haplotype.py", line 278, in update_cols
if 'N|N' in df_by_dask[sample_name + '_PG_al']:
File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/core/series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/indexes/base.py", line 2153, in get_value
raise e1
File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/indexes/base.py", line 2139, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas/index.c:3338)
File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas/index.c:3041)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13161)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13115)
KeyError: ('ms02g_PG_al', 'occurred at index 0')

Python error on data.plot(x=data.timestamp, style=".-")

In Python I am trying to use the following command to get timestamp info on my plot. I've imported the proper packages pandas and pylab. The data is all cleaned as well.
data.plot(x=data.timestamp, style=".-")
I keep getting a massive error with lots of different things. I am following along to https://www.youtube.com/watch?v=5XGycFIe8qE and it comes at 38 minutes. Here is the error I get: It's massive
Traceback (most recent call last): File "", line 1, in data.plot(x=data.timestamp, style=".-") File "C:\Python3\lib\site-packages\pandas\plotting_core.py", line 2673, in call sort_columns=sort_columns, **kwds) File "C:\Python3\lib\site-packages\pandas\plotting_core.py", line 1900, in plot_frame **kwds) File "C:\Python3\lib\site-packages\pandas\plotting_core.py", line 1727, in _plot plot_obj.generate() File "C:\Python3\lib\site-packages\pandas\plotting_core.py", line 260, in generate self._post_plot_logic_common(ax, self.data) File "C:\Python3\lib\site-packages\pandas\plotting_core.py", line 395, in _post_plot_logic_common self._apply_axis_properties(ax.yaxis, fontsize=self.fontsize) File "C:\Python3\lib\site-packages\pandas\plotting_core.py", line 468, in _apply_axis_properties labels = axis.get_majorticklabels() + axis.get_minorticklabels() File "C:\Python3\lib\site-packages\matplotlib\axis.py", line 1188, in get_majorticklabels ticks = self.get_major_ticks() File "C:\Python3\lib\site-packages\matplotlib\axis.py", line 1339, in get_major_ticks numticks = len(self.get_major_locator()()) File "C:\Python3\lib\site-packages\matplotlib\dates.py", line 1054, in call self.refresh() File "C:\Python3\lib\site-packages\matplotlib\dates.py", line 1074, in refresh dmin, dmax = self.viewlim_to_dt() File "C:\Python3\lib\site-packages\matplotlib\dates.py", line 832, in viewlim_to_dt return num2date(vmin, self.tz), num2date(vmax, self.tz) File "C:\Python3\lib\site-packages\matplotlib\dates.py", line 441, in num2date return _from_ordinalf(x, tz) File "C:\Python3\lib\site-packages\matplotlib\dates.py", line 256, in _from_ordinalf dt = datetime.datetime.fromordinal(ix).replace(tzinfo=UTC) ValueError: ordinal must be >= 1
I had a similar issue and I believe it is a version issue with pandas and the Jupyter Notebook that are different from what Quentin is using in his examples here: https://github.com/QCaudron/pydata_pandas and what you, and I, are using.
Try this:
data.plot('timestamp', style='.-')
or
data.plot(x='timestamp', style='.-')
Per the pandas docs https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html either a label or position should work.
DataFrame
x : label or position, default None
y : label or position, default None
Allows plotting of one column versus another

python - How to handle "old" dates when transfering data to excel

I have the dataframe where one of the columns contains date strings. I first convert it to datetime with:
mydf['Desk Date'] = pd.to_datetime(mydf['Desk Date'])`
and then drop the dataframe to excel with
Range('A1').value = mydf`
I get the following error:
Traceback (most recent call last):
File "C:\Program Files (x86)\Python271\lib\site-packages\IPython\core\interactiveshell.py", line 3035, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-111-6c6f5ea1ff17>", line 1, in <module>
Import.ImportFWD(test_path)
File "C:\Users\jastrzem\Downloads\pyWFP\Import.py", line 42, in ImportFWD
Range('A1').value = mydf
File "C:\Program Files (x86)\Python271\lib\site-packages\xlwings\main.py", line 818, in value
self.row1, self.col1, row2, col2), data)
File "C:\Program Files (x86)\Python271\lib\site-packages\xlwings\_xlwindows.py", line 151, in set_value
xl_range.Value = data
File "C:\Program Files (x86)\Python271\lib\site-packages\win32com\client\dynamic.py", line 560, in __setattr__
self._oleobj_.Invoke(entry.dispid, 0, invoke_type, 0, value)
com_error: (-2147352567, 'Exception occurred.', (0, None, None, None, 0, -2146827284), None)
One of the dates is Timestamp('1899-01-31 00:00:00')
which I think is the reason for the error.
I tried to use np.where to substitute all values before year 2000 to NaN, but with no luck.
f = lambda x: x.year
mydf['Desk Date'] = np.where(pd.DataFrame(mydf['Desk Date']).applymap(f) > 2000, pd.to_datetime(mydf['Desk Date'], format='%D/%M/%Y'),np.nan)
How can I fix the above command or alternatively how should I handle dates that are "not transferable" to excel?
Thanks!
[EDIT]:
I tried to use to_excel method but with no luck either. The code I put at the end of my function:
writer = pd.ExcelWriter('test7.xlsx', engine='xlsxwriter')
mydf.to_excel(writer, sheet_name = 'Sheet1')
writer.close()
it creates the file but it's empty. I get the following error:
Traceback (most recent call last):
File "C:\Program Files (x86)\Python271\lib\site-packages\IPython\core\interactiveshell.py", line 3035, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-26-6c6f5ea1ff17>", line 1, in <module>
Import.ImportFWD(test_path)
File "C:\Users\jastrzem\Downloads\pyWFP\Import.py", line 44, in ImportFWD
writer.close()
File "C:\Program Files (x86)\Python271\lib\site-packages\pandas\io\excel.py", line 623, in close
return self.save()
File "C:\Program Files (x86)\Python271\lib\site-packages\pandas\io\excel.py", line 1298, in save
return self.book.close()
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\workbook.py", line 295, in close
self._store_workbook()
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\workbook.py", line 518, in _store_workbook
xml_files = packager._create_package()
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\packager.py", line 140, in _create_package
self._write_shared_strings_file()
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\packager.py", line 280, in _write_shared_strings_file
sst._assemble_xml_file()
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\sharedstrings.py", line 53, in _assemble_xml_file
self._write_sst_strings()
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\sharedstrings.py", line 83, in _write_sst_strings
self._write_si(string)
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\sharedstrings.py", line 110, in _write_si
self._xml_si_element(string, attributes)
File "C:\Program Files (x86)\Python271\lib\site-packages\xlsxwriter\xmlwriter.py", line 122, in _xml_si_element
self.fh.write("""<si><t%s>%s</t></si>""" % (attr, string))
File "C:\Program Files (x86)\Python271\lib\codecs.py", line 694, in write
return self.writer.write(data)
File "C:\Program Files (x86)\Python271\lib\codecs.py", line 357, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 26: ordinal not in range(128)
The error is not because of the old date, but because you are trying to throw a whole dataframe at a single cell.
Instead, use the to_excel method.
Excel will not accept dates before 1900. My workaround is to replace "old" dates with np.nan since I know they are data errors anyway.
mydf['Desk Date'] = pd.to_datetime(mydf['Desk Date'])
dates_list = list(mydf['Desk Date'])
dates_list = [x if x.year > 1900 else np.nan for x in dates_list ]
mydf['Desk Date'] = dates_list

Categories