python write to a stata .dta file from notebook - python

I'm tying to write a pandas data frame to a stata .dta file. Following the advice given in Save .dta files in python, I wrote:
import pandas as pd
df.to_stata(workdir+' generosity.dta')
and I got an error message TypeError: object of type 'float' has no len() and I'm not sure what this means.
most columns in df are objects, but there are three columns that are float64.
I tried following another method(as described in this post Convert .CSV files to .DTA files in Python) via rpy2, but when i tried to install it, I received an error message "Error: tried to guess R's home but no r commnand in the path" so I've given up on it (I have R on my computer but have not used it once)
Thank you very much.
edit: here is the result:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-140-7a8f8bc8d446> in <module>()
1 #write the dataframe as a Stata file
----> 2 df.to_stata(workdir+group+' generosity.dta')
C:\Users\chungk\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in to_stata(self, fname, convert_dates, write_index, encoding, byteorder, time_stamp, data_label)
1262 time_stamp=time_stamp, data_label=data_label,
1263 write_index=write_index)
-> 1264 writer.write_file()
1265
1266 #Appender(fmt.docstring_to_string, indents=1)
C:\Users\chungk\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\stata.pyc in write_file(self)
1245 self._write(_pad_bytes("", 5))
1246 if self._convert_dates is None:
-> 1247 self._write_data_nodates()
1248 else:
1249 self._write_data_dates()
C:\Users\chungk\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\stata.pyc in _write_data_nodates(self)
1327 if var is None or var == np.nan:
1328 var = _pad_bytes('', typ)
-> 1329 if len(var) < typ:
1330 var = _pad_bytes(var, typ)
1331 if compat.PY3:
TypeError: object of type 'float' has no len()

Related

panda unable to read the csv file ? showing this error

TF_MODEL_URL = 'https://tfhub.dev/google/on_device_vision/classifier/landmarks_classifier_asia_V1/1'
mo = hub.Module('https://tfhub.dev/google/on_device_vision/classifier/landmarks_classifier_asia_V1/1')
IMAGE_SHAPE = (321,321)
df= pd.read_csv(LABLE_MAP_URL)
the error is
if self.low_memory:
--> 230 chunks = self._reader.read_low_memory(nrows)
231 # destructive to chunks
232 data = _concatenate_chunks(chunks)
1775 index,
1776 columns,
1777 col_dict,
-> 1778 ) = self._engine.read( # type: ignore[attr-defined]
1779 nrows
1780 )
deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
209 else:
210 kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)
The traceback is from pandas IO tools, so the error likely occured when you are reading the .csv. As you didn't show the file and this is not a reproducible example, you should check the file and see what went wrong. You also didn't show the entire traceback so it is difficult to tell what kind of error it is, but I would suggest this part of the traceback you provided seems somewhat similar to the part of panda's official documentation on malformed lines with too many fields.
Edit:
As suspected, the error you showed does appear to be bad lines caused by the dataset, so this may be a possible dupe. Have you tried
data = pd.read_csv(LABLE_MAP_URL, on_bad_lines='skip')
as the answer in the dupe suggested?

Dask Dataframe: Resample partitioned data loaded from multiple parquet files

I am loading multiple parquet files containing timeseries data together. But the loaded dask dataframe has unknown partitions because of which I can't apply various time series operations on it.
df = dd.read_parquet('/path/to/*.parquet', index='Timestamps)
For instance, df_resampled = df.resample('1T').mean().compute() gives following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-8e6f7f4340fd> in <module>
1 df = dd.read_parquet('/path/to/*.parquet', index='Timestamps')
----> 2 df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in resample(self, rule, closed, label)
2627 from .tseries.resample import Resampler
2628
-> 2629 return Resampler(self, rule, closed=closed, label=label)
2630
2631 #derived_from(pd.DataFrame)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/tseries/resample.py in __init__(self, obj, rule, **kwargs)
118 "for more information."
119 )
--> 120 raise ValueError(msg)
121 self.obj = obj
122 self._rule = pd.tseries.frequencies.to_offset(rule)
ValueError: Can only resample dataframes with known divisions
See https://docs.dask.org/en/latest/dataframe-design.html#partitions
for more information.
I went to the link: https://docs.dask.org/en/latest/dataframe-design.html#partitions and it says,
In these cases (when divisions are unknown), any operation that requires a cleanly partitioned DataFrame with known divisions will have to perform a sort. This can generally achieved by calling df.set_index(...).
I then tried following, but no success.
df = dd.read_parquet('/path/to/*.parquet')
df = df.set_index('Timestamps')
This step throws the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-468e9af0c4d6> in <module>
1 df = dd.read_parquet(os.path.join(OUTPUT_DATA_DIR, '20*.gzip'))
----> 2 df.set_index('Timestamps')
3 # df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in set_index(***failed resolving arguments***)
3915 npartitions=npartitions,
3916 divisions=divisions,
-> 3917 **kwargs,
3918 )
3919
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/shuffle.py in set_index(df, index, npartitions, shuffle, compute, drop, upsample, divisions, partition_size, **kwargs)
483 if divisions is None:
484 sizes = df.map_partitions(sizeof) if repartition else []
--> 485 divisions = index2._repartition_quantiles(npartitions, upsample=upsample)
486 mins = index2.map_partitions(M.min)
487 maxes = index2.map_partitions(M.max)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in __getattr__(self, key)
3755 return self[key]
3756 else:
-> 3757 raise AttributeError("'DataFrame' object has no attribute %r" % key)
3758
3759 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute '_repartition_quantiles'
Can anybody suggest what is the right way to load multiple timeseries files as a dask dataframe on which timeseries operations of pandas can be applied?

TypeError in read_parquet Dask

I have a parquet file called data.parquet. I'm using the library dask from Python. When I run the line
import dask.dataframe as dd
df = dd.read_parquet('data.parquet',engine='pyarrow')
I get the error
TypeError Traceback (most recent call last)
<ipython-input-22-807fa43763c1> in <module>
----> 1 df = dd.read_parquet('data.parquet',engine='pyarrow')
~/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/parquet.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, infer_divisions)
1395 categories=categories,
1396 index=index,
-> 1397 infer_divisions=infer_divisions,
1398 )
1399
~/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/parquet.py in _read_pyarrow(fs, fs_token, paths, columns, filters, categories, index, infer_divisions)
858 _open = lambda fn: pq.ParquetFile(fs.open(fn, mode="rb"))
859 for piece in dataset.pieces:
--> 860 pf = piece.get_metadata(_open)
861 # non_empty_pieces.append(piece)
862 if pf.num_row_groups > 0:
TypeError: get_metadata() takes 1 positional argument but 2 were given
I just don't understand why this happens, since this is how it is implemented here.
Any help will be appreciated!
I faced the same problem. I resolved by upgrade version dask 2.30.0

Using the max function in Pandas for Python

I am doing a tutorial online on Juypter notebook with Python and Pandas, and when I run the following code, I run into this error.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
# reading the csv file
titanic = pd.read_csv("titanic.csv")
titanic_class = titanic.groupby("Pclass")
titanic_class.get_group(1)
titanic_class.max()
AssertionError Traceback (most recent call last)
<ipython-input-26-4d1be28a55cb> in <module>
1 #max ticket fare paid
----> 2 titanic_class.max()
~\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py in f(self, **kwargs)
1369 # try a cython aggregation if we can
1370 try:
-> 1371 return self._cython_agg_general(alias, alt=npfunc, **kwargs)
1372 except DataError:
1373 pass
~\anaconda3\lib\site-packages\pandas\core\groupby\generic.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
992 ) -> DataFrame:
993 agg_blocks, agg_items = self._cython_agg_blocks(
--> 994 how, alt=alt, numeric_only=numeric_only, min_count=min_count
995 )
996 return self._wrap_agged_blocks(agg_blocks, items=agg_items)
~\anaconda3\lib\site-packages\pandas\core\groupby\generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
1098 # Clean up the mess left over from split blocks.
1099 for locs, result in zip(split_items, split_frames):
-> 1100 assert len(locs) == result.shape[1]
1101 for i, loc in enumerate(locs):
1102 new_items.append(np.array([loc], dtype=locs.dtype))
AssertionError:
Can someone tell me what's wrong? The titanic_class.sum() & the titanic_class.mean() works without any error.
The last column of the excel file has letters. Once I removed them, the max function worked.
This happens when any column has empty (Nan) values. Try to remove those columns before using max.

Python Float to Int Conversion Error

I'm pretty new to python. My goal is to change a float into an int. The float is a series. There are no Nans. I've checked out quite a few posts, including: Pandas: change data type of Series to String.
i've tried a few different types of syntax:
```comp.month.apply(int)```
Here's the error that followed that.
```TypeError Traceback (most recent call last) <ipython-input-190-690a8228abec> in <module>()
----> 1 comp.month.apply(int)
/Users/halliebregman/anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds)
2058 values = lib.map_infer(values, lib.Timestamp)
----> 2060 mapped = lib.map_infer(values, f, convert=convert_dtype)
2061 if len(mapped) and isinstance(mapped[0], Series):
2062 from pandas.core.frame import DataFrame
pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:58435)()
TypeError: 'file' object is not callable```
and also:
```with open ("ints.csv", "w") as ints:
for i in range(len(comp)):
months = int(comp['month'][i])
days = int(comp['day'][i])
print months, days
ints.write('{} {} \n'.format(months, days))```
Followed by this error:
```TypeError Traceback (most recent call last) <ipython-input-191-0d6fe0a99830> in <module>()
1 with open ("ints.csv", "w") as ints:
2 for i in range(len(comp)):
----> 3 months = int(comp['month'][i])
4 days = int(comp['day'][i])
5 print months, days
TypeError: 'file' object is not callable```
What am I missing here? It seems like this should be simple :/
Thanks!

Categories