panda unable to read the csv file ? showing this error - python

TF_MODEL_URL = 'https://tfhub.dev/google/on_device_vision/classifier/landmarks_classifier_asia_V1/1'
mo = hub.Module('https://tfhub.dev/google/on_device_vision/classifier/landmarks_classifier_asia_V1/1')
IMAGE_SHAPE = (321,321)
df= pd.read_csv(LABLE_MAP_URL)
the error is
if self.low_memory:
--> 230 chunks = self._reader.read_low_memory(nrows)
231 # destructive to chunks
232 data = _concatenate_chunks(chunks)
1775 index,
1776 columns,
1777 col_dict,
-> 1778 ) = self._engine.read( # type: ignore[attr-defined]
1779 nrows
1780 )
deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
209 else:
210 kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)

The traceback is from pandas IO tools, so the error likely occured when you are reading the .csv. As you didn't show the file and this is not a reproducible example, you should check the file and see what went wrong. You also didn't show the entire traceback so it is difficult to tell what kind of error it is, but I would suggest this part of the traceback you provided seems somewhat similar to the part of panda's official documentation on malformed lines with too many fields.
Edit:
As suspected, the error you showed does appear to be bad lines caused by the dataset, so this may be a possible dupe. Have you tried
data = pd.read_csv(LABLE_MAP_URL, on_bad_lines='skip')
as the answer in the dupe suggested?

Related

Dask Dataframe: Resample partitioned data loaded from multiple parquet files

I am loading multiple parquet files containing timeseries data together. But the loaded dask dataframe has unknown partitions because of which I can't apply various time series operations on it.
df = dd.read_parquet('/path/to/*.parquet', index='Timestamps)
For instance, df_resampled = df.resample('1T').mean().compute() gives following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-8e6f7f4340fd> in <module>
1 df = dd.read_parquet('/path/to/*.parquet', index='Timestamps')
----> 2 df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in resample(self, rule, closed, label)
2627 from .tseries.resample import Resampler
2628
-> 2629 return Resampler(self, rule, closed=closed, label=label)
2630
2631 #derived_from(pd.DataFrame)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/tseries/resample.py in __init__(self, obj, rule, **kwargs)
118 "for more information."
119 )
--> 120 raise ValueError(msg)
121 self.obj = obj
122 self._rule = pd.tseries.frequencies.to_offset(rule)
ValueError: Can only resample dataframes with known divisions
See https://docs.dask.org/en/latest/dataframe-design.html#partitions
for more information.
I went to the link: https://docs.dask.org/en/latest/dataframe-design.html#partitions and it says,
In these cases (when divisions are unknown), any operation that requires a cleanly partitioned DataFrame with known divisions will have to perform a sort. This can generally achieved by calling df.set_index(...).
I then tried following, but no success.
df = dd.read_parquet('/path/to/*.parquet')
df = df.set_index('Timestamps')
This step throws the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-468e9af0c4d6> in <module>
1 df = dd.read_parquet(os.path.join(OUTPUT_DATA_DIR, '20*.gzip'))
----> 2 df.set_index('Timestamps')
3 # df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in set_index(***failed resolving arguments***)
3915 npartitions=npartitions,
3916 divisions=divisions,
-> 3917 **kwargs,
3918 )
3919
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/shuffle.py in set_index(df, index, npartitions, shuffle, compute, drop, upsample, divisions, partition_size, **kwargs)
483 if divisions is None:
484 sizes = df.map_partitions(sizeof) if repartition else []
--> 485 divisions = index2._repartition_quantiles(npartitions, upsample=upsample)
486 mins = index2.map_partitions(M.min)
487 maxes = index2.map_partitions(M.max)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in __getattr__(self, key)
3755 return self[key]
3756 else:
-> 3757 raise AttributeError("'DataFrame' object has no attribute %r" % key)
3758
3759 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute '_repartition_quantiles'
Can anybody suggest what is the right way to load multiple timeseries files as a dask dataframe on which timeseries operations of pandas can be applied?

Bokeh: ValueError: Out of range float values are not JSON compliant

I came across this discussion (from a year ago): https://github.com/bokeh/bokeh/issues/2392
I also saw the white screen without any errors..and then i tried to take a small subset of 2 columns and tried the below:
Since pandas just gets a bunch of rows with empty data in there as well, I tried dropna.. this resulted in there being no data at all. So instead I just specified the rows that should go into the df (hence the df = df.head(n=19) line)
import pandas as pd
from bokeh.plotting import figure, output_file, show
df = pd.read_excel(path,sheetname,parse_cols="A:B")
df = df.head(n=19)
print(df)
rtngs = ['iAAA','iAA+','iAA','iAA-','iA+','iA','iA-','iBBB+','iBBB','iBBB-','iBB+','iBB','iBB-','iB+','iB','iB-','NR','iCCC+']
x= df['Score']
output_file("line.html")
p = figure(plot_width=400, plot_height=400, x_range=(0,100),y_range=rtngs)
# add a circle renderer with a size, color, and alpha
p.circle(df['Score'], df['Rating'], size=20, color="navy", alpha=0.5)
# show the results
#output_notebook()
show(p)
df:
Rating Score
0 iAAA 64.0
1 iAA+ 33.0
2 iAA 7.0
3 iAA- 28.0
4 iA+ 36.0
5 iA 62.0
6 iA- 99.0
7 iBBB+ 10.0
8 iBBB 93.0
9 iBBB- 91.0
10 iBB+ 79.0
11 iBB 19.0
12 iBB- 95.0
13 iB+ 26.0
14 iB 9.0
15 iB- 26.0
16 NR 49.0
17 iCCC+ 51.0
18 iAAA 18.0
The above is showing me an output within the notebook, but still throws : ValueError: Out of range float values are not JSON compliant
And also it doesn't (hence?) produce the output file as well. How do I get rid of this error for this small subset? Is it related to NaN values? Would that also solve the 'white screen of death' issue for the larger dataset?
Thanks vm for taking a look!
In case you would like to see the entire error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-4fa6b88aa415> in <module>()
16 # show the results
17 #output_notebook()
---> 18 show(p)
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\io.py in show(obj, browser, new)
300 if obj not in _state.document.roots:
301 _state.document.add_root(obj)
--> 302 return _show_with_state(obj, _state, browser, new)
303
304
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\io.py in _show_with_state(obj, state, browser, new)
310
311 if state.notebook:
--> 312 comms_handle = _show_notebook_with_state(obj, state)
313 shown = True
314
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\io.py in _show_notebook_with_state(obj, state)
334 comms_target = make_id()
335 publish_display_data({'text/html': notebook_div(obj, comms_target)})
--> 336 handle = _CommsHandle(get_comms(comms_target), state.document, state.document.to_json())
337 state.last_comms_handle = handle
338 return handle
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\document.py in to_json(self)
792 # this is a total hack to go via a string, needed because
793 # our BokehJSONEncoder goes straight to a string.
--> 794 doc_json = self.to_json_string()
795 return loads(doc_json)
796
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\document.py in to_json_string(self, indent)
785 }
786
--> 787 return serialize_json(json, indent=indent)
788
789 def to_json(self):
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\core\json_encoder.py in serialize_json(obj, encoder, indent, **kwargs)
97 indent = 2
98
---> 99 return json.dumps(obj, cls=encoder, allow_nan=False, indent=indent, separators=separators, sort_keys=True, **kwargs)
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\json\__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
235 check_circular=check_circular, allow_nan=allow_nan, indent=indent,
236 separators=separators, default=default, sort_keys=sort_keys,
--> 237 **kw).encode(obj)
238
239
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\json\encoder.py in encode(self, o)
197 # exceptions aren't as detailed. The list call should be roughly
198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
200 if not isinstance(chunks, (list, tuple)):
201 chunks = list(chunks)
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\json\encoder.py in iterencode(self, o, _one_shot)
255 self.key_separator, self.item_separator, self.sort_keys,
256 self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)
258
259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,
ValueError: Out of range float values are not JSON compliant
I had the same error and I debugged the problem: I had NaN values in my plotted dataset and bokeh's serialize_json() function (in /core/json_encoder.py) does not allow NaN values (I don't know why...). In the return part of this function there is the allow_nan=False argument in json.dumps() :(( The problem occurs only at the io part of bokeh process when the output file is generated (it calls the above serialize_json() function).
So you have to replace NaN values in your dataframe, eg.:
df = df.fillna('')
Nice day! :)
NaN support will be better supported when this Pull Request to add a binary array serialization option is merged. This should be available in Bokeh 0.12.4 in January 2017. Bokeh does not use allow_nan in the python JSON encoder, because that is not standard — nan and inf are not part of the official JSON specification (an egregious oversight IMO, but out of our control)
Well it isn't exactly an answer to your question it's more like my experience working with bokeh for a week. In my case trying to make a plot like the Texas example from bokeh..... After a lot of frustration i noticed that bokeh or json or whatever when encounters the first value of the list (myList) to be plotted to be a NaN it refuses to plot giving the message
ValueError: Out of range float values are not JSON compliant
if i change the first value of the list (myList[0]) to float it works fine even if it contains NaN's to other positions. Taking this in account someone who understands how these things work will propose an answer. Mine is to restruct your data so that the first value isn't a nan.
After removing the NAN values, there might be infinite value,
Trace the whole dataset it might have some infinite values as inf remove those infinite values some how, then it should work.
df['column'].describe()
then if you find any inf value remove those rows with
df = df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]
reference: solution here
I bumped into this problem and I realized it was happening because one column of my Dataframe was filled only with NaNs.
You could instead put it to another value, e.g.:
df['column'] = np.zeros(len(df))
I had this error in this line:
save(plot_lda, 'tsne_lda_viz_{}_{}_{}_{}_{}_{}_{}.html'.format(
num_qualified_tweet, n_topics, threshold, n_iter, num_example, n_top_words, end_date))
I worked use this repo as baseline: https://github.com/ShuaiW/twitter-analysis/blob/master/topic_tweets.py (mine)
And, i solved this with this code (hope this will useful for others):
for i in range(X_topics.shape[1]):
topic_coord[i, 0] = 0 if np.isnan(topic_coord[i, 0]) else topic_coord[i, 0]
topic_coord[i, 1] = 0 if np.isnan(topic_coord[i, 1]) else topic_coord[i, 1]
plot_lda.text(topic_coord[i, 0], topic_coord[i, 1], [topic_summaries[i]])
The key is:
var = 0 if np.isnan(number) else number
I have this issue and solved with clean my dataset
check your data set and change null records value.
If anyone else comes across this answer you can specify a parameter on the read_excel method to not use default NA values
pd.read_excel(path, sheetname, parse_cols="A:B", keep_default_na=False)
Reference: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.

python write to a stata .dta file from notebook

I'm tying to write a pandas data frame to a stata .dta file. Following the advice given in Save .dta files in python, I wrote:
import pandas as pd
df.to_stata(workdir+' generosity.dta')
and I got an error message TypeError: object of type 'float' has no len() and I'm not sure what this means.
most columns in df are objects, but there are three columns that are float64.
I tried following another method(as described in this post Convert .CSV files to .DTA files in Python) via rpy2, but when i tried to install it, I received an error message "Error: tried to guess R's home but no r commnand in the path" so I've given up on it (I have R on my computer but have not used it once)
Thank you very much.
edit: here is the result:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-140-7a8f8bc8d446> in <module>()
1 #write the dataframe as a Stata file
----> 2 df.to_stata(workdir+group+' generosity.dta')
C:\Users\chungk\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in to_stata(self, fname, convert_dates, write_index, encoding, byteorder, time_stamp, data_label)
1262 time_stamp=time_stamp, data_label=data_label,
1263 write_index=write_index)
-> 1264 writer.write_file()
1265
1266 #Appender(fmt.docstring_to_string, indents=1)
C:\Users\chungk\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\stata.pyc in write_file(self)
1245 self._write(_pad_bytes("", 5))
1246 if self._convert_dates is None:
-> 1247 self._write_data_nodates()
1248 else:
1249 self._write_data_dates()
C:\Users\chungk\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\stata.pyc in _write_data_nodates(self)
1327 if var is None or var == np.nan:
1328 var = _pad_bytes('', typ)
-> 1329 if len(var) < typ:
1330 var = _pad_bytes(var, typ)
1331 if compat.PY3:
TypeError: object of type 'float' has no len()

Pandas and Google Analytics [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 8 years ago.
Improve this question
I have been using Pandas for accessing and analysing my Google analytics data for several months however yesterday my code failed and I'm not sure why. Even using the most basic example (as given in the documentation) the ga.read_ga function is returning an error:
AttributeError Traceback (most recent call last)
/Library/Python/2.7/site-packages/ipython-0.13.1-py2.7.egg/IPython/utils/py3compat.pyc in execfile(fname, *where)
176 else:
177 filename = fname
--> 178 __builtin__.execfile(filename, *where)
/Users/danielcollins/Documents/GA Python/EngCompCombined.py in <module>()
161 # account_id=account_id,
162 max_results=max_results,
--> 163 chunksize=5000)
164
165 df1_conc = pd.concat([x for x in df1])
/Library/Python/2.7/site-packages/pandas/io/ga.py in read_ga(metrics, dimensions, start_date, **kwargs)
100 reader = GAnalytics(**reader_kwds)
101 return reader.get_data(metrics=metrics, start_date=start_date,
--> 102 dimensions=dimensions, **kwargs)
103
104
/Library/Python/2.7/site-packages/pandas/io/ga.py in get_data(self, metrics, start_date, end_date, dimensions, segment, filters, start_index, max_results, index_col, parse_dates, keep_date_col, date_parser, na_values, converters, sort, dayfirst, account_name, account_id, property_name, property_id, profile_name, profile_id, chunksize)
254
255 account = self.get_account(account_name, account_id)
--> 256 web_property = self.get_web_property(account.get('id'), property_name,
257 property_id)
258 profile = self.get_profile(account.get('id'), web_property.get('id'),
AttributeError: 'NoneType' object has no attribute 'get'
The format of the request is:
df1 = ga.read_ga(metrics,
dimensions = dimensions,
start_date = start_date,
end_date = end_date,
token_file_name = '-------',
filters = filters,
max_results=max_results,
chunksize=5000)
I have never had to specify my account-id, profile-id etc but i have tried hardcoding them without much luck. This script was collecting data everyday until yesterday and I have not touched either my python script or my GA account config.
I am running Pandas 0.11.0 however I have tried reverting to 0.10.1 to see if the error remains.
Authentication flow via refresh token seems to work fine.
I really appreciate any suggestions.

possible inconsistency in text handling of pandas read_table() function

In a previous post, I found out that pandas read_table() function can handle variable-lenth whitespace as a delimiter if you use the read_table('datafile', sep=r'\s*') construction. While this works great for many of my files, it does not work for others despite being highly similar.
EDIT:
I had posted examples that could not replicate the problem when other tried. So I am posting links to the original files for AY907538 and AY942707 as well as leaving the error message that I cannot manage to solve.
## filename:AY942707
# this will load with no problem
data = read_table('AY942707.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')
## filename: AY907538
data = read_table('AY907538.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')
which will generate the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-131d10d1fb1d> in <module>()
2
3 #temp = get_dataset('AY907538.hmmdomtblout')
----> 4 data = read_table('AY907538.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')
5 #data = read_table('AY942707.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in read_table(filepath_or_buffer, sep, dialect, header, index_col, names, skiprows, na_values, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, nrows, iterator, chunksize, skip_footer, converters, verbose, delimiter, encoding, squeeze)
282 kwds['encoding'] = None
283
--> 284 return _read(TextParser, filepath_or_buffer, kwds)
285
286 #Appender(_read_fwf_doc)
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(cls, filepath_or_buffer, kwds)
189 return parser
190
--> 191 return parser.get_chunk()
192
193 #Appender(_read_csv_doc)
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in get_chunk(self, rows)
779 msg = ('Expecting %d columns, got %d in row %d' %
780 (col_len, zip_len, row_num))
--> 781 raise ValueError(msg)
782
783 data = dict((k, v) for k, v in izip(self.columns, zipped_content))
ValueError: Expecting 26 columns, got 28 in row 6
The last field description of target in both files holds multiple words. Since white space is used as seperator, description of target is not treated as a single column by read_table. Each word in this field is in a different column. In AY942707 the first description of target holds more words than on all of the other lines, this is not the case in AY907538. read_table determines the number of columns from the first line and all following lines should have equal or less number of columns.

Categories