Bokeh: ValueError: Out of range float values are not JSON compliant - python

I came across this discussion (from a year ago): https://github.com/bokeh/bokeh/issues/2392
I also saw the white screen without any errors..and then i tried to take a small subset of 2 columns and tried the below:
Since pandas just gets a bunch of rows with empty data in there as well, I tried dropna.. this resulted in there being no data at all. So instead I just specified the rows that should go into the df (hence the df = df.head(n=19) line)
import pandas as pd
from bokeh.plotting import figure, output_file, show
df = pd.read_excel(path,sheetname,parse_cols="A:B")
df = df.head(n=19)
print(df)
rtngs = ['iAAA','iAA+','iAA','iAA-','iA+','iA','iA-','iBBB+','iBBB','iBBB-','iBB+','iBB','iBB-','iB+','iB','iB-','NR','iCCC+']
x= df['Score']
output_file("line.html")
p = figure(plot_width=400, plot_height=400, x_range=(0,100),y_range=rtngs)
# add a circle renderer with a size, color, and alpha
p.circle(df['Score'], df['Rating'], size=20, color="navy", alpha=0.5)
# show the results
#output_notebook()
show(p)
df:
Rating Score
0 iAAA 64.0
1 iAA+ 33.0
2 iAA 7.0
3 iAA- 28.0
4 iA+ 36.0
5 iA 62.0
6 iA- 99.0
7 iBBB+ 10.0
8 iBBB 93.0
9 iBBB- 91.0
10 iBB+ 79.0
11 iBB 19.0
12 iBB- 95.0
13 iB+ 26.0
14 iB 9.0
15 iB- 26.0
16 NR 49.0
17 iCCC+ 51.0
18 iAAA 18.0
The above is showing me an output within the notebook, but still throws : ValueError: Out of range float values are not JSON compliant
And also it doesn't (hence?) produce the output file as well. How do I get rid of this error for this small subset? Is it related to NaN values? Would that also solve the 'white screen of death' issue for the larger dataset?
Thanks vm for taking a look!
In case you would like to see the entire error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-4fa6b88aa415> in <module>()
16 # show the results
17 #output_notebook()
---> 18 show(p)
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\io.py in show(obj, browser, new)
300 if obj not in _state.document.roots:
301 _state.document.add_root(obj)
--> 302 return _show_with_state(obj, _state, browser, new)
303
304
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\io.py in _show_with_state(obj, state, browser, new)
310
311 if state.notebook:
--> 312 comms_handle = _show_notebook_with_state(obj, state)
313 shown = True
314
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\io.py in _show_notebook_with_state(obj, state)
334 comms_target = make_id()
335 publish_display_data({'text/html': notebook_div(obj, comms_target)})
--> 336 handle = _CommsHandle(get_comms(comms_target), state.document, state.document.to_json())
337 state.last_comms_handle = handle
338 return handle
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\document.py in to_json(self)
792 # this is a total hack to go via a string, needed because
793 # our BokehJSONEncoder goes straight to a string.
--> 794 doc_json = self.to_json_string()
795 return loads(doc_json)
796
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\document.py in to_json_string(self, indent)
785 }
786
--> 787 return serialize_json(json, indent=indent)
788
789 def to_json(self):
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\core\json_encoder.py in serialize_json(obj, encoder, indent, **kwargs)
97 indent = 2
98
---> 99 return json.dumps(obj, cls=encoder, allow_nan=False, indent=indent, separators=separators, sort_keys=True, **kwargs)
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\json\__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
235 check_circular=check_circular, allow_nan=allow_nan, indent=indent,
236 separators=separators, default=default, sort_keys=sort_keys,
--> 237 **kw).encode(obj)
238
239
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\json\encoder.py in encode(self, o)
197 # exceptions aren't as detailed. The list call should be roughly
198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
200 if not isinstance(chunks, (list, tuple)):
201 chunks = list(chunks)
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\json\encoder.py in iterencode(self, o, _one_shot)
255 self.key_separator, self.item_separator, self.sort_keys,
256 self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)
258
259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,
ValueError: Out of range float values are not JSON compliant

I had the same error and I debugged the problem: I had NaN values in my plotted dataset and bokeh's serialize_json() function (in /core/json_encoder.py) does not allow NaN values (I don't know why...). In the return part of this function there is the allow_nan=False argument in json.dumps() :(( The problem occurs only at the io part of bokeh process when the output file is generated (it calls the above serialize_json() function).
So you have to replace NaN values in your dataframe, eg.:
df = df.fillna('')
Nice day! :)

NaN support will be better supported when this Pull Request to add a binary array serialization option is merged. This should be available in Bokeh 0.12.4 in January 2017. Bokeh does not use allow_nan in the python JSON encoder, because that is not standard — nan and inf are not part of the official JSON specification (an egregious oversight IMO, but out of our control)

Well it isn't exactly an answer to your question it's more like my experience working with bokeh for a week. In my case trying to make a plot like the Texas example from bokeh..... After a lot of frustration i noticed that bokeh or json or whatever when encounters the first value of the list (myList) to be plotted to be a NaN it refuses to plot giving the message
ValueError: Out of range float values are not JSON compliant
if i change the first value of the list (myList[0]) to float it works fine even if it contains NaN's to other positions. Taking this in account someone who understands how these things work will propose an answer. Mine is to restruct your data so that the first value isn't a nan.

After removing the NAN values, there might be infinite value,
Trace the whole dataset it might have some infinite values as inf remove those infinite values some how, then it should work.
df['column'].describe()
then if you find any inf value remove those rows with
df = df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]
reference: solution here

I bumped into this problem and I realized it was happening because one column of my Dataframe was filled only with NaNs.
You could instead put it to another value, e.g.:
df['column'] = np.zeros(len(df))

I had this error in this line:
save(plot_lda, 'tsne_lda_viz_{}_{}_{}_{}_{}_{}_{}.html'.format(
num_qualified_tweet, n_topics, threshold, n_iter, num_example, n_top_words, end_date))
I worked use this repo as baseline: https://github.com/ShuaiW/twitter-analysis/blob/master/topic_tweets.py (mine)
And, i solved this with this code (hope this will useful for others):
for i in range(X_topics.shape[1]):
topic_coord[i, 0] = 0 if np.isnan(topic_coord[i, 0]) else topic_coord[i, 0]
topic_coord[i, 1] = 0 if np.isnan(topic_coord[i, 1]) else topic_coord[i, 1]
plot_lda.text(topic_coord[i, 0], topic_coord[i, 1], [topic_summaries[i]])
The key is:
var = 0 if np.isnan(number) else number

I have this issue and solved with clean my dataset
check your data set and change null records value.

If anyone else comes across this answer you can specify a parameter on the read_excel method to not use default NA values
pd.read_excel(path, sheetname, parse_cols="A:B", keep_default_na=False)
Reference: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.

Related

panda unable to read the csv file ? showing this error

TF_MODEL_URL = 'https://tfhub.dev/google/on_device_vision/classifier/landmarks_classifier_asia_V1/1'
mo = hub.Module('https://tfhub.dev/google/on_device_vision/classifier/landmarks_classifier_asia_V1/1')
IMAGE_SHAPE = (321,321)
df= pd.read_csv(LABLE_MAP_URL)
the error is
if self.low_memory:
--> 230 chunks = self._reader.read_low_memory(nrows)
231 # destructive to chunks
232 data = _concatenate_chunks(chunks)
1775 index,
1776 columns,
1777 col_dict,
-> 1778 ) = self._engine.read( # type: ignore[attr-defined]
1779 nrows
1780 )
deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
209 else:
210 kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)
The traceback is from pandas IO tools, so the error likely occured when you are reading the .csv. As you didn't show the file and this is not a reproducible example, you should check the file and see what went wrong. You also didn't show the entire traceback so it is difficult to tell what kind of error it is, but I would suggest this part of the traceback you provided seems somewhat similar to the part of panda's official documentation on malformed lines with too many fields.
Edit:
As suspected, the error you showed does appear to be bad lines caused by the dataset, so this may be a possible dupe. Have you tried
data = pd.read_csv(LABLE_MAP_URL, on_bad_lines='skip')
as the answer in the dupe suggested?

Python Machine Learning value in e form instead of integer or float

When i am running this code the predicted values come out to be in the form of e(6.291149e+06,5.684170e+06)
pred_y_df=pd.DataFrame({'Actual Value':y_test,'Predicted value':y_pred,'Difference':y_test-y_pred})
pred_y_df[0:20]
enter image description here
Actual Value Predicted value Difference
136 5740000 6.291149e+06 -5.511488e+05
80 6629000 5.684170e+06 9.448304e+05
47 7490000 7.709115e+06 -2.191149e+05
526 2310000 2.587718e+06 -2.777181e+05
200 4900000 4.867998e+06 3.200241e+04
527 2275000 2.417865e+06 -1.428652e+05
278 4277000 5.021664e+06 -7.446643e+05
402 3500000 3.228685e+06 2.713153e+05
Try adding this line before you are displaying the results:
pd.options.display.float_format = '${:,.2f}'.format
Alternatively, you can format the actual columns, e.g.:
pred_y_df['Predicted'] = pred_y_df['Predicted'].map('${:,.2f}'.format)

Splitting dataframe column in multiple columns using json_normalize does not work

I have a dataframe df created through an import from a mysql-db
ID CONFIG
0 276 {"pos":[{"type":"geo...
1 349 {"pos":[{"type":"geo...
2 378 {"pos":[{"type":"geo...
3 381 {"pos":[{"type":"geo...
4 385 {"pos":[{"type":"geo...
where the elements in the CONFIG column all have the form:
{"posit":[{"type":"geo_f","priority":1,"validity":0},{"type":"geo_m","priority":2,"validity":0},{"type":"geo_i","priority":3,"validity":0},{"type":"geo_c","priority":4,"validity":0}]}
Now, I was convinced these elements are json-type elements and tried the following method to transform them into columns:
df_new = pd.json_normalize(df['CONFIG'])
However, this return the following error:
AttributeError: 'str' object has no attribute 'values'
What am I missing? Thankful for any help!
EDIT: Full Traceback
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-22-23db4c0afdab> in <module>
----> 1 df_new = pd.json_normalize(df['CONFIG'])
c:\users\s-degossondevarennes\appdata\local\programs\python\python37\lib\site-packages\pandas\io\json\_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
c:\users\s-degossondevarennes\appdata\local\programs\python\python37\lib\site-packages\pandas\io\json\_normalize.py in <genexpr>(.0)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
AttributeError: 'str' object has no attribute 'values'
First issue is the values in CONFIG column are strings in disguise. So, a literal_eval can make them true dictionaries. Then, they are all indexed with "posit" key first that we better get rid of. But then we are left with lists; so explode comes in. Overall,
from ast import literal_eval
pd.json_normalize(df['CONFIG'].apply(lambda x: literal_eval(x)["posit"]).explode())
I get (for a 1-row sample data)
type priority validity
0 geo_f 1 0
1 geo_m 2 0
2 geo_i 3 0
3 geo_c 4 0

How can I read *.csv files that have numbers with commas using pandas?

I want to read a *.csv file that have numbers with commas.
For example,
File.csv
Date, Time, Open, High, Low, Close, Volume
2016/11/09,12:10:00,'4355,'4358,'4346,'4351,1,201 # The last value is 1201, not 201
2016/11/09,12:09:00,'4361,'4362,'4353,'4355,1,117 # The last value is 1117, not 117
2016/11/09,12:08:00,'4364,'4374,'4359,'4360,10,175 # The last value is 10175, not 175
2016/11/09,12:07:00,'4371,'4376,'4360,'4365,590
2016/11/09,12:06:00,'4359,'4372,'4358,'4369,420
2016/11/09,12:05:00,'4365,'4367,'4356,'4359,542
2016/11/09,12:04:00,'4379,'1380,'4360,'4365,1,697 # The last value is 1697, not 697
2016/11/09,12:03:00,'4394,'4396,'4376,'4381,1,272 # The last value is 1272, not 272
2016/11/09,12:02:00,'4391,'4399,'4390,'4393,524
...
2014/07/10,12:05:00,'10195,'10300,'10155,'10290,219,271 # The last value is 219271, not 271
2014/07/09,12:04:00,'10345,'10360,'10185,'10194,235,711 # The last value is 235711, not 711
2014/07/08,12:03:00,'10339,'10420,'10301,'10348,232,050 # The last value is 242050, not 050
It actually has 7 columns, but the values of the last column sometimes have commas and pandas take them as extra columns.
My questions is, if there are any methods with which I can make pandas takes only the first 6 commas and ignore the rest commas when it reads columns, or if there are any methods to delete commas after the 6th commas(I'm sorry, but I can't think of any functions to do that.)
Thank you for reading this :)
You can do all of it in Python without having to save the data into a new file. The idea is to clean the data and put in a dictionary-like format for pandas to grab it and turn it into a dataframe. The following should constitute a decent starting point:
from collections import defaultdict
from collections import OrderedDict
import pandas as pd
# Import the data
data = open('prices.csv').readlines()
# Split on the first 6 commas
data = [x.strip().replace("'","").split(",",6) for x in data]
# Get the headers
headers = [x.strip() for x in data[0]]
# Get the remaining of the data
remainings = [list(map(lambda y: y.replace(",",""), x)) for x in data[1:]]
# Create a dictionary-like container
output = defaultdict(list)
# Loop through the data and save the rows accordingly
for n, header in enumerate(headers):
for row in remainings:
output[header].append(row[n])
# Save it in an ordered dictionary to maintain the order of columns
output = OrderedDict((k,output.get(k)) for k in headers)
# Convert your raw data into a pandas dataframe
df = pd.DataFrame(output)
# Print it
print(df)
This yields:
Date Time Open High Low Close Volume
0 2016/11/09 12:10:00 4355 4358 4346 4351 1201
1 2016/11/09 12:09:00 4361 4362 4353 4355 1117
2 2016/11/09 12:08:00 4364 4374 4359 4360 10175
3 2016/11/09 12:07:00 4371 4376 4360 4365 590
4 2016/11/09 12:06:00 4359 4372 4358 4369 420
5 2016/11/09 12:05:00 4365 4367 4356 4359 542
6 2016/11/09 12:04:00 4379 1380 4360 4365 1697
7 2016/11/09 12:03:00 4394 4396 4376 4381 1272
8 2016/11/09 12:02:00 4391 4399 4390 4393 524
The starting file (prices.csv) is the following:
Date, Time, Open, High, Low, Close, Volume
2016/11/09,12:10:00,'4355,'4358,'4346,'4351,1,201
2016/11/09,12:09:00,'4361,'4362,'4353,'4355,1,117
2016/11/09,12:08:00,'4364,'4374,'4359,'4360,10,175
2016/11/09,12:07:00,'4371,'4376,'4360,'4365,590
2016/11/09,12:06:00,'4359,'4372,'4358,'4369,420
2016/11/09,12:05:00,'4365,'4367,'4356,'4359,542
2016/11/09,12:04:00,'4379,'1380,'4360,'4365,1,697
2016/11/09,12:03:00,'4394,'4396,'4376,'4381,1,272
2016/11/09,12:02:00,'4391,'4399,'4390,'4393,524
I hope this helps.
I guess pandas cant handle it so I would do a pre-processing with Perl to generate a new cvs and work on it.
Using Perl split can help you in this situation
perl -pne '$_ = join("|", split(/,/, $_, 7) )' < input.csv > output.csv
Then you can use the usual read_cvs on the output file with the seperator as |
One more way to solve your problem.
import re
import pandas as pd
l1 =[]
with open('/home/yusuf/Desktop/c1') as f:
headers = map(lambda x: x.strip(), f.readline().strip('\n').split(','))
for a in f.readlines():
b = re.findall("(.*?),(.*?),'(.*?),'(.*?),'(.*?),'(.*?),(.*)",a)
l1.append(list(b[0]))
df = pd.DataFrame(data=l1, columns=headers)
df['Volume'] = df['Volume'].apply(lambda x: x.replace(",",""))
df
Output:
Regex Demo:
https://regex101.com/r/o1zxtO/2
I'm pretty sure pandas can't handle that, but you can easily fix the final column. An approach in Python
with open('yourfile.csv') as csv, open('newcsv.csv','w') as result:
for line in csv:
columns = line.split(',')
if len(columns) > COLUMNAMOUNT:
columns[COLUMNAMOUNT-1] += ''.join(columns[COLUMNAMOUNT:])
result.write(','.join(columns[COLUMNAMOUNT-1]))
Now you can load the new csv in to pandas. Other solutions can be AWK or even shell scripting.

How to resample a python pandas TimeSeries containing dytpe Decimal values?

I'm having a pandas Series object filled with decimal numbers of dtype Decimal. I'd like to use the new pandas 0.8 function to resample the decimal time series like this:
resampled = ts.resample('D', how = 'mean')
When trying this i get an "GroupByError: No numeric types to aggregate" error. I assume the problem is that np.mean is used internaly to resample the values and np.mean expects floats instead of Decimals.
Thanks to the help of this forum i managed to solve a similar question using groupBy and the apply function but i would love to also use the cool resample function.
How use the mean method on a pandas TimeSeries with Decimal type values?
Any idea how to solve this?
Here is the complete ipython session creating the error:
In [37]: from decimal import Decimal
In [38]: from pandas import *
In [39]: rng = date_range('1.1.2012',periods=48, freq='H')
In [40]: rnd = np.random.randn(len(rng))
In [41]: rnd_dec = [Decimal(x) for x in rnd]
In [42]: ts = Series(rnd_dec, index=rng)
In [43]: ts[0:3]
Out[43]:
2012-01-01 00:00:00 -0.1020591335576267189022559023214853368699550628
2012-01-01 01:00:00 0.99245713975437366283216533702216111123561859130
2012-01-01 02:00:00 1.80080710727195758558139004890108481049537658691
Freq: H
In [44]: type(ts[0])
Out[44]: decimal.Decimal
In [45]: ts.resample('D', how = 'mean')
---------------------------------------------------------------------------
GroupByError Traceback (most recent call last)
C:\Users\THM\Documents\Python\<ipython-input-45-09c898403ddd> in <module>()
----> 1 ts.resample('D', how = 'mean')
C:\Python27\lib\site-packages\pandas\core\generic.pyc in resample(self, rule, how, axis, fill_method, closed, label, convention, kind, loffset, l
imit, base)
187 fill_method=fill_method, convention=convention,
188 limit=limit, base=base)
--> 189 return sampler.resample(self)
190
191 def first(self, offset):
C:\Python27\lib\site-packages\pandas\tseries\resample.pyc in resample(self, obj)
65
66 if isinstance(axis, DatetimeIndex):
---> 67 rs = self._resample_timestamps(obj)
68 elif isinstance(axis, PeriodIndex):
69 offset = to_offset(self.freq)
C:\Python27\lib\site-packages\pandas\tseries\resample.pyc in _resample_timestamps(self, obj)
184 if len(grouper.binlabels) < len(axlabels) or self.how is not None:
185 grouped = obj.groupby(grouper, axis=self.axis)
--> 186 result = grouped.aggregate(self._agg_method)
187 else:
188 # upsampling shortcut
C:\Python27\lib\site-packages\pandas\core\groupby.pyc in aggregate(self, func_or_funcs, *args, **kwargs)
1215 """
1216 if isinstance(func_or_funcs, basestring):
-> 1217 return getattr(self, func_or_funcs)(*args, **kwargs)
1218
1219 if hasattr(func_or_funcs,'__iter__'):
C:\Python27\lib\site-packages\pandas\core\groupby.pyc in mean(self)
290 """
291 try:
--> 292 return self._cython_agg_general('mean')
293 except GroupByError:
294 raise
C:\Python27\lib\site-packages\pandas\core\groupby.pyc in _cython_agg_general(self, how)
376
377 if len(output) == 0:
--> 378 raise GroupByError('No numeric types to aggregate')
379
380 return self._wrap_aggregated_output(output, names)
GroupByError: No numeric types to aggregate
Any help is appreciated.
Thanks,
Thomas
I found the answer by myself. It is possible to provide a function to the 'how' argument of resample:
f = lambda x: Decimal(np.mean(x))
ts.resample('D', how = f)
I get the error for object type columns in DataFrame. I got around it by using
df.resample('D', method='ffill', how=lambda c: c[-1])

Categories