Bokeh: ValueError: Out of range float values are not JSON compliant - python
I came across this discussion (from a year ago): https://github.com/bokeh/bokeh/issues/2392
I also saw the white screen without any errors..and then i tried to take a small subset of 2 columns and tried the below:
Since pandas just gets a bunch of rows with empty data in there as well, I tried dropna.. this resulted in there being no data at all. So instead I just specified the rows that should go into the df (hence the df = df.head(n=19) line)
import pandas as pd
from bokeh.plotting import figure, output_file, show
df = pd.read_excel(path,sheetname,parse_cols="A:B")
df = df.head(n=19)
print(df)
rtngs = ['iAAA','iAA+','iAA','iAA-','iA+','iA','iA-','iBBB+','iBBB','iBBB-','iBB+','iBB','iBB-','iB+','iB','iB-','NR','iCCC+']
x= df['Score']
output_file("line.html")
p = figure(plot_width=400, plot_height=400, x_range=(0,100),y_range=rtngs)
# add a circle renderer with a size, color, and alpha
p.circle(df['Score'], df['Rating'], size=20, color="navy", alpha=0.5)
# show the results
#output_notebook()
show(p)
df:
Rating Score
0 iAAA 64.0
1 iAA+ 33.0
2 iAA 7.0
3 iAA- 28.0
4 iA+ 36.0
5 iA 62.0
6 iA- 99.0
7 iBBB+ 10.0
8 iBBB 93.0
9 iBBB- 91.0
10 iBB+ 79.0
11 iBB 19.0
12 iBB- 95.0
13 iB+ 26.0
14 iB 9.0
15 iB- 26.0
16 NR 49.0
17 iCCC+ 51.0
18 iAAA 18.0
The above is showing me an output within the notebook, but still throws : ValueError: Out of range float values are not JSON compliant
And also it doesn't (hence?) produce the output file as well. How do I get rid of this error for this small subset? Is it related to NaN values? Would that also solve the 'white screen of death' issue for the larger dataset?
Thanks vm for taking a look!
In case you would like to see the entire error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-4fa6b88aa415> in <module>()
16 # show the results
17 #output_notebook()
---> 18 show(p)
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\io.py in show(obj, browser, new)
300 if obj not in _state.document.roots:
301 _state.document.add_root(obj)
--> 302 return _show_with_state(obj, _state, browser, new)
303
304
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\io.py in _show_with_state(obj, state, browser, new)
310
311 if state.notebook:
--> 312 comms_handle = _show_notebook_with_state(obj, state)
313 shown = True
314
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\io.py in _show_notebook_with_state(obj, state)
334 comms_target = make_id()
335 publish_display_data({'text/html': notebook_div(obj, comms_target)})
--> 336 handle = _CommsHandle(get_comms(comms_target), state.document, state.document.to_json())
337 state.last_comms_handle = handle
338 return handle
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\document.py in to_json(self)
792 # this is a total hack to go via a string, needed because
793 # our BokehJSONEncoder goes straight to a string.
--> 794 doc_json = self.to_json_string()
795 return loads(doc_json)
796
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\document.py in to_json_string(self, indent)
785 }
786
--> 787 return serialize_json(json, indent=indent)
788
789 def to_json(self):
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\core\json_encoder.py in serialize_json(obj, encoder, indent, **kwargs)
97 indent = 2
98
---> 99 return json.dumps(obj, cls=encoder, allow_nan=False, indent=indent, separators=separators, sort_keys=True, **kwargs)
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\json\__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
235 check_circular=check_circular, allow_nan=allow_nan, indent=indent,
236 separators=separators, default=default, sort_keys=sort_keys,
--> 237 **kw).encode(obj)
238
239
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\json\encoder.py in encode(self, o)
197 # exceptions aren't as detailed. The list call should be roughly
198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
200 if not isinstance(chunks, (list, tuple)):
201 chunks = list(chunks)
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\json\encoder.py in iterencode(self, o, _one_shot)
255 self.key_separator, self.item_separator, self.sort_keys,
256 self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)
258
259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,
ValueError: Out of range float values are not JSON compliant
I had the same error and I debugged the problem: I had NaN values in my plotted dataset and bokeh's serialize_json() function (in /core/json_encoder.py) does not allow NaN values (I don't know why...). In the return part of this function there is the allow_nan=False argument in json.dumps() :(( The problem occurs only at the io part of bokeh process when the output file is generated (it calls the above serialize_json() function).
So you have to replace NaN values in your dataframe, eg.:
df = df.fillna('')
Nice day! :)
NaN support will be better supported when this Pull Request to add a binary array serialization option is merged. This should be available in Bokeh 0.12.4 in January 2017. Bokeh does not use allow_nan in the python JSON encoder, because that is not standard — nan and inf are not part of the official JSON specification (an egregious oversight IMO, but out of our control)
Well it isn't exactly an answer to your question it's more like my experience working with bokeh for a week. In my case trying to make a plot like the Texas example from bokeh..... After a lot of frustration i noticed that bokeh or json or whatever when encounters the first value of the list (myList) to be plotted to be a NaN it refuses to plot giving the message
ValueError: Out of range float values are not JSON compliant
if i change the first value of the list (myList[0]) to float it works fine even if it contains NaN's to other positions. Taking this in account someone who understands how these things work will propose an answer. Mine is to restruct your data so that the first value isn't a nan.
After removing the NAN values, there might be infinite value,
Trace the whole dataset it might have some infinite values as inf remove those infinite values some how, then it should work.
df['column'].describe()
then if you find any inf value remove those rows with
df = df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]
reference: solution here
I bumped into this problem and I realized it was happening because one column of my Dataframe was filled only with NaNs.
You could instead put it to another value, e.g.:
df['column'] = np.zeros(len(df))
I had this error in this line:
save(plot_lda, 'tsne_lda_viz_{}_{}_{}_{}_{}_{}_{}.html'.format(
num_qualified_tweet, n_topics, threshold, n_iter, num_example, n_top_words, end_date))
I worked use this repo as baseline: https://github.com/ShuaiW/twitter-analysis/blob/master/topic_tweets.py (mine)
And, i solved this with this code (hope this will useful for others):
for i in range(X_topics.shape[1]):
topic_coord[i, 0] = 0 if np.isnan(topic_coord[i, 0]) else topic_coord[i, 0]
topic_coord[i, 1] = 0 if np.isnan(topic_coord[i, 1]) else topic_coord[i, 1]
plot_lda.text(topic_coord[i, 0], topic_coord[i, 1], [topic_summaries[i]])
The key is:
var = 0 if np.isnan(number) else number
I have this issue and solved with clean my dataset
check your data set and change null records value.
If anyone else comes across this answer you can specify a parameter on the read_excel method to not use default NA values
pd.read_excel(path, sheetname, parse_cols="A:B", keep_default_na=False)
Reference: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.
Related
panda unable to read the csv file ? showing this error
TF_MODEL_URL = 'https://tfhub.dev/google/on_device_vision/classifier/landmarks_classifier_asia_V1/1' mo = hub.Module('https://tfhub.dev/google/on_device_vision/classifier/landmarks_classifier_asia_V1/1') IMAGE_SHAPE = (321,321) df= pd.read_csv(LABLE_MAP_URL) the error is if self.low_memory: --> 230 chunks = self._reader.read_low_memory(nrows) 231 # destructive to chunks 232 data = _concatenate_chunks(chunks) 1775 index, 1776 columns, 1777 col_dict, -> 1778 ) = self._engine.read( # type: ignore[attr-defined] 1779 nrows 1780 ) deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs) 209 else: 210 kwargs[new_arg_name] = new_arg_value --> 211 return func(*args, **kwargs)
The traceback is from pandas IO tools, so the error likely occured when you are reading the .csv. As you didn't show the file and this is not a reproducible example, you should check the file and see what went wrong. You also didn't show the entire traceback so it is difficult to tell what kind of error it is, but I would suggest this part of the traceback you provided seems somewhat similar to the part of panda's official documentation on malformed lines with too many fields. Edit: As suspected, the error you showed does appear to be bad lines caused by the dataset, so this may be a possible dupe. Have you tried data = pd.read_csv(LABLE_MAP_URL, on_bad_lines='skip') as the answer in the dupe suggested?
Python Machine Learning value in e form instead of integer or float
When i am running this code the predicted values come out to be in the form of e(6.291149e+06,5.684170e+06) pred_y_df=pd.DataFrame({'Actual Value':y_test,'Predicted value':y_pred,'Difference':y_test-y_pred}) pred_y_df[0:20] enter image description here Actual Value Predicted value Difference 136 5740000 6.291149e+06 -5.511488e+05 80 6629000 5.684170e+06 9.448304e+05 47 7490000 7.709115e+06 -2.191149e+05 526 2310000 2.587718e+06 -2.777181e+05 200 4900000 4.867998e+06 3.200241e+04 527 2275000 2.417865e+06 -1.428652e+05 278 4277000 5.021664e+06 -7.446643e+05 402 3500000 3.228685e+06 2.713153e+05
Try adding this line before you are displaying the results: pd.options.display.float_format = '${:,.2f}'.format Alternatively, you can format the actual columns, e.g.: pred_y_df['Predicted'] = pred_y_df['Predicted'].map('${:,.2f}'.format)
Splitting dataframe column in multiple columns using json_normalize does not work
I have a dataframe df created through an import from a mysql-db ID CONFIG 0 276 {"pos":[{"type":"geo... 1 349 {"pos":[{"type":"geo... 2 378 {"pos":[{"type":"geo... 3 381 {"pos":[{"type":"geo... 4 385 {"pos":[{"type":"geo... where the elements in the CONFIG column all have the form: {"posit":[{"type":"geo_f","priority":1,"validity":0},{"type":"geo_m","priority":2,"validity":0},{"type":"geo_i","priority":3,"validity":0},{"type":"geo_c","priority":4,"validity":0}]} Now, I was convinced these elements are json-type elements and tried the following method to transform them into columns: df_new = pd.json_normalize(df['CONFIG']) However, this return the following error: AttributeError: 'str' object has no attribute 'values' What am I missing? Thankful for any help! EDIT: Full Traceback --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-22-23db4c0afdab> in <module> ----> 1 df_new = pd.json_normalize(df['CONFIG']) c:\users\s-degossondevarennes\appdata\local\programs\python\python37\lib\site-packages\pandas\io\json\_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level) 268 269 if record_path is None: --> 270 if any([isinstance(x, dict) for x in y.values()] for y in data): 271 # naive normalization, this is idempotent for flat records 272 # and potentially will inflate the data considerably for c:\users\s-degossondevarennes\appdata\local\programs\python\python37\lib\site-packages\pandas\io\json\_normalize.py in <genexpr>(.0) 268 269 if record_path is None: --> 270 if any([isinstance(x, dict) for x in y.values()] for y in data): 271 # naive normalization, this is idempotent for flat records 272 # and potentially will inflate the data considerably for AttributeError: 'str' object has no attribute 'values'
First issue is the values in CONFIG column are strings in disguise. So, a literal_eval can make them true dictionaries. Then, they are all indexed with "posit" key first that we better get rid of. But then we are left with lists; so explode comes in. Overall, from ast import literal_eval pd.json_normalize(df['CONFIG'].apply(lambda x: literal_eval(x)["posit"]).explode()) I get (for a 1-row sample data) type priority validity 0 geo_f 1 0 1 geo_m 2 0 2 geo_i 3 0 3 geo_c 4 0
How can I read *.csv files that have numbers with commas using pandas?
I want to read a *.csv file that have numbers with commas. For example, File.csv Date, Time, Open, High, Low, Close, Volume 2016/11/09,12:10:00,'4355,'4358,'4346,'4351,1,201 # The last value is 1201, not 201 2016/11/09,12:09:00,'4361,'4362,'4353,'4355,1,117 # The last value is 1117, not 117 2016/11/09,12:08:00,'4364,'4374,'4359,'4360,10,175 # The last value is 10175, not 175 2016/11/09,12:07:00,'4371,'4376,'4360,'4365,590 2016/11/09,12:06:00,'4359,'4372,'4358,'4369,420 2016/11/09,12:05:00,'4365,'4367,'4356,'4359,542 2016/11/09,12:04:00,'4379,'1380,'4360,'4365,1,697 # The last value is 1697, not 697 2016/11/09,12:03:00,'4394,'4396,'4376,'4381,1,272 # The last value is 1272, not 272 2016/11/09,12:02:00,'4391,'4399,'4390,'4393,524 ... 2014/07/10,12:05:00,'10195,'10300,'10155,'10290,219,271 # The last value is 219271, not 271 2014/07/09,12:04:00,'10345,'10360,'10185,'10194,235,711 # The last value is 235711, not 711 2014/07/08,12:03:00,'10339,'10420,'10301,'10348,232,050 # The last value is 242050, not 050 It actually has 7 columns, but the values of the last column sometimes have commas and pandas take them as extra columns. My questions is, if there are any methods with which I can make pandas takes only the first 6 commas and ignore the rest commas when it reads columns, or if there are any methods to delete commas after the 6th commas(I'm sorry, but I can't think of any functions to do that.) Thank you for reading this :)
You can do all of it in Python without having to save the data into a new file. The idea is to clean the data and put in a dictionary-like format for pandas to grab it and turn it into a dataframe. The following should constitute a decent starting point: from collections import defaultdict from collections import OrderedDict import pandas as pd # Import the data data = open('prices.csv').readlines() # Split on the first 6 commas data = [x.strip().replace("'","").split(",",6) for x in data] # Get the headers headers = [x.strip() for x in data[0]] # Get the remaining of the data remainings = [list(map(lambda y: y.replace(",",""), x)) for x in data[1:]] # Create a dictionary-like container output = defaultdict(list) # Loop through the data and save the rows accordingly for n, header in enumerate(headers): for row in remainings: output[header].append(row[n]) # Save it in an ordered dictionary to maintain the order of columns output = OrderedDict((k,output.get(k)) for k in headers) # Convert your raw data into a pandas dataframe df = pd.DataFrame(output) # Print it print(df) This yields: Date Time Open High Low Close Volume 0 2016/11/09 12:10:00 4355 4358 4346 4351 1201 1 2016/11/09 12:09:00 4361 4362 4353 4355 1117 2 2016/11/09 12:08:00 4364 4374 4359 4360 10175 3 2016/11/09 12:07:00 4371 4376 4360 4365 590 4 2016/11/09 12:06:00 4359 4372 4358 4369 420 5 2016/11/09 12:05:00 4365 4367 4356 4359 542 6 2016/11/09 12:04:00 4379 1380 4360 4365 1697 7 2016/11/09 12:03:00 4394 4396 4376 4381 1272 8 2016/11/09 12:02:00 4391 4399 4390 4393 524 The starting file (prices.csv) is the following: Date, Time, Open, High, Low, Close, Volume 2016/11/09,12:10:00,'4355,'4358,'4346,'4351,1,201 2016/11/09,12:09:00,'4361,'4362,'4353,'4355,1,117 2016/11/09,12:08:00,'4364,'4374,'4359,'4360,10,175 2016/11/09,12:07:00,'4371,'4376,'4360,'4365,590 2016/11/09,12:06:00,'4359,'4372,'4358,'4369,420 2016/11/09,12:05:00,'4365,'4367,'4356,'4359,542 2016/11/09,12:04:00,'4379,'1380,'4360,'4365,1,697 2016/11/09,12:03:00,'4394,'4396,'4376,'4381,1,272 2016/11/09,12:02:00,'4391,'4399,'4390,'4393,524 I hope this helps.
I guess pandas cant handle it so I would do a pre-processing with Perl to generate a new cvs and work on it. Using Perl split can help you in this situation perl -pne '$_ = join("|", split(/,/, $_, 7) )' < input.csv > output.csv Then you can use the usual read_cvs on the output file with the seperator as |
One more way to solve your problem. import re import pandas as pd l1 =[] with open('/home/yusuf/Desktop/c1') as f: headers = map(lambda x: x.strip(), f.readline().strip('\n').split(',')) for a in f.readlines(): b = re.findall("(.*?),(.*?),'(.*?),'(.*?),'(.*?),'(.*?),(.*)",a) l1.append(list(b[0])) df = pd.DataFrame(data=l1, columns=headers) df['Volume'] = df['Volume'].apply(lambda x: x.replace(",","")) df Output: Regex Demo: https://regex101.com/r/o1zxtO/2
I'm pretty sure pandas can't handle that, but you can easily fix the final column. An approach in Python with open('yourfile.csv') as csv, open('newcsv.csv','w') as result: for line in csv: columns = line.split(',') if len(columns) > COLUMNAMOUNT: columns[COLUMNAMOUNT-1] += ''.join(columns[COLUMNAMOUNT:]) result.write(','.join(columns[COLUMNAMOUNT-1])) Now you can load the new csv in to pandas. Other solutions can be AWK or even shell scripting.
How to resample a python pandas TimeSeries containing dytpe Decimal values?
I'm having a pandas Series object filled with decimal numbers of dtype Decimal. I'd like to use the new pandas 0.8 function to resample the decimal time series like this: resampled = ts.resample('D', how = 'mean') When trying this i get an "GroupByError: No numeric types to aggregate" error. I assume the problem is that np.mean is used internaly to resample the values and np.mean expects floats instead of Decimals. Thanks to the help of this forum i managed to solve a similar question using groupBy and the apply function but i would love to also use the cool resample function. How use the mean method on a pandas TimeSeries with Decimal type values? Any idea how to solve this? Here is the complete ipython session creating the error: In [37]: from decimal import Decimal In [38]: from pandas import * In [39]: rng = date_range('1.1.2012',periods=48, freq='H') In [40]: rnd = np.random.randn(len(rng)) In [41]: rnd_dec = [Decimal(x) for x in rnd] In [42]: ts = Series(rnd_dec, index=rng) In [43]: ts[0:3] Out[43]: 2012-01-01 00:00:00 -0.1020591335576267189022559023214853368699550628 2012-01-01 01:00:00 0.99245713975437366283216533702216111123561859130 2012-01-01 02:00:00 1.80080710727195758558139004890108481049537658691 Freq: H In [44]: type(ts[0]) Out[44]: decimal.Decimal In [45]: ts.resample('D', how = 'mean') --------------------------------------------------------------------------- GroupByError Traceback (most recent call last) C:\Users\THM\Documents\Python\<ipython-input-45-09c898403ddd> in <module>() ----> 1 ts.resample('D', how = 'mean') C:\Python27\lib\site-packages\pandas\core\generic.pyc in resample(self, rule, how, axis, fill_method, closed, label, convention, kind, loffset, l imit, base) 187 fill_method=fill_method, convention=convention, 188 limit=limit, base=base) --> 189 return sampler.resample(self) 190 191 def first(self, offset): C:\Python27\lib\site-packages\pandas\tseries\resample.pyc in resample(self, obj) 65 66 if isinstance(axis, DatetimeIndex): ---> 67 rs = self._resample_timestamps(obj) 68 elif isinstance(axis, PeriodIndex): 69 offset = to_offset(self.freq) C:\Python27\lib\site-packages\pandas\tseries\resample.pyc in _resample_timestamps(self, obj) 184 if len(grouper.binlabels) < len(axlabels) or self.how is not None: 185 grouped = obj.groupby(grouper, axis=self.axis) --> 186 result = grouped.aggregate(self._agg_method) 187 else: 188 # upsampling shortcut C:\Python27\lib\site-packages\pandas\core\groupby.pyc in aggregate(self, func_or_funcs, *args, **kwargs) 1215 """ 1216 if isinstance(func_or_funcs, basestring): -> 1217 return getattr(self, func_or_funcs)(*args, **kwargs) 1218 1219 if hasattr(func_or_funcs,'__iter__'): C:\Python27\lib\site-packages\pandas\core\groupby.pyc in mean(self) 290 """ 291 try: --> 292 return self._cython_agg_general('mean') 293 except GroupByError: 294 raise C:\Python27\lib\site-packages\pandas\core\groupby.pyc in _cython_agg_general(self, how) 376 377 if len(output) == 0: --> 378 raise GroupByError('No numeric types to aggregate') 379 380 return self._wrap_aggregated_output(output, names) GroupByError: No numeric types to aggregate Any help is appreciated. Thanks, Thomas
I found the answer by myself. It is possible to provide a function to the 'how' argument of resample: f = lambda x: Decimal(np.mean(x)) ts.resample('D', how = f)
I get the error for object type columns in DataFrame. I got around it by using df.resample('D', method='ffill', how=lambda c: c[-1])