Can not apply flatMap on RDD

Can not apply flatMap on RDD - python

In PySpark, for each element of an RDD, I'm trying to get an array of Row elements.Then I want to convert the result into a DataFrame.
I have the following code:
simulation = housesDF.flatMap(lambda house: goThroughAB(jobId, house))
print simulation.toDF().show()
Within that, I am calling these helper methods:
def simulate(jobId, house, a, b):
return Row(jobId=jobId, house=house, a=a, b=b, myVl=[i for i in range(10)])
def goThroughAB(jobId, house):
print "in goThroughAB"
results = []
for a in as:
for b in bs:
results += simulate(jobId, house, a, b)
print type(results)
return results
Strangely enough print "in goThroughAB" doesn't have any effect, as there is no output on the screen.
However, I am getting this error:
---> 23 print simulation.toDF().show()
24
25 dfRow = sqlContext.createDataFrame(simulationResults)
/databricks/spark/python/pyspark/sql/context.py in toDF(self, schema, sampleRatio)
62 [Row(name=u'Alice', age=1)]
63 """
---> 64 return sqlContext.createDataFrame(self, schema, sampleRatio)
65
66 RDD.toDF = toDF
/databricks/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
421
422 if isinstance(data, RDD):
--> 423 rdd, schema = self._createFromRDD(data, schema, samplingRatio)
424 else:
425 rdd, schema = self._createFromLocal(data, schema)
/databricks/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio)
308 """
309 if schema is None or isinstance(schema, (list, tuple)):
--> 310 struct = self._inferSchema(rdd, samplingRatio)
311 converter = _create_converter(struct)
312 rdd = rdd.map(converter)
/databricks/spark/python/pyspark/sql/context.py in _inferSchema(self, rdd, samplingRatio)
261
262 if samplingRatio is None:
--> 263 schema = _infer_schema(first)
264 if _has_nulltype(schema):
265 for row in rdd.take(100)[1:]:
/databricks/spark/python/pyspark/sql/types.py in _infer_schema(row)
829
830 else:
--> 831 raise TypeError("Can not infer schema for type: %s" % type(row))
832
833 fields = [StructField(k, _infer_type(v), True) for k, v in items]
TypeError: Can not infer schema for type: <type 'str'>
On this line:
print simulation.toDF().show()
So it looks like goThroughAB is not executed, which means the flatMap may not be executed.
What is the issue with the code?

First, you are not printing on the driver but on Spark executors. As you know, the executors are remote processes which execute Spark tasks in parallel. They do print that line but on their own console. You don't know which executor runs a certain partition and you should never rely on print statements in a distributed environment.
Then the problem is that when you want to create the DataFrame, Spark needs to know the schema for the table. If you don't specify it, it will use the sampling ratio and will check some rows in order to determine their types. If you do not specify the sampling ratio, it will only check the first row. This happens in your case and you probably have a field for which the type cannot be determined (it is probably null).
To solve this, you should either add the schema to the toDF() method or specify a non zero sampling ratio. The schema could be created in advance like this:
schema = StructType([StructField("int_field", IntegerType()),
StructField("string_field", StringType())])

This code is not correct. results += simulate(jobId, house, a, b) would try to concatenate row and fail. If you don't see TypeError it is not reached and your code fails somewhere else, probably when you create housesDF.

The key issue, as pointed out by others, is results += simulate(jobId, house, a, b) which won't work when simulation returns a Row object. You could try to make results a list then use list.append. But why not yield?
def goThroughAB(jobId, house):
print "in goThroughAB"
results = []
for a in as:
for b in bs:
yield simulate(jobId, house, a, b)
What happened when you + two Row objects?
In[9]:
from pyspark.sql.types import Row
Row(a='a', b=1) + Row(a='b', b=2)
Out[9]:
('a', 1, 'b', 2)
Then toDF sampled the first element and found it a str (your jobId), hence the complain
TypeError: Can not infer schema for type: <type 'str'>

Related

Splitting dataframe column in multiple columns using json_normalize does not work

I have a dataframe df created through an import from a mysql-db
ID CONFIG
0 276 {"pos":[{"type":"geo...
1 349 {"pos":[{"type":"geo...
2 378 {"pos":[{"type":"geo...
3 381 {"pos":[{"type":"geo...
4 385 {"pos":[{"type":"geo...
where the elements in the CONFIG column all have the form:
{"posit":[{"type":"geo_f","priority":1,"validity":0},{"type":"geo_m","priority":2,"validity":0},{"type":"geo_i","priority":3,"validity":0},{"type":"geo_c","priority":4,"validity":0}]}
Now, I was convinced these elements are json-type elements and tried the following method to transform them into columns:
df_new = pd.json_normalize(df['CONFIG'])
However, this return the following error:
AttributeError: 'str' object has no attribute 'values'
What am I missing? Thankful for any help!
EDIT: Full Traceback
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-22-23db4c0afdab> in <module>
----> 1 df_new = pd.json_normalize(df['CONFIG'])
c:\users\s-degossondevarennes\appdata\local\programs\python\python37\lib\site-packages\pandas\io\json\_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
c:\users\s-degossondevarennes\appdata\local\programs\python\python37\lib\site-packages\pandas\io\json\_normalize.py in <genexpr>(.0)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
AttributeError: 'str' object has no attribute 'values'

First issue is the values in CONFIG column are strings in disguise. So, a literal_eval can make them true dictionaries. Then, they are all indexed with "posit" key first that we better get rid of. But then we are left with lists; so explode comes in. Overall,
from ast import literal_eval
pd.json_normalize(df['CONFIG'].apply(lambda x: literal_eval(x)["posit"]).explode())
I get (for a 1-row sample data)
type priority validity
0 geo_f 1 0
1 geo_m 2 0
2 geo_i 3 0
3 geo_c 4 0

Computing sample statistics using NumPy functions with a Pandas DataFrame

I am trying to create a function that returns either the mean, median, or standard deviation of all columns in a Pandas DataFrame using NumPy functions.
It is for a school assignment, so there's no reason for using NumPy other than it is what is being asked of me. I am struggling to figure out how to use a NumPy function with a Pandas DataFrame for this problem.
Here is the text of the problem.
The code cell below contains a function called comp_sample_stat that accepts 2 parameters "df" which contains data from the dow jones for a particular company, and stat which will contain 1 of the 3 strings: "mean", "std", or "median".
For this problem:
if the stat is equal to "mean" return the mean of the dataframe columns using numpy's mean function
if the stat is equal to "median" return the median of the dataframe columns using numpy's median function
if the stat is equal to "std" return the std of the dataframe columns using numpy's std function
Here is the function I have written.
def comp_sample_stat(df, stat='mean'):
'''
Computes a sample statistic for any dataframe passed in
Parameters
----------
df: Pandas dataframe
Returns
-------
a pandas dataframe
'''
df_mean = df.apply(np.mean(df))
df_median = df.apply(np.median(df))
df_std = df.apply(np.std(df))
if stat is str('std'):
return df_std
elif stat is str('median'):
return df_median
else:
return df_mean
df is a DataFrame that has been defined previously in my assignment as follows:
def read_data(file_path):
'''
Reads in a dataset using pandas.
Parameters
----------
file_path : string containing path to a file
Returns
-------
pandas dataframe with data read in from the file path
'''
read_file = pd.read_csv(file_path)
new_df = pd.DataFrame(read_file)
return new_df
df = read_data('data/dow_jones_index.data')
The variable df_AA has also been previously defined as follows:
def select_stock(df, symbol):
'''
Selects data only containing a particular stock symbol.
Parameters
----------
df: dataframe containing data from the dow jones index
stock: string containing the stock symbol to select
Returns
-------
dataframe containing a particular stock
'''
stock = df[df.stock == symbol]
return stock
df_AA = select_stock(df.copy(), 'AA')
When I call the function within a Jupyter Notebook as follows:
comp_sample_stat(df_AA)
I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call
last)
<ipython-input-17-a2bcbeedcc56> in <module>()
22 return df_mean
23
---> 24 comp_sample_stat(df_AA)
<ipython-input-17-a2bcbeedcc56> in comp_sample_stat(df, stat)
11 a pandas dataframe
12 '''
---> 13 df_mean = df.apply(np.mean(df))
14 df_median = df.apply(np.median(df))
15 df_std = df.apply(np.std(df))
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in
apply(self, func, axis, broadcast, raw, reduce, result_type, args,
**kwds)
6012 args=args,
6013 kwds=kwds)
-> 6014 return op.get_result()
6015
6016 def applymap(self, func):
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in
get_result(self)
316 *self.args, **self.kwds)
317
--> 318 return super(FrameRowApply, self).get_result()
319
320 def apply_broadcast(self):
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in
get_result(self)
140 return self.apply_raw()
141
--> 142 return self.apply_standard()
143
144 def apply_empty_result(self):
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in
apply_standard(self)
246
247 # compute the result using the series generator
--> 248 self.apply_series_generator()
249
250 # wrap results
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in
apply_series_generator(self)
275 try:
276 for i, v in enumerate(series_gen):
--> 277 results[i] = self.f(v)
278 keys.append(v.name)
279 except Exception as e:
TypeError: ("'Series' object is not callable", 'occurred at index
quarter')

DataFrame.apply expects you to pass it a function, not a dataframe. So you should be passing np.mean without arguments.
That is, you should be doing something like this:
df_mean = df.apply(np.mean)
The docs.

Bokeh: ValueError: Out of range float values are not JSON compliant

I came across this discussion (from a year ago): https://github.com/bokeh/bokeh/issues/2392
I also saw the white screen without any errors..and then i tried to take a small subset of 2 columns and tried the below:
Since pandas just gets a bunch of rows with empty data in there as well, I tried dropna.. this resulted in there being no data at all. So instead I just specified the rows that should go into the df (hence the df = df.head(n=19) line)
import pandas as pd
from bokeh.plotting import figure, output_file, show
df = pd.read_excel(path,sheetname,parse_cols="A:B")
df = df.head(n=19)
print(df)
rtngs = ['iAAA','iAA+','iAA','iAA-','iA+','iA','iA-','iBBB+','iBBB','iBBB-','iBB+','iBB','iBB-','iB+','iB','iB-','NR','iCCC+']
x= df['Score']
output_file("line.html")
p = figure(plot_width=400, plot_height=400, x_range=(0,100),y_range=rtngs)
# add a circle renderer with a size, color, and alpha
p.circle(df['Score'], df['Rating'], size=20, color="navy", alpha=0.5)
# show the results
#output_notebook()
show(p)
df:
Rating Score
0 iAAA 64.0
1 iAA+ 33.0
2 iAA 7.0
3 iAA- 28.0
4 iA+ 36.0
5 iA 62.0
6 iA- 99.0
7 iBBB+ 10.0
8 iBBB 93.0
9 iBBB- 91.0
10 iBB+ 79.0
11 iBB 19.0
12 iBB- 95.0
13 iB+ 26.0
14 iB 9.0
15 iB- 26.0
16 NR 49.0
17 iCCC+ 51.0
18 iAAA 18.0
The above is showing me an output within the notebook, but still throws : ValueError: Out of range float values are not JSON compliant
And also it doesn't (hence?) produce the output file as well. How do I get rid of this error for this small subset? Is it related to NaN values? Would that also solve the 'white screen of death' issue for the larger dataset?
Thanks vm for taking a look!
In case you would like to see the entire error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-4fa6b88aa415> in <module>()
16 # show the results
17 #output_notebook()
---> 18 show(p)
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\io.py in show(obj, browser, new)
300 if obj not in _state.document.roots:
301 _state.document.add_root(obj)
--> 302 return _show_with_state(obj, _state, browser, new)
303
304
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\io.py in _show_with_state(obj, state, browser, new)
310
311 if state.notebook:
--> 312 comms_handle = _show_notebook_with_state(obj, state)
313 shown = True
314
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\io.py in _show_notebook_with_state(obj, state)
334 comms_target = make_id()
335 publish_display_data({'text/html': notebook_div(obj, comms_target)})
--> 336 handle = _CommsHandle(get_comms(comms_target), state.document, state.document.to_json())
337 state.last_comms_handle = handle
338 return handle
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\document.py in to_json(self)
792 # this is a total hack to go via a string, needed because
793 # our BokehJSONEncoder goes straight to a string.
--> 794 doc_json = self.to_json_string()
795 return loads(doc_json)
796
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\document.py in to_json_string(self, indent)
785 }
786
--> 787 return serialize_json(json, indent=indent)
788
789 def to_json(self):
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\site-packages\bokeh\core\json_encoder.py in serialize_json(obj, encoder, indent, **kwargs)
97 indent = 2
98
---> 99 return json.dumps(obj, cls=encoder, allow_nan=False, indent=indent, separators=separators, sort_keys=True, **kwargs)
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\json\__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
235 check_circular=check_circular, allow_nan=allow_nan, indent=indent,
236 separators=separators, default=default, sort_keys=sort_keys,
--> 237 **kw).encode(obj)
238
239
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\json\encoder.py in encode(self, o)
197 # exceptions aren't as detailed. The list call should be roughly
198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
200 if not isinstance(chunks, (list, tuple)):
201 chunks = list(chunks)
C:\Users\x\AppData\Local\Continuum\Anaconda3\lib\json\encoder.py in iterencode(self, o, _one_shot)
255 self.key_separator, self.item_separator, self.sort_keys,
256 self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)
258
259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,
ValueError: Out of range float values are not JSON compliant

I had the same error and I debugged the problem: I had NaN values in my plotted dataset and bokeh's serialize_json() function (in /core/json_encoder.py) does not allow NaN values (I don't know why...). In the return part of this function there is the allow_nan=False argument in json.dumps() :(( The problem occurs only at the io part of bokeh process when the output file is generated (it calls the above serialize_json() function).
So you have to replace NaN values in your dataframe, eg.:
df = df.fillna('')
Nice day! :)

NaN support will be better supported when this Pull Request to add a binary array serialization option is merged. This should be available in Bokeh 0.12.4 in January 2017. Bokeh does not use allow_nan in the python JSON encoder, because that is not standard — nan and inf are not part of the official JSON specification (an egregious oversight IMO, but out of our control)

Well it isn't exactly an answer to your question it's more like my experience working with bokeh for a week. In my case trying to make a plot like the Texas example from bokeh..... After a lot of frustration i noticed that bokeh or json or whatever when encounters the first value of the list (myList) to be plotted to be a NaN it refuses to plot giving the message
ValueError: Out of range float values are not JSON compliant
if i change the first value of the list (myList[0]) to float it works fine even if it contains NaN's to other positions. Taking this in account someone who understands how these things work will propose an answer. Mine is to restruct your data so that the first value isn't a nan.

After removing the NAN values, there might be infinite value,
Trace the whole dataset it might have some infinite values as inf remove those infinite values some how, then it should work.
df['column'].describe()
then if you find any inf value remove those rows with
df = df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]
reference: solution here

I bumped into this problem and I realized it was happening because one column of my Dataframe was filled only with NaNs.
You could instead put it to another value, e.g.:
df['column'] = np.zeros(len(df))

I had this error in this line:
save(plot_lda, 'tsne_lda_viz_{}_{}_{}_{}_{}_{}_{}.html'.format(
num_qualified_tweet, n_topics, threshold, n_iter, num_example, n_top_words, end_date))
I worked use this repo as baseline: https://github.com/ShuaiW/twitter-analysis/blob/master/topic_tweets.py (mine)
And, i solved this with this code (hope this will useful for others):
for i in range(X_topics.shape[1]):
topic_coord[i, 0] = 0 if np.isnan(topic_coord[i, 0]) else topic_coord[i, 0]
topic_coord[i, 1] = 0 if np.isnan(topic_coord[i, 1]) else topic_coord[i, 1]
plot_lda.text(topic_coord[i, 0], topic_coord[i, 1], [topic_summaries[i]])
The key is:
var = 0 if np.isnan(number) else number

I have this issue and solved with clean my dataset
check your data set and change null records value.

If anyone else comes across this answer you can specify a parameter on the read_excel method to not use default NA values
pd.read_excel(path, sheetname, parse_cols="A:B", keep_default_na=False)
Reference: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.

Pandas to_hdf succeeds but then read_hdf fails

Pandas to_hdf succeeds but then read_hdf fails when I use custom objects as column headers (I use custom objects because I need to store other info in them).
Is there some way to make this work? Or is this just a Pandas bug or PyTables bug?
As an example, below, I will show first making a DataFrame foo that uses string column headers, and everything works fine with to_hdf/read_hdf, but then changing foo to use a custom Col class for column headers, to_hdf still works fine but then read_hdf raises assertion error:
In [48]: foo = pd.DataFrame(np.random.randn(2, 3), columns = ['aaa', 'bbb', 'ccc'])
In [49]: foo
Out[49]:
aaa bbb ccc
0 -0.434303 0.174689 1.373971
1 -0.562228 0.862092 -1.361979
In [50]: foo.to_hdf('foo.h5', 'foo')
In [51]: bar = pd.read_hdf('foo.h5', 'foo')
In [52]: bar
Out[52]:
aaa bbb ccc
0 -0.434303 0.174689 1.373971
1 -0.562228 0.862092 -1.361979
In [52]:
In [53]: class Col(object):
...: def __init__(self, name, other_info):
...: self.name = name
...: self.other_info = other_info
...: def __str__(self):
...: return self.name
...:
In [54]: foo = pd.DataFrame(np.random.randn(2, 3), columns = [Col('aaa', {'z': 5}), Col('bbb', {'y': True}), Col('ccc', {})])
In [55]: foo
Out[55]:
aaa bbb ccc
0 -0.830503 1.066178 1.057349
1 0.406967 -0.131430 1.970204
In [56]: foo.to_hdf('foo.h5', 'foo')
In [57]: bar = pd.read_hdf('foo.h5', 'foo')
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-57-888b061a1d2c> in <module>()
----> 1 bar = pd.read_hdf('foo.h5', 'foo')
/.../python3.4/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, **kwargs)
330
331 try:
--> 332 return store.select(key, auto_close=auto_close, **kwargs)
333 except:
334 # if there is an error, close the store
/.../python3.4/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
672 auto_close=auto_close)
673
--> 674 return it.get_result()
675
676 def select_as_coordinates(
/.../python3.4/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
1366
1367 # directly return the result
-> 1368 results = self.func(self.start, self.stop, where)
1369 self.close()
1370 return results
/.../python3.4/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
665 return s.read(start=_start, stop=_stop,
666 where=_where,
--> 667 columns=columns, **kwargs)
668
669 # create the iterator
/.../python3.4/site-packages/pandas/io/pytables.py in read(self, **kwargs)
2792 blocks.append(blk)
2793
-> 2794 return self.obj_type(BlockManager(blocks, axes))
2795
2796 def write(self, obj, **kwargs):
/.../python3.4/site-packages/pandas/core/internals.py in __init__(self, blocks, axes, do_integrity_check, fastpath)
2180 self._consolidate_check()
2181
-> 2182 self._rebuild_blknos_and_blklocs()
2183
2184 def make_empty(self, axes=None):
/.../python3.4/site-packages/pandas/core/internals.py in _rebuild_blknos_and_blklocs(self)
2271
2272 if (new_blknos == -1).any():
-> 2273 raise AssertionError("Gaps in blk ref_locs")
2274
2275 self._blknos = new_blknos
AssertionError: Gaps in blk ref_locs
UPDATE:
So Jeff answered (a) "this is not supported" and (b) "if you have meta-data then write it to the attributes".
Question 1 regarding (a):
My column header objects have methods to return their properties, etc. For example, instead of a column header string 'x5y3z8' where I would have to parse out the values, I can simply do col_header.x (gives 5) col_header.y (gives 3) etc. This is very object-oriented and pythonic, instead of using a string to store info and having to parse it every time to retrieve info. How do you suggest I replace my current column header objects in a nice way (that's also supported)?
(BTW, you might look at 'x5y3z8' and think hierarchical index works, but that is not the case because not every column header is 'x#y#z#'. I might have one column 'foo' of strings, another one 'bar5baz7' of ints, and another 'x5y3z8' of floats. The column headers aren't uniform.)
Question 2 regarding (a):
When you say it's not supported, are you specifically talking about to_hdf/read_hdf not supporting it, or are you actually saying that Pandas in general doesn't support it? If it's only the HDF5 support that's missing, then I could switch to some other way of saving the DataFrames to disk and have it work, right? Do you foresee any problems with that in the future? Will this ever break with to_pickle/read_pickle, for example? (I lose performance, but got to give up something, right?)
Question 3 regarding (b):
What do you mean by "if you have meta-data then write it to the attributes". Attributes of what? A simple example would help me a lot. I'm pretty new to Pandas. Thanks!

This is not a supported feature.
This will raise in the next version of pandas (on the writing), for format='table'. Should for fixed as well, but that's not implemented. This is simply not supported, nor likely to be. You should just use strings. If you have meta-data then write it to the attributes.

Why my PanelND factory throwing a KeyError?

I'm using Pandas version 0.12.0 on Ubuntu 13.04. I'm trying to create a 5D panel object to contain some EEG data split by condition.
How I'm chosing to structure my data:
Let me begin by demonstrating my use of pandas.core.panelnd.creat_nd_panel_factory.
Subject = panelnd.create_nd_panel_factory(
klass_name='Subject',
axis_orders=['setsize', 'location', 'vfield', 'channels', 'samples'],
axis_slices={'labels': 'location',
'items': 'vfield',
'major_axis': 'major_axis',
'minor_axis': 'minor_axis'},
slicer=pd.Panel4D,
axis_aliases={'ss': 'setsize',
'loc': 'location',
'vf': 'vfield',
'major': 'major_axis',
'minor': 'minor_axis'}
# stat_axis=2 # dafuq is this?
)
Essentially, the organization is as follows:
setsize: an experimental condition, can be 1 or 2
location: an experimental condition, can be "same", "diff" or None
vfield: an experimental condition, can be "lvf" or "rvf"
The last two axes correspond to a DataFrame's major_axis and minor_axis. They have been renamed for clarity:
channels: columns, the EEG channels (129 of them)
samples: rows, the individual samples. samples can be though of as a time axis.
What I'm trying to do:
Each experimental condition (subject x setsize x location x vfield) is stored in it's own tab-delimited file, which I am reading in with pandas.read_table, obtaining a DataFrame object. I want to create one 5-dimensional panel (i.e. Subject) for each subject, which will contain all experimental conditions (i.e. DataFrames) for that subject.
To start, I'm building a nested dictionary for each subject/Subject:
# ... do some boring stuff to get the text files, etc...
for _, factors in df.iterrows():
# `factors` is a 4-tuple containing
# (subject number, setsize, location, vfield,
# and path to the tab-delimited file).
sn, ss, loc, vf, path = factors
eeg = pd.read_table(path, sep='\t', names=range(1, 129) + ['ref'], header=None)
# build nested dict
subjects.setdefault(sn, {}).setdefault(ss, {}).setdefault(loc, {})[vf] = eeg
# and now attempt to build `Subject`
for sn, d in subjects.iteritems():
subjects[sn] = Subject(d)
Full stack trace
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-2-831fa603ca8f> in <module>()
----> 1 import_data()
/home/louist/Dropbox/Research/VSTM/scripts/vstmlib.py in import_data()
64
65 import ipdb; ipdb.set_trace()
---> 66 for sn, d in subjects.iteritems():
67 subjects[sn] = Subject(d)
68
/usr/local/lib/python2.7/dist-packages/pandas/core/panelnd.pyc in __init__(self, *args, **kwargs)
65 if 'dtype' not in kwargs:
66 kwargs['dtype'] = None
---> 67 self._init_data(*args, **kwargs)
68 klass.__init__ = __init__
69
/usr/local/lib/python2.7/dist-packages/pandas/core/panel.pyc in _init_data(self, data, copy, dtype, **kwargs)
250 mgr = data
251 elif isinstance(data, dict):
--> 252 mgr = self._init_dict(data, passed_axes, dtype=dtype)
253 copy = False
254 dtype = None
/usr/local/lib/python2.7/dist-packages/pandas/core/panel.pyc in _init_dict(self, data, axes, dtype)
293 raxes = [self._extract_axis(self, data, axis=i)
294 if a is None else a for i, a in enumerate(axes)]
--> 295 raxes_sm = self._extract_axes_for_slice(self, raxes)
296
297 # shallow copy
/usr/local/lib/python2.7/dist-packages/pandas/core/panel.pyc in _extract_axes_for_slice(self, axes)
1477 """ return the slice dictionary for these axes """
1478 return dict([(self._AXIS_SLICEMAP[i], a) for i, a
-> 1479 in zip(self._AXIS_ORDERS[self._AXIS_LEN - len(axes):], axes)])
1480
1481 #staticmethod
KeyError: 'location'
I understand that panelnd is an experimental feature, but I'm fairly certain that I'm doing something wrong. Can somebody please point me in the right direction? If it is a bug, is there something that can be done about it?
As usual, thank you very much in advance!

Working example. You needed to specify the mapping of your axes to the internal axes names via the slices. This fiddles with the internal structure, but the fixed names of pandas still exist (and are somewhat hardcoded via Panel/Panel4D), so you need to provide the mapping.
I would create a Panel4D first, then your Subject as I did below.
Pls post on github / here if you find more bugs. This is not a heavily used feature.
Output
<class 'pandas.core.panelnd.Subject'>
Dimensions: 3 (setsize) x 1 (location) x 1 (vfield) x 10 (channels) x 2 (samples)
Setsize axis: level0_0 to level0_2
Location axis: level1_0 to level1_0
Vfield axis: level2_0 to level2_0
Channels axis: level3_0 to level3_9
Samples axis: level4_1 to level4_2
Code
import pandas as pd
import numpy as np
from pandas.core import panelnd
Subject = panelnd.create_nd_panel_factory(
klass_name='Subject',
axis_orders=['setsize', 'location', 'vfield', 'channels', 'samples'],
axis_slices={'location' : 'labels',
'vfield' : 'items',
'channels' : 'major_axis',
'samples': 'minor_axis'},
slicer=pd.Panel4D,
axis_aliases={'ss': 'setsize',
'loc': 'labels',
'vf': 'items',
'major': 'major_axis',
'minor': 'minor_axis'})
subjects = dict()
for i in range(3):
eeg = pd.DataFrame(np.random.randn(10,2),columns=['level4_1','level4_2'],index=[ "level3_%s" % x for x in range(10)])
loc, vf = ('level1_0','level2_0')
subjects["level0_%s" % i] = pd.Panel4D({ loc : { vf : eeg }})
print Subject(subjects)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can not apply flatMap on RDD - python

This code is not correct. results += simulate(jobId, house, a, b) would try to concatenate row and fail. If you don't see TypeError it is not reached and your code fails somewhere else, probably when you create housesDF.

Related

Splitting dataframe column in multiple columns using json_normalize does not work

Computing sample statistics using NumPy functions with a Pandas DataFrame

Bokeh: ValueError: Out of range float values are not JSON compliant

Pandas to_hdf succeeds but then read_hdf fails

Why my PanelND factory throwing a KeyError?

Categories

Resources