Why my PanelND factory throwing a KeyError? - python

I'm using Pandas version 0.12.0 on Ubuntu 13.04. I'm trying to create a 5D panel object to contain some EEG data split by condition.
How I'm chosing to structure my data:
Let me begin by demonstrating my use of pandas.core.panelnd.creat_nd_panel_factory.
Subject = panelnd.create_nd_panel_factory(
klass_name='Subject',
axis_orders=['setsize', 'location', 'vfield', 'channels', 'samples'],
axis_slices={'labels': 'location',
'items': 'vfield',
'major_axis': 'major_axis',
'minor_axis': 'minor_axis'},
slicer=pd.Panel4D,
axis_aliases={'ss': 'setsize',
'loc': 'location',
'vf': 'vfield',
'major': 'major_axis',
'minor': 'minor_axis'}
# stat_axis=2 # dafuq is this?
)
Essentially, the organization is as follows:
setsize: an experimental condition, can be 1 or 2
location: an experimental condition, can be "same", "diff" or None
vfield: an experimental condition, can be "lvf" or "rvf"
The last two axes correspond to a DataFrame's major_axis and minor_axis. They have been renamed for clarity:
channels: columns, the EEG channels (129 of them)
samples: rows, the individual samples. samples can be though of as a time axis.
What I'm trying to do:
Each experimental condition (subject x setsize x location x vfield) is stored in it's own tab-delimited file, which I am reading in with pandas.read_table, obtaining a DataFrame object. I want to create one 5-dimensional panel (i.e. Subject) for each subject, which will contain all experimental conditions (i.e. DataFrames) for that subject.
To start, I'm building a nested dictionary for each subject/Subject:
# ... do some boring stuff to get the text files, etc...
for _, factors in df.iterrows():
# `factors` is a 4-tuple containing
# (subject number, setsize, location, vfield,
# and path to the tab-delimited file).
sn, ss, loc, vf, path = factors
eeg = pd.read_table(path, sep='\t', names=range(1, 129) + ['ref'], header=None)
# build nested dict
subjects.setdefault(sn, {}).setdefault(ss, {}).setdefault(loc, {})[vf] = eeg
# and now attempt to build `Subject`
for sn, d in subjects.iteritems():
subjects[sn] = Subject(d)
Full stack trace
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-2-831fa603ca8f> in <module>()
----> 1 import_data()
/home/louist/Dropbox/Research/VSTM/scripts/vstmlib.py in import_data()
64
65 import ipdb; ipdb.set_trace()
---> 66 for sn, d in subjects.iteritems():
67 subjects[sn] = Subject(d)
68
/usr/local/lib/python2.7/dist-packages/pandas/core/panelnd.pyc in __init__(self, *args, **kwargs)
65 if 'dtype' not in kwargs:
66 kwargs['dtype'] = None
---> 67 self._init_data(*args, **kwargs)
68 klass.__init__ = __init__
69
/usr/local/lib/python2.7/dist-packages/pandas/core/panel.pyc in _init_data(self, data, copy, dtype, **kwargs)
250 mgr = data
251 elif isinstance(data, dict):
--> 252 mgr = self._init_dict(data, passed_axes, dtype=dtype)
253 copy = False
254 dtype = None
/usr/local/lib/python2.7/dist-packages/pandas/core/panel.pyc in _init_dict(self, data, axes, dtype)
293 raxes = [self._extract_axis(self, data, axis=i)
294 if a is None else a for i, a in enumerate(axes)]
--> 295 raxes_sm = self._extract_axes_for_slice(self, raxes)
296
297 # shallow copy
/usr/local/lib/python2.7/dist-packages/pandas/core/panel.pyc in _extract_axes_for_slice(self, axes)
1477 """ return the slice dictionary for these axes """
1478 return dict([(self._AXIS_SLICEMAP[i], a) for i, a
-> 1479 in zip(self._AXIS_ORDERS[self._AXIS_LEN - len(axes):], axes)])
1480
1481 #staticmethod
KeyError: 'location'
I understand that panelnd is an experimental feature, but I'm fairly certain that I'm doing something wrong. Can somebody please point me in the right direction? If it is a bug, is there something that can be done about it?
As usual, thank you very much in advance!

Working example. You needed to specify the mapping of your axes to the internal axes names via the slices. This fiddles with the internal structure, but the fixed names of pandas still exist (and are somewhat hardcoded via Panel/Panel4D), so you need to provide the mapping.
I would create a Panel4D first, then your Subject as I did below.
Pls post on github / here if you find more bugs. This is not a heavily used feature.
Output
<class 'pandas.core.panelnd.Subject'>
Dimensions: 3 (setsize) x 1 (location) x 1 (vfield) x 10 (channels) x 2 (samples)
Setsize axis: level0_0 to level0_2
Location axis: level1_0 to level1_0
Vfield axis: level2_0 to level2_0
Channels axis: level3_0 to level3_9
Samples axis: level4_1 to level4_2
Code
import pandas as pd
import numpy as np
from pandas.core import panelnd
Subject = panelnd.create_nd_panel_factory(
klass_name='Subject',
axis_orders=['setsize', 'location', 'vfield', 'channels', 'samples'],
axis_slices={'location' : 'labels',
'vfield' : 'items',
'channels' : 'major_axis',
'samples': 'minor_axis'},
slicer=pd.Panel4D,
axis_aliases={'ss': 'setsize',
'loc': 'labels',
'vf': 'items',
'major': 'major_axis',
'minor': 'minor_axis'})
subjects = dict()
for i in range(3):
eeg = pd.DataFrame(np.random.randn(10,2),columns=['level4_1','level4_2'],index=[ "level3_%s" % x for x in range(10)])
loc, vf = ('level1_0','level2_0')
subjects["level0_%s" % i] = pd.Panel4D({ loc : { vf : eeg }})
print Subject(subjects)

Related

Proximityhash Type Error ,cannot convert the series to <class 'float'>

import proximityhash
# filtering the dataset with the required columns
df_new=df.filter(['latitude', 'longitude','cell_radius'])
# assign the column values to a variable
latitude = df_new['latitude']
longitude = df_new['longitude']
radius= df_new['cell_radius']
precision = 7
# passing the variable as the parameters to the proximityhash library
# getting the values and assigning those to a new column as proximityhash
df_new['proximityhash']=df_new.apply([proximityhash.create_geohash(latitude,longitude,radius,precision=7)])
print(df_new)
I had used this code where I imported some dataset and using that dataset I tried to filter the necessary columns and assign them to 3 variables (latitude, longitude, radius) and tried to create a new column as "proximityhash" to the new dataframe but it shows an error like below:
[enter image description here][1]
[1]: https://i.stack.imgur.com/2xW8S.png
TypeError Traceback (most recent call last)
Input In [29], in <cell line: 15>()
11 import pygeohash as gh
13 import proximityhash
---> 15 df_new['proximityhash']=df_new.apply([proximityhash.create_geohash(latitude,longitude,radius,precision=7)])
17 print(df_new)
File ~\Anaconda3\lib\site-packages\proximityhash.py:57, in create_geohash(latitude, longitude, radius, precision, georaptor_flag, minlevel, maxlevel)
54 height = (grid_height[precision - 1])/2
55 width = (grid_width[precision-1])/2
---> 57 lat_moves = int(math.ceil(radius / height)) #4
58 lon_moves = int(math.ceil(radius / width)) #2
60 for i in range(0, lat_moves):
File ~\Anaconda3\lib\site-packages\pandas\core\series.py:191, in _coerce_method.<locals>.wrapper(self)
189 if len(self) == 1:
190 return converter(self.iloc[0])
--> 191 raise TypeError(f"cannot convert the series to {converter}")
TypeError: cannot convert the series to <class 'float'>
Figured out a way to solve this, Posting the answer since it might be helpful for others.
Defined a function and pass the library to that specific column
# filtering the dataset with the required columns
df_new=df[['latitude', 'longitude','cell_radius']]
# getting a specified row (Since running the whole process might kill the kernel)
df_new = df_new.iloc[:100, ]
# predefined the precision value.
precision = 7
def PH(row):
latitude = row['latitude']
longitude = row['longitude']
cell_radius = row['cell_radius']
row['proximityhash'] = [proximityhash.create_geohash(float(latitude),float(longitude),float(cell_radius),precision=7)]
return row
df_new = df_new.apply(PH, axis=1)
df_new['proximityhash'] =pd.Series(df_new['proximityhash'], dtype="string")

KeyError in InvoiceYearMonth. Was working well in the code written above in Jupyter Notebooks in Pycharm

# Revenue = Active Customer Count * Order Count * Average Revenue per Order
#converting the type of Invoice Date Field from string to datetime.
tx_data['InvoiceDate'] = pd.to_datetime(tx_data['InvoiceDate'])
#creating YearMonth field for the ease of reporting and visualization
tx_data['InvoiceYearMonth'] = tx_data['InvoiceDate'].map(lambda date: 100*date.year + date.month)
#calculate Revenue for each row and create a new dataframe with YearMonth - Revenue columns
tx_data['Revenue'] = tx_data['UnitPrice'] * tx_data['Quantity']
tx_revenue = tx_data.groupby(['InvoiceYearMonth'])['Revenue'].sum().reset_index()
tx_revenue
#creating a new dataframe with UK customers only
tx_uk = tx_data.query("Country=='United Kingdom'").reset_index(drop=True)
#creating monthly active customers dataframe by counting unique Customer IDs
tx_monthly_active = tx_uk.groupby('InvoiceYearMonth')['CustomerID'].nunique().reset_index()
#print the dataframe
tx_monthly_active
#plotting the output
plot_data = [
go.Bar(
x=tx_monthly_active.query['InvoiceYearMonth'],
y=tx_monthly_active.query['CustomerID'],
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
title='Monthly Active Customers'
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
It was working in the code I had written earlier. But it is showing an error here. I would really appreciate it if a solution were to be given. I am using Jupyter Notebook in Pycharm. I cannot really figure not what the issue is. I am still new to programming so I am finding it a bit difficult to navigate through this issue.
KeyError Traceback (most recent call last)
<ipython-input-26-82f7e61120b9> in <module>
3
4 #creating monthly active customers dataframe by counting unique Customer IDs
----> 5 tx_monthly_active = tx_uk.groupby('InvoiceYearMonth')['CustomerID'].nunique().reset_index()
6
7 #print the dataframe
c:\users\aayus\pycharmprojects\helloworld\venv\lib\site-packages\pandas\core\frame.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed)
5799 axis = self._get_axis_number(axis)
5800
-> 5801 return groupby_generic.DataFrameGroupBy(
5802 obj=self,
5803 keys=by,
c:\users\aayus\pycharmprojects\helloworld\venv\lib\site-packages\pandas\core\groupby\groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated)
400 from pandas.core.groupby.grouper import get_grouper
401
--> 402 grouper, exclusions, obj = get_grouper(
403 obj,
404 keys,
c:\users\aayus\pycharmprojects\helloworld\venv\lib\site-packages\pandas\core\groupby\grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate)
596 in_axis, name, level, gpr = False, None, gpr, None
597 else:
--> 598 raise KeyError(gpr)
599 elif isinstance(gpr, Grouper) and gpr.key is not None:
600 # Add key to exclusions
KeyError: 'InvoiceYearMonth'
Here is the solution. It was a basic syntax error.
Instead of typing this
tx_monthly_active = tx_uk.groupby('InvoiceYearMonth')['CustomerID'].nunique().reset_index()
we have to type
tx_monthly_active = x_uk.groupby(['InvoiceYearMonth'])'CustomerID'].nunique().reset_index()

Is there a way to fix maximum recursion level in python 3?

I'm trying to build a state map for data across a decade, with a slider to select the year displayed on the map. The sort of display where a user can pick 2014 and the map will show the data for 2014.
I merged the data I want to show with the appropriate shapefile. I end up with 733 rows and 5 columns - as many as 9 rows per county with the same county name and coordinates.
Everything seems to be okay until I try to build the map. This error message is returned:
OverflowError: Maximum recursion level reached
I've tried resetting the recursion limit using sys.setrecursionlimit but can't get past that error.
I haven't been able to find an answer on SO that I understand, so I'm hoping someone can point me in the right direction.
I'm using bokeh and json to build the map. I've tried using sys.setrecursionlimit but I get the same error message no matter how high I go.
I used the same code last week but couldn't get data from different years to display because I was using a subset of the data. Now that I've fixed that, I'm stuck on this error message.
def json_data(selectedYear):
yr = selectedYear
murders = murder[murder['Year'] == yr]
merged = mergedfinal
merged.fillna('0', inplace = True)
merged_json = json.loads(merged.to_json())
json_data = json.dumps(merged_json)
return json_data
geosource = GeoJSONDataSource(geojson = json_data(2018))
palette=brewer['YlOrRd'][9]
palette = palette[::-1]
color_mapper = LinearColorMapper(palette = palette, low = 0, high = 60, nan_color = '#d9d9d9')
hover = HoverTool(tooltips = [ ('County/City','#NAME'),('Victims', '#Victims')])
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=8,width = 500, height = 30,
border_line_color=None,location = (0,0),
orientation = 'horizontal')
p = figure(title = 'Firearm Murders in Virginia', plot_height = 600 , plot_width = 950, toolbar_location = None, tools = [hover])
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
p.xaxis.visible=False
p.yaxis.visible=False
p.patches('xs','ys', source = geosource,fill_color = {'field' :'Victims', 'transform' : color_mapper},
line_color = 'black', line_width = 0.25, fill_alpha = 1)
p.add_layout(color_bar, 'below')
def update_plot(attr, old, new):
year = Slider.value
new_data = json_data(year)
geosource.geojson = new_data
p.title.text = 'Firearm Murders in VA'
slider = Slider(title = 'Year', start = 2009, end = 2018, step = 1, value = 2018)
slider.on_change('value', update_plot)
layout = column(p,widgetbox(slider))
curdoc().add_root(layout)
output_notebook()
show(layout)
The same code worked well enough when I was using a more limited dataset. Here is the full context of the error message:
OverflowError Traceback (most recent call last)
<ipython-input-50-efd821491ac3> in <module>()
8 return json_data
9
---> 10 geosource = GeoJSONDataSource(geojson = json_data(2018))
11
12 palette=brewer['YlOrRd'][9]
<ipython-input-50-efd821491ac3> in json_data(selectedYear)
4 merged = mergedfinal
5 merged.fillna('0', inplace = True)
----> 6 merged_json = json.loads(merged.to_json())
7 json_data = json.dumps(merged_json)
8 return json_data
/Users/mcuddy/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in to_json(self, path_or_buf, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines)
1087 force_ascii=force_ascii, date_unit=date_unit,
1088 default_handler=default_handler,
-> 1089 lines=lines)
1090
1091 def to_hdf(self, path_or_buf, key, **kwargs):
/Users/mcuddy/anaconda/lib/python3.6/site-packages/pandas/io/json.py in to_json(path_or_buf, obj, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines)
37 obj, orient=orient, date_format=date_format,
38 double_precision=double_precision, ensure_ascii=force_ascii,
---> 39 date_unit=date_unit, default_handler=default_handler).write()
40 else:
41 raise NotImplementedError("'obj' should be a Series or a DataFrame")
/Users/mcuddy/anaconda/lib/python3.6/site-packages/pandas/io/json.py in write(self)
83 date_unit=self.date_unit,
84 iso_dates=self.date_format == 'iso',
---> 85 default_handler=self.default_handler)
86
87
OverflowError: Maximum recursion level reached
I had a similar problem!
I narrowed my problem down to the .to_json step. For some reason when I merged my geopandas file on the right:
Neighbourhoods_merged = df_2016.merge(gdf_neighbourhoods, how = "left", on = "Neighbourhood#")
I ran into the recursion error. I found success by switching the two:
Neighbourhoods_merged = gdf_neighbourhoods.merge(df_2016, how = "left", on = "Neighbourhood#")
This is what worked for me. Infuriatingly I have no idea why this works, but I hope this might help someone else with the same error!
I solved this problem by changing the merge direction.
so, If you want to merge two dataframes A and B, and A has type of 'geopandas.geodataframe.GeoDataFrame' and B has 'pandas.core.frame.DataFrame
', you should merge them with pd.merge(A,B,on="some column'), not with the opposite direction.
I think the maximum recursion error comes when you execute .to_json() method to the pandas dataframe type with POLYGON type in it.
When you change the direction of merge and change the type to GeoDataFrame, .to_json() is executed without problem even they have POLYGON type column in it.
I spent 2 hours with this, and I hope this can help you.
If you need a higher recursion depth, you can set it using sys:
import sys
sys.setrecursionlimit(1500)
That being said, your error is most likely the result of an infinite recursion, which may be the case if increasing the depth doesn't fix it.

How to create new columns based on multiple conditions in other columns using a for loop?

I am trying to write a for loop that creates new columns with boolean values indicating whether both of the columns being referenced contain True values. I'd like this loop to run through existing columns and compare, but I'm not sure how to get the loop to do so. Thus far, I have been trying to use lists that refer to the different columns. Code follows:
import pandas as pd
import numpy as np
elig = pd.read_excel('spreadsheet.xlsx')
elig['ELA'] = elig['SELECTED_EXAMS'].str.match('.*English Language Arts.*')
elig['LivEnv'] = elig['SELECTED_EXAMS'].str.match('.*Living Environment.*')
elig['USHist'] = elig['SELECTED_EXAMS'].str.match('.*US History.*')
elig['Geometry'] = elig['SELECTED_EXAMS'].str.match('.*Geometry.*')
elig['AlgebraI'] = elig['SELECTED_EXAMS'].str.match('.*Algebra I.*')
elig['GlobalHistory'] = elig['SELECTED_EXAMS'].str.match('.*Global History.*')
elig['Physics'] = elig['SELECTED_EXAMS'].str.match('.*Physics.*')
elig['AlgebraII'] = elig['SELECTED_EXAMS'].str.match('.*Algebra II.*')
elig['EarthScience'] = elig['SELECTED_EXAMS'].str.match('.*Earth Science.*')
elig['Chemistry'] = elig['SELECTED_EXAMS'].str.match('.*Chemistry.*')
elig['LOTE Spanish'] = elig['SELECTED_EXAMS'].str.match('.*LOTE – Spanish.*')
# CHANGE TO LOOP--enter columns for instances in which scorers overlap competencies (e.g. can score two different exams). This is helpful in the event that two exams are scored on the same day, and we need to resolve numbers of scorers.
exam_list = ['ELA','LiveEnv','USHist','Geometry','AlgebraI','GlobalHistory','Physics','AlgebraII','EarthScience','Chemistry','LOTE Spanish']
nestedExam_list = ['ELA','LiveEnv','USHist','Geometry','AlgebraI','GlobalHistory','Physics','AlgebraII','EarthScience','Chemistry','LOTE Spanish']
for exam in exam_list:
for nestedExam in nestedExam_list:
elig[exam+nestedExam+' Overlap'] = np.where((elig[exam]==True)&(elig[nestedExam]==True,),True,False)
I think the issue is with the np.where(), where what I want is for exam and nestedExam to call the columns in question, but instead they're just calling the list items. Error message follows:
ValueError Traceback (most recent call last)
<ipython-input-33-9347975b8865> in <module>
3 for exam in exam_list:
4 for nestedExam in nestedExam_list:
----> 5 elig[exam+nestedExam+' Overlap'] = np.where((elig[exam]==True)&(elig[nestedExam]==True,),True,False)
6
7 """
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops.py in wrapper(self, other)
1359
1360 res_values = na_op(self.values, other)
-> 1361 unfilled = self._constructor(res_values, index=self.index)
1362 return filler(unfilled).__finalize__(self)
1363
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in __init__(self, data, index, dtype, name, copy, fastpath)
260 'Length of passed values is {val}, '
261 'index implies {ind}'
--> 262 .format(val=len(data), ind=len(index)))
263 except TypeError:
264 pass
ValueError: Length of passed values is 1, index implies 26834
Can someone help me out with this?
First to go through your combinations more effectively, and without double-counting, I might recommend you use the built-in library itertools.
`import itertools
exam_list = ['A', 'B', 'C', 'D']
for exam1, exam2 in itertools.combinations(exam_list, 2):
print(exam1 + '_' + exam2)
A_B
A_C
A_D
B_C
B_D
C_D
If you actually need all possible orders/combinations, you can substitute permutations for combinations
To deal with the actual issue, you actually need a whole lot less code to do what you want. If you have two columns elig[exam1] and elig[exam2] that are both boolean arrays, then the array where both are true is (elig[exam1] & elig[exam2]). This is called a "bit-wise" or "logical and" operation.
For example:
df = pd.DataFrame({'A': ['car', 'cat', 'hat']})
df['start=c'] = df['A'].str.startswith('c')
df['end=t'] = df['A'].str.endswith('t')
df['both'] = df['start=c'] & df['end=t']
A start=c end=t both
0 car True False False
1 cat True True True
2 hat False True False

Can not apply flatMap on RDD

In PySpark, for each element of an RDD, I'm trying to get an array of Row elements.Then I want to convert the result into a DataFrame.
I have the following code:
simulation = housesDF.flatMap(lambda house: goThroughAB(jobId, house))
print simulation.toDF().show()
Within that, I am calling these helper methods:
def simulate(jobId, house, a, b):
return Row(jobId=jobId, house=house, a=a, b=b, myVl=[i for i in range(10)])
def goThroughAB(jobId, house):
print "in goThroughAB"
results = []
for a in as:
for b in bs:
results += simulate(jobId, house, a, b)
print type(results)
return results
Strangely enough print "in goThroughAB" doesn't have any effect, as there is no output on the screen.
However, I am getting this error:
---> 23 print simulation.toDF().show()
24
25 dfRow = sqlContext.createDataFrame(simulationResults)
/databricks/spark/python/pyspark/sql/context.py in toDF(self, schema, sampleRatio)
62 [Row(name=u'Alice', age=1)]
63 """
---> 64 return sqlContext.createDataFrame(self, schema, sampleRatio)
65
66 RDD.toDF = toDF
/databricks/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
421
422 if isinstance(data, RDD):
--> 423 rdd, schema = self._createFromRDD(data, schema, samplingRatio)
424 else:
425 rdd, schema = self._createFromLocal(data, schema)
/databricks/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio)
308 """
309 if schema is None or isinstance(schema, (list, tuple)):
--> 310 struct = self._inferSchema(rdd, samplingRatio)
311 converter = _create_converter(struct)
312 rdd = rdd.map(converter)
/databricks/spark/python/pyspark/sql/context.py in _inferSchema(self, rdd, samplingRatio)
261
262 if samplingRatio is None:
--> 263 schema = _infer_schema(first)
264 if _has_nulltype(schema):
265 for row in rdd.take(100)[1:]:
/databricks/spark/python/pyspark/sql/types.py in _infer_schema(row)
829
830 else:
--> 831 raise TypeError("Can not infer schema for type: %s" % type(row))
832
833 fields = [StructField(k, _infer_type(v), True) for k, v in items]
TypeError: Can not infer schema for type: <type 'str'>
On this line:
print simulation.toDF().show()
So it looks like goThroughAB is not executed, which means the flatMap may not be executed.
What is the issue with the code?
First, you are not printing on the driver but on Spark executors. As you know, the executors are remote processes which execute Spark tasks in parallel. They do print that line but on their own console. You don't know which executor runs a certain partition and you should never rely on print statements in a distributed environment.
Then the problem is that when you want to create the DataFrame, Spark needs to know the schema for the table. If you don't specify it, it will use the sampling ratio and will check some rows in order to determine their types. If you do not specify the sampling ratio, it will only check the first row. This happens in your case and you probably have a field for which the type cannot be determined (it is probably null).
To solve this, you should either add the schema to the toDF() method or specify a non zero sampling ratio. The schema could be created in advance like this:
schema = StructType([StructField("int_field", IntegerType()),
StructField("string_field", StringType())])
This code is not correct. results += simulate(jobId, house, a, b) would try to concatenate row and fail. If you don't see TypeError it is not reached and your code fails somewhere else, probably when you create housesDF.
The key issue, as pointed out by others, is results += simulate(jobId, house, a, b) which won't work when simulation returns a Row object. You could try to make results a list then use list.append. But why not yield?
def goThroughAB(jobId, house):
print "in goThroughAB"
results = []
for a in as:
for b in bs:
yield simulate(jobId, house, a, b)
What happened when you + two Row objects?
In[9]:
from pyspark.sql.types import Row
Row(a='a', b=1) + Row(a='b', b=2)
Out[9]:
('a', 1, 'b', 2)
Then toDF sampled the first element and found it a str (your jobId), hence the complain
TypeError: Can not infer schema for type: <type 'str'>

Categories