Cannot create dataframe from list: pyspark - python

I have a list that is generated by a function. when I execute print on my list:
print(preds_labels)
I obtain:
[(0.,8.),(0.,13.),(0.,19.),(0.,19.),(0.,19.),(0.,19.),(0.,19.),(0.,20.),(0.,21.),(0.,23.)]
but when I want to create a DataFrame with this command:
df = sqlContext.createDataFrame(preds_labels, ["prediction", "label"])
I get an error message:
not supported type: type 'numpy.float64'
If I create the list manually, I have no problem. Do you have an idea?

pyspark uses its own type system and unfortunately it doesn't deal with numpy well. It works with python types though. So you could manually convert the numpy.float64 to float like
df = sqlContext.createDataFrame(
[(float(tup[0]), float(tup[1]) for tup in preds_labels],
["prediction", "label"]
)
Note pyspark will then take them as pyspark.sql.types.DoubleType

To anyone arriving here with the error:
typeerror not supported type class 'numpy.str_'
This is true for string as well. So if you created your list strings using numpy , try to change it to pure python.
Create list of single item repeated N times

Related

Unable to call Dask Array Log10 on native datatypes

I'm working on doing some data aggregation across a dask-dataframe. The data is natively stored as parquet but I can manipulate it through to the following lines. I am power summing log-values that are stored in each row to a single vector and then returning to log value. It is the final step that I am having issues with
slice_dataframe = source_dataframe[filter_idx]
#linearize and sum. The below line works when calling vals.compute()
#slice_dataframe['data'].values returns a dask array
vals = da.sum(10**(slice_dataframe['data'].values/10),axis=0)
# cast back to log spacing. This does not work
log_values = 10*da.log10(vals)
I get the error returned as:
'''
TypeError: Parameters of such types are not supported by log10
'''
Any ideas
After doing some reading I found in the documentation that the error being thrown is a data-type operation.
The error is sourced in the line:
vals = da.sum(10**(slice_dataframe['data'].values/10),axis=0)
The return is the generic type object. By explicitly casting the value in advance to a float64 this resolve the error posted above.
Fixed:
vals = da.sum(10**(slice_dataframe['data'].values/10),axis=0).astype('float64')

pd.merge throwing error while executing through .bat file

Python script does not run while executing in a bat file, but runs seamlessly on the editor.
The error is related to datatype difference in pd.merge script. Although the datatype given to both the columns is same in both the dataframes.
df2a["supply"] = df2a["supply"].astype(str)
df2["supply_typ"] = df2["supply_typ"].astype(str)
df2a["supply_typ"] = df2a["supply_typ"].astype(str)
df = (pd.merge(df2,df2a, how=join,on=
['entity_id','pare','grome','buame','tame','prd','gsn',
'supply','supply_typ'],suffixes=['gs2','gs2x']))
While running the bat file i am getting following error in pd.merge:
You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat
Not a direct answer, but contains code that cannot be formatted in a comment, and should be enough to solve the problem.
When pandas says that you are trying to merge on float64 and object columns, it is certainly right. It may not be evident because pandas relies on numpy, and that a numpy object column can store any data.
I ended with a simple function to diagnose all those data type problem:
def show_types(df):
for i,c in enumerate(df.columns):
print(df[c].dtype, type(df.iat[0, i]))
It shows both the pandas datatype of the columns of a dataframe, and the actual type of the first element of the column. It can help do see the difference between columns containing str elements and other containing datatime.datatime ones, while the datatype is just objects.
Use that on both of your dataframes, and the problem should become evident...

unable to parse a column of json strings in modin dataframe (works in pandas)

I have a dataframe of json strings I want to convert to json objects.
df.col.apply(json.loads) works fine for pandas, but fails when using modin dataframes.
example:
import pandas
import modin.pandas
import json
pandas.DataFrame.from_dict({'a': ['{}']}).a.apply(json.loads)
0 {}
Name: a, dtype: object
modin.pandas.DataFrame.from_dict({'a': ['{}']}).a.apply(json.loads)
TypeError: the JSON object must be str, bytes or bytearray, not float
This issue was also raised on GitHub, and was answered here: https://github.com/modin-project/modin/issues/616
The error is coming from the error checking component of the run, where we call the apply (or agg) on an empty DataFrame to determine the return type and let pandas handle the error checking (Link).
Locally, I can reproduce this issue and have fixed it by changing the line to perform the operation on one line of the Series. This may affect the performance, so I need to do some more tuning to see if there is a way to speed it up and still be robust. After the fix the overhead of that check is ~10ms for 256 columns and I don't think we want error checking to take that long.
untill the fix is released, it's possible to workaround this issue by using code that work also for empty data - for example:
def safe_loads(x)
try:
return json.loads(x)
except:
return None
modin.pandas.DataFrame.from_dict({'a': ['{}']}).a.apply(safe_loads)

Python - Pandas Describe Throwing Error: unhashable type 'dict'

Update: I am using some example code from "Socrata Open Source API." I note the following comment in the code:
# First 2000 results, returned as JSON from API / converted to Python
# list of dictionaries by sodapy.
I am not v. familiar with JSON.
I have downloaded a dataset, creating a DataFrame 'df' with a large number of columns.
df = pd.DataFrame.from_records(results)
When I attempt to use the describe() method, I get "TypeError: unhashable type: 'dict'":
df.describe()
...
TypeError: unhashable type: 'dict'
How can I identify the columns which are generating this error?
UPDATE 2:
Per Yuca's request, I include an extract from the df:
I came across the same Problem today and did a bit research about different versions of pyarrow. here I found that in the past (<0.13), pyarrow would write real columns of data for the index, with names.In the most recent version of pyarrow, there would be no column data, but a range index metadata marker instead. It means parquet files produced with newer version of pyarrow cant be read by older versions.
pandas0.25.3 is ok with reading json containing dicts, apparently pandas1.0.1 not so much
df = pd.read_json(path,lines=True)
TypeError: unhashable type: ‘dict’
Above is thrown by pandas1.0.1 for the same file for which it works in pandas0.25.3.
The issue is tracked and apparently fixed in master which I suppose will make it into the next version.
Thanks to the user community (h/t G Anderson), I pieced together a solution:
for i in df.columns:
if df[i].transform(type).any() == dict:
df = df.drop(i, axis= 1)
transform(type).any() checks all elements in column i, and drops the column if the element is type dict.
Thanks to all!

python3 pandas - #TypeError: Can't convert 'int' object to str implicitly

Could you please support me to clarify the proper way to convert a selected Dataframe column into string ?
'Product_ID' part of dataframe 'df' is automatically selected as integer
If I use following statement:
df['Product_ID']=df['Product_ID'].to_string()
generate error:
TypeError: Can't convert 'int' object to str implicitly
same issue is generated by .astype(str) or .apply(str)
thanks!
First, note that to_string and .astype(str) do different things. to_string returns a single string, and .astype(str) returns a series with string values. Which are you trying to do?
Second, how sure are you that you are working with an integer series? What does df['Product_ID'].dtype return?
Third, can you try to post a reproducible example? One way to narrow down the data that is causing the problem:
for i,v in enumerate(df['Product_ID'].values):
try:
str(v)
except TypeError:
print i, v

Categories