I'm trying to create a dataframe with array in PySpark, like above, but it returns infer schema error:
data = array([-1.01835623e-01, -2.81103030e-02, 9.39835608e-01, 1.45413309e-01,
3.11870694e-01, 4.00573969e-01, -2.64698595e-01, -4.19898927e-01,
-1.18507199e-01, -3.59607369e-01, 4.42910716e-02, 6.56066418e-01,
2.20986709e-01, -4.60361429e-02, -4.06525940e-01, -2.33521834e-01])
column = ['feature']
from pyspark.sql.types import StructType, StructField, LongType
schema = StructType([StructField("feature", LongType(), True)])
dataframe = spark.createDataFrame(data, column, schema)
dataframe.show()
**TypeError: Can not infer schema for type: <class 'numpy.float32'>**
Should I try some transformation using NumPy or anyone has a hint for it?
This DoubleType worked for me.
data = [('1',[-1.01835623e-01, -2.81103030e-02, 9.39835608e-01, 1.45413309e-01,
3.11870694e-01, 4.00573969e-01, -2.64698595e-01, -4.19898927e-01,
-1.18507199e-01, -3.59607369e-01, 4.42910716e-02, 6.56066418e-01,
2.20986709e-01, -4.60361429e-02, -4.06525940e-01, -2.33521834e-01])]
schema = StructType( [StructField("ID",StringType(),True),
StructField("feature",ArrayType(DoubleType()),True)])
df =spark.createDataFrame(data, schema)
+---+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ID |feature |
+---+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |[-0.101835623, -0.028110303, 0.939835608, 0.145413309, 0.311870694, 0.400573969, -0.264698595, -0.419898927, -0.118507199, -0.359607369, 0.0442910716, 0.656066418, 0.220986709, -0.0460361429, -0.40652594, -0.233521834]|
+---+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Related
I am getting problem with below code. I want to create a single column dataframe.
May I know what wrong I am doing here.
from pyspark.sql import functions as F from pyspark.sql.types import IntegerType,ArrayType,StructType,StructField,StringType data = [ (["James","Jon","Jane"]), (["Miken","Mik","Mike"]), (["John","Johns"])]
cols = StructType([ StructField("Name",ArrayType(StringType()),True) ])
df = spark.createDataFrame(data=data,schema=cols)
df.printSchema()
df.show()
output:
Name
["James","Jon","Jane"]
["Miken","Mik","Mike"]
["John","Johns"]
I am getting a error below.
Length of object (3) does not match with length of fields (1)
This error is because you added data in form of a multiple-column structure and your requirement is in single-column data value.
So for get data in single column values you need to [(row,) for row in data]:
data = [(["James","Jon","Jane"]), (["Miken","Mik","Mike"]), (["John","Johns"])]
cols = StructType([ StructField("Name",ArrayType(StringType()),True) ])
df = spark.createDataFrame(data=[(row,) for row in data], schema=cols)
df.printSchema()
df.show()
Output:
Pyspark has this problem. The way I go about is introduce and ID column, drop it ones the df is created
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType,ArrayType,StructType,StructField,StringType
data = [ (1,["James","Jon","Jane"]), (2,["Miken","Mik","Mike"]), (3,["John","Johns"])]
cols = StructType([ StructField("ID",IntegerType(),True), StructField("Name",ArrayType(StringType()),True) ])
df = spark.createDataFrame(data=data,schema=cols).drop('ID')
df.printSchema()
df.show()
I have data stored as array of string. Internally this string is json/tuple.
I need to get EID, and reason from it
Input:
['{"Eid":'1',"reason":"null","deptID":{1,2,3}}','{"Eid":'2',"reason":"happy","deptID":{2,3}}']
I need to parse this to get eid and reason only. I want each json stored as string to be in json format. Like below.
[{"Eid":'1',"reason":"null"},
{"Eid":'2',"reason":"happy"}]
One way of doing that is parsing JSON string using from_json and schema, then extracting the fields you want and converting it back to JSON using to_json.
Example:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql import functions as F
data = [
'{"Eid":"1","reason":"null","deptID":"{1,2,3}"}',
'{"Eid":"2","reason":"happy","deptID":"{2,3}"}',
]
schema = StructType(
[
StructField("Eid", StringType(), True),
StructField("reason", StringType(), True),
StructField("deptID", StringType(), True),
]
)
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([[x] for x in data], ["value"])
df = (
df.withColumn("data", F.from_json(F.col("value"), schema))
.withColumn("Eid", F.col("data")["Eid"])
.withColumn("reason", F.col("data")["reason"])
.withColumn("json", F.to_json(F.struct([F.col("Eid"), F.col("reason")])))
.select(["value", "json"])
)
df.show(20, False)
Result:
+----------------------------------------------+----------------------------+
|value |json |
+----------------------------------------------+----------------------------+
|{"Eid":"1","reason":"null","deptID":"{1,2,3}"}|{"Eid":"1","reason":"null"} |
|{"Eid":"2","reason":"happy","deptID":"{2,3}"} |{"Eid":"2","reason":"happy"}|
+----------------------------------------------+----------------------------+
I have a json which can't be read by spark(spark.read.json("xxx").show())
{'event_date_utc': None,'deleted': False, 'cost':1 , 'name':'Mike'}
The problem seems to be the None and False are not under single quote, and spark can't default them to boolean, null or even string.
I tried to give my spark read a schema instead of inferred by forcing those 2 column to be string and have the same error.
Feel like to me spark is trying to read the data first then apply schema then failed in the read part.
Is there a way to tell spark to read those values without modify the input data? I am using python.
You input isn't a valid JSON so you can't read it using spark.read.json. Instead, you can load it as text DataFrame with spark.read.text and parse the stringified dict into json using UDF:
import ast
import json
from pyspark.sql import functions as F
from pyspark.sql.types import *
schema = StructType([
StructField("event_date_utc", StringType(), True),
StructField("deleted", BooleanType(), True),
StructField("cost", IntegerType(), True),
StructField("name", StringType(), True)
])
dict_to_json = F.udf(lambda x: json.dumps(ast.literal_eval(x)))
df = spark.read.text("xxx") \
.withColumn("value", F.from_json(dict_to_json("value"), schema)) \
.select("value.*")
df.show()
#+--------------+-------+----+----+
#|event_date_utc|deleted|cost|name|
#+--------------+-------+----+----+
#|null |false |1 |Mike|
#+--------------+-------+----+----+
The JSON doesn't look good. Field values needs to be quoted.
You can eval the lines first, which look like they're in Python dict format.
df = spark.createDataFrame(
sc.textFile('true.json').map(eval),
'event_date_utc boolean, deleted boolean, cost int, name string'
)
df.show()
+--------------+-------+----+----+
|event_date_utc|deleted|cost|name|
+--------------+-------+----+----+
| null| false| 1|Mike|
+--------------+-------+----+----+
I'm trying to create a small dataframe so that I can save two scalars (doubles) and a string
from How to create spark dataframe with column name which contains dot/period?
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
input_data = ([output_stem, paired_p_value, scalar_pearson])
schema = StructType([StructField("Comparison", StringType(), False), \
StructField("Paired p-value", DoubleType(), False), \
StructField("Pearson coefficient", DoubleType(), True)])
df_compare_AF = sqlContext.createDataFrame(input_data, schema)
display(df_compare_AF)
producing the error message:
TypeError: StructType can not accept object 's3://sanford-biofx-dev/con/dev3/dev' in type <class 'str'> which doesn't make any sense to me, this column was meant for strings
my other solution is from
Add new rows to pyspark Dataframe
columns = ["comparison", "paired p", "Pearson coefficient"]
vals = [output_stem, paired_p_value, scalar_pearson]
df = spark.createDataFrame(vals, columns)
display(df)
but this gives an error: TypeError: Can not infer schema for type: <class 'str'>
I just want a small dataframe:
comparison | paired p-value | Pearson Coefficient
-------------------------------------------------
s3://sadf | 0.045 | -0.039
The solution is to put a comma of mystery at the end of input_data thanks to #10465355 says Reinstate Monica
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
input_data = ([output_stem, paired_p_value, scalar_pearson],)
schema = StructType([StructField("Comparison", StringType(), False), \
StructField("Paired p-value", DoubleType(), False), \
StructField("Pearson coefficient", DoubleType(), True)])
df_compare_AF = sqlContext.createDataFrame(input_data, schema)
display(df_compare_AF)
I don't understand why this comma is necessary, or what it does, but it seems to do the job
I have a pandas data frame my_df, and my_df.dtypes gives us:
ts int64
fieldA object
fieldB object
fieldC object
fieldD object
fieldE object
dtype: object
Then I am trying to convert the pandas data frame my_df to a spark data frame by doing below:
spark_my_df = sc.createDataFrame(my_df)
However, I got the following errors:
ValueErrorTraceback (most recent call last)
<ipython-input-29-d4c9bb41bb1e> in <module>()
----> 1 spark_my_df = sc.createDataFrame(my_df)
2 spark_my_df.take(20)
/usr/local/spark-latest/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio)
520 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
521 else:
--> 522 rdd, schema = self._createFromLocal(map(prepare, data), schema)
523 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
524 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
/usr/local/spark-latest/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
384
385 if schema is None or isinstance(schema, (list, tuple)):
--> 386 struct = self._inferSchemaFromList(data)
387 if isinstance(schema, (list, tuple)):
388 for i, name in enumerate(schema):
/usr/local/spark-latest/python/pyspark/sql/session.py in _inferSchemaFromList(self, data)
318 schema = reduce(_merge_type, map(_infer_schema, data))
319 if _has_nulltype(schema):
--> 320 raise ValueError("Some of types cannot be determined after inferring")
321 return schema
322
ValueError: Some of types cannot be determined after inferring
Does anyone know what the above error mean? Thanks!
In order to infer the field type, PySpark looks at the non-none records in each field. If a field only has None records, PySpark can not infer the type and will raise that error.
Manually defining a schema will resolve the issue
>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField("foo", StringType(), True)])
>>> df = spark.createDataFrame([[None]], schema=schema)
>>> df.show()
+----+
|foo |
+----+
|null|
+----+
And to fix this problem, you could provide your own defined schema.
For example:
To reproduce the error:
>>> df = spark.createDataFrame([[None, None]], ["name", "score"])
To fix the error:
>>> from pyspark.sql.types import StructType, StructField, StringType, DoubleType
>>> schema = StructType([StructField("name", StringType(), True), StructField("score", DoubleType(), True)])
>>> df = spark.createDataFrame([[None, None]], schema=schema)
>>> df.show()
+----+-----+
|name|score|
+----+-----+
|null| null|
+----+-----+
If you are using the RDD[Row].toDF() monkey-patched method you can increase the sample ratio to check more than 100 records when inferring types:
# Set sampleRatio smaller as the data size increases
my_df = my_rdd.toDF(sampleRatio=0.01)
my_df.show()
Assuming there are non-null rows in all fields in your RDD, it will be more likely to find them when you increase the sampleRatio towards 1.0.
I've run into this same issue, if you do not need the columns that are null you can simply drop them from the pandas dataframe before importing to spark:
my_df = my_df.dropna(axis='columns', how='all') # Drops columns with all NA values
spark_my_df = sc.createDataFrame(my_df)
This is probably because of the columns that have all null values. You should drop those columns before converting them to a spark dataframe
The reason for this error is that Spark is not able to determine the data types of your pandas dataframe so, one way to solve this you can pass the schema separately to the sparks createDataFrame function.
For example your pandas dataframe looks like this
d = {
'col1': [1, 2],
'col2': ['A', 'B]
}
df = pd.DataFrame(data = d)
print(df)
col1 col2
0 1 A
1 2 B
When you want to convert it into Spark dataframe start by defining schema and adding it to your createDataFrame as follows
from pyspark.sql.types import StructType, StructField, LongType, StringType
schema = StructType([
StructField("col1", LongType()),
StructField("col2", StringType()),
])
spark_df = spark.createDataFrame(df, schema = schema)