PySpark schema not recognised

PySpark schema not recognised - python

i'm attempting to convert a csv file using this schema :
sch = StructType([
StructField("id", StringType(), True),
StructField("words", ArrayType((StringType())), True)
])
dataFile = 'mycsv.csv'
df = sqlContext.read.option("mode", "DROPMALFORMED").schema(sch).option("delimiter", format(",")).option("charset", "UTF-8").load(dataFile, format='com.databricks.spark.csv', header='true', inferSchema='false')
mycsv.csv contains :
id , words
a , test here
I expect df to contain [Row(id='a', words=['test' , 'here'])]
but instead its an empty array as df.collect() returns []
Is my schema defined correctly ?

Well, clearly your words column isnt of type Array its of type StringType() only. and since you have DROPMALFORMED enabled, its droping the records because its not matching Array schema. Try schema like below and it should work fine -
sch = StructType([
StructField("id", StringType(), True),
StructField("words", StringType(), True)
])
edit : if you really really want 2nd column as Array/List of words , do this -
from pyspark.sql.functions import split
df.select(df.id,split(df.words," ").alias('words')).show()
this outputs :
+---+--------------+
| id| words|
+---+--------------+
| a |[, test, here]|
+---+--------------+

Related

Parsing in pyspark, json stored as string. Need to store as json only

I have data stored as array of string. Internally this string is json/tuple.
I need to get EID, and reason from it
Input:
['{"Eid":'1',"reason":"null","deptID":{1,2,3}}','{"Eid":'2',"reason":"happy","deptID":{2,3}}']
I need to parse this to get eid and reason only. I want each json stored as string to be in json format. Like below.
[{"Eid":'1',"reason":"null"},
{"Eid":'2',"reason":"happy"}]

One way of doing that is parsing JSON string using from_json and schema, then extracting the fields you want and converting it back to JSON using to_json.
Example:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql import functions as F
data = [
'{"Eid":"1","reason":"null","deptID":"{1,2,3}"}',
'{"Eid":"2","reason":"happy","deptID":"{2,3}"}',
]
schema = StructType(
[
StructField("Eid", StringType(), True),
StructField("reason", StringType(), True),
StructField("deptID", StringType(), True),
]
)
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([[x] for x in data], ["value"])
df = (
df.withColumn("data", F.from_json(F.col("value"), schema))
.withColumn("Eid", F.col("data")["Eid"])
.withColumn("reason", F.col("data")["reason"])
.withColumn("json", F.to_json(F.struct([F.col("Eid"), F.col("reason")])))
.select(["value", "json"])
)
df.show(20, False)
Result:
+----------------------------------------------+----------------------------+
|value |json |
+----------------------------------------------+----------------------------+
|{"Eid":"1","reason":"null","deptID":"{1,2,3}"}|{"Eid":"1","reason":"null"} |
|{"Eid":"2","reason":"happy","deptID":"{2,3}"} |{"Eid":"2","reason":"happy"}|
+----------------------------------------------+----------------------------+

How to read in JSON so each element of dict/hash is a new row in dataframe?

I'm attempting to read a large dataset written in JSON into a dataframe.
a minimal working example of this dataframe:
{"X":{"sex":"Male","age":57,"BMI":"19.7"},"XX":{"BMI":"30.7","age":44,"sex":"Female"},"XXX":{"age":18,"sex":"Female","BMI":"22.3"},"XXXX":{"sex":"Male","age":29,"BMI":"25.7"},"ZZZ":{"sex":"Male","age":61,"BMI":"40.5"}}
However, the dataset is not being read correctly, as it should have about 10,999 elements, and I'm only getting 1.
The JSON is a hash/dict where each element should be a new row.
I've tried
df = spark.read.option.json("dbfs:/FileStore/shared_uploads/xyz/data.json")
df = spark.read.option("multiline", "true").json("dbfs:/FileStore/shared_uploads/xyz/data.json")
I've also tried inferSchema, but this doesn't interpret the schema even close to correctly: I still get 1 row.
and made a custom schema, where each field is a sub-key of each row.
e.g.
custom_schema = StructType([
StructField('Admission_Date', StringType(), True),
StructField('BMI', StringType(), True),
StructField('age', IntegerType(), True),
StructField('latest_date', StringType(), True),...
...
StructField('sex', StringType(), True),True)
])
and then load with the custom schema:
df = spark.read.option("multiline", "true").schema(custom_schema).json("dbfs:/FileStore/shared_uploads/xyz/data.json")
but this again yields a single row.
How can I load this JSON so that every key is considered a single row?

You can create array column from all the dataframe columns, explode it and star expand the resulting struct column :
from pyspark.sql import functions as F
df1 = df.select(
F.explode(F.array(*df.columns)).alias("rows")
).select("rows.*")
df1.show()
#+----+---+------+
#| BMI|age| sex|
#+----+---+------+
#|19.7| 57| Male|
#|30.7| 44|Female|
#|22.3| 18|Female|
#|25.7| 29| Male|
#|40.5| 61| Male|
#+----+---+------+

pyspark corrupt_record while reading json file

I have a json which can't be read by spark(spark.read.json("xxx").show())
{'event_date_utc': None,'deleted': False, 'cost':1 , 'name':'Mike'}
The problem seems to be the None and False are not under single quote, and spark can't default them to boolean, null or even string.
I tried to give my spark read a schema instead of inferred by forcing those 2 column to be string and have the same error.
Feel like to me spark is trying to read the data first then apply schema then failed in the read part.
Is there a way to tell spark to read those values without modify the input data? I am using python.

You input isn't a valid JSON so you can't read it using spark.read.json. Instead, you can load it as text DataFrame with spark.read.text and parse the stringified dict into json using UDF:
import ast
import json
from pyspark.sql import functions as F
from pyspark.sql.types import *
schema = StructType([
StructField("event_date_utc", StringType(), True),
StructField("deleted", BooleanType(), True),
StructField("cost", IntegerType(), True),
StructField("name", StringType(), True)
])
dict_to_json = F.udf(lambda x: json.dumps(ast.literal_eval(x)))
df = spark.read.text("xxx") \
.withColumn("value", F.from_json(dict_to_json("value"), schema)) \
.select("value.*")
df.show()
#+--------------+-------+----+----+
#|event_date_utc|deleted|cost|name|
#+--------------+-------+----+----+
#|null |false |1 |Mike|
#+--------------+-------+----+----+

The JSON doesn't look good. Field values needs to be quoted.
You can eval the lines first, which look like they're in Python dict format.
df = spark.createDataFrame(
sc.textFile('true.json').map(eval),
'event_date_utc boolean, deleted boolean, cost int, name string'
)
df.show()
+--------------+-------+----+----+
|event_date_utc|deleted|cost|name|
+--------------+-------+----+----+
| null| false| 1|Mike|
+--------------+-------+----+----+

How to Read a parquet file , change datatype and write to another Parquet file in Hadoop using pyspark

My source parquet file has everything as string. My destination parquet file needs to convert this to different datatype like int, string, date etc. How do I do this?

you may wanted to apply userdefined schema to speedup data loading.
There are 2 ways to apply that-
using the input DDL-formatted string
spark.read.schema("a INT, b STRING, c DOUBLE").parquet("test.parquet")
Use StructType schema
customSchema = StructType([
StructField("a", IntegerType(), True),
StructField("b", StringType(), True),
StructField("c", DoubleType(), True)])
spark.read.schema(customSchema).parquet("test.parquet")

You should read the file and then typecast all the columns as required and save them
from pyspark.sql.functions import *
df = spark.read.parquet('/path/to/file')
df = df.select(col('col1').cast('int'), col('col2').cast('string'))
df.write.parquet('/target/path')

Data File:
| data_extract_id| Alien_Dollardiff| Alien_Dollar
|ab1def1gh-123-ea0| 0| 0
Script:
def createPrqtFParqt (datPath, parquetPath, inpustJsonSchema, outputdfSchema):
print("## Parsing " + datPath)
df = ssc.read.schema(outputdfSchema).parquet(datPath)
print("## Writing " + parquetPath)
df.write.mode("overwrite").parquet(parquetPath)
Output:
An error occured while calling Parquet.
Column: Alien_Dollardiff| Expected double Found BINARY.

PySpark: Cannot create small dataframe

I'm trying to create a small dataframe so that I can save two scalars (doubles) and a string
from How to create spark dataframe with column name which contains dot/period?
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
input_data = ([output_stem, paired_p_value, scalar_pearson])
schema = StructType([StructField("Comparison", StringType(), False), \
StructField("Paired p-value", DoubleType(), False), \
StructField("Pearson coefficient", DoubleType(), True)])
df_compare_AF = sqlContext.createDataFrame(input_data, schema)
display(df_compare_AF)
producing the error message:
TypeError: StructType can not accept object 's3://sanford-biofx-dev/con/dev3/dev' in type <class 'str'> which doesn't make any sense to me, this column was meant for strings
my other solution is from
Add new rows to pyspark Dataframe
columns = ["comparison", "paired p", "Pearson coefficient"]
vals = [output_stem, paired_p_value, scalar_pearson]
df = spark.createDataFrame(vals, columns)
display(df)
but this gives an error: TypeError: Can not infer schema for type: <class 'str'>
I just want a small dataframe:
comparison | paired p-value | Pearson Coefficient
-------------------------------------------------
s3://sadf | 0.045 | -0.039

The solution is to put a comma of mystery at the end of input_data thanks to #10465355 says Reinstate Monica
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
input_data = ([output_stem, paired_p_value, scalar_pearson],)
schema = StructType([StructField("Comparison", StringType(), False), \
StructField("Paired p-value", DoubleType(), False), \
StructField("Pearson coefficient", DoubleType(), True)])
df_compare_AF = sqlContext.createDataFrame(input_data, schema)
display(df_compare_AF)
I don't understand why this comma is necessary, or what it does, but it seems to do the job

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark schema not recognised - python

Related

Parsing in pyspark, json stored as string. Need to store as json only

How to read in JSON so each element of dict/hash is a new row in dataframe?

pyspark corrupt_record while reading json file

How to Read a parquet file , change datatype and write to another Parquet file in Hadoop using pyspark

PySpark: Cannot create small dataframe

Categories

Resources