PySpark: Cannot create small dataframe

PySpark: Cannot create small dataframe - python

I'm trying to create a small dataframe so that I can save two scalars (doubles) and a string
from How to create spark dataframe with column name which contains dot/period?
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
input_data = ([output_stem, paired_p_value, scalar_pearson])
schema = StructType([StructField("Comparison", StringType(), False), \
StructField("Paired p-value", DoubleType(), False), \
StructField("Pearson coefficient", DoubleType(), True)])
df_compare_AF = sqlContext.createDataFrame(input_data, schema)
display(df_compare_AF)
producing the error message:
TypeError: StructType can not accept object 's3://sanford-biofx-dev/con/dev3/dev' in type <class 'str'> which doesn't make any sense to me, this column was meant for strings
my other solution is from
Add new rows to pyspark Dataframe
columns = ["comparison", "paired p", "Pearson coefficient"]
vals = [output_stem, paired_p_value, scalar_pearson]
df = spark.createDataFrame(vals, columns)
display(df)
but this gives an error: TypeError: Can not infer schema for type: <class 'str'>
I just want a small dataframe:
comparison | paired p-value | Pearson Coefficient
-------------------------------------------------
s3://sadf | 0.045 | -0.039

The solution is to put a comma of mystery at the end of input_data thanks to #10465355 says Reinstate Monica
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
input_data = ([output_stem, paired_p_value, scalar_pearson],)
schema = StructType([StructField("Comparison", StringType(), False), \
StructField("Paired p-value", DoubleType(), False), \
StructField("Pearson coefficient", DoubleType(), True)])
df_compare_AF = sqlContext.createDataFrame(input_data, schema)
display(df_compare_AF)
I don't understand why this comma is necessary, or what it does, but it seems to do the job

Related

Iterate through each column and find the max length

I want to get the maximum length from each column from a pyspark dataframe.
Following is the sample dataframe:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
I tried to implement the solution provided in Scala but could not convert it.

This would work
from pyspark.sql.functions import col, length, max
df=df.select([max(length(col(name))) for name in df.schema.names])
Result
Edit: For reference: Converting to Rows (As asked here, updated there as well - pyspark max string length for each column in the dataframe)
df = df.select([max(length(col(name))).alias(name) for name in df.schema.names])
row=df.first().asDict()
df2 = spark.createDataFrame([Row(col=name, length=row[name]) for name in df.schema.names], ['col', 'length'])
Output:

InferSchema numpy.float32 PySpark

I'm trying to create a dataframe with array in PySpark, like above, but it returns infer schema error:
data = array([-1.01835623e-01, -2.81103030e-02, 9.39835608e-01, 1.45413309e-01,
3.11870694e-01, 4.00573969e-01, -2.64698595e-01, -4.19898927e-01,
-1.18507199e-01, -3.59607369e-01, 4.42910716e-02, 6.56066418e-01,
2.20986709e-01, -4.60361429e-02, -4.06525940e-01, -2.33521834e-01])
column = ['feature']
from pyspark.sql.types import StructType, StructField, LongType
schema = StructType([StructField("feature", LongType(), True)])
dataframe = spark.createDataFrame(data, column, schema)
dataframe.show()
**TypeError: Can not infer schema for type: <class 'numpy.float32'>**
Should I try some transformation using NumPy or anyone has a hint for it?

This DoubleType worked for me.
data = [('1',[-1.01835623e-01, -2.81103030e-02, 9.39835608e-01, 1.45413309e-01,
3.11870694e-01, 4.00573969e-01, -2.64698595e-01, -4.19898927e-01,
-1.18507199e-01, -3.59607369e-01, 4.42910716e-02, 6.56066418e-01,
2.20986709e-01, -4.60361429e-02, -4.06525940e-01, -2.33521834e-01])]
schema = StructType( [StructField("ID",StringType(),True),
StructField("feature",ArrayType(DoubleType()),True)])
df =spark.createDataFrame(data, schema)
+---+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ID |feature |
+---+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |[-0.101835623, -0.028110303, 0.939835608, 0.145413309, 0.311870694, 0.400573969, -0.264698595, -0.419898927, -0.118507199, -0.359607369, 0.0442910716, 0.656066418, 0.220986709, -0.0460361429, -0.40652594, -0.233521834]|
+---+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Parsing in pyspark, json stored as string. Need to store as json only

I have data stored as array of string. Internally this string is json/tuple.
I need to get EID, and reason from it
Input:
['{"Eid":'1',"reason":"null","deptID":{1,2,3}}','{"Eid":'2',"reason":"happy","deptID":{2,3}}']
I need to parse this to get eid and reason only. I want each json stored as string to be in json format. Like below.
[{"Eid":'1',"reason":"null"},
{"Eid":'2',"reason":"happy"}]

One way of doing that is parsing JSON string using from_json and schema, then extracting the fields you want and converting it back to JSON using to_json.
Example:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql import functions as F
data = [
'{"Eid":"1","reason":"null","deptID":"{1,2,3}"}',
'{"Eid":"2","reason":"happy","deptID":"{2,3}"}',
]
schema = StructType(
[
StructField("Eid", StringType(), True),
StructField("reason", StringType(), True),
StructField("deptID", StringType(), True),
]
)
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([[x] for x in data], ["value"])
df = (
df.withColumn("data", F.from_json(F.col("value"), schema))
.withColumn("Eid", F.col("data")["Eid"])
.withColumn("reason", F.col("data")["reason"])
.withColumn("json", F.to_json(F.struct([F.col("Eid"), F.col("reason")])))
.select(["value", "json"])
)
df.show(20, False)
Result:
+----------------------------------------------+----------------------------+
|value |json |
+----------------------------------------------+----------------------------+
|{"Eid":"1","reason":"null","deptID":"{1,2,3}"}|{"Eid":"1","reason":"null"} |
|{"Eid":"2","reason":"happy","deptID":"{2,3}"} |{"Eid":"2","reason":"happy"}|
+----------------------------------------------+----------------------------+

filter on the pyspark dataframe schema to get new dataframe with columns having specific type

I want to create a generic function in pyspark that takes dataframe and a datatype as a parameter and filter the columns that does not satisfy the criteria. I am not very good at python and I am kind of stuck at the point from where I am not able to find how can I do that.
I have a scala representation of the code that does the same thing.
//sample data
val df = Seq(("587","mumbai",Some(5000),5.05),("786","chennai",Some(40000),7.055),("432","Gujarat",Some(20000),6.75),("2","Delhi",None,10.0)).toDF("Id","City","Salary","Increase").withColumn("RefID",$"Id")
import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {
val cols = df.schema.toList
.filter(x => x.dataType == colType)
.map(c => col(c.name))
df.select(cols:_*)
}
val res = selectByType(IntegerType, df)
res is the dataframe that has only integer columns in this case the salary column and we have drop all the other columns that have different types dynamically.
I won't the same behaviour in pyspark but I am not able to accomplish that.
This is what I have tried
//sample data
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, DoubleType
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True), \
StructField("raise",DoubleType(),True) \
])
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000,2.5),
("Michael","Rose","","40288","M",4000,4.7),
("Robert","","Williams","42114","M",4000,8.9),
("Maria","Anne","Jones","39192","F",4000,0.0),
("Jen","Mary","Brown","","F",-1,-1.2)
]
df = spark.createDataFrame(data=data2,schema=schema)
//getting the column list from schema of the dataframe
pschema = df.schema.fields
datatypes = [IntegerType,DoubleType] //column datatype that I want.
out = filter(lambda x: x.dataType.isin(datatypes), pschema) //gives invalid syntax error.
Can someone help me out in terms of what is the thing that I am doing wrong. Scala code only passes single datatype but I as per my use case I want to handle scenario in which we can pass multiple datatypes and we get the dataframe with required columns of that specified datatypes.
Initially if someone can give any idea for how can I make it work for single datatype then I can give it a try to see if I can do that same for multiple datatypes.
Note : Sample data for scala and pyspark is different as I copied the Pyspark sample data from somewhere just to speed the operation as I am just concerned about the final output requirement.

PySpark schema not recognised

i'm attempting to convert a csv file using this schema :
sch = StructType([
StructField("id", StringType(), True),
StructField("words", ArrayType((StringType())), True)
])
dataFile = 'mycsv.csv'
df = sqlContext.read.option("mode", "DROPMALFORMED").schema(sch).option("delimiter", format(",")).option("charset", "UTF-8").load(dataFile, format='com.databricks.spark.csv', header='true', inferSchema='false')
mycsv.csv contains :
id , words
a , test here
I expect df to contain [Row(id='a', words=['test' , 'here'])]
but instead its an empty array as df.collect() returns []
Is my schema defined correctly ?

Well, clearly your words column isnt of type Array its of type StringType() only. and since you have DROPMALFORMED enabled, its droping the records because its not matching Array schema. Try schema like below and it should work fine -
sch = StructType([
StructField("id", StringType(), True),
StructField("words", StringType(), True)
])
edit : if you really really want 2nd column as Array/List of words , do this -
from pyspark.sql.functions import split
df.select(df.id,split(df.words," ").alias('words')).show()
this outputs :
+---+--------------+
| id| words|
+---+--------------+
| a |[, test, here]|
+---+--------------+

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark: Cannot create small dataframe - python

Related

Iterate through each column and find the max length

InferSchema numpy.float32 PySpark

Parsing in pyspark, json stored as string. Need to store as json only

filter on the pyspark dataframe schema to get new dataframe with columns having specific type

PySpark schema not recognised

Categories

Resources