When I assemble a one row data frame as follows my method successfully brings back the expected data frame.
def build_job_finish_data_frame(sql_context, job_load_id, is_success):
job_complete_record_schema = StructType(
[
StructField("job_load_id", IntegerType(), False),
StructField("terminate_datetime", TimestampType(), False),
StructField("was_success", BooleanType(), False)
]
)
data = [
Row(
job_load_id=job_load_id,
terminate_datetime=datetime.now(),
was_success=is_success
)
]
return sql_context.createDataFrame(data, job_complete_record_schema)
If I change the "terminate_datetime" to "end_datetime" or "finish_datetime" as shown below it throws an error.
def build_job_finish_data_frame(sql_context, job_load_id, is_success):
job_complete_record_schema = StructType(
[
StructField("job_load_id", IntegerType(), False),
StructField("end_datetime", TimestampType(), False),
StructField("was_success", BooleanType(), False)
]
)
data = [
Row(
job_load_id=job_load_id,
end_datetime=datetime.now(),
was_success=is_success
)
]
return sql_context.createDataFrame(data, job_complete_record_schema)
The error I receive is
TypeError: IntegerType can not accept object datetime.datetime(2016, 10, 5, 11, 19, 31, 915745) in type <class 'datetime.datetime'>
I can change "terminate_datetime" to "start_datetime" and have experimented with other words.
I can see no reason for field name changes breaking this code as it is doing nothing more than building a manual data frame.
This is worrying as I am using data frames to load up a data warehouse where I have no control of the field names.
I am running PySpark on Python 3.3.2 on Fedora 20.
Why the name changes things? The problem is that Row is a tuple sorted by __fields__. So the first case creates
from pyspark.sql import Row
from datetime import datetime
x = Row(job_load_id=1, terminate_datetime=datetime.now(), was_success=True)
x.__fields__
## ['job_load_id', 'terminate_datetime', 'was_success']
while the second one creates:
y = Row(job_load_id=1, end_datetime=datetime.now(), was_success=True)
y.__fields__
## ['end_datetime', 'job_load_id', 'was_success']
This no longer matches the schema you defined which expects (IntegerType, TimestampType, Boolean).
Because Row is useful mostly for schema inference and you provide schema directly you can address that by using standard tuple:
def build_job_finish_data_frame(sql_context, job_load_id, is_success):
job_complete_record_schema = StructType(
[
StructField("job_load_id", IntegerType(), False),
StructField("end_datetime", TimestampType(), False),
StructField("was_success", BooleanType(), False)
]
)
data = [tuple(job_load_id, datetime.now(), is_success)]
return sql_context.createDataFrame(data, job_complete_record_schema)
although creating a single element DataFrame looks strange if not pointless.
Related
I'm going to ingest data using databricks notebook. I want to validate the schema of the data ingested against what I'm expecting the schema of these data to be.
So basically I have:
validation_schema = StructType([
StructField("a", StringType(), True),
StructField("b", IntegerType(), False),
StructField("c", StringType(), False),
StructField("d", StringType(), False)
])
data_ingested_good = [("foo",1,"blabla","36636"),
("foo",2,"booboo","40288"),
("bar",3,"fafa","42114"),
("bar",4,"jojo","39192"),
("baz",5,"jiji","32432")
]
data_ingested_bad = [("foo","1","blabla","36636"),
("foo","2","booboo","40288"),
("bar","3","fafa","42114"),
("bar","4","jojo","39192"),
("baz","5","jiji","32432")
]
data_ingested_good.printSchema()
data_ingested_bad.printSchema()
validation_schema.printSchema()
I've seen similar questions but answers are always in scala.
it's really depends on your exact requirements & complexities of schemas that you want to compare - for example, ignore nullability flag vs. taking it into account, order of columns, support for maps/structs/arrays, etc. Also, do you want to see difference or just a flag if schemas are matching or not.
In the simplest case it could be as simple as following - just compare string representations of schemas:
def compare_schemas(df1, df2):
return df1.schema.simpleString() == df2.schema.simpleString()
I personally would recommend to take an existing library, like Chispa that has more advanced schema comparison functions - you can tune checks, it will show differences, etc. After installation (you can just do %pip install chispa) - this will throw an exception if schemas are different:
from chispa.schema_comparer import assert_schema_equality
assert_schema_equality(df1.schema, df2.schema)
another method , you can find the difference based on the simple python list comparisons .
dept = [("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
]
deptColumns = ["dept_name","dept_id"]
dept1 = [("Finance",10,'999'),
("Marketing",20,'999'),
("Sales",30,'999'),
("IT",40,'999')
]
deptColumns1 = ["dept_name","dept_id","extracol"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
dept1DF = spark.createDataFrame(data=dept1, schema = deptColumns1)
deptDF_columns=deptDF.schema.names
dept1DF_columns=dept1DF.schema.names
list_difference = []
for item in dept1DF_columns:
if item not in deptDF_columns:
list_difference.append(item)
print(list_difference)
Screen print :
I want to create a generic function in pyspark that takes dataframe and a datatype as a parameter and filter the columns that does not satisfy the criteria. I am not very good at python and I am kind of stuck at the point from where I am not able to find how can I do that.
I have a scala representation of the code that does the same thing.
//sample data
val df = Seq(("587","mumbai",Some(5000),5.05),("786","chennai",Some(40000),7.055),("432","Gujarat",Some(20000),6.75),("2","Delhi",None,10.0)).toDF("Id","City","Salary","Increase").withColumn("RefID",$"Id")
import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {
val cols = df.schema.toList
.filter(x => x.dataType == colType)
.map(c => col(c.name))
df.select(cols:_*)
}
val res = selectByType(IntegerType, df)
res is the dataframe that has only integer columns in this case the salary column and we have drop all the other columns that have different types dynamically.
I won't the same behaviour in pyspark but I am not able to accomplish that.
This is what I have tried
//sample data
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, DoubleType
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True), \
StructField("raise",DoubleType(),True) \
])
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000,2.5),
("Michael","Rose","","40288","M",4000,4.7),
("Robert","","Williams","42114","M",4000,8.9),
("Maria","Anne","Jones","39192","F",4000,0.0),
("Jen","Mary","Brown","","F",-1,-1.2)
]
df = spark.createDataFrame(data=data2,schema=schema)
//getting the column list from schema of the dataframe
pschema = df.schema.fields
datatypes = [IntegerType,DoubleType] //column datatype that I want.
out = filter(lambda x: x.dataType.isin(datatypes), pschema) //gives invalid syntax error.
Can someone help me out in terms of what is the thing that I am doing wrong. Scala code only passes single datatype but I as per my use case I want to handle scenario in which we can pass multiple datatypes and we get the dataframe with required columns of that specified datatypes.
Initially if someone can give any idea for how can I make it work for single datatype then I can give it a try to see if I can do that same for multiple datatypes.
Note : Sample data for scala and pyspark is different as I copied the Pyspark sample data from somewhere just to speed the operation as I am just concerned about the final output requirement.
i'm attempting to convert a csv file using this schema :
sch = StructType([
StructField("id", StringType(), True),
StructField("words", ArrayType((StringType())), True)
])
dataFile = 'mycsv.csv'
df = sqlContext.read.option("mode", "DROPMALFORMED").schema(sch).option("delimiter", format(",")).option("charset", "UTF-8").load(dataFile, format='com.databricks.spark.csv', header='true', inferSchema='false')
mycsv.csv contains :
id , words
a , test here
I expect df to contain [Row(id='a', words=['test' , 'here'])]
but instead its an empty array as df.collect() returns []
Is my schema defined correctly ?
Well, clearly your words column isnt of type Array its of type StringType() only. and since you have DROPMALFORMED enabled, its droping the records because its not matching Array schema. Try schema like below and it should work fine -
sch = StructType([
StructField("id", StringType(), True),
StructField("words", StringType(), True)
])
edit : if you really really want 2nd column as Array/List of words , do this -
from pyspark.sql.functions import split
df.select(df.id,split(df.words," ").alias('words')).show()
this outputs :
+---+--------------+
| id| words|
+---+--------------+
| a |[, test, here]|
+---+--------------+
I am new spark and python and facing this difficulty of building a schema from a metadata file that can be applied to my data file.
Scenario: Metadata File for the Data file(csv format), contains the columns and their types: for example:
id,int,10,"","",id,"","",TRUE,"",0
created_at,timestamp,"","","",created_at,"","",FALSE,"",0
I have successfully converted this to a dataframe that looks like:
+--------------------+---------------+
| name| type|
+--------------------+---------------+
| id| IntegerType()|
| created_at|TimestampType()|
| updated_at| StringType()|
But when I try to convert this to a StructField format using this
fields = schemaLoansNew.map(lambda l:([StructField(l.name, l.type, 'true')]))
OR
schemaList = schemaLoansNew.map(lambda l: ("StructField(" + l.name + "," + l.type + ",true)")).collect()
And then later convert it to StructType, using
schemaFinal = StructType(schemaList)
I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/mapr/spark/spark-1.4.1/python/pyspark/sql/types.py", line 372, in __init__
assert all(isinstance(f, DataType) for f in fields), "fields should be a list of DataType"
AssertionError: fields should be a list of DataType
I am stuck on this due to my lack of knowledge on Data Frames, can you please advise, how to proceed on this. once I have schema ready I want to use createDataFrame to apply to my data File. This process has to be done for many tables so I do not want to hardcode the types rather use the metadata file to build the schema and then apply to the RDD.
Thanks in advance.
Fields have argument have to be a list of DataType objects. This:
.map(lambda l:([StructField(l.name, l.type, 'true')]))
generates after collect a list of lists of tuples (Rows) of DataType (list[list[tuple[DataType]]]) not to mention that nullable argument should be boolean not a string.
Your second attempt:
.map(lambda l: ("StructField(" + l.name + "," + l.type + ",true)")).
generates after collect a list of str objects.
Correct schema for the record you've shown should look more or less like this:
from pyspark.sql.types import *
StructType([
StructField("id", IntegerType(), True),
StructField("created_at", TimestampType(), True),
StructField("updated_at", StringType(), True)
])
Although using distributed data structures for task like this is a serious overkill, not to mention inefficient, you can try to adjust your first solution as follows:
StructType([
StructField(name, eval(type), True) for (name, type) in df.rdd.collect()
])
but it is not particularly safe (eval). It could be easier to build a schema from JSON / dictionary. Assuming you have function which maps from type description to canonical type name:
def get_type_name(s: str) -> str:
"""
>>> get_type_name("int")
'integer'
"""
_map = {
'int': IntegerType().typeName(),
'timestamp': TimestampType().typeName(),
# ...
}
return _map.get(s, StringType().typeName())
You can build dictionary of following shape:
schema_dict = {'fields': [
{'metadata': {}, 'name': 'id', 'nullable': True, 'type': 'integer'},
{'metadata': {}, 'name': 'created_at', 'nullable': True, 'type': 'timestamp'}
], 'type': 'struct'}
and feed it to StructType.fromJson:
StructType.fromJson(schema_dict)
Below steps can be followed to change the Datatype Objects
data_schema=[
StructField("age", IntegerType(), True),
StructField("name", StringType(), True)
]
final_struct=StructType(fields=data_schema)
df=spark.read.json('/home/abcde/Python-and-Spark-for-Big-Data-master/Spark_DataFrames/people.json', schema=final_struct)
df.printSchema()
root
|-- age: integer (nullable = true)
|-- name: string (nullable = true)
val columns: Array[String] = df1.columns
val reorderedColumnNames: Array[String] = df2.columns //or do the reordering you want
val result: DataFrame = dataFrame.select(reorderedColumnNames.head, reorderedColumnNames.tail: _*)
I have a tab-separated file containing lines as
id1 name1 ['a', 'b'] 3.0 2.0 0.0 1.0
that is, an id, a name, a list with some strings, and a series of 4 float attributes.
I am reading this file as
rdd = sc.textFile('myfile.tsv') \
.map(lambda row: row.split('\t'))
df = sqlc.createDataFrame(rdd, schema)
where I give the schema as
schema = StructType([
StructField('id', StringType(), True),
StructField('name', StringType(), True),
StructField('list', ArrayType(StringType()), True),
StructField('att1', FloatType(), True),
StructField('att2', FloatType(), True),
StructField('att3', FloatType(), True),
StructField('att4', FloatType(), True)
])
Problem is, both the list and the attributes do not get properly read, judging from a collect on the DataFrame. In fact, I get None for all of them:
Row(id=u'id1', brand_name=u'name1', list=None, att1=None, att2=None, att3=None, att4=None)
What am I doing wrong?
It is properly read, it just doesn't work as you expect. Schema argument declares what are the types to avoid expensive schema inference not how to cast the data. Providing input that matches declared schema is your responsibility.
This can be also handled either by data source (take a look at spark-csv and inferSchema option). It won't handle complex types like array though.
Since your schema is mostly flat and you know the types you can try something like this:
df = rdd.toDF([f.name for f in schema.fields])
exprs = [
# You should excluding casting
# on other complex types as well
col(f.name).cast(f.dataType) if f.dataType.typeName() != "array"
else col(f.name)
for f in schema.fields
]
df.select(*exprs)
and handle complex types separately using string processing functions or UDFs. Alternatively, since you read data in Python anyway, just enforce desired types before you create DF.