How to pass schema to create a new Dataframe from existing Dataframe? - python

To pass schema to a json file we do this:
from pyspark.sql.types import (StructField, StringType, StructType, IntegerType)
data_schema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]
final_struc = StructType(fields = data_schema)
df =spark.read.json('people.json', schema=final_struc)
The above code works as expected. However now, I have data in table which I display by:
df = sqlContext.sql("SELECT * FROM people_json")
But if I try to pass a new schema to it by using following command it does not work.
df2 = spark.sql("SELECT * FROM people_json", schema=final_struc)
It gives the following error:
sql() got an unexpected keyword argument 'schema'
NOTE: I am using Databrics Community Edition
What am I missing?
How do I pass the new schema if I have data in the table instead of some JSON file?

You cannot apply a new schema to already created dataframe. However, you can change the schema of each column by casting to another datatype as below.
df.withColumn("column_name", $"column_name".cast("new_datatype"))
If you need to apply a new schema, you need to convert to RDD and create a new dataframe again as below
df = sqlContext.sql("SELECT * FROM people_json")
val newDF = spark.createDataFrame(df.rdd, schema=schema)
Hope this helps!

There is already one answer available but still I want to add something.
Create DF from RDD
using toDF
newDf = rdd.toDF(schema, column_name_list)
using createDataFrame
newDF = spark.createDataFrame(rdd ,schema, [list_of_column_name])
Create DF from other DF
suppose I have DataFrame with columns|data type - name|string, marks|string, gender|string.
if I want to get only marks as integer.
newDF = oldDF.select("marks")
newDF_with_int = newDF.withColumn("marks", df['marks'].cast('Integer'))
This will convert marks to integer.

Related

filter on the pyspark dataframe schema to get new dataframe with columns having specific type

I want to create a generic function in pyspark that takes dataframe and a datatype as a parameter and filter the columns that does not satisfy the criteria. I am not very good at python and I am kind of stuck at the point from where I am not able to find how can I do that.
I have a scala representation of the code that does the same thing.
//sample data
val df = Seq(("587","mumbai",Some(5000),5.05),("786","chennai",Some(40000),7.055),("432","Gujarat",Some(20000),6.75),("2","Delhi",None,10.0)).toDF("Id","City","Salary","Increase").withColumn("RefID",$"Id")
import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {
val cols = df.schema.toList
.filter(x => x.dataType == colType)
.map(c => col(c.name))
df.select(cols:_*)
}
val res = selectByType(IntegerType, df)
res is the dataframe that has only integer columns in this case the salary column and we have drop all the other columns that have different types dynamically.
I won't the same behaviour in pyspark but I am not able to accomplish that.
This is what I have tried
//sample data
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, DoubleType
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True), \
StructField("raise",DoubleType(),True) \
])
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000,2.5),
("Michael","Rose","","40288","M",4000,4.7),
("Robert","","Williams","42114","M",4000,8.9),
("Maria","Anne","Jones","39192","F",4000,0.0),
("Jen","Mary","Brown","","F",-1,-1.2)
]
df = spark.createDataFrame(data=data2,schema=schema)
//getting the column list from schema of the dataframe
pschema = df.schema.fields
datatypes = [IntegerType,DoubleType] //column datatype that I want.
out = filter(lambda x: x.dataType.isin(datatypes), pschema) //gives invalid syntax error.
Can someone help me out in terms of what is the thing that I am doing wrong. Scala code only passes single datatype but I as per my use case I want to handle scenario in which we can pass multiple datatypes and we get the dataframe with required columns of that specified datatypes.
Initially if someone can give any idea for how can I make it work for single datatype then I can give it a try to see if I can do that same for multiple datatypes.
Note : Sample data for scala and pyspark is different as I copied the Pyspark sample data from somewhere just to speed the operation as I am just concerned about the final output requirement.

Pyspark: how to create a dataframe with only one row?

What I am trying to do seems to be quite simple. I need to create a dataframe with a single column and a single value.
I have tried a few approaches, namely:
Creation of empty dataframe and appending the data afterwards:
project_id = 'PC0000000042'
schema = T.StructType([T.StructField("ProjectId", T.StringType(), True)])
empty_df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
rdd = sc.parallelize([(project_id)])
df_temp = spark.createDataFrame(rdd, SCHEMA)
df = empty_df.union(df_temp)
Creation of dataframe based on this one value.
rdd = sc.parallelize([(project_id)])
df = spark.createDataFrame(rdd, schema)
However, what I get in both cases is:
TypeError: StructType can not accept object 'PC0000000042' in type <class 'str'>
Which I don't quite understand since the type seems to be correct. Thank you for any advice!
One small change. If you have project_id = 'PC0000000042', then
rdd = sc.parallelize([[project_id]])
You should pass the data as a list of list: [['PC0000000042']] instead of ['PC0000000042'].
If you have 2 rows, then:
project_id = [['PC0000000042'], ['PC0000000043']]
rdd = sc.parallelize(project_id)
spark.createDataFrame(rdd, schema).show()
+------------+
| ProjectId|
+------------+
|PC0000000042|
|PC0000000043|
+------------+
Without RDDs, you can also do:
project_id = [['PC0000000042']]
spark.createDataFrame(project_id,schema=schema).show()

Cannot create a dataframe in pyspark and write it to Hive table

I am trying to create a dataframe in pyspark, then write it as a Hive table, and then read it back, but it is not working...
sqlContext = HiveContext(sc)
hive_context = HiveContext(sc) #Initialize Hive
#load the control table
cntl_dt = [('2016-04-30')]
rdd = sc.parallelize(cntl_dt)
row_cntl_dt = rdd.map(lambda x: Row(load_dt=x[0]))
df_cntl_dt = sqlContext.createDataFrame(row_cntl_dt)
df_cntl_dt.write.mode("overwrite").saveAsTable("schema.cntrl_tbl")
load_dt = hive_context.sql("select load_dt from schema.cntrl_tbl" ).first()['load_dt'];
print (load_dt)
Prints: 2
I expect :2016-12-31
This is because:
cntl_dt = [('2016-04-30')]
is not a valid syntax for a single element tuple. Quotes will be ignored and result will be the same as:
['2016-04-30']
and
Row(load_dt=x[0])
will give:
Row(load_dt='2')
Use:
cntl_dt = [('2016-04-30', )]
Also you're mixing different context (SQLContext and HiveContext) which is generally a bad idea (and both shouldn't be used in any recent Spark version)

pyspark RDD to DataFrame

I am new to Spark.
I have a DataFrame and I used the following command to group it by 'userid'
def test_groupby(df):
return list(df)
high_volumn = self.df.filter(self.df.outmoney >= 1000).rdd.groupBy(
lambda row: row.userid).mapValues(test_groupby)
It gives a RDD which in following structure:
(326033430, [Row(userid=326033430, poiid=u'114233866', _mt_datetime=u'2017-06-01 14:54:48', outmoney=1127.0, partner=2, paytype=u'157', locationcity=u'\u6f4d\u574a', locationprovince=u'\u5c71\u4e1c\u7701', location=None, dt=u'20170601')])
326033430 is the big group.
My question is how can I convert this RDD back to a DataFrame Structure? If I cannot do that, how I can get values from the Row term?
Thank you.
You should just
from pyspark.sql.functions import *
high_volumn = self.df\
.filter(self.df.outmoney >= 1000)\
.groupBy('userid').agg(collect_list('col'))
and in .agg method pass what You want to do with rest of data.
Follow this link : http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.agg

Duplicated timestamps when reading data from a CSV with PySpark

I want to read data from a cvs file with the following format
HIERARCHYELEMENTID, REFERENCETIMESTAMP, VALUE
LOUTHNMA,"2014-12-03 00:00:00.0",0.004433333289
LOUTHNMA,"2014-12-03 00:15:00.0",0.004022222182
LOUTHNMA,"2014-12-03 00:30:00.0",0.0037666666289999998
LOUTHNMA,"2014-12-03 00:45:00.0",0.003522222187
LOUTHNMA,"2014-12-03 01:00:00.0",0.0033333332999999996
I am using the following PySpark function to read from this file
# Define a specific function to load flow data with schema
def load_flow_data(sqlContext, filename, timeFormat):
# Columns we're interested in
flow_columns = ['DMAID','TimeStamp', 'Value']
df = load_data(sqlContext, filename, flow_schema, flow_columns)
# convert type of timestamp column from string to timestamp
col = unix_timestamp(df['TimeStamp'], timeFormat).cast("timestamp")
df = df.withColumn('realTimeStamp', col)
return df
with the following schema and auxiliary function
flow_schema = StructType([
StructField('DMAID', StringType(), True),
StructField('TimeStamp', StringType(), True),
StructField('Value', FloatType(), True)
])
def load_data(sqlContext, filename, schema=None, columns=None):
# If no schema is specified, then infer the schema automatically
if schema is None:
df = sqlContext.read.format('com.databricks.spark.csv'). \
option('header', 'true').option('inferschema', 'true'). \
option('mode', 'DROPMALFORMED'). \
load(filename)
else:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(filename, schema=schema)
# If no columns are specified, then select all columns
if columns is None:
columns = schema.names
df = df.select(columns)
return df
I load the data from the cvs file using these commands
timeFormat = "yyyy-MM-dd HH:mm:SS"
df_flow_DMA = load_flow_data(sqlContext, flow_file, timeFormat)
Then I convert this data frame to Pandas for visualisation purposes.
However, I find that col = unix_timestamp(df['TimeStamp'], timeFormat).cast("timestamp") is mapping different date & time strings in the cvs file (found in the field 'TimeStamp') to the same 'realTimeStamp' field, as shown in the attached screenshot.
I suspect that the problem is related to the date time string format that I am passing to load_flow_data; I have tried several variations but nothing seems to work.
Could someone please provide a hist at what is wrong with my code? I use Python 2.7 and Spark 1.6.
Cheers

Categories