Pyspark - withColumn is not working while calling on empty dataframe - python

I am creating an empty dataframe for some requirement and when I am calling the withColumn function on it, I'm getting the columns but the data is coming as null as following-
schema = StructType([])
df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
json = list(map(lambda row: row.asDict(True), df.collect()))
df.show()
++
||
++
++
df= df.withColumn('First_name',F.lit('Tony'))\
.withColumn('Last_name',F.lit('Chapman'))\
.withColumn('Age',F.lit('28'))
df.show()
+----------+---------+---+
|First_name|Last_name|Age|
+----------+---------+---+
+----------+---------+---+
What is the reason for this? How to solve this?

that's the expected result - withColumn means spark will iterate on all the rows and then add a column to each. Since your dataframe is empty there's nothing to iterate on so no values
if you want to take some data into a dataframe you need to use parallelize
from pyspark.sql import Row
l = [('Tony','Chapman',28)]
rdd = sc.parallelize(l)
rdd_rows = rdd.map(lambda x: Row(First_Name=x[0],Last_Name=x[1] Age=int(x[2])))
df = sqlContext.createDataFrame(rdd_rows)
or from Spark 2.0 (thanks pault) you can skip the rdd creation
l = [('Tony','Chapman',28)]
df = sqlContext.createDataFrame(l, ["First_Name", "Last_Name", "Age"]

Related

How can I unnest a long column(map) to multiple columns with pandas?

I have a dataframe like this:
dataframe name: df_test
ID
Data
test-001
{"B":{"1":{"_seconds":1663207410,"_nanoseconds":466000000}},"C":{"1":{"_seconds":1663207409,"_nanoseconds":978000000}},"D":{"1":{"_seconds":1663207417,"_nanoseconds":231000000}}}
test-002
{"B":{"1":{"_seconds":1663202431,"_nanoseconds":134000000}},"C":{"1":{"_seconds":1663208245,"_nanoseconds":412000000}},"D":{"1":{"_seconds":1663203482,"_nanoseconds":682000000}}}
I want it to be unnested like this:
ID
B_1_seconds
B_1_nanoseconds
C_1_seconds
C_1_nanoseconds
D__seconds
D__nanoseconds
test-001
1663207410
466000000
1663207409
978000000
1663207417
231000000
test-002
1663202431
134000000
1663208245
412000000
1663203482
682000000
I tryed df_test.explode but it doesn't work for this
I used Dataiku to unnest the data and it worked perfectly, now I want to unnest the data within my python notebook, what should I do?
Edit:
I tried
df_list=df_test["data"].tolist()
then
pd.json_normalize.(df_list)
it returned an empty dataframe with only index but no value in it.
Since pd.json_normalize returns an empty dataframe I'd guess that df["Data"] contains strings? If that's the case you could try
import json
df_data = pd.json_normalize(json.loads("[" + ",".join(df["Data"]) + "]"), sep="_")
res = pd.concat([df[["ID"]], df_data], axis=1).rename(lambda c: c.replace("__", "_"), axis=1)
or
df_data = pd.json_normalize(df["Data"].map(eval), sep="_")
res = pd.concat([df[["ID"]], df_data], axis=1).rename(lambda c: c.replace("__", "_"), axis=1)
Result for both alternatives is:
ID B_1_seconds B_1_nanoseconds C_1_seconds C_1_nanoseconds \
0 test-001 1663207410 466000000 1663207409 978000000
1 test-002 1663202431 134000000 1663208245 412000000
D_1_seconds D_1_nanoseconds
0 1663207417 231000000
1 1663203482 682000000

Pyspark: how to create a dataframe with only one row?

What I am trying to do seems to be quite simple. I need to create a dataframe with a single column and a single value.
I have tried a few approaches, namely:
Creation of empty dataframe and appending the data afterwards:
project_id = 'PC0000000042'
schema = T.StructType([T.StructField("ProjectId", T.StringType(), True)])
empty_df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
rdd = sc.parallelize([(project_id)])
df_temp = spark.createDataFrame(rdd, SCHEMA)
df = empty_df.union(df_temp)
Creation of dataframe based on this one value.
rdd = sc.parallelize([(project_id)])
df = spark.createDataFrame(rdd, schema)
However, what I get in both cases is:
TypeError: StructType can not accept object 'PC0000000042' in type <class 'str'>
Which I don't quite understand since the type seems to be correct. Thank you for any advice!
One small change. If you have project_id = 'PC0000000042', then
rdd = sc.parallelize([[project_id]])
You should pass the data as a list of list: [['PC0000000042']] instead of ['PC0000000042'].
If you have 2 rows, then:
project_id = [['PC0000000042'], ['PC0000000043']]
rdd = sc.parallelize(project_id)
spark.createDataFrame(rdd, schema).show()
+------------+
| ProjectId|
+------------+
|PC0000000042|
|PC0000000043|
+------------+
Without RDDs, you can also do:
project_id = [['PC0000000042']]
spark.createDataFrame(project_id,schema=schema).show()

Parsing data fram to add new column and update the column pyspark

I have the below code that creates a data frame as below :
ratings = spark.createDataFrame(
sc.textFile("myfile.json").map(lambda l: json.loads(l)),
)
ratings.registerTempTable("mytable")
final_df = sqlContext.sql("select * from mytable");
The data frame look something like this
I'm storing the created_at and user_id into a list :
user_id_list = final_df.select('user_id').rdd.flatMap(lambda x: x).collect()
created_at_list = final_df.select('created_at').rdd.flatMap(lambda x: x).collect()
and parsing through one of the list to call another function:
for i in range(len(user_id_list)):
status=get_status(user_id_list[I],created_at_list[I])
I want to create a new column in my data frame called status and update the value for the corresponding user_id_list and created_at_list value
I know I need use this functionality - but not sure how to proceed
final_df.withColumn('status', 'give the condition here')
Dont create lists. Simply give a UDF function to dataframe
import pyspark.sql.functions as F
status_udf = F.udf(lambda x: get_status(x[0], x[1]))
df = df.select(df.columns + [status_udf(F.col('user_id_list'), \
F.col('created_at_list value')).alias('status')])

Create dictionary of dataframe in pyspark

I am trying to create a dictionary for year and month. Its a kind of macro which i can call over required no. of year and month. I am facing challenge while adding dynamic column in pyspark df
df = spark.createDataFrame([(1, "foo1",'2016-1-31'),(1, "test",'2016-1-31'), (2, "bar1",'2012-1-3'),(4, "foo2",'2011-1-11')], ("k", "v","date"))
w = Window().partitionBy().orderBy(col('date').desc())
df = df.withColumn("next_date",lag('date').over(w).cast(DateType()))
df = df.withColumn("next_name",lag('v').over(w))
df = df.withColumn("next_date",when(col("k") != lag(df.k).over(w),date_add(df.date,605)).otherwise(col('next_date')))
df = df.withColumn("next_name",when(col("k") != lag(df.k).over(w),"").otherwise(col('next_name')))
import copy
dict_of_YearMonth = {}
for yearmonth in [200901,200902,201605 .. etc]:
key_name = 'Snapshot_'+str(yearmonth)
dict_of_YearMonth[key_name].withColumn("test",yearmonth)
dict_of_YearMonth[key_name].withColumn("test_date",to_date(''+yearmonth[:4]+'-'+yearmonth[4:2]+'-1'+''))
# now i want to add a condition
if(dict_of_YearMonth[key_name].test_date >= dict_of_YearMonth[key_name].date) and (test_date <= next_date) then output snapshot_yearmonth /// i.e dataframe which satisfy this condition i am able to do it in pandas but facing challenge in pyspark
dict_of_YearMonth[key_name]
dict_of_YearMonth
Then i want to concatenate all the dataframe into single pyspark dataframe, i could do this in pandas as shown below but i need to do in pyspark
snapshots=pd.concat([dict_of_YearMonth['Snapshot_201104'],dict_of_YearMonth['Snapshot_201105']])
If is any other idea to generate dictionary of dynamic data frame with dynamic addition of columns and perform condition and generate year based data frame and merge them in single data frame. Any help would be appreciated.
I have tried below code is working fine
// Function to append all the dataframe using union
def unionAll(*dfs):
return reduce(DataFrame.unionAll, dfs)
// convert dates
def is_date(x):
try:
x= str(x)+str('01')
parse(x)
return datetime.datetime.strptime(x, '%Y%m%d').strftime("%Y-%m-%d")
except ValueError:
pass # if incorrect format, keep trying other format
dict_of_YearMonth = {}
for yearmonth in [200901,200910]:
key_name = 'Snapshot_'+str(yearmonth)
dict_of_YearMonth[key_name]=df
func = udf(lambda x: yearmonth, StringType())
dict_of_YearMonth[key_name] = df.withColumn("test",func(col('v')))
default_date = udf (lambda x : is_date(x))
dict_of_YearMonth[key_name] = dict_of_YearMonth[key_name].withColumn("test_date",default_date(col('test')).cast(DateType()))
dict_of_YearMonth
To add mutiple dataframes use below code:
final_df = unionAll(dict_of_YearMonth['Snapshot_200901'], dict_of_YearMonth['Snapshot_200910'])

Spark: equivelant of zipwithindex in dataframe

Assuming I am having the following dataframe:
dummy_data = [('a',1),('b',25),('c',3),('d',8),('e',1)]
df = sc.parallelize(dummy_data).toDF(['letter','number'])
And i want to create the following dataframe:
[('a',0),('b',2),('c',1),('d',3),('e',0)]
What I do is to convert it to rdd and use zipWithIndex function and after join the results:
convertDF = (df.select('number')
.distinct()
.rdd
.zipWithIndex()
.map(lambda x:(x[0].number,x[1]))
.toDF(['old','new']))
finalDF = (df
.join(convertDF,df.number == convertDF.old)
.select(df.letter,convertDF.new))
Is if there is something similar function as zipWIthIndex in dataframes? Is there another more efficient way to do this task?
Please check https://issues.apache.org/jira/browse/SPARK-23074 for this direct functionality parity in dataframes .. upvote that jira if you're interested to see this at some point in Spark.
Here's a workaround though in PySpark:
def dfZipWithIndex (df, offset=1, colName="rowId"):
'''
Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
and preserves a schema
:param df: source dataframe
:param offset: adjustment to zipWithIndex()'s index
:param colName: name of the index column
'''
new_schema = StructType(
[StructField(colName,LongType(),True)] # new added field in front
+ df.schema.fields # previous schema
)
zipped_rdd = df.rdd.zipWithIndex()
new_rdd = zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))
return spark.createDataFrame(new_rdd, new_schema)
That's also available in abalon package.

Categories