What I am trying to do seems to be quite simple. I need to create a dataframe with a single column and a single value.
I have tried a few approaches, namely:
Creation of empty dataframe and appending the data afterwards:
project_id = 'PC0000000042'
schema = T.StructType([T.StructField("ProjectId", T.StringType(), True)])
empty_df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
rdd = sc.parallelize([(project_id)])
df_temp = spark.createDataFrame(rdd, SCHEMA)
df = empty_df.union(df_temp)
Creation of dataframe based on this one value.
rdd = sc.parallelize([(project_id)])
df = spark.createDataFrame(rdd, schema)
However, what I get in both cases is:
TypeError: StructType can not accept object 'PC0000000042' in type <class 'str'>
Which I don't quite understand since the type seems to be correct. Thank you for any advice!
One small change. If you have project_id = 'PC0000000042', then
rdd = sc.parallelize([[project_id]])
You should pass the data as a list of list: [['PC0000000042']] instead of ['PC0000000042'].
If you have 2 rows, then:
project_id = [['PC0000000042'], ['PC0000000043']]
rdd = sc.parallelize(project_id)
spark.createDataFrame(rdd, schema).show()
+------------+
| ProjectId|
+------------+
|PC0000000042|
|PC0000000043|
+------------+
Without RDDs, you can also do:
project_id = [['PC0000000042']]
spark.createDataFrame(project_id,schema=schema).show()
Related
I am getting problem with below code. I want to create a single column dataframe.
May I know what wrong I am doing here.
from pyspark.sql import functions as F from pyspark.sql.types import IntegerType,ArrayType,StructType,StructField,StringType data = [ (["James","Jon","Jane"]), (["Miken","Mik","Mike"]), (["John","Johns"])]
cols = StructType([ StructField("Name",ArrayType(StringType()),True) ])
df = spark.createDataFrame(data=data,schema=cols)
df.printSchema()
df.show()
output:
Name
["James","Jon","Jane"]
["Miken","Mik","Mike"]
["John","Johns"]
I am getting a error below.
Length of object (3) does not match with length of fields (1)
This error is because you added data in form of a multiple-column structure and your requirement is in single-column data value.
So for get data in single column values you need to [(row,) for row in data]:
data = [(["James","Jon","Jane"]), (["Miken","Mik","Mike"]), (["John","Johns"])]
cols = StructType([ StructField("Name",ArrayType(StringType()),True) ])
df = spark.createDataFrame(data=[(row,) for row in data], schema=cols)
df.printSchema()
df.show()
Output:
Pyspark has this problem. The way I go about is introduce and ID column, drop it ones the df is created
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType,ArrayType,StructType,StructField,StringType
data = [ (1,["James","Jon","Jane"]), (2,["Miken","Mik","Mike"]), (3,["John","Johns"])]
cols = StructType([ StructField("ID",IntegerType(),True), StructField("Name",ArrayType(StringType()),True) ])
df = spark.createDataFrame(data=data,schema=cols).drop('ID')
df.printSchema()
df.show()
I have a column in a Python df like:
TAGS
{user_type:active}
{session_type:session1}
{user_type:inactive}
How can I efficiently make this column its own column for each of the tags specified?
Desired:
TAGS |user_type|session_type
{user_type:active} |active |null
{session_type:session1}|null |session1
{user_type:inactive} |inactive |null
My attempt only is able to do this in a boolean sense (not what I want) and only if I specify the columns from the tags (which I don't know ahead of time):
mask = df['tags'].apply(lambda x: 'user_type' in x)
df['user_type'] = mask
there are better ways but this is from what you got
df['user_type'] = df['tags'].apply(lambda x: x.split(':')[1] if 'user_type' in x else np.nan)
df['session_type'] = df['tags'].apply(lambda x: x.split(':')[1] if 'session_type' in x else np.nan)
You could use pandas.json_normalize() to convert TAGS column to dict object and check if user_type is a key of that dict.
df2 = pd.json_normalize(df['TAGS'])
df2['user_type'] = df2['TAGS'].apply(lambda x: x['user_type'] if 'user_type' in x else 'null')
This is what ended up working for me, wanted to post a short working example from the json library the helped.
def js(row):
if row:
return json.loads(row)
else:
return {'':''}
#This example includes if there was/wasn't a dataframe with other fields including tags
import json
import pandas as pd
df2 = df.copy()
#Make some dummy tags
df2['tags'] = ['{"user_type":"active","nonuser_type":"inactive"}']*len(df2['tags'])
df2['tags'] = df2['tags'].apply(js)
df_temp = pd.DataFrame(df2['tags'].values.tolist())
df3 = (pd.concat([df2.drop('tags', axis=1), df_temp], axis=1))
#Ynjxsjmh your approach reminds me of something like that I had used in the past, but in this case, I had gotten the following error:
AttributeError: 'str' object has no attribute 'values'
#Bing Wang I am a big fan of list comprehension but in this case I don't know the names of the columns before hand.
I am creating an empty dataframe for some requirement and when I am calling the withColumn function on it, I'm getting the columns but the data is coming as null as following-
schema = StructType([])
df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
json = list(map(lambda row: row.asDict(True), df.collect()))
df.show()
++
||
++
++
df= df.withColumn('First_name',F.lit('Tony'))\
.withColumn('Last_name',F.lit('Chapman'))\
.withColumn('Age',F.lit('28'))
df.show()
+----------+---------+---+
|First_name|Last_name|Age|
+----------+---------+---+
+----------+---------+---+
What is the reason for this? How to solve this?
that's the expected result - withColumn means spark will iterate on all the rows and then add a column to each. Since your dataframe is empty there's nothing to iterate on so no values
if you want to take some data into a dataframe you need to use parallelize
from pyspark.sql import Row
l = [('Tony','Chapman',28)]
rdd = sc.parallelize(l)
rdd_rows = rdd.map(lambda x: Row(First_Name=x[0],Last_Name=x[1] Age=int(x[2])))
df = sqlContext.createDataFrame(rdd_rows)
or from Spark 2.0 (thanks pault) you can skip the rdd creation
l = [('Tony','Chapman',28)]
df = sqlContext.createDataFrame(l, ["First_Name", "Last_Name", "Age"]
To pass schema to a json file we do this:
from pyspark.sql.types import (StructField, StringType, StructType, IntegerType)
data_schema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]
final_struc = StructType(fields = data_schema)
df =spark.read.json('people.json', schema=final_struc)
The above code works as expected. However now, I have data in table which I display by:
df = sqlContext.sql("SELECT * FROM people_json")
But if I try to pass a new schema to it by using following command it does not work.
df2 = spark.sql("SELECT * FROM people_json", schema=final_struc)
It gives the following error:
sql() got an unexpected keyword argument 'schema'
NOTE: I am using Databrics Community Edition
What am I missing?
How do I pass the new schema if I have data in the table instead of some JSON file?
You cannot apply a new schema to already created dataframe. However, you can change the schema of each column by casting to another datatype as below.
df.withColumn("column_name", $"column_name".cast("new_datatype"))
If you need to apply a new schema, you need to convert to RDD and create a new dataframe again as below
df = sqlContext.sql("SELECT * FROM people_json")
val newDF = spark.createDataFrame(df.rdd, schema=schema)
Hope this helps!
There is already one answer available but still I want to add something.
Create DF from RDD
using toDF
newDf = rdd.toDF(schema, column_name_list)
using createDataFrame
newDF = spark.createDataFrame(rdd ,schema, [list_of_column_name])
Create DF from other DF
suppose I have DataFrame with columns|data type - name|string, marks|string, gender|string.
if I want to get only marks as integer.
newDF = oldDF.select("marks")
newDF_with_int = newDF.withColumn("marks", df['marks'].cast('Integer'))
This will convert marks to integer.
Assuming I am having the following dataframe:
dummy_data = [('a',1),('b',25),('c',3),('d',8),('e',1)]
df = sc.parallelize(dummy_data).toDF(['letter','number'])
And i want to create the following dataframe:
[('a',0),('b',2),('c',1),('d',3),('e',0)]
What I do is to convert it to rdd and use zipWithIndex function and after join the results:
convertDF = (df.select('number')
.distinct()
.rdd
.zipWithIndex()
.map(lambda x:(x[0].number,x[1]))
.toDF(['old','new']))
finalDF = (df
.join(convertDF,df.number == convertDF.old)
.select(df.letter,convertDF.new))
Is if there is something similar function as zipWIthIndex in dataframes? Is there another more efficient way to do this task?
Please check https://issues.apache.org/jira/browse/SPARK-23074 for this direct functionality parity in dataframes .. upvote that jira if you're interested to see this at some point in Spark.
Here's a workaround though in PySpark:
def dfZipWithIndex (df, offset=1, colName="rowId"):
'''
Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
and preserves a schema
:param df: source dataframe
:param offset: adjustment to zipWithIndex()'s index
:param colName: name of the index column
'''
new_schema = StructType(
[StructField(colName,LongType(),True)] # new added field in front
+ df.schema.fields # previous schema
)
zipped_rdd = df.rdd.zipWithIndex()
new_rdd = zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))
return spark.createDataFrame(new_rdd, new_schema)
That's also available in abalon package.