I need to append a NumPy array into a PySpark Dataframe.
The result needs to be like this, adding the var38mc variable:
+----+------+-------------+-------+
| ID|TARGET| var38|var38mc|
+----+------+-------------+-------+
| 1.0| 0.0| 117310.9790| True|
| 3.0| 0.0| 39205.17000| False|
| 4.0| 0.0| 117310.9790| True|
+----+------+-------------+-------+
Firstly, I calculated an array with the approximation of 117310.979016494 value.
array_var38mc = np.isclose(train3.select("var38").rdd.flatMap(lambda x: x).collect(), 117310.979016494)
The output is an object numpy.ndarray, like this [True, False, True]
Next, I'm trying to append a Numpy array, previously calculated with the data of this same PySpark.Dataframe.
train4 = train3.withColumn('var38mc',col(df_var38mc))
But I got this error:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
P.S.: I tried to convert the numpy array into a list and in another PySpark Dataframe without success.
Use UDF instead:
import pyspark.sql.functions as F
from pyspark.sql.types import BooleanType
import numpy as np
func = F.udf(lambda x: bool(np.isclose(x, 117310.979016494)), BooleanType())
train4 = train3.withColumn('var38mc', func('var38'))
Related
I have a udf function which takes the key and return the corresponding value from name_dict.
from pyspark.sql import *
from pyspark.sql.functions import udf, when, col
name_dict = {'James': "manager", 'Robert': 'director'}
func = udf(lambda name: name_dict[name])
The original dataframe: James and Robert are in the dict, but Michael is not.
data = [("James","M"),("Michael","M"),("Robert",None)]
test = spark.createDataFrame(data = data, schema = ['name', 'gender'])
test.show()
+-------+------+
| name|gender|
+-------+------+
| James| M|
|Michael| M|
| Robert| null|
+-------+------+
To prevent KeyError, I use the when condition to filter the rows before any operation, but it does not work.
test.withColumn('senior', when(col('name').isin(['James', 'Robert']), func(col('name'))).otherwise(col('gender'))).show()
PythonException: An exception was thrown from a UDF: 'KeyError:
'Michael'', from , line 8. Full traceback
below...
What is the cause of this and are there any feasible ways to solve this problem? Assume that not all the names are keys of the dictionary and for those that are not included, I would like to copy the value from another column, say gender here.
This actually the behavior of user-defined functions in Spark. You can read from the docs:
The user-defined functions do not support conditional expressions or
short circuiting in boolean expressions and it ends up with being
executed all internally. If the functions can fail on special rows,
the workaround is to incorporate the condition into the functions.
So in your case you need to rewrite your UDF as:
func = udf(lambda name: name_dict.get(name, "NA"))
Then calling it using:
test.withColumn('senior', func(col('name'))).show()
#+-------+------+--------+
#| name|gender| senior|
#+-------+------+--------+
#| James| M| manager|
#|Michael| M| NA|
#| Robert| null|director|
#+-------+------+--------+
However, in you case you can actually do this without having to use udf, by using a map column:
from itertools import chain
from pyspark.sql.functions import col, create_map, lit
map_col = create_map(*[lit(x) for x in chain(*name_dict.items())])
test.withColumn('senior', map_col[col('name')]).show()
I do not completely understand when I need to use a lambda function in the definition of a UDF.
My prior understanding was that I needed lambda in order for the DataFrame to recognize that it has to iterate over each row but I have seen many applications of UDFs without a lambda expression.
For example:
I have a silly function that works well like this without using lambda:
#udf("string")
def unknown_city(s, city):
if s == 'KS' and 'MI':
return 'Unknown'
else:
return city
display(df2.
withColumn("new_city", unknown_city(col('geo.state'), col('geo.city')))
)
How can I make it work with lambda? Is it necessary?
Python lambda is just a way to write your functions. See the example code below and you will see they're pretty much the same, except that the lambda function is only for one-line code.
With lambda function
from pyspark.sql import functions as F
from pyspark.sql import types as T
df.withColumn('num+1', F.udf(lambda num: num + 1, T.IntegerType())('num')).show()
# +---+-----+
# |num|num+1|
# +---+-----+
# | 10| 11|
# | 20| 21|
# +---+-----+
With normal function
from pyspark.sql import functions as F
from pyspark.sql import types as T
def numplus2(num):
return num + 2
df.withColumn('num+2', F.udf(numplus2, T.IntegerType())('num')).show()
# +---+-----+
# |num|num+2|
# +---+-----+
# | 10| 12|
# | 20| 22|
# +---+-----+
I have a Spark Dataframe (json_df) and I need to create another Dataframe based on the json nested:
This is my current Dataframe:
I know I could do that manually like: final_df = json_df.select( col("Body.EquipmentId"),..... ) but I want to do that in a generic way.
note: for this specific DF, the json records has the same structure.
Any idea?
Thanks!
Programmatically, you can do it like this:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf
from pyspark.sql import functions as F
conf = SparkConf()
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
df = sc.parallelize([({"A":1, "B":2},), ({"A":3,"B":4},), ({"A":5,"B":6},)]).toDF(['Body'])
keys_df = df.select(F.explode(F.map_keys(F.col('Body')))).distinct()
keys = list(map(lambda row: row[0], keys_df.collect()))
key_cols = list(map(lambda f: F.col("Body").getItem(f).alias(str(f)), keys))
final_cols = df.select(key_cols)
final_cols.show()
Which produces
+---+---+
| B| A|
+---+---+
| 2| 1|
| 4| 3|
| 6| 5|
+---+---+
If you have the entire list of keys already, you can skip the part where it gets the keys and just set the keys manually:
keys = ['A', 'B']
Source: https://mungingdata.com/pyspark/dict-map-to-multiple-columns/
I have a spark dataframe which has a column 'X'.The column contains elements which are in the form:
u'[23,4,77,890,455,................]'
. How can I convert this unicode to list.That is my output should be
[23,4,77,890,455...................]
. I have apply it for each element in the 'X' column.
I have tried df.withColumn("X_new", ast.literal_eval(x)) and got the error
"Malformed String"
I also tried
df.withColumn("X_new", json.loads(x)) and got the error "Expected
String or Buffer"
and
df.withColumn("X_new", json.dumps(x)) which says JSON not
serialisable.
and also
df_2 = df.rdd.map(lambda x: x.encode('utf-8')) which says rdd has no
attribute encode.
I dont want to use collect and toPandas() because its memory consuming.(But if thats the only way please do tell).I am using Pyspark
Update: cph_sto gave the answer using UDF.Though it worked well,I find that it is Slow.Can Somebody suggest any other method?
import ast
from pyspark.sql.functions import udf
values = [(u'[23,4,77,890.455]',10),(u'[11,2,50,1.11]',20),(u'[10.05,1,22.04]',30)]
df = sqlContext.createDataFrame(values,['list','A'])
df.show()
+-----------------+---+
| list| A|
+-----------------+---+
|[23,4,77,890.455]| 10|
| [11,2,50,1.11]| 20|
| [10.05,1,22.04]| 30|
+-----------------+---+
# Creating a UDF to convert the string list to proper list
string_list_to_list = udf(lambda row: ast.literal_eval(row))
df = df.withColumn('list',string_list_to_list(col('list')))
df.show()
+--------------------+---+
| list| A|
+--------------------+---+
|[23, 4, 77, 890.455]| 10|
| [11, 2, 50, 1.11]| 20|
| [10.05, 1, 22.04]| 30|
+--------------------+---+
Extension of the Q, as asked by OP -
# Creating a UDF to find length of resulting list.
length_list = udf(lambda row: len(row))
df = df.withColumn('length_list',length_list(col('list')))
df.show()
+--------------------+---+-----------+
| list| A|length_list|
+--------------------+---+-----------+
|[23, 4, 77, 890.455]| 10| 4|
| [11, 2, 50, 1.11]| 20| 4|
| [10.05, 1, 22.04]| 30| 3|
+--------------------+---+-----------+
Since it's a string, you could remove the first and last characters:
From '[23,4,77,890,455]' to '23,4,77,890,455'
Then apply the split() function to generate an array, taking , as the delimiter.
Please use the below code to ignore unicode
df.rdd.map(lambda x: x.encode("ascii","ignore"))
I have a Dataframe with two columns: BrandWatchErwaehnungID and word_counts.
The word_counts column is the output of `CountVectorizer (a sparse vector). After dropped the empty rows I have created two new columns one with the indices of the sparse vector and one with their values.
help0 = countedwords_text['BrandWatchErwaehnungID','word_counts'].rdd\
.filter(lambda x : x[1].indices.size!=0)\
.map(lambda x : (x[0],x[1],DenseVector(x[1].indices) , DenseVector(x[1].values))).toDF()\
.withColumnRenamed("_1", "BrandWatchErwaenungID").withColumnRenamed("_2", "word_counts")\
.withColumnRenamed("_3", "word_indices").withColumnRenamed("_4", "single_word_counts")
I needed to convert them to dense vectors before adding to my Dataframe due to spark did not accept numpy.ndarray. My problem is that I now want to explode that Dataframeon the word_indices column but the explode method from pyspark.sql.functions does only support arrays or map as input.
I have tried:
help1 = help0.withColumn('b' , explode(help0.word_indices))
and get the following error:
cannot resolve 'explode(`word_indices')' due to data type mismatch: input to function explode should be array or map type
Afterwards I tried:
help1 = help0.withColumn('b' , explode(help0.word_indices.toArray()))
Which also did not worked...
Any suggestions?
You have to use udf:
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import *
from pyspark.ml.linalg import *
#udf("array<integer>")
def indices(v):
if isinstance(v, DenseVector):
return list(range(len(v)))
if isinstance(v, SparseVector):
return v.indices.tolist()
df = spark.createDataFrame([
(1, DenseVector([1, 2, 3])), (2, SparseVector(5, {4: 42}))],
("id", "v"))
df.select("id", explode(indices("v"))).show()
# +---+---+
# | id|col|
# +---+---+
# | 1| 0|
# | 1| 1|
# | 1| 2|
# | 2| 4|
# +---+---+