pySpark - Add list to a dataframe as a column [duplicate]

pySpark - Add list to a dataframe as a column [duplicate] - python

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows:
dt.withColumn('new_column', 10).head(5)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-50-a6d0257ca2be> in <module>()
1 dt = (messages
2 .select(messages.fromuserid, messages.messagetype, floor(messages.datetime/(1000*60*5)).alias("dt")))
----> 3 dt.withColumn('new_column', 10).head(5)
/Users/evanzamir/spark-1.4.1/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col)
1166 [Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
1167 """
-> 1168 return self.select('*', col.alias(colName))
1169
1170 #ignore_unicode_prefix
AttributeError: 'int' object has no attribute 'alias'
It seems that I can trick the function into working as I want by adding and subtracting one of the other columns (so they add to zero) and then adding the number I want (10 in this case):
dt.withColumn('new_column', dt.messagetype - dt.messagetype + 10).head(5)
[Row(fromuserid=425, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=47019141, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=49746356, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=93506471, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=80488242, messagetype=1, dt=4809600.0, new_column=10)]
This is supremely hacky, right? I assume there is a more legit way to do this?

Spark 2.2+
Spark 2.2 introduces typedLit to support Seq, Map, and Tuples (SPARK-19254) and following calls should be supported (Scala):
import org.apache.spark.sql.functions.typedLit
df.withColumn("some_array", typedLit(Seq(1, 2, 3)))
df.withColumn("some_struct", typedLit(("foo", 1, 0.3)))
df.withColumn("some_map", typedLit(Map("key1" -> 1, "key2" -> 2)))
Spark 1.3+ (lit), 1.4+ (array, struct), 2.0+ (map):
The second argument for DataFrame.withColumn should be a Column so you have to use a literal:
from pyspark.sql.functions import lit
df.withColumn('new_column', lit(10))
If you need complex columns you can build these using blocks like array:
from pyspark.sql.functions import array, create_map, struct
df.withColumn("some_array", array(lit(1), lit(2), lit(3)))
df.withColumn("some_struct", struct(lit("foo"), lit(1), lit(.3)))
df.withColumn("some_map", create_map(lit("key1"), lit(1), lit("key2"), lit(2)))
Exactly the same methods can be used in Scala.
import org.apache.spark.sql.functions.{array, lit, map, struct}
df.withColumn("new_column", lit(10))
df.withColumn("map", map(lit("key1"), lit(1), lit("key2"), lit(2)))
To provide names for structs use either alias on each field:
df.withColumn(
"some_struct",
struct(lit("foo").alias("x"), lit(1).alias("y"), lit(0.3).alias("z"))
)
or cast on the whole object
df.withColumn(
"some_struct",
struct(lit("foo"), lit(1), lit(0.3)).cast("struct<x: string, y: integer, z: double>")
)
It is also possible, although slower, to use an UDF.
Note:
The same constructs can be used to pass constant arguments to UDFs or SQL functions.

In spark 2.2 there are two ways to add constant value in a column in DataFrame:
1) Using lit
2) Using typedLit.
The difference between the two is that typedLit can also handle parameterized scala types e.g. List, Seq, and Map
Sample DataFrame:
val df = spark.createDataFrame(Seq((0,"a"),(1,"b"),(2,"c"))).toDF("id", "col1")
+---+----+
| id|col1|
+---+----+
| 0| a|
| 1| b|
+---+----+
1) Using lit: Adding constant string value in new column named newcol:
import org.apache.spark.sql.functions.lit
val newdf = df.withColumn("newcol",lit("myval"))
Result:
+---+----+------+
| id|col1|newcol|
+---+----+------+
| 0| a| myval|
| 1| b| myval|
+---+----+------+
2) Using typedLit:
import org.apache.spark.sql.functions.typedLit
df.withColumn("newcol", typedLit(("sample", 10, .044)))
Result:
+---+----+-----------------+
| id|col1| newcol|
+---+----+-----------------+
| 0| a|[sample,10,0.044]|
| 1| b|[sample,10,0.044]|
| 2| c|[sample,10,0.044]|
+---+----+-----------------+

As the other answers have described, lit and typedLit are how to add constant columns to DataFrames. lit is an important Spark function that you will use frequently, but not for adding constant columns to DataFrames.
You'll commonly be using lit to create org.apache.spark.sql.Column objects because that's the column type required by most of the org.apache.spark.sql.functions.
Suppose you have a DataFrame with a some_date DateType column and would like to add a column with the days between December 31, 2020 and some_date.
Here's your DataFrame:
+----------+
| some_date|
+----------+
|2020-09-23|
|2020-01-05|
|2020-04-12|
+----------+
Here's how to calculate the days till the year end:
val diff = datediff(lit(Date.valueOf("2020-12-31")), col("some_date"))
df
.withColumn("days_till_yearend", diff)
.show()
+----------+-----------------+
| some_date|days_till_yearend|
+----------+-----------------+
|2020-09-23| 99|
|2020-01-05| 361|
|2020-04-12| 263|
+----------+-----------------+
You could also use lit to create a year_end column and compute the days_till_yearend like so:
import java.sql.Date
df
.withColumn("yearend", lit(Date.valueOf("2020-12-31")))
.withColumn("days_till_yearend", datediff(col("yearend"), col("some_date")))
.show()
+----------+----------+-----------------+
| some_date| yearend|days_till_yearend|
+----------+----------+-----------------+
|2020-09-23|2020-12-31| 99|
|2020-01-05|2020-12-31| 361|
|2020-04-12|2020-12-31| 263|
+----------+----------+-----------------+
Most of the time, you don't need to use lit to append a constant column to a DataFrame. You just need to use lit to convert a Scala type to a org.apache.spark.sql.Column object because that's what's required by the function.
See the datediff function signature:
As you can see, datediff requires two Column arguments.

Related

Trim String Characters in Pyspark dataframe

Suppose if I have dataframe in which I have the values in a column like :
ABC00909083888
ABC93890380380
XYZ7394949
XYZ3898302
PQR3799_ABZ
MGE8983_ABZ
I want to trim these values like, remove first 3 characters and remove last 3 characters if it ends with ABZ.
00909083888
93890380380
7394949
3898302
3799
8983
Tried some methods but did not work.
from pyspark.sql import functions as f
new_df = df.withColumn("new_column", f.when((condition on some column),
f.substring('Existing_COL', 4, f.length(f.col("Existing_COL"))), ))
Can anyone please tell me which function I can use in pyspark.
Trim only removes white space or tab something characters.

Based upon your input and expected output. See below logic -
from pyspark.sql.functions import *
df = spark.createDataFrame(data = [("ABC00909083888",) ,("ABC93890380380",) ,("XYZ7394949",) ,("XYZ3898302",) ,("PQR3799_ABZ",) ,("MGE8983_ABZ",)], schema = ["values",])
(df.withColumn("new_vals", when(col('values').rlike("(_ABZ$)"), regexp_replace(col('values'),r'(_ABZ$)', '')).otherwise(col('values')))
.withColumn("final_vals", expr(("substring(new_vals, 4 ,length(new_vals))")))
).show()
Output
+--------------+--------------+-----------+
| values| new_vals| final_vals|
+--------------+--------------+-----------+
|ABC00909083888|ABC00909083888|00909083888|
|ABC93890380380|ABC93890380380|93890380380|
| XYZ7394949| XYZ7394949| 7394949|
| XYZ3898302| XYZ3898302| 3898302|
| PQR3799_ABZ| PQR3799| 3799|
| MGE8983_ABZ| MGE8983| 8983|
+--------------+--------------+-----------+

If I get you correctly and if you don't insist on using pyspark substring or trim functions, you can easily define a function to do what you want and then make use of that with udfs in spark:
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def mysub(word):
if word.endswith('_ABZ'):
word = word[:-4]
return word[3:]
udf1 = udf(lambda x: mysub(x), StringType())
df.withColumn('new_label',udf1('label')).show()
The output will be like:
+---+--------------+-----------+
| id| label| new_label|
+---+--------------+-----------+
| 1|ABC00909083888|00909083888|
| 2|ABC93890380380|93890380380|
| 3| XYZ7394949| 7394949|
| 4| XYZ3898302| 3898302|
| 5| PQR3799_ABZ| 3799|
| 6| MGE8983_ABZ| 8983|
+---+--------------+-----------+
Please let me know if I got you wrong in some cases.

PySpark replace() function does not replace integer with NULL value

Notice: this is for Spark version 2.1.1.2.6.1.0-129
I have a spark dataframe (Python). I would like to replace all instances of 0 across the entirety of the dataframe (without specifying particular column names), with NULL values.
The following is the code that I have written:
my_df = my_df.na.replace(0, None)
The following is the error that I receive:
File "<stdin>", line 1, in <module>
File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 1634, in replace
return self.df.replace(to_replace, value, subset)
File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 1323, in replace
raise ValueError("value should be a float, int, long, string, list, or tuple")
ValueError: value should be a float, int, long, string, list, or tuple

Apparently in Spark 2.1.1, df.na.replace does not support None. None option is only available since 2.3.0, which is not applicable in your case.
To replace values dynamically (i.e without typing columns name manually), you can use either df.columns or df.dtypes. The latter will give you the option to compare datatype as well.
from pyspark.sql import functions as F
for c in df.dtypes:
if c[1] == 'bigint':
df = df.withColumn(c[0], F.when(F.col(c[0]) == 0, F.lit(None)).otherwise(F.col(c[0])))
# Input
# +---+---+
# | id|val|
# +---+---+
# | 0| a|
# | 1| b|
# | 2| c|
# +---+---+
# Output
# +----+---+
# | id|val|
# +----+---+
# |null| a|
# | 1| b|
# | 2| c|
# +----+---+

pyspark `substr' without length

Is there a way, in pyspark, to perform the substr function on a DataFrame column, without specifying the length? Namely, something like df["my-col"].substr(begin).

I am not sure why this function is not exposed as api in pysaprk.sql.functions module.
SparkSQL supports the substring function without defining len argument substring(str, pos, len)
You can use it with expr api of functions module like below to achieve same:
df.withColumn('substr_name', f.expr("substring(name, 2)")).show()
+----------+---+-----------+
| name| id|substr_name|
+----------+---+-----------+
|Alex Shtof| 1| lex Shtof|
| SMaZ| 2| MaZ|
+----------+---+-----------+
How spark is doing it internally :
Now if you see physical plan of above statement then will notice that if we don't pass len then spark will automatically add 2147483647.
As #pault said in comment, 2147483647 is the maximum positive value for a 32-bit signed binary integer (2^31 -1).
df.withColumn('substr_name', f.expr("substring(name, 2)")).explain()
== Physical Plan ==
*Project [name#140, id#141L, substring(name#140, 2, 2147483647) AS substr_name#169]
+- Scan ExistingRDD[name#140,id#141L] --> 2147483647 is automatically added
In substring api implementation of functions module, it expect us to explicitly pass length. If you want then you can give any higher number in len which can cover max lengths of your column.
df.withColumn('substr_name', f.substring('name', 2, 100)).show()
+----------+---+-----------+
| name| id|substr_name|
+----------+---+-----------+
|Alex Shtof| 1| lex Shtof|
| SMaZ| 2| MaZ|
+----------+---+-----------+
>>> df.withColumn('substr_name', f.substring('name', 2, 100)).explain()
== Physical Plan ==
*Project [name#140, id#141L, substring(name#140, 2, 100) AS substr_name#189]
+- Scan ExistingRDD[name#140,id#141L] --> 100 is what we passed

If the objective is to make a substring from a position given by a parameter begin to the end of the string, then you can do it as follows:
import pyspark.sql.functions as f
l = [(1, 'Prague'), (2, 'New York')]
df = spark.createDataFrame(l, ['id', 'city'])
begin = 2
l = (f.length('city') - f.lit(begin) + 1)
(
df
.withColumn('substr', f.col('city').substr(f.lit(begin), l))
).show()
+---+--------+-------+
| id| city| substr|
+---+--------+-------+
| 1| Prague| rague|
| 2|New York|ew York|
+---+--------+-------+

I'd create udf.
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import StringType
>>> df = spark.createDataFrame([('Alice', 23), ('Brian', 25)], schema=["name", "age"])
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 23|
|Brian| 25|
+-----+---+
>>> #F.udf(returnType=StringType())
... def substr_udf(col):
... return str(col)[2:]
>>> df = df.withColumn('substr', substr_udf('name'))
>>> df.show()
+-----+---+------+
| name|age|substr|
+-----+---+------+
|Alice| 23| ice|
|Brian| 25| ian|
+-----+---+------+

No we need to specify the both parameters pos and len
BUt do make sure that both should be of same type other wise it will give error.
Error: Column not iterable.
You can do in this way:
df = df.withColumn("new", F.col("previous").substr(F.lit(5), F.length("previous")-5))

Pyspark: How to deal with null values in python user defined functions

I want to use some string similarity functions that are not native to pyspark such as the jaro and jaro-winkler measures on dataframes. These are readily available in python modules such as jellyfish. I can write pyspark udf's fine for cases where there a no null values present, i.e. comparing cat to dog. when I apply these udf's to data where null values are present, it doesn't work. In problems such as the one I'm solving it is very common for one of the strings to be null
I need help getting my string similarity udf to work in general, to be more specific, to work in cases where one of the values are null
I wrote a udf that works when there are no null values in the input data:
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
import pyspark.sql.functions as F
import jellyfish.cjellyfish
def jaro_winkler_func(df, column_left, column_right):
jaro_winkler_udf = udf(f=lambda s1, s2: jellyfish.jaro_winkler(s1, s2), returnType=DoubleType())
df = (df
.withColumn('test',
jaro_winkler_udf(df[column_left], df[column_right])))
return df
Example input and output:
+-----------+------------+
|string_left|string_right|
+-----------+------------+
| dude| dud|
| spud| dud|
+-----------+------------+
+-----------+------------+------------------+
|string_left|string_right| test|
+-----------+------------+------------------+
| dude| dud|0.9166666666666666|
| spud| dud|0.7222222222222222|
+-----------+------------+------------------+
When I run this on data that has a null value then I get the usual reams of spark errors, the most applicable one seems to be TypeError: str argument expected. I assume this one is due to null values in the data since it worked when there were none.
I modified the function above to to check if both values are not null and only run the function if that's the case, otherwise return 0.
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
import pyspark.sql.functions as F
import jellyfish.cjellyfish
def jaro_winkler_func(df, column_left, column_right):
jaro_winkler_udf = udf(f=lambda s1, s2: jellyfish.jaro_winkler(s1, s2), returnType=DoubleType())
df = (df
.withColumn('test',
F.when(df[column_left].isNotNull() & df[column_right].isNotNull(),
jaro_winkler_udf(df[column_left], df[column_right]))
.otherwise(0.0)))
return df
However, I still get the same errors as before.
Sample input and what I would like the output to be:
+-----------+------------+
|string_left|string_right|
+-----------+------------+
| dude| dud|
| spud| dud|
| spud| null|
| null| null|
+-----------+------------+
+-----------+------------+------------------+
|string_left|string_right| test|
+-----------+------------+------------------+
| dude| dud|0.9166666666666666|
| spud| dud|0.7222222222222222|
| spud| null|0.0 |
| null| null|0.0 |
+-----------+------------+------------------+

We will modify a little bit your code and it should works fine :
#udf(DoubleType())
def jaro_winkler(s1, s2):
if not all((s1, s2)): # or, if None in (s1, s2):
out = 0
else:
out = jellyfish.jaro_winkler(s1, s2)
return out
def jaro_winkler_func(df, column_left, column_right):
df = df.withColumn("test", jaro_winkler(df[column_left], df[column_right]))
return df
Depending on the expected behavior, you need to change the test :
if not all((s1, s2)): will return 0 for both null and empty
string ''.
if None in (s1, s2): will return 0 only for null

Spark DataFrame: Computing row-wise mean (or any aggregate operation)

I have a Spark DataFrame loaded up in memory, and I want to take the mean (or any aggregate operation) over the columns. How would I do that? (In numpy, this is known as taking an operation over axis=1).
If one were calculating the mean of the DataFrame down the rows (axis=0), then this is already built in:
from pyspark.sql import functions as F
F.mean(...)
But is there a way to programmatically do this against the entries in the columns? For example, from the DataFrame below
+--+--+---+---+
|id|US| UK|Can|
+--+--+---+---+
| 1|50| 0| 0|
| 1| 0|100| 0|
| 1| 0| 0|125|
| 2|75| 0| 0|
+--+--+---+---+
Omitting id, the means would be
+------+
| mean|
+------+
| 16.66|
| 33.33|
| 41.67|
| 25.00|
+------+

All you need here is a standard SQL like this:
SELECT (US + UK + CAN) / 3 AS mean FROM df
which can be used directly with SqlContext.sql or expressed using DSL
df.select(((col("UK") + col("US") + col("CAN")) / lit(3)).alias("mean"))
If you have a larger number of columns you can generate expression as follows:
from functools import reduce
from operator import add
from pyspark.sql.functions import col, lit
n = lit(len(df.columns) - 1.0)
rowMean = (reduce(add, (col(x) for x in df.columns[1:])) / n).alias("mean")
df.select(rowMean)
or
rowMean = (sum(col(x) for x in df.columns[1:]) / n).alias("mean")
df.select(rowMean)
Finally its equivalent in Scala:
df.select(df.columns
.drop(1)
.map(col)
.reduce(_ + _)
.divide(df.columns.size - 1)
.alias("mean"))
In a more complex scenario you can combine columns using array function and use an UDF to compute statistics:
import numpy as np
from pyspark.sql.functions import array, udf
from pyspark.sql.types import FloatType
combined = array(*(col(x) for x in df.columns[1:]))
median_udf = udf(lambda xs: float(np.median(xs)), FloatType())
df.select(median_udf(combined).alias("median"))
The same operation expressed using Scala API:
val combined = array(df.columns.drop(1).map(col).map(_.cast(DoubleType)): _*)
val median_udf = udf((xs: Seq[Double]) =>
breeze.stats.DescriptiveStats.percentile(xs, 0.5))
df.select(median_udf(combined).alias("median"))
Since Spark 2.4 an alternative approach is to combine values into an array and apply aggregate expression. See for example Spark Scala row-wise average by handling null.

in Scala something like this would do it
val cols = Seq("US","UK","Can")
f.map(r => (r.getAs[Int]("id"),r.getValuesMap(cols).values.fold(0.0)(_+_)/cols.length)).toDF

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pySpark - Add list to a dataframe as a column [duplicate] - python

Related

Trim String Characters in Pyspark dataframe

PySpark replace() function does not replace integer with NULL value

pyspark `substr' without length

Pyspark: How to deal with null values in python user defined functions

Spark DataFrame: Computing row-wise mean (or any aggregate operation)

Categories

Resources