i am loading a dataframe of tweets in as JSON objects in pyspark.
I am trying to split the text in to individual words, and then select all the words that include a #. I want to try to avoid using regular python functions, adn try to stick with what is available inside of pyspark.
I am running the code as a jupyter notebook window: but this is the code overall.
import findspark
findspark.init()
from pyspark.sql import SQLContext, SparkSession
spark = SparkSession \
.builder \
.appName("Jupyter Spark shell") \
.getOrCreate()
sc = spark.sparkContext
folder = 'tweet-id-text-345'
tweets = spark.read.format("json").option("delimeter", "\t").load(folder)
tweets.count()
I am very unsure on how to do this. the viable result would be to get a sort of array of all the different words, and one for the different words that include a #. These would be two seperate lists.
Here is what the content looks like
+------------------------+
| text|
+------------------------+
| โปรทุนน้อย สุดประ...|
| RT #sOLehOXClj1XE...|
|RT #rkayama: 論文「関...|
| SixTONES OneSTのグッ...|
| मुख्यमंत्री #mlkh...|
+------------------------+
only showing top 5 rows
Assuming that the json follows this format the text of the tweet is stored in a field called text.
The text column is split into single words and the resulting array is filtered using rlike:
from pyspark.sql import functions as f
df=spark.read.option("multiline", "true").json(<...>).select("text")
df.withColumn("all_words", f.split("text", " "))\
.withColumn("only_hash", f.expr("filter(all_words, w -> rlike(w, '.*#.*'))")) \
.show(truncate=False)
If the original text was hello #world how a#re you today# the output would be
+--------------------------------+---------------------------------------+----------------------+
|text |all_words |only_hash |
+--------------------------------+---------------------------------------+----------------------+
|hello #world how a#re you today#|[hello, #world, how, a#re, you, today#]|[#world, a#re, today#]|
+--------------------------------+---------------------------------------+----------------------+
I have the following pyspark df:
+------------------+--------+-------+
| ID| Assets|Revenue|
+------------------+--------+-------+
|201542399349300619| 1633944| 32850|
|201542399349300629| 3979760| 850914|
|201542399349300634| 3402687|1983568|
|201542399349300724| 1138291|1097553|
|201522369349300122| 1401406|1010828|
|201522369349300137| 16948| 171534|
|201522369349300142|13474056|2285323|
|201522369349300202| 481045| 241788|
|201522369349300207| 700861|1185640|
|201522369349300227| 178479| 267976|
+------------------+--------+-------+
For each row, I want to be able to get the rows that are within 20% of the Assets amount. For example, for the first row (ID=201542399349300619), I want to be able to get all the rows where Assets are within 20% +/- of 1,633,944 (so between 1,307,155 to 1,960,732):
+------------------+--------+-------+
| ID| Assets|Revenue|
+------------------+--------+-------+
|201542399349300619| 1633944| 32850|
|201522369349300122| 1401406|1010828|
Using this subsetted table, I want to get the average assets and add it as a new column. So for the above example, it would be the average assets of (1633944+1401406) = 1517675
+------------------+--------+-------+---------+
| ID| Assets|Revenue|AvgAssets|
+------------------+--------+-------+---------+
|201542399349300619| 1633944| 32850| 1517675|
Assuming your DataFrame has a schema similar to the following (i.e. Assets and Revenue are numeric):
df.printSchema()
#root
# |-- ID: long (nullable = true)
# |-- Assets: integer (nullable = true)
# |-- Revenue: integer (nullable = true)
You can join the DataFrame to itself on the condition that you've set forth. After the join, you can group and aggregate by taking the average of the Assets column.
For example:
from pyspark.sql.functions import avg, expr
df.alias("l")\
.join(
df.alias("r"),
on=expr("r.assets between l.assets*0.8 and l.assets*1.2")
)\
.groupBy("l.ID", "l.Assets", "l.Revenue")\
.agg(avg("r.Assets").alias("AvgAssets"))\
.show()
#+------------------+--------+-------+------------------+
#| ID| Assets|Revenue| AvgAssets|
#+------------------+--------+-------+------------------+
#|201542399349300629| 3979760| 850914| 3691223.5|
#|201522369349300202| 481045| 241788| 481045.0|
#|201522369349300207| 700861|1185640| 700861.0|
#|201522369349300137| 16948| 171534| 16948.0|
#|201522369349300142|13474056|2285323| 1.3474056E7|
#|201522369349300227| 178479| 267976| 178479.0|
#|201542399349300619| 1633944| 32850| 1517675.0|
#|201522369349300122| 1401406|1010828|1391213.6666666667|
#|201542399349300724| 1138291|1097553| 1138291.0|
#|201542399349300634| 3402687|1983568| 3691223.5|
#+------------------+--------+-------+------------------+
Since we are joining the DataFrame to itself, we can use aliases to refer to the left table ("l") and the right table ("r"). The logic above says join l to r on the condition that the assets in r is +/20% of the assets in l.
There are multiple ways to express the +/20% condition, but I am using the spark-sql between expression to find rows that are between Assets * 0.8 and Assets * 1.2.
Then we aggregate on all of the columns (groupBy) of the left table and average over the assets in the right table.
The resulting AvgAssets column is a FloatType column, but you can easily convert it to IntegerType by adding a .cast("int") before the .alias("AvgAssets") if that's what you prefer.
See also:
What are the various join types in Spark?
Say I have a very basic Spark DataFrame that consists of a couple of columns, one of which contains a value that I want to modify.
|| value || lang ||
| 3 | en |
| 4 | ua |
Say, I want to have a new column per specific class where I would add a float number to the given value (this is not much relevant to the final question though, in reality I do a prediction with sklearn there, but for simplicity let's assume we are adding stuff, the idea is I am modifying the value in some way). So given a dict classes={'1':2.0, '2':3.0} I would like to have a column for each class where I add the value from DF to the value of the class and then save it to a csv:
class_1.csv
|| value || lang || my_class | modified ||
| 3 | en | 1 | 5.0 | # this is 3+2.0
| 4 | ua | 1 | 6.0 | # this is 4+2.0
class_2.csv
|| value || lang || my_class | modified ||
| 3 | en | 2 | 6.0 | # this is 3+3.0
| 4 | ua | 2 | 7.0 | # this is 4+3.0
So far I have the following code that works and modifies the value for each defined class, but it is done with a for loop and I am looking for a more advanced optimization for it:
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
from pyspark.sql.functions import lit
# create session and context
spark = pyspark.sql.SparkSession.builder.master("yarn").appName("SomeApp").getOrCreate()
conf = SparkConf().setAppName('Some_App').setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)
my_df = spark.read.csv("some_file.csv")
# modify the value here
def do_stuff_to_column(value, separate_class):
# do stuff to column, let's pretend we just add a specific value per specific class that is read from a dictionary
class_dict = {'1':2.0, '2':3.0} # would be loaded from somewhere
return float(value+class_dict[separate_class])
# iterate over each given class later
class_dict = {'1':2.0, '2':3.0} # in reality have more than 10 classes
# create a udf function
udf_modify = udf(do_stuff_to_column, FloatType())
# loop over each class
for my_class in class_dict:
# create the column first with lit
my_df2 = my_df.withColumn("my_class", lit(my_class))
# modify using udf function
my_df2 = my_df2.withColumn("modified", udf_modify("value","my_class"))
# write to csv now
my_df2.write.format("csv").save("class_"+my_class+".csv")
So the question is, is there a better/faster way of doing this then in a for loop?
I would use some form of join, in this case crossJoin. Here's a MWE:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(3, 'en'), (4, 'ua')], ['value', 'lang'])
classes = spark.createDataFrame([(1, 2.), (2, 3.)], ['class_key', 'class_value'])
res = df.crossJoin(classes).withColumn('modified', F.col('value') + F.col('class_value'))
res.show()
For saving as separate CSV's I think there is no better way than to use a loop.
I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows:
dt.withColumn('new_column', 10).head(5)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-50-a6d0257ca2be> in <module>()
1 dt = (messages
2 .select(messages.fromuserid, messages.messagetype, floor(messages.datetime/(1000*60*5)).alias("dt")))
----> 3 dt.withColumn('new_column', 10).head(5)
/Users/evanzamir/spark-1.4.1/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col)
1166 [Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
1167 """
-> 1168 return self.select('*', col.alias(colName))
1169
1170 #ignore_unicode_prefix
AttributeError: 'int' object has no attribute 'alias'
It seems that I can trick the function into working as I want by adding and subtracting one of the other columns (so they add to zero) and then adding the number I want (10 in this case):
dt.withColumn('new_column', dt.messagetype - dt.messagetype + 10).head(5)
[Row(fromuserid=425, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=47019141, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=49746356, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=93506471, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=80488242, messagetype=1, dt=4809600.0, new_column=10)]
This is supremely hacky, right? I assume there is a more legit way to do this?
Spark 2.2+
Spark 2.2 introduces typedLit to support Seq, Map, and Tuples (SPARK-19254) and following calls should be supported (Scala):
import org.apache.spark.sql.functions.typedLit
df.withColumn("some_array", typedLit(Seq(1, 2, 3)))
df.withColumn("some_struct", typedLit(("foo", 1, 0.3)))
df.withColumn("some_map", typedLit(Map("key1" -> 1, "key2" -> 2)))
Spark 1.3+ (lit), 1.4+ (array, struct), 2.0+ (map):
The second argument for DataFrame.withColumn should be a Column so you have to use a literal:
from pyspark.sql.functions import lit
df.withColumn('new_column', lit(10))
If you need complex columns you can build these using blocks like array:
from pyspark.sql.functions import array, create_map, struct
df.withColumn("some_array", array(lit(1), lit(2), lit(3)))
df.withColumn("some_struct", struct(lit("foo"), lit(1), lit(.3)))
df.withColumn("some_map", create_map(lit("key1"), lit(1), lit("key2"), lit(2)))
Exactly the same methods can be used in Scala.
import org.apache.spark.sql.functions.{array, lit, map, struct}
df.withColumn("new_column", lit(10))
df.withColumn("map", map(lit("key1"), lit(1), lit("key2"), lit(2)))
To provide names for structs use either alias on each field:
df.withColumn(
"some_struct",
struct(lit("foo").alias("x"), lit(1).alias("y"), lit(0.3).alias("z"))
)
or cast on the whole object
df.withColumn(
"some_struct",
struct(lit("foo"), lit(1), lit(0.3)).cast("struct<x: string, y: integer, z: double>")
)
It is also possible, although slower, to use an UDF.
Note:
The same constructs can be used to pass constant arguments to UDFs or SQL functions.
In spark 2.2 there are two ways to add constant value in a column in DataFrame:
1) Using lit
2) Using typedLit.
The difference between the two is that typedLit can also handle parameterized scala types e.g. List, Seq, and Map
Sample DataFrame:
val df = spark.createDataFrame(Seq((0,"a"),(1,"b"),(2,"c"))).toDF("id", "col1")
+---+----+
| id|col1|
+---+----+
| 0| a|
| 1| b|
+---+----+
1) Using lit: Adding constant string value in new column named newcol:
import org.apache.spark.sql.functions.lit
val newdf = df.withColumn("newcol",lit("myval"))
Result:
+---+----+------+
| id|col1|newcol|
+---+----+------+
| 0| a| myval|
| 1| b| myval|
+---+----+------+
2) Using typedLit:
import org.apache.spark.sql.functions.typedLit
df.withColumn("newcol", typedLit(("sample", 10, .044)))
Result:
+---+----+-----------------+
| id|col1| newcol|
+---+----+-----------------+
| 0| a|[sample,10,0.044]|
| 1| b|[sample,10,0.044]|
| 2| c|[sample,10,0.044]|
+---+----+-----------------+
As the other answers have described, lit and typedLit are how to add constant columns to DataFrames. lit is an important Spark function that you will use frequently, but not for adding constant columns to DataFrames.
You'll commonly be using lit to create org.apache.spark.sql.Column objects because that's the column type required by most of the org.apache.spark.sql.functions.
Suppose you have a DataFrame with a some_date DateType column and would like to add a column with the days between December 31, 2020 and some_date.
Here's your DataFrame:
+----------+
| some_date|
+----------+
|2020-09-23|
|2020-01-05|
|2020-04-12|
+----------+
Here's how to calculate the days till the year end:
val diff = datediff(lit(Date.valueOf("2020-12-31")), col("some_date"))
df
.withColumn("days_till_yearend", diff)
.show()
+----------+-----------------+
| some_date|days_till_yearend|
+----------+-----------------+
|2020-09-23| 99|
|2020-01-05| 361|
|2020-04-12| 263|
+----------+-----------------+
You could also use lit to create a year_end column and compute the days_till_yearend like so:
import java.sql.Date
df
.withColumn("yearend", lit(Date.valueOf("2020-12-31")))
.withColumn("days_till_yearend", datediff(col("yearend"), col("some_date")))
.show()
+----------+----------+-----------------+
| some_date| yearend|days_till_yearend|
+----------+----------+-----------------+
|2020-09-23|2020-12-31| 99|
|2020-01-05|2020-12-31| 361|
|2020-04-12|2020-12-31| 263|
+----------+----------+-----------------+
Most of the time, you don't need to use lit to append a constant column to a DataFrame. You just need to use lit to convert a Scala type to a org.apache.spark.sql.Column object because that's what's required by the function.
See the datediff function signature:
As you can see, datediff requires two Column arguments.
I have a Spark DataFrame loaded up in memory, and I want to take the mean (or any aggregate operation) over the columns. How would I do that? (In numpy, this is known as taking an operation over axis=1).
If one were calculating the mean of the DataFrame down the rows (axis=0), then this is already built in:
from pyspark.sql import functions as F
F.mean(...)
But is there a way to programmatically do this against the entries in the columns? For example, from the DataFrame below
+--+--+---+---+
|id|US| UK|Can|
+--+--+---+---+
| 1|50| 0| 0|
| 1| 0|100| 0|
| 1| 0| 0|125|
| 2|75| 0| 0|
+--+--+---+---+
Omitting id, the means would be
+------+
| mean|
+------+
| 16.66|
| 33.33|
| 41.67|
| 25.00|
+------+
All you need here is a standard SQL like this:
SELECT (US + UK + CAN) / 3 AS mean FROM df
which can be used directly with SqlContext.sql or expressed using DSL
df.select(((col("UK") + col("US") + col("CAN")) / lit(3)).alias("mean"))
If you have a larger number of columns you can generate expression as follows:
from functools import reduce
from operator import add
from pyspark.sql.functions import col, lit
n = lit(len(df.columns) - 1.0)
rowMean = (reduce(add, (col(x) for x in df.columns[1:])) / n).alias("mean")
df.select(rowMean)
or
rowMean = (sum(col(x) for x in df.columns[1:]) / n).alias("mean")
df.select(rowMean)
Finally its equivalent in Scala:
df.select(df.columns
.drop(1)
.map(col)
.reduce(_ + _)
.divide(df.columns.size - 1)
.alias("mean"))
In a more complex scenario you can combine columns using array function and use an UDF to compute statistics:
import numpy as np
from pyspark.sql.functions import array, udf
from pyspark.sql.types import FloatType
combined = array(*(col(x) for x in df.columns[1:]))
median_udf = udf(lambda xs: float(np.median(xs)), FloatType())
df.select(median_udf(combined).alias("median"))
The same operation expressed using Scala API:
val combined = array(df.columns.drop(1).map(col).map(_.cast(DoubleType)): _*)
val median_udf = udf((xs: Seq[Double]) =>
breeze.stats.DescriptiveStats.percentile(xs, 0.5))
df.select(median_udf(combined).alias("median"))
Since Spark 2.4 an alternative approach is to combine values into an array and apply aggregate expression. See for example Spark Scala row-wise average by handling null.
in Scala something like this would do it
val cols = Seq("US","UK","Can")
f.map(r => (r.getAs[Int]("id"),r.getValuesMap(cols).values.fold(0.0)(_+_)/cols.length)).toDF