This question is a follow up of this answer. Spark is displaying an error when the following situation arises:
# Group results in 12 second windows of "foo", then by integer buckets of 2 for "bar"
fooWindow = window(col("foo"), "12 seconds"))
# A sub bucket that contains values in [0,2), [2,4), [4,6]...
barWindow = window(col("bar").cast("timestamp"), "2 seconds").cast("struct<start:bigint,end:bigint>")
results = df.groupBy(fooWindow, barWindow).count()
The error is:
"Multiple time window expressions would result in a cartesian product
of rows, therefore they are currently not supported."
Is there some way to achieve the desired behavior?
I was able to come up with a solution using an adaptation of this SO answer.
Note: This solution only works if there is at most one call to window, meaning multiple time windows are not allowed. Doing a quick search on the spark github shows there's a hard limit of <= 1 windows.
By using withColumn to define the buckets for each row, we can then group by that new column directly:
from pyspark.sql import functions as F
from datetime import datetime as dt, timedelta as td
start = dt.now()
second = td(seconds=1)
data = [(start, 0), (start+second, 1), (start+ (12*second), 2)]
df = spark.createDataFrame(data, ('foo', 'bar'))
# Create a new column defining the window for each bar
df = df.withColumn("barWindow", F.col("bar") - (F.col("bar") % 2))
# Keep the time window as is
fooWindow = F.window(F.col("foo"), "12 seconds").start.alias("foo")
# Use the new column created
results = df.groupBy(fooWindow, F.col("barWindow")).count().show()
# +-------------------+---------+-----+
# | foo|barWindow|count|
# +-------------------+---------+-----+
# |2019-01-24 14:12:48| 0| 2|
# |2019-01-24 14:13:00| 2| 1|
# +-------------------+---------+-----+
Related
I have this dataframe:
se_cols = [
"id",
"name",
"associated_countries"
]
person = (
1,
"GABRIELE",
["ITA", "BEL", "BVI"],
)
company = (
2,
"Bad Company",
["CYP", "RUS", "ITA"],
)
se_data = [person, company]
se = spark.createDataFrame(se_data).toDF(*se_cols)
Now, what I want, is to be able to iterate over each array in the "associated_countries" column, and as soon as I find one country that belongs to a certain set, select that row.
The way I could think of was to use F.exists with a dictionary whose keys are the ISO codes of the target countries I'm looking for.
secrecy = {"CYP":"cyprus", "BVI":"british virgin island"}
def at_least_one_secrecy(x_arr, secrecy_map=secrecy):
for x in x_arr:
if secrecy_map.get(x, False) is False:
continue
else:
return True
return False
se.withColumn("linked_to_secrecy", F.exists("associated_countries", lambda x_arr: at_least_one_secrecy(x_arr=x_arr))).show()
But this returns the error:
TypeError: Column is not iterable
PS: I know this could be solved by adding a column "target_countries" where each row would contain my target ISO as an array and do some sort of array_overlap > 0 condition between "associated_countries" and "terget_countries". But consider I have a huge dataset, and that would be very expensive.
You can use arrays_overlap function with literal array that contains your ISO countries codes:
from pyspark.sql import functions as F,
secrecy_array = F.array(*[F.lit(x) for x in secrecy.keys()])
se.withColumn(
"linked_to_secrecy",
F.arrays_overlap(F.col("associated_countries"), secrecy_array)
).show()
#+---+-----------+--------------------+-----------------+
#| id| name|associated_countries|linked_to_secrecy|
#+---+-----------+--------------------+-----------------+
#| 1| GABRIELE| [ITA, BEL, BVI]| true|
#| 2|Bad Company| [CYP, RUS, ITA]| true|
#+---+-----------+--------------------+-----------------+
Spark dataframe which has column emailID : ram.shyam.78uy#testing.com. i would like to extract the string between "." and "#" i.e 78uy and store it in column.
tried
split_for_alias = split(rs_csv['emailID'],'[.]')
rs_csv_alias= rs_csv.withColumn('alias',split_for_alias.getItem(size(split_for_alias) -2))
Its adding 78uy#testing as alias. Another column can be added and chop off the extra values. But is it possible to do in single statement.
Extract the alphanumeric immediately to the left of special character . and immediately followed by special character #
DataFrame
data= [
(1,"am.shyam.78uy#testing.com"),
(2, "j.k.kilo#jom.com")
]
df=spark.createDataFrame(data, ("id",'emailID'))
df.show()
+---+--------------------+
| id| emailID|
+---+--------------------+
| 1|am.shyam.78uy#tes...|
| 2| j.k.kilo#jom.com|
+---+--------------------+
Code
df.withColumn('name', regexp_extract('emailID', '(?<=\.)(\w+)(?=\#)',1)).show()
outcome
+---+--------------------+----+
| id| emailID|name|
+---+--------------------+----+
| 1|am.shyam.78uy#tes...|78uy|
| 2| j.k.kilo#jom.com|kilo|
+---+--------------------+----+
We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
First we setup a Pandas DataFrame to test:
import pandas as pd
df = pd.DataFrame({"id":[1,2],"email": ["am.shyam.78uy#testing.com", "j.k.kilo#jom.com"]})
Next, we make a native Python function. The logic is clear this way.
from typing import List, Dict, Any
def extract(df:List[Dict[str,Any]]) -> List[Dict[str,Any]]:
for row in df:
email = row["email"].split("#")[0].split(".")[-1]
row["new_col"] = email
return df
Then we can test on the Pandas engine:
from fugue import transform
transform(df, extract, schema="*, new_col:str")
Because it works, we can bring it to Spark by supplying an engine:
import fugue_spark
transform(df, extract, schema="*, new_col:str", engine="spark").show()
+---+--------------------+-------+
| id| email|new_col|
+---+--------------------+-------+
| 1|am.shyam.78uy#tes...| 78uy|
| 2| j.k.kilo#jom.com| kilo|
+---+--------------------+-------+
Note .show() is needed because Spark evaluates lazily. This transform can take in both Pandas and Spark DataFrames and will output a Spark DataFrame if using the Spark engine.
I have to tranform data from basically merge line until |#| is found in data
Output Needed
I have transformed using lead lag function but unsure how to proceed
from pyspark.sql import functions as F
from pyspark.sql import Window
from pyspark.sql.functions import *
df = spark.read.text('text.dat')
#Adding index column each row get its row numbers , Spark distributes the data and to maintain the order of data we need to perfrom this action
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("linenumber")
#zipindex creates array making back to string
df_2 = spark.sql("select value.value as value , index from linenumber")
df_2.createOrReplaceTempView("linenumber2")
#Splitting and extracting the location value from header and assigning null
df_new = spark.sql("select value,case when value like '%|##|' then value else null end as orgval,case when value like '%|#|' then 1 else 0 end as valrow,index from linenumber2")
w = Window().partitionBy().orderBy(col("index"))
df_new=df_new.select("*", lag("valrow").over(w).alias("validrows"))
df_new.createOrReplaceTempView("linenumber3")
spark.sql("select * from linenumber3 order by index").show(100)
Please help.
Here is my code and explanation:
from pyspark.sql import functions as f, Row
from pyspark.sql.window import Window
df = spark.createDataFrame([
Row(Value='A', LineNumber=6),
Row(Value='B', LineNumber=7),
Row(Value='C', LineNumber=8),
Row(Value='D|#|', LineNumber=9),
Row(Value='A|#|', LineNumber=10),
Row(Value='E', LineNumber=11),
Row(Value='F', LineNumber=12),
Row(Value='G|#|', LineNumber=13),
Row(Value='I', LineNumber=23),
Row(Value='J', LineNumber=24),
Row(Value='K', LineNumber=25),
Row(Value='L', LineNumber=25)
])
df = df.withColumn('filename', f.input_file_name())
df = df.repartition('filename')
w = Window.partitionBy('filename').orderBy('index')
# Creating an id to enable window functions
df = df.withColumn('index', f.monotonically_increasing_id())
# Identifying if the previous row has |#| delimiter
df = df.withColumn('delimiter', f.lag('Value', default=False).over(w).contains('|#|'))
# Creating a column to group all values that must be concatenated
df = df.withColumn('group', f.sum(f.col('delimiter').cast('int')).over(w))
# Grouping them, removing |#|, collecting all values and concatenate them
df = (df
.groupBy('group')
.agg(f.concat_ws(',', f.collect_list(f.regexp_replace('Value', '\|#\|', ''))).alias('ConcalValue'),
f.min('LineNumber').alias('LineNumber')))
# Selecting only desired columns
(df
.select(f.col('ConcalValue').alias('Concal Value'), f.col('LineNumber').alias('Initial Line Number'))
.sort('LineNumber')
.show(truncate=False))
Output:
+------------+-------------------+
|Concal Value|Initial Line Number|
+------------+-------------------+
| A,B,C,D| 6|
| A| 10|
| E,F,G| 11|
| I,J,K,L| 23|
+------------+-------------------+
I'm new to Spark and not quite how to ask this (which terms to use, etc.), so here's a picture of what I'm conceptually trying to accomplish:
I have lots of small, individual .txt "ledger" files (e.g., line-delimited files with a timestamp and attribute values at that time).
I'd like to:
Read each "ledger" file into individual data frames (read: NOT combining into one, big data frame);
Perform some basic calculations on each individual data frame, which result in a row of new data values; and then
Merge all the individual result rows into a final object & save it to disk in a line-delimited file.
It seems like nearly every answer I find (when googling related terms) is about loading multiple files into a single RDD or DataFrame, but I did find this Scala code:
val data = sc.wholeTextFiles("HDFS_PATH")
val files = data.map { case (filename, content) => filename}
def doSomething(file: String) = {
println (file);
// your logic of processing a single file comes here
val logData = sc.textFile(file);
val numAs = logData.filter(line => line.contains("a")).count();
println("Lines with a: %s".format(numAs));
// save rdd of single file processed data to hdfs comes here
}
files.collect.foreach( filename => {
doSomething(filename)
})
... but:
A. I can't tell if this parallelizes the read/analyze operation, and
B. I don't think it provides for merging the results into a single object.
Any direction or recommendations are greatly appreciated!
Update
It seems like what I'm trying to do (run a script on multiple files in parallel and then combine results) might require something like thread pools (?).
For clarity, here's an example of the calculation I'd like to perform on the DataFrame created by reading in the "ledger" file:
from dateutil.relativedelta import relativedelta
from datetime import datetime
from pyspark.sql.functions import to_timestamp
# Read "ledger file"
df = spark.read.json("/path/to/ledger-filename.txt")
# Convert string ==> timestamp & sort
df = (df.withColumn("timestamp", to_timestamp(df.timestamp, 'yyyy-MM-dd HH:mm:ss'))).sort('timestamp')
columns_with_age = ("location", "status")
columns_without_age = ("wh_id")
# Get the most-recent values (from the last row of the df)
row_count = df.count()
last_row = df.collect()[row_count-1]
# Create an empty "final row" dictionary
final_row = {}
# For each column for which we want to calculate an age value ...
for c in columns_with_age:
# Initialize loop values
target_value = last_row.__getitem__(c)
final_row[c] = target_value
timestamp_at_lookback = last_row.__getitem__("timestamp")
look_back = 1
different = False
while not different:
previous_row = df.collect()[row_count - 1 - look_back]
if previous_row.__getitem__(c) == target_value:
timestamp_at_lookback = previous_row.__getitem__("timestamp")
look_back += 1
else:
different = True
# At this point, a difference has been found, so calculate the age
final_row["days_in_{}".format(c)] = relativedelta(datetime.now(), timestamp_at_lookback).days
Thus, a ledger like this:
+---------+------+-------------------+-----+
| location|status| timestamp|wh_id|
+---------+------+-------------------+-----+
| PUTAWAY| I|2019-04-01 03:14:00| 20|
|PICKABLE1| X|2019-04-01 04:24:00| 20|
|PICKABLE2| X|2019-04-01 05:33:00| 20|
|PICKABLE2| A|2019-04-01 06:42:00| 20|
| HOTPICK| A|2019-04-10 05:51:00| 20|
| ICEXCEPT| A|2019-04-10 07:04:00| 20|
| ICEXCEPT| X|2019-04-11 09:28:00| 20|
+---------+------+-------------------+-----+
Would reduce to (assuming the calculation was run on 2019-04-14):
{ '_id': 'ledger-filename', 'location': 'ICEXCEPT', 'days_in_location': 4, 'status': 'X', 'days_in_status': 3, 'wh_id': 20 }
Using wholeTextFiles is not recommended as it loads the full file into memory at once. If you really want to create an individual data frame per file, you can simply use the full path instead of a directory. However, this is not recommended and will most likely lead to poor resource utilisation. Instead, consider using input_file_path https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/functions.html#input_file_name--
For example:
spark
.read
.textFile("path/to/files")
.withColumn("file", input_file_name())
.filter($"value" like "%a%")
.groupBy($"file")
.agg(count($"value"))
.show(10, false)
+----------------------------+------------+
|file |count(value)|
+----------------------------+------------+
|path/to/files/1.txt |2 |
|path/to/files/2.txt |4 |
+----------------------------+------------+
so the files can be processed individually and then later combined.
You could fetch the file paths in hdfs
import org.apache.hadoop.fs.{FileSystem,Path}
val files=FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path(your_path)).map( x => x.getPath ).map(x=> "hdfs://"+x.toUri().getRawPath())
create a unique dataframe for each path
val arr_df= files.map(spark.read.format("csv").option("delimeter", ",").option("header", true).load(_))
The apply your filter or any transformation before unioning to one dataframe
val df= arr_df.map(x=> x.where(your_filter)).reduce(_ union _)
This question already has an answer here:
How to divide a column by its sum in a Spark DataFrame
(1 answer)
Closed 4 years ago.
I am trying to divide columns in PySpark by their respective sums. My dataframe(using only one column here) looks like this:
event_rates = [[1,10.461016949152542], [2, 10.38953488372093], [3, 10.609418282548477]]
event_rates = spark.createDataFrame(event_rates, ['cluster_id','mean_encoded'])
event_rates.show()
+----------+------------------+
|cluster_id| mean_encoded|
+----------+------------------+
| 1|10.461016949152542|
| 2| 10.38953488372093|
| 3|10.609418282548477|
+----------+------------------+
I tried two methods to do this but have failed in getting results
from pyspark.sql.functions import sum as spark_sum
cols = event_rates.columns[1:]
for each in cols:
event_rates = event_rates.withColumn(each+"_scaled", event_rates[each]/spark_sum(event_rates[each]))
This gives me the following error
org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and '`cluster_id`' is not an aggregate function. Wrap '((`mean_encoded` / sum(`mean_encoded`)) AS `mean_encoded_scaled`)' in windowing function(s) or wrap '`cluster_id`' in first() (or first_value) if you don't care which value you get.;;
Aggregate [cluster_id#22356L, mean_encoded#22357, (mean_encoded#22357 / sum(mean_encoded#22357)) AS mean_encoded_scaled#2
and following the question here I tried the following
stats = (event_rates.agg([spark_sum(x).alias(x + '_sum') for x in cols]))
event_rates = event_rates.join(broadcast(stats))
exprs = [event_rates[x] / event_rates[event_rates + '_sum'] for x in cols]
event_rates.select(exprs)
But I get an error from the first line stating
AssertionError: all exprs should be Column
How do I get across this?
This is an example on how to divide column mean_encoded by its sum. You need to sum the column first then crossJoin back to the previous dataframe. Then, you can divide any column by its sum.
import pyspark.sql.functions as fn
from pyspark.sql.types import *
event_rates = event_rates.crossJoin(event_rates.groupby().agg(fn.sum('mean_encoded').alias('sum_mean_encoded')))
event_rates_div = event_rates.select('cluster_id',
'mean_encoded',
fn.col('mean_encoded') / fn.col('sum_mean_encoded'))
Output
+----------+------------------+---------------------------------+
|cluster_id| mean_encoded|(mean_encoded / sum_mean_encoded)|
+----------+------------------+---------------------------------+
| 1|10.461016949152542| 0.3325183371367686|
| 2| 10.38953488372093| 0.3302461777809474|
| 3|10.609418282548477| 0.3372354850822839|
+----------+------------------+---------------------------------+
Try out this,
from pyspark.sql import functions as F
total = event_rates.groupBy().agg(F.sum("mean_encoded"),F.sum("cluster_id")).collect()
total
Answer will be,
[Row(sum(mean_encoded)=31.459970115421946, sum(cluster_id)=6)]