Spark: Load multiple files, analyze individually, merge results, and save - python

I'm new to Spark and not quite how to ask this (which terms to use, etc.), so here's a picture of what I'm conceptually trying to accomplish:
I have lots of small, individual .txt "ledger" files (e.g., line-delimited files with a timestamp and attribute values at that time).
I'd like to:
Read each "ledger" file into individual data frames (read: NOT combining into one, big data frame);
Perform some basic calculations on each individual data frame, which result in a row of new data values; and then
Merge all the individual result rows into a final object & save it to disk in a line-delimited file.
It seems like nearly every answer I find (when googling related terms) is about loading multiple files into a single RDD or DataFrame, but I did find this Scala code:
val data = sc.wholeTextFiles("HDFS_PATH")
val files = data.map { case (filename, content) => filename}
def doSomething(file: String) = {
println (file);
// your logic of processing a single file comes here
val logData = sc.textFile(file);
val numAs = logData.filter(line => line.contains("a")).count();
println("Lines with a: %s".format(numAs));
// save rdd of single file processed data to hdfs comes here
}
files.collect.foreach( filename => {
doSomething(filename)
})
... but:
A. I can't tell if this parallelizes the read/analyze operation, and
B. I don't think it provides for merging the results into a single object.
Any direction or recommendations are greatly appreciated!
Update
It seems like what I'm trying to do (run a script on multiple files in parallel and then combine results) might require something like thread pools (?).
For clarity, here's an example of the calculation I'd like to perform on the DataFrame created by reading in the "ledger" file:
from dateutil.relativedelta import relativedelta
from datetime import datetime
from pyspark.sql.functions import to_timestamp
# Read "ledger file"
df = spark.read.json("/path/to/ledger-filename.txt")
# Convert string ==> timestamp & sort
df = (df.withColumn("timestamp", to_timestamp(df.timestamp, 'yyyy-MM-dd HH:mm:ss'))).sort('timestamp')
columns_with_age = ("location", "status")
columns_without_age = ("wh_id")
# Get the most-recent values (from the last row of the df)
row_count = df.count()
last_row = df.collect()[row_count-1]
# Create an empty "final row" dictionary
final_row = {}
# For each column for which we want to calculate an age value ...
for c in columns_with_age:
# Initialize loop values
target_value = last_row.__getitem__(c)
final_row[c] = target_value
timestamp_at_lookback = last_row.__getitem__("timestamp")
look_back = 1
different = False
while not different:
previous_row = df.collect()[row_count - 1 - look_back]
if previous_row.__getitem__(c) == target_value:
timestamp_at_lookback = previous_row.__getitem__("timestamp")
look_back += 1
else:
different = True
# At this point, a difference has been found, so calculate the age
final_row["days_in_{}".format(c)] = relativedelta(datetime.now(), timestamp_at_lookback).days
Thus, a ledger like this:
+---------+------+-------------------+-----+
| location|status| timestamp|wh_id|
+---------+------+-------------------+-----+
| PUTAWAY| I|2019-04-01 03:14:00| 20|
|PICKABLE1| X|2019-04-01 04:24:00| 20|
|PICKABLE2| X|2019-04-01 05:33:00| 20|
|PICKABLE2| A|2019-04-01 06:42:00| 20|
| HOTPICK| A|2019-04-10 05:51:00| 20|
| ICEXCEPT| A|2019-04-10 07:04:00| 20|
| ICEXCEPT| X|2019-04-11 09:28:00| 20|
+---------+------+-------------------+-----+
Would reduce to (assuming the calculation was run on 2019-04-14):
{ '_id': 'ledger-filename', 'location': 'ICEXCEPT', 'days_in_location': 4, 'status': 'X', 'days_in_status': 3, 'wh_id': 20 }

Using wholeTextFiles is not recommended as it loads the full file into memory at once. If you really want to create an individual data frame per file, you can simply use the full path instead of a directory. However, this is not recommended and will most likely lead to poor resource utilisation. Instead, consider using input_file_path https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/functions.html#input_file_name--
For example:
spark
.read
.textFile("path/to/files")
.withColumn("file", input_file_name())
.filter($"value" like "%a%")
.groupBy($"file")
.agg(count($"value"))
.show(10, false)
+----------------------------+------------+
|file |count(value)|
+----------------------------+------------+
|path/to/files/1.txt |2 |
|path/to/files/2.txt |4 |
+----------------------------+------------+
so the files can be processed individually and then later combined.

You could fetch the file paths in hdfs
import org.apache.hadoop.fs.{FileSystem,Path}
val files=FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path(your_path)).map( x => x.getPath ).map(x=> "hdfs://"+x.toUri().getRawPath())
create a unique dataframe for each path
val arr_df= files.map(spark.read.format("csv").option("delimeter", ",").option("header", true).load(_))
The apply your filter or any transformation before unioning to one dataframe
val df= arr_df.map(x=> x.where(your_filter)).reduce(_ union _)

Related

How to read empty delta partitions without failing in Azure Databricks?

I'm looking for a workaround. Sometimes our automated framework will read delta partitions, that does not exist. It will fail because no parquet files are in this partition.
I don't want it to fail.
What I do then is :
spark_read.format('delta').option("basePath",location) \
.load('/mnt/water/green/date=20221209/object=34')
Instead, I want it to return the empty dataframe. Return a dataframe with no records.
I did that, but found it a bit cumbersome, and was wondering if there was a better way.
df = spark_read.format('delta').load(location)
folder_partition = /date=20221209/object=34'.split("/")
for folder_pruning_token in folder_partition :
folder_pruning_token_split = folder_pruning_token.split("=")
column_name = folder_pruning_token_split[0]
column_value = folder_pruning_token_split[1]
df = df .filter(df [column_name] == column_value)
You really don't need to do that trick with Delta Lake tables. This trick was primarily used for Parquet & other file formats to avoid scanning of files on HDFS or cloud storage that is very expensive.
You just need to load data, and filter data using where/filter. It's similar to what you do:
df = spark_read.format('delta').load(location) \
.filter("date = '20221209' and object = 34")
If you need, you can of course extract that values automatically, maybe slightly simpler code:
df = spark_read.format('delta').load(location)
folder_partition = '/date=20221209/object=34'.split("/")
cols = [f"{s[0]} = '{s[1]}'"
for s in [f.split('=')for f in folder_partition]
]
df = df.filter(" and ".join(cols))

how to add column index to Parquet output from Apache Beam Python SDK?

I'm trying to batch process .avro files from GCS and write the result as Parquet files back to GCS, the data is a timeseries and elements are timestamped. How can I make a column index from the timestamp column in the Parquet output? in Pandas/Dask its a simple .set_index('timestamp') statement..
class AddTimestampDoFn(beam.DoFn):
def process(self, element):
yield beam.window.TimestampedValue(element, element['timestamp'])
with beam.Pipeline(options=pipeline_options) as pipeline:
p = pipeline | 'ReadAvro' >> beam.io.ReadFromAvro(
'gs://input/*.avro')
timestamped_items = p | 'timestamp' >> beam.ParDo(AddTimestampDoFn())
fixed_windowed_items = (timestamped_items | 'window' >>
beam.WindowInto(window.FixedWindows(60)))
processed_items = fixed_windowed_items | 'compute' >> beam.ParDo(
ComputeDoFn())
_ = processed_items | beam.io.WriteToParquet('gs://output/out.parquet',
pyarrow.schema(
[
('timestamp',
pyarrow.timestamp('s')), ........
We beam.io.WriteToParquet uses arrow's ParquetWriter to write parquet files. I'm not seeing any way to set the index with this writer. However, you could use Beam's Dataframe support to convert your PCollection to a Dataframe, set the index, and then call to_parquet(...) which delegates to the underlying pandas implementation and should write indices out.

Processing .txt file using wholeTextFiles & wanting to extract filename

I am reading a .txt file using wholeTextFiles() in python spark. I know that after reading wholeTextFiles(), the resultant rdd will be of format (filepath, content). I have multiple files to read. I want to cut the file name from the filepath and save to a spark dataframe and a part of the filename as a date folder in HDFS location. But while saving, I am not getting the corresponding filenames. Is there any way to do so? Below is my code
base_data = sc.wholeTextFiles("/user/nikhil/raw_data/")
data1 = base_data.map(lambda x : x[0]).flatMap(lambda x : x.split('/')).filter(lambda x : x.startswith('CH'))
data2=data1.flatMap(lambda x : x.split('F_')).filter(lambda x : x.startswith('2'))
print(data1.collect())
print(data2.collect())
df.repartition(1).write.mode('overwrite').parquet(outputLoc + "/xxxxx/" + data2)
logdf = sqlContext.createDataFrame(
[(data1, pstrt_time, pend_time, 'DeltaLoad Completed')],
["filename","process_start_time", "process_end_time", "status"])`
output :
data1: ['CHNC_P0BCDNAF_20200217', 'CHNC_P0BCDNAF_20200227', 'CHNC_P0BCDNAF_20200615', 'CHNC_P0BCDNAF_20200925']
data2: ['20200217', '20200227', '20200615', '20200925']
Here a Scala version that is easily convertible to pyspark by your good self:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
val files = sc.wholeTextFiles("/FileStore/tables/*ZZ.txt",0)
val res1 = files.map(line => (line._1, line._2.split("\n").flatMap(x => x.split(" ")) )).map(elem => {(elem._1, elem._2) })
val res2 = res1.flatMap {
case (x, y) => {
y.map(z => (x, z))
}}
val res3 = res2.map(line => (line._1, line._1.split("/")(3), line._2))
val df = res3.toDF()
val df2 = df.withColumn("s", split($"_1", "/"))
.withColumn("f1", $"s"(3))
.withColumn("f2", $"f1".cast(StringType)) // avoid issues with split subsequently
.withColumn("filename", substring_index(col("f2"), ".", 1))
df2.show(false)
df2.repartition($"filename").write.mode("overwrite").parquet("my_parquet") // default 200 and add partitionBy as well for good measure on your `write`.
Some sample data, you strip away via .drop or using select:
+--------------------------------+---------+-------+-------------------------------------+---------+---------+--------+
|_1 |_2 |_3 |s |f1 |f2 |filename|
+--------------------------------+---------+-------+-------------------------------------+---------+---------+--------+
|dbfs:/FileStore/tables/AAAZZ.txt|AAAZZ.txt|wwww |[dbfs:, FileStore, tables, AAAZZ.txt]|AAAZZ.txt|AAAZZ.txt|AAAZZ |
|dbfs:/FileStore/tables/AAAZZ.txt|AAAZZ.txt|wwww |[dbfs:, FileStore, tables, AAAZZ.txt]|AAAZZ.txt|AAAZZ.txt|AAAZZ |
|dbfs:/FileStore/tables/AAAZZ.txt|AAAZZ.txt|rrr |[dbfs:, FileStore, tables, AAAZZ.txt]|AAAZZ.txt|AAAZZ.txt|AAAZZ |
|dbfs:/FileStore/tables/AAAZZ.txt|AAAZZ.txt| |[dbfs:, FileStore, tables, AAAZZ.txt]|AAAZZ.txt|AAAZZ.txt|AAAZZ |
|dbfs:/FileStore/tables/AAAZZ.txt|AAAZZ.txt|4445
...
Usual aspects of punctuation removal, trimming of spaces to apply. You need to adapt for your filename situation of course, I cannot see that.
The issue is you cannot split on an already splitted thing.

Can google dataflow convert an input date to a bigquery timestamp

quite new to dataflow, I have been searching for days for a solution to my problem. I need to run a pipeline that reads a date from a csv file in the following format : 2019010420300033, passing it through the the different flows and end up in bigquery as a timestamp. Is there a way to do this or the input file must be converter first to a convertible date (I know a format like this works : 2019-01-01 20:30:00.331).
Or, is is possible to have dataflow output in some way a new pipeline with that date converted?
thanks
This is an easy job for Dataflow. You can use either a ParDo or a Map.
In the example below each line from the CSV will be passed to Map(convertDate). The function convertDate, which you need to modify to fit your date conversion, then returns the modified line. Then the entire converted CSV is written to the output file set.
Example (simplified) using Map:
def convertDate(line):
# convert date to desired format
# Split line into columns, change date format for desired column
# Rejoin columns into line and return
cols = line.split(',') # change for your column seperator
cols[2] = my_change_method_for_date(cols[2]) # code the date conversion here
return ",".join(cols)
with beam.Pipeline(argv=pipeline_args) as p:
lines = p | 'ReadCsvFile' >> beam.io.ReadFromText(args.input)
lines = lines | 'ConvertDate' >> beam.Map(convertDate)
lines | 'WriteCsvFile' >> beam.io.WriteToText(args.output)

Multiple pyspark "window()" calls shows error when doing a "groupBy()"

This question is a follow up of this answer. Spark is displaying an error when the following situation arises:
# Group results in 12 second windows of "foo", then by integer buckets of 2 for "bar"
fooWindow = window(col("foo"), "12 seconds"))
# A sub bucket that contains values in [0,2), [2,4), [4,6]...
barWindow = window(col("bar").cast("timestamp"), "2 seconds").cast("struct<start:bigint,end:bigint>")
results = df.groupBy(fooWindow, barWindow).count()
The error is:
"Multiple time window expressions would result in a cartesian product
of rows, therefore they are currently not supported."
Is there some way to achieve the desired behavior?
I was able to come up with a solution using an adaptation of this SO answer.
Note: This solution only works if there is at most one call to window, meaning multiple time windows are not allowed. Doing a quick search on the spark github shows there's a hard limit of <= 1 windows.
By using withColumn to define the buckets for each row, we can then group by that new column directly:
from pyspark.sql import functions as F
from datetime import datetime as dt, timedelta as td
start = dt.now()
second = td(seconds=1)
data = [(start, 0), (start+second, 1), (start+ (12*second), 2)]
df = spark.createDataFrame(data, ('foo', 'bar'))
# Create a new column defining the window for each bar
df = df.withColumn("barWindow", F.col("bar") - (F.col("bar") % 2))
# Keep the time window as is
fooWindow = F.window(F.col("foo"), "12 seconds").start.alias("foo")
# Use the new column created
results = df.groupBy(fooWindow, F.col("barWindow")).count().show()
# +-------------------+---------+-----+
# | foo|barWindow|count|
# +-------------------+---------+-----+
# |2019-01-24 14:12:48| 0| 2|
# |2019-01-24 14:13:00| 2| 1|
# +-------------------+---------+-----+

Categories