I'm doing some analysis on a large set of waveform data. I have to get all the FFTs of all waveforms and then take the mean/min/max and standard deviation from them to check if all the tests were sound.
What I wrote is working... on a very small data set. I think spark is trying to get all the FFTs and after that get the mean. I'm getting a lot of errors and it seems that I'm either out of memory, or that the request is taking to long. I think I should try another solution by getting 1 FFT and do something like a rolling average or so?? I'm trying to find a solution but I can't wrap my head around it.
I tried using F.avg() but that doesn't work on ArrayTypes. I thought of writing my own UDF but I couldn't figure out how to inject the group size into it. (I left the grouping stage out of the example though)
This is what I got. First you'll see the code that sets up the dataframe I get from every test instance. Then I'll show you the DF shape. After that I'll try to aggregate all of those into a single DF.
import numpy as np
import pyspark.sql.types as T
import pyspark.sql.functions as F
from pyspark.sql.functions import col
# Maybe some other imports
# Setup:
def _rfft(x):
transformed = np.fft.rfft(x)
return transformed.real.tolist(), transformed.imag.tolist(), np.abs(transformed).tolist()
spark_complex_abs = T.StructType([
T.StructField("real", T.ArrayType(T.DoubleType(),False), False),
T.StructField("imag", T.ArrayType(T.DoubleType(),False), False),
T.StructField("abs", T.ArrayType(T.DoubleType(),False), False),
])
spark_rfft = F.udf(_rfft, spark_complex_abs)
def _rfft_bins(size, periodMicroSeconds):
return np.fft.rfftfreq(size, d=(periodMicroSeconds/10**6)).tolist()
spark_rfft_bins = F.udf(_rfft_bins, T.ArrayType(T.DoubleType(), False))
df = df.select('samplePeriod') \
.withColumn('dataSize', col("waveformData")['dimensions'][0]) \
.withColumn("data", col("waveformData")['elements']) \
.withColumn('fft', spark_rfft('data')) \
.withColumn('fftAmplitude', col('fft')['abs'])
.withColumn('fftBins', spark_rfft_bins('dataSize', 'samplePeriod'))
# Other selects (but not part of this example)
# DataFrame shape:
# | samplePeriod | dataSize | data | fft | fftAmplitude | fftBins |
# | DoubleType | DoubleType | ArrayType(DoubleType) | ComplexType | ArrayType(DoubleType) | ArrayType(DoubleType) |
# Grouping stage not part of this example
# Aggregration
# This will not work on large data sets :(
def _index_avg(arr):
return np.mean(arr, axis = 0).tolist()
spark_index_avg = F.udf(_index_avg, T.ArrayType(T.DoubleType(), False))
df = df.agg(\
spark_index_avg(F.collect_list(col('fftAmplitude'))).alias('avg'), \
F.first('fftBins').alias('fftBins') \
)
result = df.toPandas()
(Censored) error messages:
23/02/14 10:36:52 ERROR YarnScheduler: Lost executor 14 on XXXXXXX: Container from a bad node: container_XXXXXXX on host: XXXXXXX. Exit status: 143. Diagnostics: [2023-02-14 10:36:51.901]Container killed on request. Exit code is 143
[2023-02-14 10:36:51.922]Container exited with a non-zero exit code 143.
[2023-02-14 10:36:51.926]Killed by external signal
.
23/02/14 10:36:52 WARN TaskSetManager: Lost task 0.0 in stage 13.0 XXXXXXX: ExecutorLostFailure (executor 14 exited caused by one of the running tasks) Reason: Container from a bad node: containerXXXXXXX on host: XXXXXXX. Exit status: 143. Diagnostics: [2023-02-14 10:36:51.901]Container killed on request. Exit code is 143
[2023-02-14 10:36:51.922]Container exited with a non-zero exit code 143.
[2023-02-14 10:36:51.926]Killed by external signal
.
23/02/14 10:36:52 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 14 for reason Container from a bad node: container XXXXX on host: XXXXXXX. Exit status: 143. Diagnostics: [2023-02-14 10:36:51.901]Container killed on request. Exit code is 143
[2023-02-14 10:36:51.922]Container exited with a non-zero exit code 143.
[2023-02-14 10:36:51.926]Killed by external signal
EDIT:
I also tried
df = df.agg(F.avg(F.explode(F.collect_list('fftAmplitude'))).alias('avg'))
But then it will complain with: The generator is not supported: nested in expressions "avg(explode(collect_list(fft.abs)))"
Related
I have a very simple pipeline in python
with beam.Pipeline(options=create_pipeline_options(pipeline_args)) as p:
rows = (p | 'ReadFromBigquery' >> beam.io.ReadFromBigQuery(table=f"{known_args.project}:{known_args.datasetId}.{known_args.tableId}", use_standard_sql=True))
entities = (rows | 'GetEntities' >> beam.ParDo(GetEntity()))
updated = (entities | 'Update Entities' >> beam.ParDo(UpdateEntity()))
_ = (updated | 'Write To Datastore' >> WriteToDatastore(known_args.project))
I want to log which entities have been correctly updated after WriteToDatastore has finished running so I could write them in a bigquery audit table. Ideally it would look something like this
successful_entities, failed entities = (updated | 'Write To Datastore' >> WriteToDatastoreWrapper(known_args.project))
_ = (successful_entities | 'Write Success To Bigquery' >> beam.io.WriteToBigQuery(table=f"{c.audit_table}:{known_args.datasetId}.{known_args.tableId}"))
_ = (failed_entities| 'Write Failed To Bigquery' >> beam.io.WriteToBigQuery(table=f"{c.audit_table}:{known_args.datasetId}.{known_args.tableId}"))
Is this possible to achieve?
Alternatively if the whole batch fails after n numbers of retries, is it possible to catch that failure and log which batch has failed (assuming I have some sort of runId to keep track of batches)
I hope it can help.
I propose you a solution with a dead letter queue before writing the result to Datastore.
Beam suggests using a dead letter queue in this case, and we can achieve that with TupleTags.
You can write it with Beam native but the code is verbose.
I created a library in Beam Java and Python called Asgarde :
Here the link of the Python version : https://github.com/tosun-si/pasgarde
You can install the package with pip :
pip install asgarde==0.16.0
With Asgarde, you can catch errors in each step of the pipeline, before writing the result with the IO (Datastore in this case)
Example :
input_teams: PCollection[str] = p | 'Read' >> beam.Create(team_names)
result = (CollectionComposer.of(input_teams)
.map('Map with country', lambda tname: TeamInfo(name=tname, country=team_countries[tname], city=''))
.map('Map with city', lambda tinfo: TeamInfo(name=tinfo.name, country=tinfo.country, city=team_cities[tinfo.name]))
.filter('Filter french team', lambda tinfo: tinfo.country == 'France'))
result_outputs: PCollection[TeamInfo] = result.outputs
result_failures: PCollection[Failure] = result.failures
Asgarde proposes a wrapper with CollectionComposer class instantiated from a PCollection.
Then each operator like map, flat_map, filter applies the operation while error handling.
The result of CollectionComposer is a Tuple with :
PCollection of successful outputs
PCollection of Failure
Failure is an object given by Asgarde :
#dataclass
class Failure:
pipeline_step: str
input_element: str
exception: Exception
This object gives the current input_element concerned by the error and the current exception.
pipeline_step is the current name used for the transformation.
Your pipeline can be adapted in the following way with Asgarde :
with beam.Pipeline(options=create_pipeline_options(pipeline_args)) as p:
rows = (p | 'ReadFromBigquery' >> beam.io.ReadFromBigQuery(table=f"{known_args.project}:{known_args.datasetId}.{known_args.tableId}", use_standard_sql=True))
result = (CollectionComposer.of(rows)
.map('GetEntities', lambda el : get_entity_function(el))
.map('Update Entities', lambda entity : update_entity_function(entity)))
result_outputs = result.outputs
result_failures: PCollection[Failure] = result.failures
(result_outputs | 'Write To Datastore' >> WriteToDatastore(known_args.project))
(result_failures
| 'Map before Write to BQ' >> beam.Map(failure_to_your_obj_function)
| 'Write Failed To Bigquery' >> beam.io.WriteToBigQuery(table=f"{c.audit_table}:{known_args.datasetId}.{known_args.tableId}"))
You can also apply the same logic with Beam native, I share you an example from my personal Github repository :
https://github.com/tosun-si/teams-league-python-dlq-native-beam-summit/blob/main/team_league/domain_ptransform/team_stats_transform.py
I have a pyomo model and want to compare the performance of solving it normally against solving a reduced model that I generate through an algorithm. I want to set a time limit and see either which one finishes first or if one or both does not finish in the time limit, which of both models reached the better solution by saving the gap value.
Currently my code looks like this:
solver_info = optimizer.solve(self.pyM, warmstart=warmstartHeuristic, tee=True, load_solutions=False)
self.solverSpecs['gap'] = solver_info.solution(0).gap
self.pyM.solutions.load_from(solver_info)
The second line of code is supposed to save the gap value. My test example for the normally solved model ended it optimization with gurobi with these lines:
Nodes | Current Node | Objective Bounds | Work
Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time
0 0 21125.0674 0 5 - 21125.0674 - - 60s
H 0 0 192740.94072 21125.0674 89.0% - 227s
Explored 1 nodes (56138 simplex iterations) in 300.02 seconds
Thread count was 3 (of 4 available processors)
Solution count 1: 192741
Time limit reached
Best objective 1.927409407219e+05, best bound 2.112506743254e+04, gap 89.0397%
WARNING: Loading a SolverResults object with an 'aborted' status, but
containing a solution
Status: aborted
Return code: 0
Message: Optimization terminated because the time expended exceeded the value specified in the TimeLimit parameter.
Termination condition: maxTimeLimit
Termination message: Optimization terminated because the time expended exceeded the value specified in the TimeLimit parameter.
Wall time: 300.0222430229187
Error rc: 0
Time: 301.3100845813751
which states that the solution has a gap of around 90%. But the value that is saved into the dictionary is zero. I am using:
Pyomo 5.7.2.
Gurobi 9.0.2
Isn't this gap value the MipGap or is this a bug? I took a look at the complete solver_info object and the bounds saved there would result in the correct gap, so I am a little bit confused.
I have the following piece of code:
# fact table
df = (spark.table(f'nn_squad7_{country}.fact_table')
.filter(f.col('date_key').between(start_date,end_date))
#.filter(f.col('is_lidl_plus')==1)
.filter(f.col('source')=='tickets')
.filter(f.col('subtype')=='trx')
.filter(f.col('is_trx_ok') == 1)
.join(dim_stores,'store_id','inner')
.join(dim_customers,'customer_id','inner')
.withColumn('week', f.expr('DATE_FORMAT(DATE_SUB(date_key, 1), "Y-ww")'))
.withColumn('quarter', f.expr('DATE_FORMAT(DATE_SUB(date_key, 1), "Q")')))
#checking metrics
df2 =(df
.groupby('is_client_plus','quarter')
.agg(
f.countDistinct('store_id'),
f.sum('customer_id'),
f.sum('ticket_id')))
display(df2)
When I execute the query I get the following error:
SparkException: Job aborted due to stage failure: Task 58 in stage 13.0 failed 4 times, most recent failure: Lost task 58.3 in stage 13.0 (TID 488, 10.32.14.43, executor 4): java.lang.IllegalArgumentException: Illegal pattern character 'Q'
I'm not sure about why I'm getting this error because when I run the fact table chunck alone I'm not getting any kind of error.
Any advice? Thanks!
According to the docs of Spark 3, 'Q' is a valid datetime format pattern, despite it's not a valid Java time format. Not sure why it didn't work for you - maybe a Spark version issue. Try using the function quarter instead, which should give the same expected output:
df = (spark.table(f'nn_squad7_{country}.fact_table')
.filter(f.col('date_key').between(start_date,end_date))
#.filter(f.col('is_lidl_plus')==1)
.filter(f.col('source')=='tickets')
.filter(f.col('subtype')=='trx')
.filter(f.col('is_trx_ok') == 1)
.join(dim_stores,'store_id','inner')
.join(dim_customers,'customer_id','inner')
.withColumn('week', f.expr('DATE_FORMAT(DATE_SUB(date_key, 1), "Y-ww")'))
.withColumn('quarter', f.expr('quarter(DATE_SUB(date_key, 1))')))
If you look at the documentation of the function, it explicitly states what pattern letters are valid to use:
All pattern letters of the Java class java.text.SimpleDateFormat can be used.
You can see the valid patterns here:
https://docs.oracle.com/javase/10/docs/api/java/text/SimpleDateFormat.html
It looks like Q is not one of them. As per the comments #mck shows a suitable alternative.
I am creating some kind of RAM memory. Idea was firstly to create RAM "write" functionality, as you can see in code below. Beside RAM memory, there is RAM model driver, which was suposed to write data to RAM (just to briefly verify if write functionality works properly).
RAM model driver and RAM model are connected to each other and some transaction should occur, but problem is that simulation is completed within zero simulation seconds.
Anybody has idea what could be a problem?
#gear
def ram_model(write_addr: Uint,
write_data: Queue['dtype'],*,
ram_mem = None,
dtype = b'dtype',
mem_granularity_in_bytes = 1) -> (Queue['dtype']):
if(ram_mem is None and type(ram_mem) is not dict):
ram_mem = {}
ram_write_op(write_addr = write_addr,
write_data = write_data,
ram_memory = ram_mem)
#gear
async def ram_write_op(write_addr: Uint,
write_data: Queue,*,
ram_memory = None,
mem_granularity_in_bytes = 1):
if(ram_memory is None and type(ram_mem) is not dict):
SystemError("Ram memory is %s but it should be dictionary",(type(ram_memory)))
byte_t = Array[Uint[8], mem_granularity_in_bytes]
async with write_addr as addr:
async for data, _ in write_data:
for b in code(data, byte_t):
ram_memory[addr] = b
addr += 1
#gear
async def ram_model_drv(*,addr_bus_width = b'asize',
data_type = b'dtype') -> (Uint[8], Queue['data_type']):
num_of_w_comnds = 15
matrix = np.random.randint(10, size = (num_of_w_comnds, 10))
for command_id in range(num_of_w_comnds):
for i in range(matrix[command_id].size):
yield (command_id, (matrix[command_id][i], i == matrix[command_id].size))
stimul = ram_model_drv(addr_bus_width = 8, data_type = Fixp[8,8])
out = ram_model(stimul[0], stimul[1])
sim()
Here is the output message:
python ram_model.py
- [INFO]: Running sim with seed: 3934280405122873233
0 [INFO]: -------------- Simulation start --------------
0 [INFO]: ----------- Simulation done ---------------
0 [INFO]: Elapsed: 0.00
Yeah, this one is a bit convoluted. Gist of the issue is that in the ram_model_drv module you are synchronously outputting data on both of its output interfaces with they yield statement. For PyGears this means that you would like data on both of these interfaces acknowledged before continuing. The ram_write_op module is connected to both of these interfaces via write_addr and write_data agruments. Inside that module, you acknowledge data from write_addr interface only after you've read multiple data from write_data interface, hence there's a deadlock and PyGears simulator detects that no further simulation steps are possible and exits at the end of the step 0.
There are also two additional issues in the driver:
It will never generate an eot for the output data Queue. Instead eot should be generated when i == matrix[command_id].size - 1.
The async modules are run in an endless loop by PyGears, so your ram_model_drv will generate the data endlessly unless you explicitly generate a GearDone exception.
OK, back to the main issue. There are several possibilities to circumvent it:
Use decoupling
For this to work, you first need to split data output in two yield statements, one for the write_addr and the other for the write_data, since your ram_write_op will use only one address per several write data.
#gear
async def ram_model_drv(*, addr_bus_width, data_type) -> (Uint[8], Queue['data_type']):
num_of_w_comnds = 15
matrix = np.random.randint(10, size=(num_of_w_comnds, 10))
for command_id in range(num_of_w_comnds):
yield (command_id, None)
for i in range(matrix[command_id].size):
yield (None, (matrix[command_id][i], i == matrix[command_id].size - 1))
raise GearDone
You can use either dreg or decouple modules to temporarily store output data from ram_model_drv before they are consumed by ram_write_op.
out = ram_model(stimul[0] | decouple, stimul[1] | decouple)
Split driver into two modules, one driving each of the two interfaces
Use low level synchronization API for interfaces
Underneath the yield mechanism, there is a lower level API for communicating via PyGears interfaces. Handles to output interfaces can be obtained via module().dout field. Data can be sent via interface without waiting for it to be acknowledged using put_nb() method. Later, in order to wait for the aknowledgment, ready() method can be awaited. Finally, put() method combines the two in one call, so it will both send the data and wait for the acknowledgement.
#gear
async def ram_model_drv(*,
addr_bus_width=b'asize',
data_type=b'dtype') -> (Uint[8], Queue['data_type']):
addr, data = module().dout
num_of_w_comnds = 15
matrix = np.random.randint(10, size=(num_of_w_comnds, 10))
for command_id in range(num_of_w_comnds):
addr.put_nb(command_id)
for i in range(matrix[command_id].size):
await data.put((matrix[command_id][i], i == matrix[command_id].size - 1))
await addr.ready()
raise GearDone
I run locust using this command:
locust -f locustfile.py --no-web -c10 -r10 &> locust.log &
My understanding is all output (stdout, stderr) will goes to locust.log
However, when the program stopped without me triggering to stop, last lines of locust.log is only the stat like below, no error message could be found:
Name # reqs # fails Avg Min Max | Median req/s
--------------------------------------------------------------------------------------------------------------------------------------------
GET /*******/**********/ 931940 8(0.00%) 45 10 30583 | 23 101.20
GET /**************/************/ 931504 14(0.00%) 47 11 30765 | 24 104.10
GET /**************/***************/ 594 92243(99.36%) 30 12 549 | 23 0.00
--------------------------------------------------------------------------------------------------------------------------------------------
Total 1864038 92265(4.95%) 205.30
Since I didn't put number of request, the job should not stop forever.
Where and how should I check why the job is stopping?