Can't make apache beam write outputs to bigquery when using DataflowRunner

Can't make apache beam write outputs to bigquery when using DataflowRunner - python

I'm trying to understand why this pipeline writes no output to BigQuery.
What I'm trying to achieve is to calculate the USD index for the last 10 years, starting from different currency pairs observations.
All the data is in BigQuery and I need to organize it and sort it in a chronollogical way (if there is a better way to achieve this, I'm glad to read it because I think this might not be the optimal way to do this).
The idea behing the class Currencies() is to start grouping (and keep) the last observation of a currency pair (eg: EURUSD), update all currency pair values as they "arrive", sort them chronologically and finally get the open, high, low and close value of the USD index for that day.
This code works in my jupyter notebook and in cloud shell using DirectRunner, but when I use DataflowRunner it does not write any output. In order to see if I could figure it out, I tried to just create the data using beam.Create() and then write it to BigQuery (which it worked) and also just read something from BQ and write it on other table (also worked), so my best guess is that the problem is in the beam.CombineGlobally part, but I don't know what it is.
The code is as follows:
import logging
import collections
import apache_beam as beam
from datetime import datetime
SYMBOLS = ['usdjpy', 'usdcad', 'usdchf', 'eurusd', 'audusd', 'nzdusd', 'gbpusd']
TABLE_SCHEMA = "date:DATETIME,index:STRING,open:FLOAT,high:FLOAT,low:FLOAT,close:FLOAT"
class Currencies(beam.CombineFn):
def create_accumulator(self):
return {}
def add_input(self,accumulator,inputs):
logging.info(inputs)
date,currency,bid = inputs.values()
if '.' not in date:
date = date+'.0'
date = datetime.strptime(date,'%Y-%m-%dT%H:%M:%S.%f')
data = currency+':'+str(bid)
accumulator[date] = [data]
return accumulator
def merge_accumulators(self,accumulators):
merged = {}
for accum in accumulators:
ordered_data = collections.OrderedDict(sorted(accum.items()))
prev_date = None
for date,date_data in ordered_data.items():
if date not in merged:
merged[date] = {}
if prev_date is None:
prev_date = date
else:
prev_data = merged[prev_date]
merged[date].update(prev_data)
prev_date = date
for data in date_data:
currency,bid = data.split(':')
bid = float(bid)
currency = currency.lower()
merged[date].update({
currency:bid
})
return merged
def calculate_index_value(self,data):
return data['usdjpy']*data['usdcad']*data['usdchf']/(data['eurusd']*data['audusd']*data['nzdusd']*data['gbpusd'])
def extract_output(self,accumulator):
ordered = collections.OrderedDict(sorted(accumulator.items()))
index = {}
for dt,currencies in ordered.items():
if not all([symbol in currencies.keys() for symbol in SYMBOLS]):
continue
date = str(dt.date())
index_value = self.calculate_index_value(currencies)
if date not in index:
index[date] = {
'date':date,
'index':'usd',
'open':index_value,
'high':index_value,
'low':index_value,
'close':index_value
}
else:
max_value = max(index_value,index[date]['high'])
min_value = min(index_value,index[date]['low'])
close_value = index_value
index[date].update({
'high':max_value,
'low':min_value,
'close':close_value
})
return index
def main():
query = """
select date,currency,bid from data_table
where date(date) between '2022-01-13' and '2022-01-16'
and currency like ('%USD%')
"""
options = beam.options.pipeline_options.PipelineOptions(
temp_location = 'gs://PROJECT/temp',
project = 'PROJECT',
runner = 'DataflowRunner',
region = 'REGION',
num_workers = 1,
max_num_workers = 1,
machine_type = 'n1-standard-1',
save_main_session = True,
staging_location = 'gs://PROJECT/stag'
)
with beam.Pipeline(options = options) as pipeline:
inputs = (pipeline
| 'Read From BQ' >> beam.io.ReadFromBigQuery(query=query,use_standard_sql=True)
| 'Accumulate' >> beam.CombineGlobally(Currencies())
| 'Flat' >> beam.ParDo(lambda x: x.values())
| beam.io.Write(beam.io.WriteToBigQuery(
table = 'TABLE',
dataset = 'DATASET',
project = 'PROJECT',
schema = TABLE_SCHEMA))
)
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
main()
They way I execute this is from shell, using python3 -m first_script (is this the way I should run this batch jobs?).
What I'm missing or doing wrong? This is my first attemp to use Dataflow, so i'm probably making several mistakes in the book.

For whom it may help: I faced a similar problem but I already used the same code for a different flow that had a pubsub as input where it worked flawless instead a file based input where it simply did not. After a lot of experimenting I found that in the options I changed the flag
options = PipelineOptions(streaming=True, ..
to
options = PipelineOptions(streaming=False,
as of course it is not a streaming source, it's a bounded source, a batch. After I set this flag to true I found my rows in the BigQuery table. After it had finished it even stopped the pipeline as it where a batch operation. Hope this helps

Related

Databricks DLT reading a table from one schema(bronze), process CDC data and store to another schema (processed)

I am developing an ETL pipeline using databricks DLT pipelines for CDC data that I recieve from kafka. I have created 2 pipelines successfully for landing, and raw zone. The raw one will have operation flag, a sequence column, and I would like to process the CDC and store the clean data in processed layer (SCD 1 type). I am having difficulties in reading table from one schema, apply CDC changes, and load to target db schema tables.
I have 100 plus tables, so i am planning to loop through the tables in RAW layer and apply CDC, move to processed layer. Following is my code that I have tried (I have left the commented code just for your reference).
import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *
raw_db_name = "raw_db"
processed_db_name = "processed_db_name"
def generate_curated_table(src_table_name, tgt_table_name, df):
# #dlt.view(
# name= src_table_name,
# spark_conf={
# "pipelines.incompatibleViewCheck.enabled": "false"
# },
# comment="Processed data for " + str(src_table_name)
# )
# # def create_target_table():
# # return (df)
# dlt.create_target_table(name=tgt_table_name,
# comment= f"Clean, merged {tgt_table_name}",
# #partition_cols=["topic"],
# table_properties={
# "quality": "silver"
# }
# )
# #dlt.view
# def users():
# return spark.readStream.format("delta").table(src_table_name)
#dlt.view
def raw_tbl_data():
return df
dlt.create_target_table(name=tgt_table_name,
comment="Clean, merged customers",
table_properties={
"quality": "silver"
})
dlt.apply_changes(
target = tgt_table_name,
source = f"{raw_db_name}.raw_tbl_data,
keys = ["id"],
sequence_by = col("timestamp_ms"),
apply_as_deletes = expr("op = 'DELETE'"),
apply_as_truncates = expr("op = 'TRUNCATE'"),
except_column_list = ["id", "timestamp_ms"],
stored_as_scd_type = 1
)
return
tbl_name = 'raw_po_details'
df = spark.sql(f'select * from {raw_dbname}.{tbl_name}')
processed_tbl_name = tbl_name.replace("raw", "processed") //processed_po_details
generate_curated_table(tbl_name, processed_tbl_name, df)
I have tried with dlt.view(), dlt.table(), dlt.create_streaming_live_table(), dlt.create_target_table(), but ending up with either of the following errors:
AttributeError: 'function' object has no attribute '_get_object_id'
pyspark.sql.utils.AnalysisException: Failed to read dataset '<raw_db_name.mytable>'. Dataset is not defined in the pipeline
.Expected result:
Read the dataframe which is passed as a parameter (RAW_DB) and
Create new tables in PROCESSED_DB which is configured in DLT pipeline settings
https://www.databricks.com/blog/2022/04/27/how-uplift-built-cdc-and-multiplexing-data-pipelines-with-databricks-delta-live-tables.html
https://cprosenjit.medium.com/databricks-delta-live-tables-job-workflows-orchestration-patterns-bc7643935299
Appreciate any help please.
Thanks in advance

I got the solution myself and got it working, thanks to all. Am adding my solution so it could be a reference to others.
import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *
def generate_silver_tables(target_table, source_table):
#dlt.table
def customers_filteredB():
return spark.table("my_raw_db.myraw_table_name")
### Create the target table definition
dlt.create_target_table(name=target_table,
comment= f"Clean, merged {target_table}",
#partition_cols=["topic"],
table_properties={
"quality": "silver",
"pipelines.autoOptimize.managed": "true"
}
)
## Do the merge
dlt.apply_changes(
target = target_table,
source = "customers_filteredB",
keys = ["id"],
apply_as_deletes = expr("operation = 'DELETE'"),
sequence_by = col("timestamp_ms"),#primary key, auto-incrementing ID of any kind that can be used to identity order of events, or timestamp
ignore_null_updates = False,
except_column_list = ["operation", "timestamp_ms"],
stored_as_scd_type = "1"
)
return
raw_dbname = "raw_db"
raw_tbl_name = 'raw_table_name'
processed_tbl_name = raw_tbl_name.replace("raw", "processed")
generate_silver_tables(processed_tbl_name, raw_tbl_name)

Problems querying with python to BigQuery (Python String Format)

I am trying to make a query to BigQuery in order to modify all the values of a row (in python). When I use a simple string to query, I have no problems. Nevertheless, when I introduce the string formatting the query does not work. As follows I'm presenting the same query, but diminishing the number of columns that I am modifying.
I already made the connection to BigQuery, by defining the Client, etc (and works properly).
I tried:
"UPDATE `riscos-dev.survey_test.data-test-bdrn` SET informaci_meteorol_gica = {inf}, risc = {ri} WHERE objectid = {obj_id}".format(inf = df.informaci_meteorol_gica[index], ri = df.risc[index], obj_id = df.objectid[index])
To specify the input values in format:
df.informaci_meteorol_gica[index] = 'Neu' , also a string for df.risc[index] and df.objectid[index] = 3
I am obtaining the following error message:
BadRequest: 400 Braced constructors are not supported at [1:77]

Instead of using format method of string, I propose you another approach with the f string formating in Python :
def build_query():
inf = "'test_inf'"
ri = "'test_ri'"
obj_id = "'test_obj_id'"
return f"UPDATE `riscos-dev.survey_test.data-test-bdrn` SET informaci_meteorol_gica = {inf}, risc = {ri} WHERE objectid = {obj_id}"
if __name__ == '__main__':
query = build_query()
print(query)
The result is :
UPDATE `riscos-dev.survey_test.data-test-bdrn` SET informaci_meteorol_gica = 'test_inf', risc = 'test_ri' WHERE objectid = 'test_obj_id'
I mocked the query params in my example with :
inf = "'test_inf'"
ri = "'test_ri'"
obj_id = "'test_obj_id'"

ValueError: timestamp out of range for platform localtime()/gmtime() function

I have a class assignment to write a python program to download end-of-day data last 25 years the major global stock market indices from Yahoo Finance:
Dow Jones Index (USA)
S&P 500 (USA)
NASDAQ (USA)
DAX (Germany)
FTSE (UK)
HANGSENG (Hong Kong)
KOSPI (Korea)
CNX NIFTY (India)
Unfortunately, when I run the program an error occurs.
File "C:\ProgramData\Anaconda2\lib\site-packages\yahoofinancials__init__.py", line 91, in format_date
form_date = datetime.datetime.fromtimestamp(int(in_date)).strftime('%Y-%m-%d')
ValueError: timestamp out of range for platform localtime()/gmtime() function
If you see below, you can see the code that I have written. I'm trying to debug my mistakes. Can you help me out please? Thanks
from yahoofinancials import YahooFinancials
import pandas as pd
# Select Tickers and stock history dates
index1 = '^DJI'
index2 = '^GSPC'
index3 = '^IXIC'
index4 = '^GDAXI'
index5 = '^FTSE'
index6 = '^HSI'
index7 = '^KS11'
index8 = '^NSEI'
freq = 'daily'
start_date = '1993-06-30'
end_date = '2018-06-30'
# Function to clean data extracts
def clean_stock_data(stock_data_list):
new_list = []
for rec in stock_data_list:
if 'type' not in rec.keys():
new_list.append(rec)
return new_list
# Construct yahoo financials objects for data extraction
dji_financials = YahooFinancials(index1)
gspc_financials = YahooFinancials(index2)
ixic_financials = YahooFinancials(index3)
gdaxi_financials = YahooFinancials(index4)
ftse_financials = YahooFinancials(index5)
hsi_financials = YahooFinancials(index6)
ks11_financials = YahooFinancials(index7)
nsei_financials = YahooFinancials(index8)
# Clean returned stock history data and remove dividend events from price history
daily_dji_data = clean_stock_data(dji_financials
.get_historical_stock_data(start_date, end_date, freq)[index1]['prices'])
daily_gspc_data = clean_stock_data(gspc_financials
.get_historical_stock_data(start_date, end_date, freq)[index2]['prices'])
daily_ixic_data = clean_stock_data(ixic_financials
.get_historical_stock_data(start_date, end_date, freq)[index3]['prices'])
daily_gdaxi_data = clean_stock_data(gdaxi_financials
.get_historical_stock_data(start_date, end_date, freq)[index4]['prices'])
daily_ftse_data = clean_stock_data(ftse_financials
.get_historical_stock_data(start_date, end_date, freq)[index5]['prices'])
daily_hsi_data = clean_stock_data(hsi_financials
.get_historical_stock_data(start_date, end_date, freq)[index6]['prices'])
daily_ks11_data = clean_stock_data(ks11_financials
.get_historical_stock_data(start_date, end_date, freq)[index7]['prices'])
daily_nsei_data = clean_stock_data(nsei_financials
.get_historical_stock_data(start_date, end_date, freq)[index8]['prices'])
stock_hist_data_list = [{'^DJI': daily_dji_data}, {'^GSPC': daily_gspc_data}, {'^IXIC': daily_ixic_data},
{'^GDAXI': daily_gdaxi_data}, {'^FTSE': daily_ftse_data}, {'^HSI': daily_hsi_data},
{'^KS11': daily_ks11_data}, {'^NSEI': daily_nsei_data}]
# Function to construct data frame based on a stock and it's market index
def build_data_frame(data_list1, data_list2, data_list3, data_list4, data_list5, data_list6, data_list7, data_list8):
data_dict = {}
i = 0
for list_item in data_list2:
if 'type' not in list_item.keys():
data_dict.update({list_item['formatted_date']: {'^DJI': data_list1[i]['close'], '^GSPC': list_item['close'],
'^IXIC': data_list3[i]['close'], '^GDAXI': data_list4[i]['close'],
'^FTSE': data_list5[i]['close'], '^HSI': data_list6[i]['close'],
'^KS11': data_list7[i]['close'], '^NSEI': data_list8[i]['close']}})
i += 1
tseries = pd.to_datetime(list(data_dict.keys()))
df = pd.DataFrame(data=list(data_dict.values()), index=tseries,
columns=['^DJI', '^GSPC', '^IXIC', '^GDAXI', '^FTSE', '^HSI', '^KS11', '^NSEI']).sort_index()
return df

Your problem is your datetime stamps are in the wrong format. If you look at the error code it clugely tells you:
datetime.datetime.fromtimestamp(int(in_date)).strftime('%Y-%m-%d')
Notice the int(in_date) part?
It wants the unix timestamp. There are several ways to get this, out of the time module or the calendar module, or using Arrow.
import datetime
import calendar
date = datetime.datetime.strptime("1993-06-30", "%Y-%m-%d")
start_date = calendar.timegm(date.utctimetuple())
* UPDATED *
OK so I fixed up to the dataframes portion. Here is my current code:
# Select Tickers and stock history dates
index = {'DJI' : YahooFinancials('^DJI'),
'GSPC' : YahooFinancials('^GSPC'),
'IXIC':YahooFinancials('^IXIC'),
'GDAXI':YahooFinancials('^GDAXI'),
'FTSE':YahooFinancials('^FTSE'),
'HSI':YahooFinancials('^HSI'),
'KS11':YahooFinancials('^KS11'),
'NSEI':YahooFinancials('^NSEI')}
freq = 'daily'
start_date = '1993-06-30'
end_date = '2018-06-30'
# Clean returned stock history data and remove dividend events from price history
daily = {}
for k in index:
tmp = index[k].get_historical_stock_data(start_date, end_date, freq)
if tmp:
daily[k] = tmp['^{}'.format(k)]['prices'] if 'prices' in tmp['^{}'.format(k)] else []
Unfortunately I had to fix a couple things in the yahoo module. For the class YahooFinanceETL:
#staticmethod
def format_date(in_date, convert_type):
try:
x = int(in_date)
convert_type = 'standard'
except:
convert_type = 'unixstamp'
if convert_type == 'standard':
if in_date < 0:
form_date = datetime.datetime(1970, 1, 1) + datetime.timedelta(seconds=in_date)
else:
form_date = datetime.datetime.fromtimestamp(int(in_date)).strftime('%Y-%m-%d')
else:
split_date = in_date.split('-')
d = date(int(split_date[0]), int(split_date[1]), int(split_date[2]))
form_date = int(time.mktime(d.timetuple()))
return form_date
AND:
# private static method to scrap data from yahoo finance
#staticmethod
def _scrape_data(url, tech_type, statement_type):
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
script = soup.find("script", text=re.compile("root.App.main")).text
data = loads(re.search("root.App.main\s+=\s+(\{.*\})", script).group(1))
if tech_type == '' and statement_type != 'history':
stores = data["context"]["dispatcher"]["stores"]["QuoteSummaryStore"]
elif tech_type != '' and statement_type != 'history':
stores = data["context"]["dispatcher"]["stores"]["QuoteSummaryStore"][tech_type]
else:
if "HistoricalPriceStore" in data["context"]["dispatcher"]["stores"] :
stores = data["context"]["dispatcher"]["stores"]["HistoricalPriceStore"]
else:
stores = data["context"]["dispatcher"]["stores"]["QuoteSummaryStore"]
return stores
You will want to look at the daily dict, and rewrite your build_data_frame function, which it should be a lot simpler now since you are working with a dictionary already.

I am actually the maintainer and author of YahooFinancials. I just saw this post and wanted to personally apologize for the inconvenience and let you all know I will be working on fixing the module this evening.
Could you please open an issue on the module's Github page detailing this?
It would also be very helpful to know which version of python you were running when you encountered these issues.
https://github.com/JECSand/yahoofinancials/issues
I am at work right now, however as soon as I get home in ~7 hours or so I will attempt to code a fix and release it. I'll also work on the exception handling. I try my best to maintain this module, but my day (and often night time) job is rather demanding. I will report back with the final results of these fixes and publish to pypi when it is done and stable.
Also if anyone else has any feedback or personal fixes made you can offer, it would be a huge huge help in fixing this. Proper credit will be given of course. I am also in desperate need of contributers, so if anyone is interested in that as well let me know. I am really wanting to take YahooFinancials to the next level and have this project become a stable and reliable alternative for free financial data for python projects.
Thank you for your patience and for using YahooFinancials.

storing data to database using python and retrieving it in a node.js app

The following code annotates a video with labels (or "tags") that are selected based on the image content.
import argparse
from google.cloud import videointelligence
def analyze_labels(path):
""" Detects labels given a GCS path. """
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [videointelligence.enums.Feature.LABEL_DETECTION]
operation = video_client.annotate_video(path, features=features)
print('\nProcessing video for label annotations:')
result = operation.result(timeout=90)
print('\nFinished processing.')
segment_labels = result.annotation_results[0].segment_label_annotations
for i, segment_label in enumerate(segment_labels):
print('Video label description: {}'.format(
segment_label.entity.description))
for category_entity in segment_label.category_entities:
print('\tLabel category description: {}'.format(
category_entity.description))
for i, segment in enumerate(segment_label.segments):
start_time = (segment.segment.start_time_offset.seconds +
segment.segment.start_time_offset.nanos / 1e9)
end_time = (segment.segment.end_time_offset.seconds +
segment.segment.end_time_offset.nanos / 1e9)
positions = '{}s to {}s'.format(start_time, end_time)
confidence = segment.confidence
print('\tSegment {}: {}'.format(i, positions))
print('\tConfidence: {}'.format(confidence))
print('\n')
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument('path', help='GCS file path for label detection.')
args = parser.parse_args()
analyze_labels(args.path)
The code prints the result to the terminal.
I want to store the Video label description, Label category description and confidence values to a Database(maybe MongoDB) and later retrieve it in a Noje.js application.
How can I do that or Is there any other better way?

There's nothing stopping you from doing that; as long as both your python program and your Node.js program agree on how data is being stored, you should have no trouble storing and retrieving simple strings. MongoDB, being a key/value store, would work well if you will know the keys you need to retrieve beforehand; something like SQLite might be better if you want to retrieve data based on some criteria without knowing an exact key. Both Node.js and Python have MongoDB/SQLite clients that function perfectly well.

How to get 10 random GAE ndb entities?

I have the following class to keep my records:
class List(ndb.Model):
'''
Index
Key: sender
'''
sender = ndb.StringProperty()
...
counter = ndb.IntegerProperty(default=0)
ignore = ndb.BooleanProperty(default=False)
added = ndb.DateTimeProperty(auto_now_add=True, indexed=False)
updated = ndb.DateTimeProperty(auto_now=True, indexed=False)
The following code is used to return all entities I need:
entries = List.query()
entries = entries.filter(List.counter > 5)
entries = entries.filter(List.ignore == False)
entries = entries.fetch()
How should I modify the code to get 10 random records from entries? I am planning to have a daily cron task to extract random records, so they should be really random. What is the best way to get these records (to minimize number of read operations)?
I don't think that the following code is the best:
entries = random.sample(entries, 10)

Well after reading the comments the only improvement you can make as far I can see is to fetch the keys only and limit if possible.
Haven't tested but like so
list_query = List.query()
list_query = list_query.filter(List.counter > 5)
list_query = list_query.filter(List.ignore == False)
list_keys = list_query.fetch(keys_only=True) # maybe put a limit here.
list_keys = random.sample(list_keys, 10)
lists = [list_key.get() for list_key in list_keys]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't make apache beam write outputs to bigquery when using DataflowRunner - python

Related

Databricks DLT reading a table from one schema(bronze), process CDC data and store to another schema (processed)

Problems querying with python to BigQuery (Python String Format)

ValueError: timestamp out of range for platform localtime()/gmtime() function

storing data to database using python and retrieving it in a node.js app

How to get 10 random GAE ndb entities?

Categories

Resources