Passing parameters to reducer in MRjob - python

I am using MRjob to run Hadoop Streaming jobs over our HBase instance. For the life of me I cannot figure out how to pass a parameter to my reducer. I have two parameters that I want to pass to my reducer from when I run the job: startDate and endDate. Here's what my current reducer looks like:
def reducer(self, groupId, meterList):
"""
Print bucket.
"""
sys.stderr.write("Working on group = " + str(groupId) + "\n")
#print "Opening connection..."
conn = open_connection(hostname)
#print "Getting table..."
table = get_table(conn, tableName)
compositeDf = DataFrame()
for meterId in meterList:
sys.stderr.write("Querying: " + str(meterId) + "\n")
df = extract_meter_data(table, meterId, startDate, endDate)
I cannot seem to pass startDate and endDate as parameters to my reducer. The only way I can get the job to pick up the parameters is through a global variable at the top of the class.
startDate = datetime.datetime(2012, 6, 10)
endDate = datetime.datetime(2012, 6, 11)
class MRDataQuality(MRJob):
"""
MapReduce job that does a data quality check on the meter data in HBase.
"""
But that is dirty. I want to pass it in from calling the job. I've tried many methods. Setting it as an instance variable, setting it as a static class variable, creating an overloaded constructor for MRDataQualityJob.... nothing seems to work. I am calling it from my top-level script programmatically like so:
if args.hadoop:
mrdq_job = MRDataQuality(args=['-r', 'hadoop', '--conf-path', 'mrjob.conf', '--jobconf', 'mapred.reduce.tasks=42', meterFile])
else:
mrdq_job = MRDataQuality(args=[meterFile])
with mrdq_job.make_runner() as runner:
runner.run()
No matter what I do to the mrdq_job instance it seems like the runner.run() is using a fresh new instance of the class which doesn't have the instance or static variables defined. How can i pass my parameters to the reducer???? I can do it in regular Hadoop Streaming by passing a string: "--reducer reducer.py arg1 arg2". Is there any equivalent for MRjob?

How about passing your parameters to job config and then reading them with get_jobconf_value?
Something like this:
from mrjob.compat import get_jobconf_value
class MRDataQuality(MRJob):
def reducer(self, groupId, meterList):
...
startDate = get_jobconf_value("my.job.settings.startdate")
endDate = get_jobconf_value("my.job.settings.enddate")
for meterId in meterList:
sys.stderr.write("Querying: " + str(meterId) + "\n")
df = extract_meter_data(table, meterId, startDate, endDate)
And then set the parameters in code like you did above
mrdq_job = MRDataQuality(args=['-r', 'hadoop', '--conf-path', 'mrjob.conf', '--jobconf', 'mapred.reduce.tasks=42', '--jobconf', 'my.job.settings.startdate=2013-06-10', '--jobconf', 'my.job.settings.enddate=2013-06-11', meterFile])

How about passing your parameters to job config and then reading them with get_jobconf_value inside of the reducer_init? This way you only have to read the parameters in once.
Something like this:
from mrjob.compat import get_jobconf_value
class MRDataQuality(MRJob):
def reducer_init(self):
...
self.startDate = get_jobconf_value("my.job.settings.startdate")
self.endDate = get_jobconf_value("my.job.settings.enddate")
def reducer(self, groupId, meterList):
for meterId in meterList:
sys.stderr.write("Querying: " + str(meterId) + "\n")
df = extract_meter_data(table, meterId, self.startDate, self.endDate)
And then set the parameters in code like you did above
mrdq_job = MRDataQuality(args=['-r', 'hadoop', '--conf-path', 'mrjob.conf', '--jobconf', 'mapred.reduce.tasks=42', '--jobconf', 'my.job.settings.startdate=2013-06-10', '--jobconf', 'my.job.settings.enddate=2013-06-11', meterFile])

Related

Can't make apache beam write outputs to bigquery when using DataflowRunner

I'm trying to understand why this pipeline writes no output to BigQuery.
What I'm trying to achieve is to calculate the USD index for the last 10 years, starting from different currency pairs observations.
All the data is in BigQuery and I need to organize it and sort it in a chronollogical way (if there is a better way to achieve this, I'm glad to read it because I think this might not be the optimal way to do this).
The idea behing the class Currencies() is to start grouping (and keep) the last observation of a currency pair (eg: EURUSD), update all currency pair values as they "arrive", sort them chronologically and finally get the open, high, low and close value of the USD index for that day.
This code works in my jupyter notebook and in cloud shell using DirectRunner, but when I use DataflowRunner it does not write any output. In order to see if I could figure it out, I tried to just create the data using beam.Create() and then write it to BigQuery (which it worked) and also just read something from BQ and write it on other table (also worked), so my best guess is that the problem is in the beam.CombineGlobally part, but I don't know what it is.
The code is as follows:
import logging
import collections
import apache_beam as beam
from datetime import datetime
SYMBOLS = ['usdjpy', 'usdcad', 'usdchf', 'eurusd', 'audusd', 'nzdusd', 'gbpusd']
TABLE_SCHEMA = "date:DATETIME,index:STRING,open:FLOAT,high:FLOAT,low:FLOAT,close:FLOAT"
class Currencies(beam.CombineFn):
def create_accumulator(self):
return {}
def add_input(self,accumulator,inputs):
logging.info(inputs)
date,currency,bid = inputs.values()
if '.' not in date:
date = date+'.0'
date = datetime.strptime(date,'%Y-%m-%dT%H:%M:%S.%f')
data = currency+':'+str(bid)
accumulator[date] = [data]
return accumulator
def merge_accumulators(self,accumulators):
merged = {}
for accum in accumulators:
ordered_data = collections.OrderedDict(sorted(accum.items()))
prev_date = None
for date,date_data in ordered_data.items():
if date not in merged:
merged[date] = {}
if prev_date is None:
prev_date = date
else:
prev_data = merged[prev_date]
merged[date].update(prev_data)
prev_date = date
for data in date_data:
currency,bid = data.split(':')
bid = float(bid)
currency = currency.lower()
merged[date].update({
currency:bid
})
return merged
def calculate_index_value(self,data):
return data['usdjpy']*data['usdcad']*data['usdchf']/(data['eurusd']*data['audusd']*data['nzdusd']*data['gbpusd'])
def extract_output(self,accumulator):
ordered = collections.OrderedDict(sorted(accumulator.items()))
index = {}
for dt,currencies in ordered.items():
if not all([symbol in currencies.keys() for symbol in SYMBOLS]):
continue
date = str(dt.date())
index_value = self.calculate_index_value(currencies)
if date not in index:
index[date] = {
'date':date,
'index':'usd',
'open':index_value,
'high':index_value,
'low':index_value,
'close':index_value
}
else:
max_value = max(index_value,index[date]['high'])
min_value = min(index_value,index[date]['low'])
close_value = index_value
index[date].update({
'high':max_value,
'low':min_value,
'close':close_value
})
return index
def main():
query = """
select date,currency,bid from data_table
where date(date) between '2022-01-13' and '2022-01-16'
and currency like ('%USD%')
"""
options = beam.options.pipeline_options.PipelineOptions(
temp_location = 'gs://PROJECT/temp',
project = 'PROJECT',
runner = 'DataflowRunner',
region = 'REGION',
num_workers = 1,
max_num_workers = 1,
machine_type = 'n1-standard-1',
save_main_session = True,
staging_location = 'gs://PROJECT/stag'
)
with beam.Pipeline(options = options) as pipeline:
inputs = (pipeline
| 'Read From BQ' >> beam.io.ReadFromBigQuery(query=query,use_standard_sql=True)
| 'Accumulate' >> beam.CombineGlobally(Currencies())
| 'Flat' >> beam.ParDo(lambda x: x.values())
| beam.io.Write(beam.io.WriteToBigQuery(
table = 'TABLE',
dataset = 'DATASET',
project = 'PROJECT',
schema = TABLE_SCHEMA))
)
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
main()
They way I execute this is from shell, using python3 -m first_script (is this the way I should run this batch jobs?).
What I'm missing or doing wrong? This is my first attemp to use Dataflow, so i'm probably making several mistakes in the book.
For whom it may help: I faced a similar problem but I already used the same code for a different flow that had a pubsub as input where it worked flawless instead a file based input where it simply did not. After a lot of experimenting I found that in the options I changed the flag
options = PipelineOptions(streaming=True, ..
to
options = PipelineOptions(streaming=False,
as of course it is not a streaming source, it's a bounded source, a batch. After I set this flag to true I found my rows in the BigQuery table. After it had finished it even stopped the pipeline as it where a batch operation. Hope this helps

Airflow Pipeline to read CSVs and load into PostgreSQL

So, I am trying to write an airflow Dag to 1) Read a few different CSVs from my local desk, 2) Create different PostgresQL tables, 3) Load the files into their respective tables. When I am running the DAG, the second step seems to fail.
Below are the DAG logic operators code:
AIRFLOW_HOME = os.getenv('AIRFLOW_HOME')
def get_listings_data ():
listings = pd.read_csv(AIRFLOW_HOME + '/dags/data/listings.csv')
return listings
def get_g01_data ():
demographics= pd.read_csv(AIRFLOW_HOME + '/dags/data/demographics.csv')
return demographics
def insert_listing_data_func(**kwargs):
ps_pg_hook = PostgresHook(postgres_conn_id="postgres")
conn_ps = ps_pg_hook.get_conn()
ti = kwargs['ti']
insert_df = pd.DataFrame.listings
if len(insert_df) > 0:
col_names = ['host_id', 'host_name', 'host_neighbourhood', 'host_total_listings_count', 'neighbourhood_cleansed', 'property_type', 'price', 'has_availability', 'availability_30']
values = insert_df[col_names].to_dict('split')
values = values['data']
logging.info(values)
insert_sql = """
INSERT INTO assignment_2.listings (host_name, host_neighbourhood, host_total_listings_count, neighbourhood_cleansed, property_type, price, has_availability, availability_30)
VALUES %s
"""
result = execute_values(conn_ps.cursor(), insert_sql, values, page_size=len(insert_df))
conn_ps.commit()
else:
None
return None
def insert_demographics_data_func(**kwargs):
ps_pg_hook = PostgresHook(postgres_conn_id="postgres")
conn_ps = ps_pg_hook.get_conn()
ti = kwargs['ti']
insert_df = pd.DataFrame.demographics
if len(insert_df) > 0:
col_names = ['LGA', 'Median_age_persons', 'Median_mortgage_repay_monthly', 'Median_tot_prsnl_inc_weekly', 'Median_rent_weekly', 'Median_tot_fam_inc_weekly', 'Average_num_psns_per_bedroom', 'Median_tot_hhd_inc_weekly', 'Average_household_size']
values = insert_df[col_names].to_dict('split')
values = values['data']
logging.info(values)
insert_sql = """
INSERT INTO assignment_2.demographics (LGA,Median_age_persons,Median_mortgage_repay_monthly,Median_tot_prsnl_inc_weekly,Median_rent_weekly,Median_tot_fam_inc_weekly,Average_num_psns_per_bedroom,Median_tot_hhd_inc_weekly,Average_household_size)
VALUES %s
"""
result = execute_values(conn_ps.cursor(), insert_sql, values, page_size=len(insert_df))
conn_ps.commit()
else:
None
return None
And my postgresQL hook for the demographics table (just an example) is below:
create_psql_table_demographics= PostgresOperator(
task_id="create_psql_table_demographics",
postgres_conn_id="postgres",
sql="""
CREATE TABLE IF NOT EXISTS postgres.demographics (
LGA VARCHAR,
Median_age_persons INT,
Median_mortgage_repay_monthly INT,
Median_tot_prsnl_inc_weekly INT,
Median_rent_weekly INT,
Median_tot_fam_inc_weekly INT,
Average_num_psns_per_bedroom DECIMAL(10,1),
Median_tot_hhd_inc_weekly INT,
Average_household_size DECIMAL(10,2)
);
""",
dag=dag)
Am I missing something in my code that stops the completion of that create_psql_table_demographics from running successfully on Airflow?
If your Postgresql database has access to the CSV files,
you may simply use the copy_expert method of the PostgresHook class (cf documentation).
Postgresql is pretty efficient in loading flat files: you'll save a lot of cpu cycles by not involving python (and Pandas!), not to mention the potential encoding issues that you would have to address.

Why is my self instance not getting passed in method

I have the following code.
class ResourceManagerSuspension is inheriting from class TestCase. TestCase executes tests alphabetically methods like testA will be executed before testB.
so testLinkData method is executed before testSuspension and self.link is getting it's value in testLinkData.
I am initializing a variable "self.link" and when method testSuspension gets executed it calls getQueryValues.
My question is why
"self.link" is not being passed in getQueryValues method ?
Can anyone explain how the self mechanism is working here, maybe I am doing something wrong?
class ResourceManagerSuspension(TestCase):
#classmethod
def setUpClass(self):
logger.info("=== Starting setup ===")
# self.rm_obj = ResourceManager(agg='mapper-prefix1-aggs.A.m2-test.akamai.com')
self.rm_obj = ResourceManager()
self.rm_leader = self.rm_obj.get_rm_leader()
logger.info("RM lead target is %s" % (self.rm_leader))
self.found = ""
self.link = ""
logger.info("self.link is : {}".format(self.link))
logger.info("self in setUpClass is : {}".format(self.__dict__))
#the name should be get link number to get started etc
def testLinkData(self):
linkValues = {}
#get a random link
sqlquery = "select * from rm_links_debugonly where adjuster_reason not like '\%suspend\%' and ip=" + self.rm_leader + " and link!=0 limit 1"
link_obj = self.rm_obj.get_link_info(query=sqlquery)
for row in link_obj:
self.link = row.link
self.getDynamicConfig()
logger.info("self.link is : {}".format(self.link))
logger.info("self in testLinkData is : {}".format(self.__dict__))
def testSuspension(self):
if not ResourceManagerSuspension.found:
#get the 'control_reason' from "rm_link_load_control_debugonly" and 'adjuster_cap' from "rm_links_debugonly" before submitting the dynamic config
self.control_reason_without_config, self.adjuster_cap_without_config = self.getQueryValues()
logger.info("param not present in the file, submitting with the param")
self.rm_obj.dyamic_config_submit(fromLocation = self.rm_obj.dynamic_config_modified, to = self.rm_obj.dynamic_config_incoming)
else:
logger.info("param is already present, removing it and submitting the config")
self.rm_obj.dyamic_config_submit(fromLocation = self.rm_obj.dynamic_config_modified, to = self.rm_obj.dynamic_config_incoming)
logger.info("self.link is : {}".format(self.link))
logger.info("self in testSuspension is : {}".format(self.__dict__))
def getQueryValues(self):
logger.info("self in getQueryValues is : {}".format(self.__dict__))
logger.info("self.link is : {}".format(self.link))
The output of last line in code is
[05:55:39.709 test_suspension_2: 61 I] self.link is :
Unit tests are supposed to be able to run independently from one another, entailing that getQueryValues should be able to run before or after testLinkData, but in your implementation, getQueryValues must run after testLinkData in order for the output you expect to be given.
To remedy this, you must write set-up utility methods that your test methods can call as they run, i.e. to give self.link a value independent of another test.

How can I pass a Python parameter in config.py to .sql file?

I am using Python Snowflake connector to extract data from tables in Snowflake. Here is my file structure:
sql
a.sql
b.sql
c.sql
configurations.py
data_extract.py
main.py
Here the sql folder contains all my sql queries in .sql files. I put these sql files separately because they are handreds of lines long each and looks messy if I put them into python files.
configuration.py contains datetime parameters I want to change every time I run the code. It looks like this:
START_TIME = '2018-10-01 00:00:00'
END_TIME = '2019-04-01 00:00:00'
I want to add these parameters into the .sql files. For example, a.sql includes the following content:
DECLARE
#START_PICKUP_DATE DATE,
#END_PICKUP_DATE DATE,
SET
#START_PICKUP_DATE = '2018-10-01'
SET
#END_PICKUP_DATE = '2019-04-01'
select supplier_confirmation_id, pickup_datetime, dropoff_datetime, pickup_station_distance
from SANDBOX.ZQIAN.V_PDL
where pickup_datetime >= START_PICKUP_DATE and pickup_datetime < END_PICKUP_DATE
and supplier_confirmation_id is not null;
I use a.sql in my python code in the following way:
def executeSQLScriptsFromFile(filepath):
# snowflake credentials, replace SECRET with your own
ctx = snowflake.connector.connect(
user='S_ANALYTICS_USER',
account=SECRET_A,
region='us-east-1',
warehouse=SECRET_B,
database=SECRET_C,
role=SECRET_D,
password=SECRET_E)
fd = open(filepath, 'r')
query = fd.read()
fd.close()
cs = ctx.cursor()
try:
cur = cs.execute(query)
df = pd.DataFrame.from_records(iter(cur), columns=[x[0] for x in cur.description])
finally:
cs.close()
ctx.close()
return df
def extract_data():
a_sqlpath = os.path.join(os.getcwd(), 'sql\a.sql')
a_df = executeSQLScriptsFromFile(a_sqlpath)
return a_df
The problem is I want START_PICKUP_DATE and END_PICKUP_DATE in a.sql file to be synced and equal to START_TIME and END_TIME in configurations.py file so that I only need to change START_TIME and END_TIME in configurations.py and extract data in different timeframe using a.sql in Snowflake.
I've been looking for solutions online for quite a long time, but still not able to find a good solution that is specific to my problem. Many thanks to anyone who can provide a hint!
You should be able to parameterize the sql statements so that instead of declaring in the SQL file you can just make it a parameter passed during execution.
select supplier_confirmation_id, pickup_datetime, dropoff_datetime, pickup_station_distance
from SANDBOX.ZQIAN.V_PDL
where pickup_datetime >= %(START_PICKUP_DATE)s and pickup_datetime < %(END_PICKUP_DATE)s and supplier_confirmation_id is not null;
Then when calling the function, just send the parameters START_PICKUP_DATE and END_PICKUP_DATE as parameters to the execute statement. One way to do this is to do a mapping from the parameter name to the value of the parameter. (In this example I'm assuming you have a function that will get the parameter value).
cur = cs.execute(query, {'START_PICKUP_DATE':get_value_from_config('start_pickup'), 'END_PICKUP_DATE':get_value_from_config('end_pickup')})
Or you can pass them by location
cur = cs.execute(query, [get_value_from_config('start_pickup'), get_value_from_config('end_pickup')])
Which in essense becomes
cur = cs.execute(query, ['2018-10-01 00:00:00','2019-04-01 00:00:00'])
To accomplish this, I would take your .sql files and extract the queries into triple-quoted python strings with format specifiers for your variables. Then import the queries into your main script just like you import your configuration:
sql_queries.py:
sql_a = """
DECLARE
#START_PICKUP_DATE DATE,
#END_PICKUP_DATE DATE,
SET
#START_PICKUP_DATE = {START_TIME}
SET
#END_PICKUP_DATE = {END_TIME}
select supplier_confirmation_id, pickup_datetime, dropoff_datetime, pickup_station_distance
from SANDBOX.ZQIAN.V_PDL
where pickup_datetime >= START_PICKUP_DATE and pickup_datetime < END_PICKUP_DATE
and supplier_confirmation_id is not null;
"""
main:
from sql_queries import sql_a
print(sql_a.format(configuration.START_TIME, configuration.END_TIME))

class instance method takes exactly two args, one given

I am having a bit of an issue. First off, I know that this code is able to stand alone and not be in a class but I would prefer that it is in a class. Second, when I run the code, I get this error TypeError: set_options() takes exactly 2 arguments (1 given) .
Here is my code. If anyone could point me in the right direction, I would appreciate it. I'm assuming that the set_options method isn't getting my jobj instance. Am I correct in assuming that and how would one go about fixing this? ps. I do have the correct imports and here is my py command at terminal python test.py radar 127.0.0.1 hashNumber testplan:speed
class TransferStuff(object):
tool = sys.argv[1]
target = sys.argv[2]
hash = sys.argv[3]
options = sys.argv[4]
def set_options(self, test_options):
option_arr = test_options.split(',')
new_arr = [i + ':{}'.format(i) for i in option_arr if ':' not in i]
for i in option_arr:
if ':' in i:
new_arr.append(i)
d = {}
for i in new_arr:
temp = i.split(':')
d[temp[0]] = temp[1]
return d
data = {'target': target, 'test': tool, 'HASH': hash,
'options': set_options(options)}
def write_to_json(self):
"""Serialize cli args and tool options in json format.
Write stream to json file.
"""
with open('envs.json', 'w') as fi:
json.dump(TransferStuff.data, fi)
if __name__ == "__main__":
try:
jobj = TransferStuff()
jobj.write_to_json()
Your method is inside a class, you need to create a instance of the class:
transfer_stuff_instance = TransferStuff()
And call the method with this instance:
transfer_stuff_instance.ser_options(options)

Categories