I've been looking around and can't find a solution for my issue. I have a DAG that is mainly checking that the backups are correct, so 1 task connects to a MySql DB and the 2nd one connects to a Postgres. Once I get those counts I want to send those results to another task that checks whether or not they match:
def mysql_count_validator(**kwargs):
db_hook = MySqlHook(mysql_conn_id='MySQL_DB')
# Query to grab desired results:
df_mysql = db_hook.get_pandas_df('''
SELECT COUNT(*)
FROM `schema`.`table`;
''')
# Save query results in a variable:
return df_mysql
def postgres_count_validator(**kwargs):
db_hook = PostgresHook(postgres_conn_id='Postgres_DB')
# Query to grab desired results:
df_postgress = db_hook.get_pandas_df('''
SELECT COUNT(*)
FROM `schema`.`table`;
''')
# Save query results in a variable:
return df_postgres
def validator(**kwargs):
if df_mysql == df_postgres:
print('Matched')
else:
print('Not Matched!')
mysql_count_validator = PythonOperator(
task_id = 'mysql_count_validator',
python_callable = mysql_count_validator
)
postgres_count_validator = PythonOperator(
task_id = 'postgres_count_validator',
python_callable = postgres_count_validator
)
validator = PythonOperator(
task_id = 'validator',
python_callable = validator,
op_kwarg = {df_mysql, df_postgres}
)
[mysql_count_validator, postgres_count_validator] >> validator
I tried passing it to the Xcom since it's only one line per task, so the data is not that big; but still not luck. Is it the way I'm saving the query results that is causing the issue or am I missing anything else?
Thanks in advance!
Ok, so after some trial and error I was able to pass the variable into the 3rd task.
My issue was not calling the pull in the third function:
def validator(**kwargs):
df_mysql = kwargs['task_instance'].xcom_pull(task_ids='mysql_count_validator')
df_postgress = kwargs['task_instance'].xcom_pull(task_ids='postgres_count_validator')
if df_mysql == df_postgress:
print('Matched')
else:
print(f'Not Matched!\nMySQL: {df_mysql}\nPostgres: {df_postgress}')
validator = PythonOperator(
task_id = 'validator',
python_callable = validator,
provide_context = True
)
Related
So, I am trying to write an airflow Dag to 1) Read a few different CSVs from my local desk, 2) Create different PostgresQL tables, 3) Load the files into their respective tables. When I am running the DAG, the second step seems to fail.
Below are the DAG logic operators code:
AIRFLOW_HOME = os.getenv('AIRFLOW_HOME')
def get_listings_data ():
listings = pd.read_csv(AIRFLOW_HOME + '/dags/data/listings.csv')
return listings
def get_g01_data ():
demographics= pd.read_csv(AIRFLOW_HOME + '/dags/data/demographics.csv')
return demographics
def insert_listing_data_func(**kwargs):
ps_pg_hook = PostgresHook(postgres_conn_id="postgres")
conn_ps = ps_pg_hook.get_conn()
ti = kwargs['ti']
insert_df = pd.DataFrame.listings
if len(insert_df) > 0:
col_names = ['host_id', 'host_name', 'host_neighbourhood', 'host_total_listings_count', 'neighbourhood_cleansed', 'property_type', 'price', 'has_availability', 'availability_30']
values = insert_df[col_names].to_dict('split')
values = values['data']
logging.info(values)
insert_sql = """
INSERT INTO assignment_2.listings (host_name, host_neighbourhood, host_total_listings_count, neighbourhood_cleansed, property_type, price, has_availability, availability_30)
VALUES %s
"""
result = execute_values(conn_ps.cursor(), insert_sql, values, page_size=len(insert_df))
conn_ps.commit()
else:
None
return None
def insert_demographics_data_func(**kwargs):
ps_pg_hook = PostgresHook(postgres_conn_id="postgres")
conn_ps = ps_pg_hook.get_conn()
ti = kwargs['ti']
insert_df = pd.DataFrame.demographics
if len(insert_df) > 0:
col_names = ['LGA', 'Median_age_persons', 'Median_mortgage_repay_monthly', 'Median_tot_prsnl_inc_weekly', 'Median_rent_weekly', 'Median_tot_fam_inc_weekly', 'Average_num_psns_per_bedroom', 'Median_tot_hhd_inc_weekly', 'Average_household_size']
values = insert_df[col_names].to_dict('split')
values = values['data']
logging.info(values)
insert_sql = """
INSERT INTO assignment_2.demographics (LGA,Median_age_persons,Median_mortgage_repay_monthly,Median_tot_prsnl_inc_weekly,Median_rent_weekly,Median_tot_fam_inc_weekly,Average_num_psns_per_bedroom,Median_tot_hhd_inc_weekly,Average_household_size)
VALUES %s
"""
result = execute_values(conn_ps.cursor(), insert_sql, values, page_size=len(insert_df))
conn_ps.commit()
else:
None
return None
And my postgresQL hook for the demographics table (just an example) is below:
create_psql_table_demographics= PostgresOperator(
task_id="create_psql_table_demographics",
postgres_conn_id="postgres",
sql="""
CREATE TABLE IF NOT EXISTS postgres.demographics (
LGA VARCHAR,
Median_age_persons INT,
Median_mortgage_repay_monthly INT,
Median_tot_prsnl_inc_weekly INT,
Median_rent_weekly INT,
Median_tot_fam_inc_weekly INT,
Average_num_psns_per_bedroom DECIMAL(10,1),
Median_tot_hhd_inc_weekly INT,
Average_household_size DECIMAL(10,2)
);
""",
dag=dag)
Am I missing something in my code that stops the completion of that create_psql_table_demographics from running successfully on Airflow?
If your Postgresql database has access to the CSV files,
you may simply use the copy_expert method of the PostgresHook class (cf documentation).
Postgresql is pretty efficient in loading flat files: you'll save a lot of cpu cycles by not involving python (and Pandas!), not to mention the potential encoding issues that you would have to address.
Disclaimer: Yes I am well aware this is a mad attempt.
Use case:
I am reading from a config file to run a test collection where each such collection comprises of set of test cases with corresponding results and a fixed setup.
Flow (for each test case):
Setup: wipe and setup database with specific test case dataset (glorified SQL file)
load expected test case results from csv
execute collections query/report
compare results.
Sounds good, except the people writing the test cases are more from a tech admin perspective, so the goal is to enable this without writing any python code.
code
Assume these functions exist.
# test_queries.py
def gather_collections(): (collection, query, config)
def gather_cases(collection): (test_case)
def load_collection_stubs(collection): None
def load_case_dataset(test_case): None
def read_case_result_csv(test_case): [csv_result]
def execute(query): [query_result]
class TestQueries(unittest.TestCase):
def setup_method(self, method):
collection = self._item.name.replace('test_', '')
load_collection_stubs(collection)
# conftest.py
import pytest
#pytest.hookimpl(hookwrapper=True)
def pytest_runtest_protocol(item, nextitem):
item.cls._item = item
yield
Example Data
Collection stubs / data (setting up of environment)
-- stubs/test_setup_log.sql
DROP DATABASE IF EXISTS `test`;
CREATE DATABASE `test`;
USE test;
CREATE TABLE log (`id` int(9) NOT NULL AUTO_INCREMENT, `timestamp` datetime NOT NULL DEFAULT NOW(), `username` varchar(100) NOT NULL, `message` varchar(500));
Query to test
-- queries/count.sql
SELECT count(*) as `log_count` from test.log where username = 'unicorn';
Test case 1 input data
-- test_case_1.sql
INSERT INTO log (`id`, `timestamp`, `username`, `message`)
VALUES
(1,'2020-12-18T11:23.01Z', 'unicorn', 'user logged in'),
(2,'2020-12-18T11:23.02Z', 'halsey', 'user logged off'),
(3,'2020-12-18T11:23.04Z', 'unicorn', 'user navigated to home')
Test case 1 expected result
test_case_1.csv
log_count
2
Attempt 1
for collection, query, config in gather_collections():
test_method_name = 'test_{}'.format(collection)
LOGGER.debug("collections.{}.test - {}".format(collection, config))
cases = gather_cases(collection)
LOGGER.debug("collections.{}.cases - {}".format(collection, cases))
setattr(
TestQueries,
test_method_name,
pytest.mark.parametrize(
'case_name',
cases,
ids=cases
)(
lambda self, case_name: (
load_case_dataset(case_name),
self.assertEqual(execute(query, case_name), read_case_result_csv( case_name))
)
)
)
Attempt 2
for collection, query, config in gather_collections():
test_method_name = 'test_{}'.format(collection)
LOGGER.debug("collections.{}.test - {}".format(collection, config))
setattr(
TestQueries,
test_method_name,
lambda self, case_name: (
load_case_dataset(case_name),
self.assertEqual(execute(query, case_name), read_case_result_csv(case_name))
)
)
def pytest_generate_tests(metafunc):
collection = metafunc.function.__name__.replace('test_', '')
# FIXME logs and id setting not working
cases = gather_cases(collection)
LOGGER.info("collections.{}.pytest.cases - {}".format(collection, cases))
metafunc.parametrize(
'case_name',
cases,
ids=cases
)
So I figured it out, but it's not the most elegant solution.
Essentially you use one function and then use some of pytests hooks to change the function names for reporting.
There are numerous issues, e.g. if you don't use pytest.param to pass the parameters to parametrize then you do not have the required information available.
Also the method passed to setup_method is not aware of the actual iteration being run when its called, so I had to hack that in with the iter counter.
# test_queries.py
def gather_tests():
global TESTS
for test_collection_name in TESTS.keys():
LOGGER.debug("collections.{}.gather - {}".format(test_collection_name, TESTS[test_collection_name]))
query = path.join(SRC_DIR, TESTS[test_collection_name]['query'])
cases_dir = TESTS[test_collection_name]['cases']
result_sets = path.join(TEST_DIR, cases_dir, '*.csv')
for case_result_csv in glob.glob(result_sets):
test_case_name = path.splitext(path.basename(case_result_csv))[0]
yield test_case_name, query, test_collection_name, TESTS[test_collection_name]
class TestQueries():
iter = 0
def setup_method(self, method):
method_name = method.__name__ # or self._item.originalname
global TESTS
if method_name == 'test_scripts_reports':
_mark = next((m for m in method.pytestmark if m.name == 'parametrize' and 'collection_name' in m.args[0]), None)
if not _mark:
raise Exception('test {} missing collection_name parametrization'.format(method_name)) # nothing to do here
_args = _mark.args[0]
_params = _mark.args[1]
LOGGER.debug('setup_method: _params - {}'.format(_params))
if not _params:
raise Exception('test {} missing pytest.params'.format(method_name)) # nothing to do here
_currparams =_params[self.iter]
self.iter += 1
_argpos = [arg.strip() for arg in _args.split(',')].index('collection_name')
collection = _currparams.values[_argpos]
LOGGER.debug('collections.{}.setup_method[{}] - {}'.format(collection, self.iter, _currparams))
load_collection_stubs(collection)
#pytest.mark.parametrize(
'case_name, collection_query, collection_name, collection_config',
[pytest.param(*c, id='{}:{}'.format(c[2], c[0])) for c in gather_tests()]
)
def test_scripts_reports(self, case_name, collection_query, collection_name, collection_config):
if not path.isfile(collection_query):
pytest.skip("report query does not exist: {}".format(collection_query))
LOGGER.debug("test_scripts_reports.{}.{} - ".format(collection_name, case_name))
load_case_dataset( case_name)
assert execute(collection_query, case_name) == read_case_result_csv(case_name)
Then to make the test ids more human you can do this
# conftest.py
def pytest_collection_modifyitems(items):
# https://stackoverflow.com/questions/61317809/pytest-dynamically-generating-test-name-during-runtime
for item in items:
if item.originalname == 'test_scripts_reports':
item._nodeid = re.sub(r'::\w+::\w+\[', '[', item.nodeid)
the result with the following files:
stubs/
00-wipe-db.sql
setup-db.sql
queries/
report1.sql
collection/
report1/
case1.sql
case1.csv
case2.sql
case2.csv
# results (with setup_method firing before each test and loading the appropriate stubs as per configuration)
FAILED test_queries.py[report1:case1]
FAILED test_queries.py[report1:case2]
I use flask, an api and SQLAlchemy with SQLite.
I begin in python and flask and i have problem with the list.
My application work, now i try a news functions.
I need to know if my json informations are in my db.
The function find_current_project_team() get information in the API.
def find_current_project_team():
headers = {"Authorization" : "bearer "+session['token_info']['access_token']}
user = requests.get("https://my.api.com/users/xxxx/", headers = headers)
user = user.json()
ids = [x['id'] for x in user]
return(ids)
I use ids = [x['id'] for x in user] (is the same that) :
ids = []
for x in user:
ids.append(x['id'])
To get ids information. Ids information are id in the api, and i need it.
I have this result :
[2766233, 2766237, 2766256]
I want to check the values ONE by One in my database.
If the values doesn't exist, i want to add it.
If one or all values exists, I want to check and return "impossible sorry, the ids already exists".
For that I write a new function:
def test():
test = find_current_project_team()
for find_team in test:
find_team_db = User.query.filter_by(
login=session['login'], project_session=test
).first()
I have absolutely no idea to how check values one by one.
If someone can help me, thanks you :)
Actually I have this error :
sqlalchemy.exc.InterfaceError: (InterfaceError) Error binding
parameter 1 - probably unsupported type. 'SELECT user.id AS user_id,
user.login AS user_login, user.project_session AS user_project_session
\nFROM user \nWHERE user.login = ? AND user.project_session = ?\n
LIMIT ? OFFSET ?' ('my_tab_login', [2766233, 2766237, 2766256], 1, 0)
It looks to me like you are passing the list directly into the database query:
def test():
test = find_current_project_team()
for find_team in test:
find_team_db = User.query.filter_by(login=session['login'], project_session=test).first()
Instead, you should pass in the ID only:
def test():
test = find_current_project_team()
for find_team in test:
find_team_db = User.query.filter_by(login=session['login'], project_session=find_team).first()
Asides that, I think you can do better with the naming conventions though:
def test():
project_teams = find_current_project_team()
for project_team in project_teams:
project_team_result = User.query.filter_by(login=session['login'], project_session=project_team).first()
All works thanks
My code :
project_teams = find_current_project_team()
for project_team in project_teams:
project_team_result = User.query.filter_by(project_session=project_team).first()
print(project_team_result)
if project_team_result is not None:
print("not none")
else:
project_team_result = User(login=session['login'], project_session=project_team)
db.session.add(project_team_result)
db.session.commit()
I made a custom airflow operator, this operator takes an input and the output of this operator is on XCOM.
What I want to achieve is to call the operator with some defined input, parse the output as Python callable inside the Branch Operator and then pass the parsed output to another task that calls the same operator tree:
CustomOperator_Task1 = CustomOperator(
data={
'type': 'custom',
'date': '2017-11-12'
},
task_id='CustomOperator_Task1',
dag=dag)
data = {}
def checkOutput(**kwargs):
result = kwargs['ti'].xcom_pull(task_ids='CustomOperator_Task1')
if result.success = True:
data = result.data
return "CustomOperator_Task2"
return "Failure"
BranchOperator_Task = BranchPythonOperator(
task_id='BranchOperator_Task ',
dag=dag,
python_callable=checkOutput,
provide_context=True,
trigger_rule="all_done")
CustomOperator_Task2 = CustomOperator(
data= data,
task_id='CustomOperator_Task2',
dag=dag)
CustomOperator_Task1 >> BranchOperator_Task >> CustomOperator_Task2
In task CustomOperator_Task2 I would want to pass the parsed data from BranchOperator_Task. Right now it is always empty {}
What is the best way to do that?
I see your issue now. Setting the data variable like you are won't work because of how Airflow works. An entirely different process will be running the next task, so it won't have the context of what data was set to.
Instead, BranchOperator_Task has to push the parsed output into another XCom so CustomOperator_Task2 can explicitly fetch it.
def checkOutput(**kwargs):
ti = kwargs['ti']
result = ti.xcom_pull(task_ids='CustomOperator_Task1')
if result.success:
ti.xcom_push(key='data', value=data)
return "CustomOperator_Task2"
return "Failure"
BranchOperator_Task = BranchPythonOperator(
...)
CustomOperator_Task2 = CustomOperator(
data_xcom_task_id=BranchOperator_Task.task_id,
data_xcom_key='data',
task_id='CustomOperator_Task2',
dag=dag)
Then your operator might look something like this.
class CustomOperator(BaseOperator):
#apply_defaults
def __init__(self, data_xcom_task_id, data_xcom_key, *args, **kwargs):
self.data_xcom_task_id = data_xcom_task_id
self.data_xcom_key = data_xcom_key
def execute(self, context):
data = context['ti'].xcom_pull(task_ids=self.data_xcom_task_id, key=self.data_xcom_key)
...
Parameters may not be required if you just want to hardcode them. It depends on your use case.
As your comment suggests, the return value from your custom operator is None, therefore your xcom_pull should expect to be empty.
Please use xcom_push explicitly, as the default behavior of airflow could change over time.
Many times I find myself writing code similar to:
query = MyModel.objects.all()
if request.GET.get('filter_by_field1'):
query = query.filter(field1 = True)
if request.GET.get('filter_by_field2'):
query = query.filter(field2 = False)
field3_filter = request.GET.get('field3'):
if field3_filter is not None:
query = query.filter(field3 = field3_filter)
if field4_filter:
query = query.filter(field4 = field4_filter)
# etc...
return query
Is there a better, more generic way of building queries such as the one above?
If the only things that are ever going to be in request GET are potential query arguments, you could do this:
query = MyModel.objects.filter(**request.GET)