I'm doing ETL with Airflow PythonOperator to update a SCD1 dimension table (dim_user).
The structure of the mysql dimension table:
| user_key | open_id | gender | nickname | mobile | load_time | updated_at |
|----------|---------------------|--------|----------|-------------|---------------------|---------------------|
| 117 | ohwv90JTgZSn******* | 2 | ABC | ************| 2019-05-24 10:12:44 | 2019-05-23 19:00:43 |
In the python script, I have a same structure (except the user_key and load_time column) pandas dataframe df_users_updated.
Now I want to update the mysql table on the condition of open_id field matched:
# database connection
conn = create_engine(db_conn_str)
# update the rows with a for loop
for index, row in df_users_updated.iterrows():
info = dict(row)
conn.execute('update dim_user set gender=%s, nickname=%s, mobile=%s, updated_at=%s where open_id=%s',
(info['gender'], info['nickname'], info['mobile'], info['updated_at'], info['open_id']))
conn.dispose()
The problem is I only have 1000 rows in the df_users_updated, it toke over 10 minutes to execute these update queries.
Is there a better way to do this?
based on my experience, there are some tricks could improve the performance.
use mysqlclient lib, cursor.executemany(sql, params) method
use tuple type of params
use index on the where fields.
Related
I am trying to store some tables I create in my code in an RDS instance using psycopg2. The script runs without issue and I can see the table being stored correctly in the DB. However, if I try to retrieve the query, I only see the columns, but no data:
import pandas as pd
import psycopg2
test=pd.DataFrame({'A':[1,1],'B':[2,2]})
#connect is a function to connect to the RDS instance
connection= connect()
cursor=connection.cursor()
query='CREATE TABLE test (A varchar NOT NULL,B varchar NOT NULL);'
cursor.execute(query)
connection.commit()
cursor.close()
connection.close()
This script runs without issues and, printing out file_check from the following script:
connection=connect()
# check if file already exists in SQL
sql = """
SELECT "table_name","column_name", "data_type", "table_schema"
FROM INFORMATION_SCHEMA.COLUMNS
WHERE "table_schema" = 'public'
ORDER BY table_name
"""
file_check=pd.read_sql(sql, con=connection)
connection.close()
I get:
table_name column_name data_type table_schema
0 test a character varying public
1 test b character varying public
which looks good.
Running the following however:
read='select * from public.test'
df=pd.read_sql(read,con=connection)
returns:
Empty DataFrame
Columns: [a, b]
Index: []
Anybody have any idea why this is happening? I cannot seem to get around this
Erm, your first script has a test_tbl dataframe, but it's never referred to after it's defined.
You'll need to
test_tbl.to_sql("test", connection)
or similar to actually write it.
A minimal example:
$ createdb so63284022
$ python
>>> import sqlalchemy as sa
>>> import pandas as pd
>>> test = pd.DataFrame({'A':[1,1],'B':[2,2], 'C': ['yes', 'hello']})
>>> engine = sa.create_engine("postgres://localhost/so63284022")
>>> with engine.connect() as connection:
... test.to_sql("test", connection)
...
>>>
$ psql so63284022
so63284022=# select * from test;
index | A | B | C
-------+---+---+-------
0 | 1 | 2 | yes
1 | 1 | 2 | hello
(2 rows)
so63284022=# \d+ test
Table "public.test"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+--------+-----------+----------+---------+----------+--------------+-------------
index | bigint | | | | plain | |
A | bigint | | | | plain | |
B | bigint | | | | plain | |
C | text | | | | extended | |
Indexes:
"ix_test_index" btree (index)
Access method: heap
so63284022=#
I was able to solve this:
As it was pointed out by #AKX, I was only creating the table structure, but I was not filling in the table.
I now import import psycopg2.extras as well and, after this:
query='CREATE TABLE test (A varchar NOT NULL,B varchar NOT NULL);'
cursor.execute(query)
I add something like:
update_query='INSERT INTO test(A, B) VALUES(%s,%s) ON CONFLICT DO NOTHING'
psycopg2.extras.execute_batch(cursor, update_query, test.values)
cursor.close()
connection.close()
My table is now correctly filled after checking with pd.read_sql
I have a table which has columns named measured_time, data_type and value.
In data_type, there is two types, temperature and humidity.
I want to combine two rows of data if they have same measured_time using Django ORM.
I am using Maria DB.
Using Raw SQL, The following Query does what I want to.
SELECT T1.measured_time, T1.temperature, T2.humidity
FROM ( SELECT CASE WHEN data_type = 1 then value END as temperature,
CASE WHEN data_type = 2 then value END as humidity ,
measured_time FROM data_table) as T1,
( SELECT CASE WHEN data_type = 1 then value END as temperature ,
CASE WHEN data_type = 2 then value END as humidity ,
measured_time FROM data_table) as T2
WHERE T1.measured_time = T2.measured_time and
T1.temperature IS NOT null and T2.humidity IS NOT null and
DATE(T1.measured_time) = '2019-07-01'
Original Table
| measured_time | data_type | value |
|---------------------|-----------|-------|
| 2019-07-01-17:27:03 | 1 | 25.24 |
| 2019-07-01-17:27:03 | 2 | 33.22 |
Expected Result
| measured_time | temperaure | humidity |
|---------------------|------------|----------|
| 2019-07-01-17:27:03 | 25.24 | 33.22 |
I've never used it and so can't answer in detail, but you can feed a raw SQL query into Django and get the results back through the ORM. Since you have already got the SQL this may be the easiest way to proceed. Documentation here
Consider a table users of 6 rows
+_______________________+
| userid | name |
+-----------------------+
| 1 | john |
| 2 | steve |
| 3 | joe |
| 4 | jason |
| 5 | abraham |
| 6 | leonard |
+-----------------------+
I am using the below SQL query:
SELECT userid,name FROM users where userid IN (2,3,4,5);
which returns 4 rows -
| 2 | steve |
| 3 | joe |
| 4 | jason |
| 5 | abraham |
Pymysql equivalent code is as below:
def get_username(user_ids):
data=[]
conn = init_db()
cur = conn.cursor(pymysql.cursors.DictCursor)
cur.executemany("SELECT userid,name from users WHERE userid IN (%s)",user_ids)
rows=cur.fetchall()
for row in rows:
data.append([row['userid'],row['name']])
cur.close()
conn.close()
return data
user_ids=[2,3,4,5]
get_usernames(user_ids)
This code just returns the last row [[5,abraham]] . How can I fetch all those rows?.
That's the (partly documented) behaviour of .executemany():
Help on method executemany in module pymysql.cursors:
executemany(self, query, args) method of pymysql.cursors.Cursor instance
Run several data against one query
:param query: query to execute on server
:param args: Sequence of sequences or mappings. It is used as parameter.
:return: Number of rows affected, if any.
This method improves performance on multiple-row INSERT and
REPLACE. Otherwise it is equivalent to looping over args with
execute().
So what you want here is cursor.execute() - but then, you have a bit more work to build your SQL query:
user_ids = (2, 3, 4, 5)
placeholders = ", ".join(["%s"] * len(user_ids))
sql = "SELECT userid,name from users WHERE userid IN ({})".format(placeholders)
cursor.execute(sql, user_ids)
data = list(cursor)
Note that cursors are iterables, so you don't need to explicitely call cursor.fetchall() then iterate on the result, you can iterate directly on the cursor. Also note that if you want a list of (id, name) tuples, using a DictCursor is just a double waste of CPU cycles (once for building the dicts and once for rebuilding tuples out of them), you could just use a default cursor and return list(cursor) instead.
My first guess is that is something related SELECT statement.
May you try this way of generating the query?
def get_username(user_ids):
data=[]
conn = init_db()
cur = conn.cursor(pymysql.cursors.DictCursor)
cur.executemany("SELECT userid,name from users WHERE userid IN "+"("+','.join(str(e) for e in user_ids)+")")
rows=cur.fetchall()
for row in rows:
data.append([row['userid'],row['name']])
cur.close()
conn.close()
return data
user_ids=[2,3,4,5]
get_usernames(user_ids)
This question already has answers here:
Generate sql with subquery as a column in select statement using SQLAlchemy
(2 answers)
Closed 5 years ago.
Can the following MySQL query be done with a single SQLAlchemy session.query or do I have to run a second session.query ? If so, how so?
Select *, (select c from table2 where id = table1.id) as d from table1 where foo = x
What you want is SQLAlchemy's subquery object. Essentially, you write a query as normal, but instead of ending the query with .all() or .first() (as you would normally do to return some kind of result directly), you end your query with .subquery() to return a subquery object. The subquery object basically generates the subquery SQL embedded within an alias, but doesn't run it. You can then use it in your primary query, and SQLAlchemy will issue the necessary SQL to perform the query and subquery in a single operation.
Let's say we had the following student_scores table:
+------------+-------+-----+
| name | score | age |
+------------+-------+-----+
| Xu Feng | 95 | 25 |
| John Smith | 88 | 26 |
| Sarah Taft | 89 | 25 |
| Ahmed Zaki | 86 | 26 |
+------------+-------+-----|
(Ignore the horrible database design)
In this example, we want to get a result set containing all the students and their scores, joined to the average score by age. In raw SQL we would do something like this:
SELECT ss.name, ss.age, ss.score, sub.average
FROM student_scores AS "ss"
JOIN ( SELECT age, AVG(score) AS "average"
FROM student_scores
GROUP BY age) AS "sub"
ON ss.age = sub.age
ORDER BY ss.score DESC
The result should be something like this:
+------------+-------+-----+---------+
| name | score | age | average |
+------------+-------+-----+---------+
| Xu Feng | 95 | 25 | 92 |
| John Smith | 88 | 26 | 87 |
| Sarah Taft | 89 | 25 | 92 |
| Ahmed Zaki | 86 | 26 | 87 |
+------------+-------+-----|---------+
In SQLAlchemy, we can first define the subquery on its own:
from sqlalchemy.sql import func
avg_scores = (
session.query(
func.avg(StudentScores.score).label('average'),
StudentScores.age
)
.group_by(StudentScores.age)
.subquery()
)
Now our subquery is defined, but no statements have actually been sent to the database. Nevertheless we can treat our subquery object almost as though it were just another table, and write our main query:
results = (
session.query(StudentScores, avg_scores)
.join(avg_scores, StudentScores.age == avg_scores.c.age)
.order_by('score DESC').all()
)
Only now is any SQL issued to the database, and we get the same results as the raw subquery example.
Having said that, the example you provided is actually pretty trivial and shouldn't require a subquery at all. Depending on how your relationships are defined, SQLAlchemy can eagerly load related objects, so that the object returned by:
results = session.query(Table1).filter(Table1.foo == 'x').all()
will have access to the child (or parent) record(s) from Table2, even though we didn't ask for it here - because the relationship defined directly in the models is handling that for us. Check out "Relationship Loading Techniques" in the SQLAlchemy docs for more information on how this works.
I have a table defined like so:
Column | Type | Modifiers | Storage | Stats target | Description
-------------+---------+-----------+---------+--------------+-------------
id | uuid | not null | plain | |
user_id | uuid | | plain | |
area_id | integer | | plain | |
vote_amount | integer | | plain | |
I want to be able to generate a rank 'column' when I query this database. This rank column would be ordered by the vote_amount column. I have attempted to create a query to do this, it looks like so:
subq_rank = db.session.query(user_stories).add_columns(db.func.rank.over(partition_by=user_stories.user_id, order_by=user_stories.vote_amount).label('rank')).subquery('slr')
data = db.session.query(user_stories).select_entity_from(subq_rank).filter(user_stories.area_id == id).group_by(-subq_rank.c.rank).limit(50).all()
Hopefully my attempt will give you an idea of what I am trying to achieve.
Thanks.
Well, if you need in each query these columns better I would do it in DB. I would create a view which contains the column rank, and in the query I call this view to show directly the data in code:
CREATE VIEW [ranking_user_stories] AS
SELECT TOP 50 * FROM
(SELECT *, rank() over (partition by user_stories.user_id order by user_stories.vote_amount ASC) AS ranking
FROM user_stories
WHERE user_stories.area_id = id) uS
ORDER BY vote_amount ASC
It's the same logic than your code but in SQL, if your are using MySQL, just change TOP 50 to LIMIT 50 (and put at the end of query). I don't see the sense to put the last group by by ranking, but if you need it:
CREATE VIEW [ranking_user_stories] AS
SELECT TOP 50 MAX(id) AS id, user_id, area_id, MAX(vote_amount) AS vote_amount, ranking FROM
(SELECT *, rank() over (partition by user_stories.user_id order by user_stories.vote_amount ASC) AS ranking
FROM user_stories
WHERE user_stories.area_id = id) uS
ORDER BY MAX(vote_amount) ASC
GROUP BY user_id, area_id, ranking