I've built a Web UI to serve as an ETL application that allows users to select some CSV and TSV files contain large amounts of records and I am attempting to insert them into a PostgreSQL database. As has already been well commented on, this process is kind of slow. After some research it looked like using the UNNEST function would be my answer but I'm having trouble implementing it. Honestly I just didn't find a great walk-through tutorial as I normally do when researching any data processing in Python.
Here's the SQL string as I store them (to be used in functions later):
salesorder_write = """
INSERT INTO api.salesorder (
site,
sale_type,
sales_rep,
customer_number,
shipto_number,
cust_po_number,
fob,
order_number
) VALUES (
UNNEST(ARRAY %s)
"""
I use this string along with a list of tuples like so:
for order in orders:
inputs=(
order['site'],
order['sale_type'],
order['sales_rep'],
order['customer_number'],
order['shipto_number'],
order['cust_po_number'],
order['fob'],
order['order_number']
)
tup_list.append(inputs)
cur.execute(strSQL,tup_list)
This gives me the error that Not all arguments converted during string formatting. My first question is How do I need to structure my SQL to be able to pass my list of tuples. My second is, can I use the existing dictionary structure in much the same way?
unnest is not superior to the now (since Psycopg 2.7) canonical execute_values:
from psycopg2.extras import execute_values
orders = [
dict (
site = 'x',
sale_type = 'y',
sales_rep = 'z',
customer_number = 1,
shipto_number = 2,
cust_po_number = 3,
fob = 4,
order_number = 5
)
]
salesorder_write = """
insert into t (
site,
sale_type,
sales_rep,
customer_number,
shipto_number,
cust_po_number,
fob,
order_number
) values %s
"""
execute_values (
cursor,
salesorder_write,
orders,
template = """(
%(site)s,
%(sale_type)s,
%(sales_rep)s,
%(customer_number)s,
%(shipto_number)s,
%(cust_po_number)s,
%(fob)s,
%(order_number)s
)""",
page_size = 1000
)
Related
I am using table merging in order to select items from my db against a list of parameter tuples. The query works fine, but cur.fetchall() does not return the entire table that I want.
For example:
data = (
(1, '2020-11-19'),
(1, '2020-11-20'),
(1, '2020-11-21'),
(2, '2020-11-19'),
(2, '2020-11-20'),
(2, '2020-11-21')
)
query = """
with data(song_id, date) as (
values %s
)
select t.*
from my_table t
join data d
on t.song_id = d.song_id and t.date = d.date::date
"""
execute_values(cursor, query, data)
results = cursor.fetchall()
In practice, my list of tuples to check against is thousands of rows long, and I expect the response to also be thousands of rows long.
But I am only getting 5 rows back if I call cur.fetchall() at the end of this request.
I know that this is because execute_values batches the requests, but there is some strange behavior.
If I pass page_size=10 then I get 2 items back. And if I set fetch=True then I get no results at all (even though the rowcount does not match that).
My thought was to batch these requests, but the page_size for the batch is not matching the number of items that I'm expecting per batch.
How should I change this request so that I can get all the results I'm expecting?
Edit: (years later after much experience with this)
What you really want to do here, is use the COPY command to bulk insert your data into a temporary dataframe. Then use that temporary dataframe to merge on both your columns as you would a normal table. With psycopg2 you can use the copy_expert method to perform the COPY. To reiterate (according to this example) here's how you would do that...
Also... trust me when I say this... if SPEED is an issue for you, this is by far, not-even-close, the fastest method out there.
code in this example is not tested
df = pd.DataFrame('<whatever your dataframe is>')
# Start by creating the temporary table
string = '''
create temp table mydata as (
item_id int,
date date
);
'''
cur.execute(string)
# Now you need to generate an sql string that will copy
# your data into the db
string = sql.SQL("""
copy {} ({})
from stdin (
format csv,
null "NaN",
delimiter ',',
header
)
""").format(sql.Identifier('mydata'), sql.SQL(',').join([sql.Identifier(i) for i in df.columns])
# Write your dataframe to the disk as a csv
df.to_csv('./temp_dataframe.csv', index=False, na_rep='NaN')
# Copy into the database
with open('./temp_dataframe.csv') as csv_file:
cur.copy_expert(string, csv_file)
# Now your data should be in your temporary table, so we can
# perform our select like normal
string = '''
select t.*
from my_table t
join mydata d
on t.item_id = d.item_id and t.date = d.date
'''
cur.execute(string)
data = cur.fetchall()
I am attempting to return some named columns from a jsonb data set that is stored with PostgreSQL.
I am able to run a raw query that meets my needs directly, however I am trying to run the query utilising SQLAlchemy, in order to ensure that my code is 'pythonic' and easy to read.
The query that returns the correct result (two columns) is:
SELECT
tmp.item->>'id',
tmp.item->>'name'
FROM (SELECT jsonb_array_elements(t.data -> 'users') AS item FROM tpeople t) as tmp
Example json (each user has 20+ columns)
{ "results":247, "users": [
{"id":"202","regdate":"2015-12-01","name":"Bob Testing"},
{"id":"87","regdate":"2014-12-12","name":"Sally Testing"},
{"id":"811", etc etc}
...
]}
The table is simple enough, with a PK, datetime of json extraction, and the jsonb column for the extract
CREATE TABLE tpeople
(
record_id bigint NOT NULL DEFAULT nextval('"tpeople_record_id_seq"'::regclass) ( INCREMENT 1 START 1 MINVALUE 1 MAXVALUE 9223372036854775807 CACHE 1 ),
scrape_time timestamp without time zone NOT NULL,
data jsonb NOT NULL,
CONSTRAINT "tpeople_pkey" PRIMARY KEY (record_id)
);
Additionally I have a People Class that looks as follows:
class people(Base):
__tablename__ = 'tpeople'
record_id = Column(BigInteger, primary_key=True, server_default=text("nextval('\"tpeople_record_id_seq\"'::regclass)"))
scrape_time = Column(DateTime, nullable=False)
data = Column(JSONB(astext_type=Text()), nullable=False)
Presently my code to return the two columns looks like this:
from db.db_conn import get_session // Generic connector for my db
from model.models import people
from sqlalchemy import func,
sess = get_session()
sub = sess.query(func.jsonb_array_elements(people.data["users"]).label("item")).subquery()
test = sess.query(sub.c.item).select_entity_from(sub).all()
SQLAlchemy generates the following SQL:
SELECT anon_1.item AS anon_1_item
FROM (SELECT jsonb_array_elements(tpeople.data -> %(data_1)s) AS item
FROM tpeople) AS anon_1
{'data_1': 'users'}
But nothing I seem to do can allow me to only get certain columns within the item itself like the raw SQL I can write. Some of the approaches I have tried as follows (they all error out):
test = sess.query("sub.item.id").select_entity_from(sub).all()
test = sess.query(sub.item.["id"]).select_entity_from(sub).all()
aas = func.jsonb_to_recordset(people.data["users"])
res = sess.query("id").select_from(aas).all()
sub = select(func.jsonb_array_elements(people.data["users"]).label("item"))
Presently I can extract the columns I need in a simple for loop, but this seems like a hacky way to do it, and I'm sure there is something dead obvious I'm missing.
for row in test:
print(row.item['id'])
Searched for a few hours eventually found some who accidentally did this while trying to get another result.
sub = sess.query(func.jsonb_array_elements(people.data["users"]).label("item")).subquery()
tmp = sub.c.item.op('->>')('id')
tmp2 = sub.c.item.op('->>')('name')
test = sess.query(tmp, tmp2).all()
In a postgresql database:
class Persons(models.Model):
person_name = models.CharField(max_length=10, unique=True)
The persons.csv file, contains 1 million names.
$cat persons.csv
Name-1
Name-2
...
Name-1000000
I want to:
Create the names that do not already exist
Query the database and fetch the id for each name contained in the csv file.
My approach:
Use the COPY command or the django-postgres-copy application that implements it.
Also take advantage of the new Postgresql-9.5+ upsert feature.
Now, all the names in the csv file, are also in the database.
I need to get their ids -from the database- either in memory or in another csv file with an efficient way:
Use Q objects
list_of_million_q = <iterate csv and append Qs>
million_names = Names.objects.filter(list_of_million_q)
or
Use __in to filter based on a list of names:
list_of_million_names = <iterate csv and append strings>
million_names = Names.objects.filter(
person_name__in=[list_of_million_names]
)
or
?
I do not feel that any of the above approaches for fetching the ids is efficient.
Update
There is a third option, along the lines of this post that should be a great solution which combines all the above.
Something like:
SELECT * FROM persons;
make a name: id dictionary out of the names recieved from the database:
db_dict = {'Harry': 1, 'Bob': 2, ...}
Query the dictionary:
ids = []
for name in list_of_million_names:
if name in db_dict:
ids.append(db_dict[name])
This way you're using the quick dictionary indexing as opposed to the slower if x in list approach.
But the only way to really know for sure is to benchmark these 3 approaches.
This post describes how to use RETURNING with ON CONFLICT so while inserting into the database the contents of the csv file, the ids will be saved in another table either when an insertion was successful, or when -due to unique constraints- the insertion was omitted.
I have tested it in sqlfiddle where I used a set up that resembles the one used for the COPY command which inserts to the database straight from a csv file, respecting the unique constraints.
The schema:
CREATE TABLE IF NOT EXISTS label (
id serial PRIMARY KEY,
label_name varchar(200) NOT NULL UNIQUE
);
INSERT INTO label (label_name) VALUES
('Name-1'),
('Name-2');
CREATE TABLE IF NOT EXISTS ids (
id serial PRIMARY KEY,
label_ids varchar(12) NOT NULL
);
The script:
CREATE TEMP TABLE tmp_table
(LIKE label INCLUDING DEFAULTS)
ON COMMIT DROP;
INSERT INTO tmp_table (label_name) VALUES
('Name-2'),
('Name-3');
WITH ins AS(
INSERT INTO label
SELECT *
FROM tmp_table
ON CONFLICT (label_name) DO NOTHING
RETURNING id
)
INSERT INTO ids (label_ids)
SELECT
id FROM ins
UNION ALL
SELECT
l.id FROM tmp_table
JOIN label l USING(label_name);
The output:
SELECT * FROM ids;
SELECT * FROM label;
I have multiple sql queries I need to run (via pandas.io.sql / .read_sql) that have a very similar structure so I am attempting to parameterize them.
I am wondering if there is a way to pass in column values using .format (which works for strings).
My query (truncated to simplify this post):
sql= '''
SELECT DISTINCT
CAST(report_suite AS STRING) AS report_suite, post_pagename,
COUNT(DISTINCT(CONCAT(post_visid_high,post_visid_low))) AS unique_visitors
FROM
FOO.db
WHERE
date_time BETWEEN '{0}' AND '{1}'
AND report_suite = '{2}'
GROUP BY
report_suite, post_pagename
ORDER BY
unique_visitors DESC
'''.format(*parameters)
What I would like to do, is be able to parameterize the COUNT(DISTINCT(CONCAT(post_visid_high, post_visid_low))) as Unique Visitors
like this somehow:
COUNT(DISTINCT({3})) as {'4'}
The problem I can't seem to get around is that in order to do this would require storing the column names as something other than a string to avoid the quotes. Is there any good ways around this?
Consider the following approach:
sql_dynamic_parms = dict(
func1='CONCAT(post_visid_high,post_visid_low)',
name1='unique_visitors'
)
sql= '''
SELECT DISTINCT
CAST(report_suite AS STRING) AS report_suite, post_pagename,
COUNT(DISTINCT({func1})) AS {name1}
FROM
FOO.db
WHERE
date_time BETWEEN %(date_from)s AND %(date_to)s
AND report_suite = %(report_suite)s
GROUP BY
report_suite, post_pagename
ORDER BY
unique_visitors DESC
'''.format(**sql_dynamic_parms)
params = dict(
date_from=pd.to_datetime('2017-01-01'),
date_to=pd.to_datetime('2017-12-01'),
report_suite=111
)
df = pd.read_sql(sql, conn, params=params)
PS you may want to read PEP-249 to see what kind of parameter placeholders are accepted
I am trying to submit data to a Sqlite db through python with executemany(). I am reading data from a JSON file and then placing it into the db. My problem is that the JSON creation is not under my control and depending on who I get the file from, the order of values is not the same each time. The keys are correct so they correlate with the keys in the db but I can't just toss the values at the executemany() function and have the data appear in the correct columns each time.
Here is what I need to be able to do.
keyTuple = (name, address, telephone)
listOfTuples = [(name1, address1, telephone1),
(name2, address2, telephone2),
(...)]
cur.executemany("INSERT INTO myTable(?,?,?)", keysTuple"
"VALUES(?,?,?)", listOfTuples)
The problem I have is that some JSON files have order of "name, telephone, address" or some other order. I need to be able to input my keysTuple into the INSERT portion of the command so I can keep my relations straight no matter what order the JSON file come in without having to completely rebuild the listOfTuples. I know there has got to be a way but what I have written doesn't match the right syntax for the INSERT portion. The VALUES line works just fine, it uses each element in listofTuples.
Sorry if I am not asking with the correct verbage. FNG here and this is my first post. I have look all over the web but it only produces the examples of using ? in the VALUE portion, never in the INSERT INTO portion.
You cannot use SQL parameters (?) for table/column names.
But when you already have the column names in the correct order, you can simply join them in order to be able to insert them into the SQL command string:
>>> keyTuple = ("name", "address", "telephone")
>>> "INSERT INTO MyTable(" + ",".join(keyTuple) + ")"
'INSERT INTO MyTable(name,address,telephone)'
Try this
Example if you have table named products with the following fields:
Prod_Name Char( 30 )
UOM Char( 10 )
Reference Char( 10 )
Const Float
Price Float
list_products = [('Garlic', '5 Gr.', 'Can', 1.10, 2.00),
('Beans', '8 On.', 'Bag', 1.25, 2.25),
('Apples', '1 Un.', 'Unit', 0.25, 0.30),
]
c.executemany('Insert Into products Values (?,?,?,?,?)', list_products )