Get Row Count in SQLAlchemy Bulk Operations - python

Is there a way to get the number of rows affected as a result of executing session.bulk_insert_mappings() and session.bulk_update_mappings()?
I have seen that you can use ResultProxy.rowcount to get this number with connection.execute(), but how does it work with these bulk operations?

Unfortunately bulk_insert_mappings and bulk_update_mappings do not return the number of rows created/updated.
If your update is the same for all the objects (for example increasing some int field by 1) you could use this:
updatedCount = session.query(People).filter(People.name == "Nadav").upadte({People.age: People.age + 1)

Related

What is the best way to query a pytable column with many values?

I have a 11 columns x 13,470,621 rows pytable. The first column of the table contains a unique identifier to each row (this identifier is always only present once in the table).
This is how I select rows from the table at the moment:
my_annotations_table = h5r.root.annotations
# Loop through table and get rows that match gene identifiers (column labeled gene_id).
for record in my_annotations_table.where("(gene_id == b'gene_id_36624' ) | (gene_id == b'gene_id_14701' ) | (gene_id == b'gene_id_14702')"):
# Do something with the data.
Now this works fine with small datasets, but I will need to routinely perform queries in which I can have many thousand of unique identifiers to match for in the table's gene_id column. For these larger queries, the query string can quickly get very large and I get an exception:
File "/path/to/my/software/python/python-3.9.0/lib/python3.9/site-packages/tables/table.py", line 1189, in _required_expr_vars
cexpr = compile(expression, '<string>', 'eval')
RecursionError: maximum recursion depth exceeded during compilation
I've looked at this question (What is the PyTables counterpart of a SQL query "SELECT col2 FROM table WHERE col1 IN (val1, val2, val3...)"?), which is somehow similar to mine, but was not satisfactory.
I come from an R background where we often do these kinds of queries (i.e. my_data_frame[my_data_frame$gene_id %in% c("gene_id_1234", "gene_id_1235"),] and was wondering if there was comparable solution that I could use with pytables.
Thanks very much,
Another approach to consider is combining 2 functions: Table.get_where_list() with Table.read_coordinates()
Table.get_where_list(): gets the row coordinates fulfilling the given condition.
Table.read_coordinates(): Gets a set of rows given their coordinates (in a list), and returns as a (record) array.
The code would look something like this:
my_annotations_table = h5r.root.annotations
gene_name_list = ['gene_id_36624', 'gene_id_14701', 'gene_id_14702']
# Loop through gene names and get rows that match gene identifiers (column labeled gene_id)
gene_row_list = []
for gene_name in gene_name_list:
gene_rows = my_annotations_table.get_where_list("gene_id == gene_name"))
gene_row_list.extend(gene_rows)
# Retieve all of the data in one call
gene_data_arr = my_annotations_table.read_coordinates(gene_row_list)
Okay, I managed to do some satisfactory improvements on this.
1st: optimize the table (with the help of the documentation - https://www.pytables.org/usersguide/optimization.html)
Create table. Make sure to specify the expectedrows=<int> arg as it has the potential to increase the query speed.
table = h5w.create_table("/", 'annotations',
DataDescr, "Annotation table unindexed",
expectedrows=self._number_of_genes,
filters=tb.Filters(complevel=9, complib='blosc')
#tb comes from import tables as tb ...
I also modified the input data so that the gene_id_12345 fields are simple integers (gene_id_12345 becomes 12345).
Once the table is populated with its 13,470,621 entries (i.e. rows),
I created a complete sorted index based on the gene_id column (Column.create_csindex()) and sorted it.
table.cols.gene_id.create_csindex()
table.copy(overwrite=True, sortby='gene_id', newname="Annotation table", checkCSI=True)
# Just make sure that the index is usable. Will print an empty list if not.
print(table.will_query_use_indexing('(gene_id == 57403)'))
2nd - The table is optimized, but I still can't query thousands of gene_ids at a time. So I simply separated them in chunks of 31 gene_ids (yes 31 was the absolute maximum, 32 was too much apparently).
I did not perform benchmarks, but querying ~8000 gene_ids now takes approximately 10 seconds which is acceptable for my needs.

Python, Postgres, and integers with blank values?

So I have some fairly sparse data columns where most of the values are blank but sometimes have some integer value. In Python, if there is a blank then that column is interpreted as a float and there is a .0 at the end of each number.
I tried two things:
Changed all of the columns to text and then stripped the .0 from everything
Filled blanks with 0 and made each column an integer
Stripping the .0 is kind of time consuming on about 2mil+ rows per day and then the data is in text format which means I can't do quick sums and stuff.
Filling blanks seems somewhat wasteful because some columns literally have just a few actual values out of millions. My table for just one month is already over 80gigs (200 columns, but many of the columns after about 30 or so are pretty sparse).
What postgres datatype is best for this? There are NO decimals because the columns contain the number of seconds and it must be pre-rounded by the application.
Edit - here is what I am doing currently (but this bloats up the size and seems wasteful):
def create_int(df, col):
df[col].fillna(0, inplace=True)
df[col] = df[col].astype(int)
If I try to create the column astype(int) without filling in the 0s I get the error:
error: Cannot convert NA to integer
Here is the link the the Gotcha about this.
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
So it makes each int a float. Should I change the datatypes in postgres to numeric or something? I do not need high precision because there are no values after the decimal.
You could take advantage of the fact you are using POSTGRESQL (9.3 or above), and implement a "poor man's sparse row" by converting your data into Python dictionaries and then using a JSON datatype (JSONB is better).
The following Python snippets generate random data in the format you said you have yours, convert them to apropriate json, and upload them into a PostgreSQL table with a JSONB column.
import psycopg2
import json
import random
def row_factory(n=200, sparcity=0.1):
return [random.randint(0, 2000) if random.random() < sparcity else None for i in range(n)]
def to_row(data):
result = {}
for i, element in enumerate(data):
if element is not None: result[i] = element
return result
def from_row(row, lenght=200):
result = [None] * lenght
for index, value in row.items():
result[int(index)] = value
return result
con = psycopg2.connect("postgresql://...")
cursor = con.cursor()
cursor.execute("CREATE TABLE numbers (values JSONB)")
def upload_data(rows=100):
for i in range(rows):
cursor.execute("INSERT INTO numbers VALUES(%s)", (json.dumps(to_row(row_factory(sparcity=0.5))),) )
upload_data()
# To retrieve the sum of all columns:
cursor.execute("""SELECT {} from numbers limit 10""".format(", ".join("sum(CAST(values->>'{}' as int))".format(i) for i in range(200))))
result = cursor.fetchall()
It took me a while to find out how to perform numeric operations on the JSONB data inside Postgresql (if you will be using them from Python you can just use the snippet from_row function above). But the last two lines have a Select operation that performs a SUM on all columns - the select statement itself is assembled using Python string formatting methods - the key to use a Json value as number is to select it with the ->> operator, and them cast it to number.(the sum(CAST(values->>'0' as int)) part)

How do I express this query in SQL Alchemy

I am trying to query a table in an existing sqlite database. The data must first be subsetted as such, from a user input:
query(Data.num == input)
Then I want to find the max and min of another field: date in this subset.
I have tried using func.min/max, as well as union, but received an error saying the columns do not match. One of the issues here is that func.min/max need to be used as query arguments, not filter.
ids = session.query(Data).filter(Data.num == input)
q = session.query(func.max(Data.date),
func.min(Data.date))
ids.union(q).all()
ArgumentError: All selectables passed to CompoundSelect must have identical numbers of columns; select #1 has 12 columns, select #2 has 2
Similarly, if I use func.max and min separately, the error says #2 has 1 column.
I think seeing this query in SQL might help as well.
Thanks
The following solution works. You first need to set up the query, then filter the data down afterwards.
query = session.query(Data.num, func.min(Data.date),
func.max(Data.date), Data.date)
query = query.filter(Data.num == input)
query = query.all()

SQLAlchemy Select with Exist limiting

I have a table with 4 columns (1 PK) from which I need to select 30 rows.
Of these rows, two columns (col. A and B) must exists in another table (8 columns, 1 PK, 2 are A and B).
Second table is large, contains millions of records and it's enough for me to know if even a single row exists containing values of col. A and B of 1st table.
I am using the code below:
query = db.Session.query(db.Table_1).\
filter(
exists().where(db.Table_2.col_a == db.Table_1.col_a).\
where(db.Table_2.col_b == db.Table_2.col_b)
).limit(30).all()
This query gets me the results I desire however I'm afraid it might be a bit slow since it does not imply a limit condition to exists() function nor does it do select 1 but a select *.
exists() does not accept a .limit(1)
How can I put a limit to exists to get it not to look for whole table, hence making this query run faster?
I need n rows from Table_1, which 2 columns exist in a record in
Table_2
Thank you
You can do the "select 1" thing using a more explicit form as it mentioned here, that is,
exists([1]).where(...)
However, while I've been a longtime diehard "select 1" kind of guy, I've since learned that the usage of "1" vs. "*" for performance is now a myth (more / more).
exists() is also a wrapper around select(), so you can get a limit() by constructing the select() first:
s = select([1]).where(
table1.c.col_a == table2.c.colb
).where(
table1.c.colb == table2.c.colb
).limit(30)
s = exists(s)
query=select([db.Table_1])
query=query.where(
and_(
db.Table_2.col_a == db.Table_1.col_a,
db.Table_2.col_b == db.Table_2.col_b
)
).limit(30)
result=session.execute(query)

Python library for dealing with time associated data?

I've got some data (NOAA-provided weather forecasts) I'm trying to work with. There are various data series (temperature, humidity, etc), each of which contains a series of data points, and indexes into an array of datetimes, on various time scales (Some series are hourly, others 3-hourly, some daily). Is there any sort of library for dealing with data like this, and accessing it in a user-friendly way.
Ideal usage would be something like:
db = TimeData()
db.set_val('2010-12-01 12:00','temp',34)
db.set_val('2010-12-01 15:00','temp',37)
db.set_val('2010-12-01 12:00','wind',5)
db.set_val('2010-12-01 13:00','wind',6)
db.query('2010-12-01 13:00') # {'wind':6, 'temp':34}
Basically the query would return the most recent value of each series.
I looked at scikits.timeseries, but it isn't very amenable to this use case, due to the amount of pre-computation involved (it expects all the data in one shot, no random-access setting).
If your data is sorted you can use the bisect module to quickly get the entry with the greatest time less than or equal to the specified time.
Something like:
i = bisect_right(times, time)
# times[j] <= time for j<i
# times[j] > time for j>=i
if times[i-1] == time:
# exact match
value = values[i-1]
else:
# interpolate
value = (values[i-1]+values[i])/2
SQLite has a date type. You can also convert all the times to seconds since epoch (by going through time.gmtime() or time.localtime()), which makes comparisons trivial.
It is a classic row-to-column problem, in a good SQL DBMS you can use unions:
SELECT MAX(d_t) AS d_t, SUM(temp) AS temp, SUM(wind) AS wind, ... FROM (
SELECT d_t, 0 AS temp, value AS wind FROM table
WHERE type='wind' AND d_t >= some_date
ORDER BY d_t DESC LIMIT 1
UNION
SELECT d_t, value, 0 FROM table
WHERE type='temp' AND d_t >= some_date
ORDER BY d_t DESC LIMIT 1
UNION
...
) q1;
The trick is to make a subquery for each dimension while providing placeholder columns for the other dimensions. In Python you can use SQLAlchemy to dynamically generate a query like this.

Categories