Efficiently querying a graph structure - python

I have a database which consists of a graph. The table I need to access looks like this:
Sno Source Dest
1 'jack' 'bob'
2 'jack' 'Jill'
3 'bob' 'Jim'
Here Sno is the primary key. Source and Destination are 2 non-unique numbers which represents an edge between nodes in my graph. My Source and Dest may also be strings and not necessarily an number data type. I have around 5 million entries in my database and I have built it using Postgresql with Psycopg2 for python.
It is very easy and quick to query for the primary key. However, I need to frequently query this database for all the dest a particular source is connected to. Right now I achieve this by calling the query:
SELECT * FROM name_table WHERE Source = 'jack'
This turns out to be quite inefficient (Up to 2 seconds per query) and there is no way that I can make this the primary key as it is not unique. Is there any way that I can make an index based on these repeated values and query it quickly?

This should make your query much faster.
CREATE INDEX table_name_index_source ON table_name Source;
However there are many options which you can use
PostgreSQL Documentation
CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ name ] ON table [ USING method ]
( { column | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
[ WITH ( storage_parameter = value [, ... ] ) ]
[ TABLESPACE tablespace ]
[ WHERE predicate ]
Read more about indexing with PostgreSQL in their Documentation.
Update
If your table is that small as yours, this will for help for sure. However if your dataset is growing you should probably consider a schema change to have unique values which can be indexed more efficiently.

Related

Grouping DISTINCT SQL results into objects

I'm trying to combine data from two tables (the first being a shipment details table with data for a single shipment per row, the other a table that tracks transactions made against shipments) to build a view that shows me the product data for a given shipment.
My current attempt to get the data looks like this:
SELECT DISTINCT
sd.erp_order AS shipment_id,
th.product_code,
th.lot
FROM transaction_history th
JOIN shipment_detail sd
ON sd.shipment_id = th.reference_id
AND sd.item = th.item
WHERE th.transaction_type = '1'
AND sd.erp_order in ('1111', '1112')
which returns my data in the following format:
|shipment_id|product_code|lot|
| 1111| PRODUCT_A| 1A|
| 1111| PRODUCT_B| 2B|
| 1112| PRODUCT_A| 1A|
| 1112| PRODUCT_B| 3B|
This is great, but now I need to organize it so that when it goes through my API (I'm using Django), it the lot code and the product code are grouped together in their own object, and then all the products are listed under the relevant shipment:
[
{
"shipment_id": '1111',
"products": [
{
"product_code": "PRODUCT_A",
"lot": "1A",
},
{
"product_code": "PRODUCT_B",
"lot": "2B",
}
]
}
]
and I'm not quite sure how to do it. Is this something that can be done with SQL, or will I have to do it with Python?
I also recognize that I should be able to get this kind of data from existing tables, but this is a siloed database that I cannot modify, and I'm told by the supporting team that this is the best place to get the data I need.
Try something like:
data = serializers.serialize('json', SomeModel.objects.raw(query), fields=('id', 'name', 'parent'))
Serialisation Django
You can use this query to generate JSON data in the DB and then convert data to JSON in python:
SELECT DISTINCT
sd.erp_order AS shipment_id,
products.product_code,
products.lot
FROM shipment_detail sd
JOIN transaction_history products
ON sd.shipment_id = products.reference_id
AND sd.item = products.item
WHERE products.transaction_type = '1'
AND sd.erp_order in ('1111', '1112')
FOR JSON AUTO
for more information read here.
Sample running in MSSQL:

ALTER TABLE Statement -> sql syntax error: incorrect syntax near """: line 1 col 49 (at pos 49)'

I write a python program, where I generate a SQL-Table. Also I want to add Column to this Table, but then I become the error idh_jdbc java.lang.Exception: com.sap.db.jdbc.exceptions.JDBCDriverException: SAP DBTech JDBC: [257]: sql syntax error: incorrect syntax near """
My Code:
alter = f'ALTER TABLE "CELONIS_E2E"."CAG_List" ADD COLUMN "{header[x]}" {sqltype}'
sqltype = NVARCHAR(255)
header[x] = Assigned_Groups_2
The Print-out: ALTER TABLE "CELONIS_E2E"."CAG_List" ADD COLUMN "Assigned_Groups_2" NVARCHAR(255)
According to the SAP Hana specifications, you should not include "COLUMN".
The SQL should be
ALTER TABLE "CELONIS_E2E"."CAG_List" ADD "Assigned_Groups_2" NVARCHAR(255)
based on the specification for the add columns clause below:
<add_columns_clause> ::= ADD ( <column_list > )
<column_list> ::= <column_specification> [, <column_specification> [,...] ] [ ONLINE [ PREFERRED ] ]
<column_specification> ::=
<column_name> { [ <column_definition> ] [ <column_constraint_short> ] [ COMMENT <string_literal> ] }
CLIENTSIDE ENCRYPTION ON WITH <column_encryption_key_name> [ RANDOM | DETERMINISTIC ] [ <column_load_unit> ]
<column_load_unit> ::= <column_name> <load_unit>

Using Unnest With psycopg2

I've built a Web UI to serve as an ETL application that allows users to select some CSV and TSV files contain large amounts of records and I am attempting to insert them into a PostgreSQL database. As has already been well commented on, this process is kind of slow. After some research it looked like using the UNNEST function would be my answer but I'm having trouble implementing it. Honestly I just didn't find a great walk-through tutorial as I normally do when researching any data processing in Python.
Here's the SQL string as I store them (to be used in functions later):
salesorder_write = """
INSERT INTO api.salesorder (
site,
sale_type,
sales_rep,
customer_number,
shipto_number,
cust_po_number,
fob,
order_number
) VALUES (
UNNEST(ARRAY %s)
"""
I use this string along with a list of tuples like so:
for order in orders:
inputs=(
order['site'],
order['sale_type'],
order['sales_rep'],
order['customer_number'],
order['shipto_number'],
order['cust_po_number'],
order['fob'],
order['order_number']
)
tup_list.append(inputs)
cur.execute(strSQL,tup_list)
This gives me the error that Not all arguments converted during string formatting. My first question is How do I need to structure my SQL to be able to pass my list of tuples. My second is, can I use the existing dictionary structure in much the same way?
unnest is not superior to the now (since Psycopg 2.7) canonical execute_values:
from psycopg2.extras import execute_values
orders = [
dict (
site = 'x',
sale_type = 'y',
sales_rep = 'z',
customer_number = 1,
shipto_number = 2,
cust_po_number = 3,
fob = 4,
order_number = 5
)
]
salesorder_write = """
insert into t (
site,
sale_type,
sales_rep,
customer_number,
shipto_number,
cust_po_number,
fob,
order_number
) values %s
"""
execute_values (
cursor,
salesorder_write,
orders,
template = """(
%(site)s,
%(sale_type)s,
%(sales_rep)s,
%(customer_number)s,
%(shipto_number)s,
%(cust_po_number)s,
%(fob)s,
%(order_number)s
)""",
page_size = 1000
)

python pandas to_sql with sqlalchemy : how to speed up exporting to MS SQL?

I have a dataframe with ca 155,000 rows and 12 columns.
If I export it to csv with dataframe.to_csv , the output is an 11MB file (which is produced instantly).
If, however, I export to a Microsoft SQL Server with the to_sql method, it takes between 5 and 6 minutes!
No columns are text: only int, float, bool and dates. I have seen cases where ODBC drivers set nvarchar(max) and this slows down the data transfer, but it cannot be the case here.
Any suggestions on how to speed up the export process? Taking 6 minutes to export 11 MBs of data makes the ODBC connection practically unusable.
Thanks!
My code is:
import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"
engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database)
conn = engine.connect()
metadata = MetaData(conn)
my_data_frame.to_sql(TableName,engine)
I recently had the same problem and feel like to add an answer to this for others.
to_sql seems to send an INSERT query for every row which makes it really slow. But since 0.24.0 there is a method parameter in pandas.to_sql() where you can define your own insertion function or just use method='multi' to tell pandas to pass multiple rows in a single INSERT query, which makes it a lot faster.
Note that your Database may has a parameter limit. In that case you also have to define a chunksize.
So the solution should simply look like to this:
my_data_frame.to_sql(TableName, engine, chunksize=<yourParameterLimit>, method='multi')
If you do not know your database parameter limit, just try it without the chunksize parameter. It will run or give you an error telling you your limit.
The DataFrame.to_sql method generates insert statements to your ODBC connector which then is treated by the ODBC connector as regular inserts.
When this is slow, it is not the fault of pandas.
Saving the output of the DataFrame.to_sql method to a file, then replaying that file over an ODBC connector will take the same amount of time.
The proper way of bulk importing data into a database is to generate a csv file and then use a load command, which in the MS flavour of SQL databases is called BULK INSERT
For example:
BULK INSERT mydatabase.myschema.mytable
FROM 'mydatadump.csv';
The syntax reference is as follows:
BULK INSERT
[ database_name . [ schema_name ] . | schema_name . ] [ table_name | view_name ]
FROM 'data_file'
[ WITH
(
[ [ , ] BATCHSIZE = batch_size ]
[ [ , ] CHECK_CONSTRAINTS ]
[ [ , ] CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]
[ [ , ] DATAFILETYPE =
{ 'char' | 'native'| 'widechar' | 'widenative' } ]
[ [ , ] FIELDTERMINATOR = 'field_terminator' ]
[ [ , ] FIRSTROW = first_row ]
[ [ , ] FIRE_TRIGGERS ]
[ [ , ] FORMATFILE = 'format_file_path' ]
[ [ , ] KEEPIDENTITY ]
[ [ , ] KEEPNULLS ]
[ [ , ] KILOBYTES_PER_BATCH = kilobytes_per_batch ]
[ [ , ] LASTROW = last_row ]
[ [ , ] MAXERRORS = max_errors ]
[ [ , ] ORDER ( { column [ ASC | DESC ] } [ ,...n ] ) ]
[ [ , ] ROWS_PER_BATCH = rows_per_batch ]
[ [ , ] ROWTERMINATOR = 'row_terminator' ]
[ [ , ] TABLOCK ]
[ [ , ] ERRORFILE = 'file_name' ]
)]
You can use this: what makes it faster is the method parameter of pandas to_sql. I hope this help helps.
The result of this on my experience was from infinite time to 8 secs.
df = pd.read_csv('test.csv')
conn = create_engine(<connection_string>)
start_time = time.time()
df.to_sql('table_name', conn, method='multi',index=False, if_exists='replace')
print("--- %s seconds ---" % (time.time() - start_time))
With SQLAlchemy>=1.3, while creating engine object, set fast_executemany=True. Reference
You can use d6tstack which has fast pandas to SQL functionality because it uses native DB import commands. It supports MS SQL, Postgres and MYSQL
uri_psql = 'postgresql+psycopg2://usr:pwd#localhost/db'
d6tstack.utils.pd_to_psql(df, uri_psql, 'table')
uri_mssql = 'mssql+pymssql://usr:pwd#localhost/db'
d6tstack.utils.pd_to_mssql(df, uri_mssql, 'table', 'schema') # experimental
Also useful for importing multiple CSV with data schema changes and/or preprocess with pandas before writing to db, see further down in examples notebook
d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'),
apply_after_read=apply_fun).to_psql_combine(uri_psql, 'table')
Why is pandas.DataFrame.to_sql slow?
When uploading data from pandas to Microsoft SQL Server, most time is actually spent in converting from pandas to Python objects to the representation needed by the MS SQL ODBC driver. One of the reasons pandas is much faster for analytics than basic Python code is that it works on lean native arrays of integers / floats / … that don't have the same overhead as their respective Python counterparts. The to_sql method is actually converting all these lean columns to many individual Python objects and thus doesn't get the usual performance treatment as the other pandas operations have.
Use turbodbc.Cursor.insertmanycolumns to speed this up
Given a pandas.DataFrame, you can use turbodbc and pyarrow to insert the data with less conversion overhead than happening with the conversion to Python objects.
import pyarrow as pa
import turbodbc
cursor = … # cursor to a MS SQL connection initiated with turbodbc
df = … # the pd.DataFrame to be inserted
# Convert the pandas.DataFrame to a pyarrow.Table, most of the columns
# will be zero-copy and thus this is quite fast.
table = pa.Table.from_pandas(table)
# Insert into the database
cursor.executemanycolumns("INSERT INTO my_table VALUES (?, ?, ?)",
table)
Why is this faster?
Instead of the conversion of pd.DataFrame -> collection of Python objects -> ODBC data structures, we are doing a conversion path pd.DataFrame -> pyarrow.Table -> ODBC structure. This is more performant due to:
Most of the columns of a pandas.DataFrame can be converted to columns of the pyarrow.Table without copying. The columns of the table will reference the same memory. So no actual conversion is done.
The conversion is done fully in native code with native types. This means that at no stage we occur the overhead of Python objects as long as we don't have object typed columns.
I was running out of time and memory (more than 18GB allocated for a DataFrame loaded from 120MB CSV) with this line:
df.to_sql('my_table', engine, if_exists='replace', method='multi', dtype={"text_field": db.String(64), "text_field2": db.String(128), "intfield1": db.Integer(), "intfield2": db.Integer(), "floatfield": db.Float()})
Here is the code that helped me to import and track progress of insertions at the same time:
import sqlalchemy as db
engine = db.create_engine('mysql://user:password#localhost:3306/database_name', echo=False)
connection = engine.connect()
metadata = db.MetaData()
my_table = db.Table('my_table', metadata,
db.Column('text_field', db.String(64), index=True),
db.Column('text_field2', db.String(128), index=True),
db.Column('intfield1', db.Integer()),
db.Column('intfield2', db.Integer()),
db.Column('floatfield', db.Float())
)
metadata.create_all(engine)
kw_dict = df.reset_index().sort_values(by="intfield2", ascending=False).to_dict(orient="records")
batch_size=10000
for batch_start in range(0, len(kw_dict), batch_size):
print("Inserting {}-{}".format(batch_start, batch_start + batch_size))
connection.execute(my_table.insert(), kw_dict[batch_start:batch_start + batch_size])
My solution to this problem is below if this helps anyone. From what I've read, pandas tosql method loads one record at a time.
You can make a bulk insert statement that loads 1000 lines and commits that transaction instead of committing a single row each time. This increases the speed massively.
import pandas as pd
from sqlalchemy import create_engine
import pymssql
import os
connect_string = [your connection string]
engine = create_engine(connect_string,echo=False)
connection = engine.raw_connection()
cursor = connection.cursor()
def load_data(report_name):
# my report_name variable is also my sql server table name so I use that variable to create table name string
sql_table_name = 'AR_'+str(report_name)
global chunk # to QC chunks that fail for some reason
for chunk in pd.read_csv(report_full_path_new,chunksize=1000):
chunk.replace('\'','\'\'',inplace=True,regex=True) #replace single quotes in data with double single quotes to escape it in mysql
chunk.fillna('NULL',inplace=True)
my_data = str(chunk.to_records(index=False).tolist()) # convert data to string
my_data = my_data[1:-1] # clean up the ends
my_data = my_data.replace('\"','\'').replace('\'NULL\'','NULL') #convert blanks to NULLS for mysql
sql_table_name = [your sql server table name]
sql = """
INSERT INTO {0}
VALUES {1}
""".format(sql_table_name,my_data)
cursor.execute(sql)
# you must call commit() to persist your data if you don't set autocommit to True
connection.commit()
For sqlalchemy >= 1.3, rather than using to_sql()'s method parameter, use fast_executemany=True in sqlalchemy's create_engine(). This should be at least as fast as method="multi" while avoiding T-SQL's limit of 2100 parameter values for a stored procedure, which causes the error seen here.
Credit to Gord Thompson from the same link.
Based on this answer - Aseem.
You can use the copy_from method to simulate a bulk load with a cursor object.
This was tested on Postgres, try it with your DB:
import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
from StringIO import StringIO
ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"
engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database) #don't forget to add a password if needed
my_data_frame.head(0).to_sql(TableName, engine, if_exists='replace', index=False) # create an empty table - just for structure
conn = engine.raw_connection()
cur = conn.cursor()
output = StringIO()
my_data_frame.to_csv(output, sep='\t', header=False, index=False) # a CSV that will be used for the bulk load
output.seek(0)
cur.copy_from(output, TableName, null="") # null values become ''
conn.commit()
conn.close()
cur.close()
As said in other answers, the reason for the slowdown and/or time out is because pandas is inserting many single rows over and over. The high volume of insert commands is slow and/or may be overloading the target database.
using method='multi' tells pandas to upload in chunks. This is much faster and won't time out as easily.
sqlEngine=create_engine('mysql+mysqlconnector://'+config['user']+':'+config['pass']+'#'+config['host']+'/'+config['dbname'])
dbConnection=sqlEngine.connect()
df.to_sql('table_name',con=dbConnection,method='multi',if_exists='append',index=False)
dbConnection.close()
Probably the pyarrow answer above is best, but for mariadb, I wrote a wrapper on DataFrame to use executemany and fetchall, which gave me a 300x speedup. This also had the added bonus of not using sqlalchemy at all.
You can use it as normal: df.to_sql(...), or df = read_sql_table(...).
See https://gist.github.com/MichaelCurrie/b5ab978c0c0c1860bb5e75676775b43b

Interrelated requests MySQL, analogue in MongoDB

Good day dear colleagues, I decided to move some projects from MySQL to MongoDB and faced several difficulties:
For example there are two tables in MySQL:
Users:
CREATE TABLE `testdb`.`users` (
`id` INT( 11 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`name` VARCHAR( 55 ) NOT NULL ,
`password` VARCHAR( 32 ) NOT NULL
) ENGINE = MYISAM
Rules:
CREATE TABLE `testdb`.`rules` (
`id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`uid` INT NOT NULL ,
`title` VARCHAR( 155 ) NOT NULL ,
`points` INT NOT NULL
) ENGINE = MYISAM
Now to chose all "rules", which belong to a paticular user I can make SQL request:
SELECT r.`title`, r.`points` FROM `rules` r, `users` u WHERE r.`uid` = u.`id` AND u.`id` = '123'
By now, I can't figure out how to do the same in MongoDB, can you please explain and provide an example.
P.S. I make implementation in Python with the help of pymongo
P.P.S. I also wanted to see the alternative ways of solving this problem with the help of ORM mongoengine or mongokit.
Thank you in advance:)
MongoDB does not support joins, unlike RDBMS's like mysql. And that's because MongoDB is not a relational database. Modelling data in MongoDB in the same way as you do in an RDBMS is therefore generally a bad idea - you have to design your schemas in a whole different mindset.
In this case for example, in MongoDB you could have 1 document per User, with the Rules belonging each user nested inside.
e.g.
{
"ID" : 1,
"name" : "John",
"password" : "eek hope this is secure",
"rules": [
{
"ID" : 1,
"Title" : "Rule 1",
"Points" : 100
},
{
"ID" : 2,
"Title" : "Rule 2",
"Points" : 200
}
]
}
This means, you only need a single read to pull back a user and all their rules.
A good starting point is the Mongodb.org reference on Schema Design - what I'm talking about above is embedding objects.

Categories