Proper way to insert iterative data into Cassandra using Python

Proper way to insert iterative data into Cassandra using Python - python

Let's say I have cassandra table define like this:
CREATE TABLE IF NOT EXISTS {} (
user_id bigint ,
username text,
age int,
PRIMARY KEY (user_id)
);
I have 3 list of same size let's 1 000 000 records in each list. Is it a good practice to insert data using a for loop like this:
for index, user_id in enumerate(user_ids):
query = "INSERT INTO TABLE (user_id, username, age) VALUES ({0}, '{1}', {1});".format(user_id, username[index] ,age[index])
session.execute(query)

Prepared statements with concurrent execution will be your best bet. The driver provides utility functions for concurrent execution of statements with sequences of parameters, just as you have with your lists: execute_concurrent_with_args
Zipping your lists together will produce a sequence of parameter tuples suitable for input to that function.
Something like this:
prepared = session.prepare("INSERT INTO table (user_id, username, age) VALUES (?, ?, ?)")
execute_concurrent_with_args(session, prepared, zip(user_ids, username, age))

Its probably a good idea to start by looking at the python driver getting started guide. If you have already seen that then apologies but I thought it worth mentioning.
Generally speaking you'd create your session object and then do your inserts inside your loop, probably using something like a prepared statement (talked about further down the getting started page) but also here and here
The example of the above page uses this as a good starting point
user_lookup_stmt = session.prepare("SELECT * FROM users WHERE user_id=?")
users = []
for user_id in user_ids_to_query:
user = session.execute(user_lookup_stmt, [user_id])
users.append(user)
You may also find this blog helps when talking about better throughput with the python driver
You might find the python driver github page a useful resource, in particular I found this example using a prepared statement here that might help you too.

Related

How to insert user variable into an SQL Update/Select statement using python [duplicate]

This question already has answers here:
How to use variables in SQL statement in Python?
(5 answers)
Closed 2 months ago.
def update_inv_quant():
new_quant = int(input("Enter the updated quantity in stock: "))
Hello! I'm wondering how to insert a user variable into an sql statement so that a record is updated to said variable. Also, it'd be really helpful if you could also help me figure out how to print records of the database into the actual python console. Thank you!
I tried doing soemthing like ("INSERT INTO Inv(ItemName) Value {user_iname)") but i'm not surprised it didnt work

It would have been more helpful if you specified an actual database.
First method (Bad)
The usual way (which is highly discouraged as Graybeard said in the comments) is using python's f-string. You can google what it is and how to use it more in-depth.
but basically, say you have two variables user_id = 1 and user_name = 'fish', f-string turns something like f"INSERT INTO mytable(id, name) values({user_id},'{user_name}')" into the string INSERT INTO mytable(id,name) values(1,'fish').
As we mentioned before, this causes something called SQL injection. There are many good youtube videos that demonstrate what that is and why it's dangerous.
Second method
The second method is dependent on what database you are using. For example, in Psycopg2 (Driver for PostgreSQL database), the cursor.execute method uses the following syntax to pass variables cur.execute('SELECT id FROM users WHERE cookie_id = %s',(cookieid,)), notice that the variables are passed in a tuple as a second argument.
All databases use similar methods, with minor differences. For example, I believe SQLite3 uses ? instead of psycopg2's %s. That's why I said that specifying the actual database would have been more helpful.
Fetching records
I am most familiar with PostgreSQL and psycopg2, so you will have to read the docs of your database of choice.
To fetch records, you send the query with cursor.execute() like we said before, and then call cursor.fetchone() which returns a single row, or cursor.fetchall() which returns all rows in an iterable that you can directly print.
Execute didn't update the database?
Statements executing from drivers are transactional, which is a whole topic by itself that I am sure will find people on the internet who can explain it better than I can. To keep things short, for the statement to physically change the database, you call connection.commit() after cursor.execute()
So finally to answer both of your questions, read the documentation of the database's driver and look for the execute method.

This is what I do (which is for sqlite3 and would be similar for other SQL type databases):
Assuming that you have connected to the database and the table exists (otherwise you need to create the table). For the purpose of the example, i have used a table called trades.
new_quant = 1000
# insert one record (row)
command = f"""INSERT INTO trades VALUES (
'some_ticker', {new_quant}, other_values, ...
) """
cur.execute(command)
con.commit()
print('trade inserted !!')
You can then wrap the above into your function accordingly.

Inserting arrays into databases

I am trying to write a general function that will insert a line of data into a table in a database but I am trying to write an array of unknown length. I am aiming to just be able to call this function in any programand write a line of data of any length to the table (assuming the table and the array are the same length.
I have tried adding the array like it is a singular peice of data.
import sqlite3
def add2Db(dbName, tableName, data):
connection = sqlite3.connect(dbName)
cur = connection.cursor()
cur.execute("INSERT INTO "+ tableName +" VALUES (?)", (data))
connection.commit()
connection.close()
add2Db("items.db", "allItems", (1, "chair", 5, 4))
This just crashes and gives me an error saying it has 4 columns but only one value was supplied.

SQLite does not support arrays - you have to convert to a TEXT using ','.join() to join your array items into a single string and pass that.
Source: SQLite website
https://www.sqlite.org/datatype3.html

I'm not a Python programmer, but I've been doing SQL a long time. I even wrote my own ORM. My advice is do not write your own SQL query builder. There's a myriad of subtle issues and especially security issues. I elaborate on a few of them below.
Instead, use a well-established SQL Query Builder or ORM. They've already dealt with these issues. Here's an example using SQLAlchemy.
from datetime import date
from sqlalchemy import create_engine, MetaData
# Connect to the database with debugging on.
engine = create_engine('sqlite:///test.sqlite', echo=True)
conn = engine.connect()
# Read the schemas from the database
meta = MetaData()
meta.reflect(bind=engine)
# INSERT INTO users (name, birthday, state, country) VALUES (?, ?, ?, ?)
users = meta.tables['users']
conn.execute(
users.insert().values(name="Yarrow Hock", birthday=date(1977, 1, 23), state="NY", country="US")
)
SQLAlchemy can do the entire range of SQL operations and will work with different SQL variants. You also get type safety.
conn.execute(
users.insert().values(name="Yarrow Hock", birthday="in the past", state="NY", country="US")
)
sqlalchemy.exc.StatementError: (exceptions.TypeError) SQLite Date type only accepts Python date objects as input. [SQL: u'INSERT INTO users (name, birthday, state, country) VALUES (?, ?, ?, ?)']
insert into table values (...) relies on column definition order
This relies on the order columns were defined in the schema. This leaves two problems. First is a readability problem.
add2Db(db, 'some_table', (1, 39, 99, 45, 'papa foxtrot', 0, 42, 0, 6)
What does any of that mean? A reader can't tell. They have to go digging into the schema and count columns to figure out what each value means.
Second is a maintenance problem. If, for any reason, the schema is altered and the column order is not exactly the same, this can lead to some extremely difficult to find bugs. For example...
create table users ( name text, birthday date, state text, country text );
vs
create table users ( name text, birthday date, country text, state text );
add2Db(db, 'users', ('Yarrow Hock', date(1977, 1, 23), 'NY', 'US'));
That insert will silently "work" with either column order.
You can fix this by passing in a dictionary and using the keys for column names.
add2Db(db, 'users', (name="Yarrow Hock", birthday=date(1977, 1, 23), state="NY", country="US"));
Then we'd produce a query like:
insert into users
(name, birthday, state, country)
values (?, ?, ?, ?)
This leads to the next and much bigger problem.
SQL Injection Attack
Now this opens up a new problem. If we simply stick the table and column names into the query that leaves us open to one of the most common security holes, a SQL Injection Attack. That's where someone can craft a value which when naively used in a SQL statement causes the query to do something else. Like Little Bobby Tables.
While the ? protects against SQL Injection for values, it's still possible to inject via the column names. There's no guarantee the column names can be trusted. Maybe they came from the parameters of a web form?
Protecting table and column names is complicated and easy to get wrong.
The more SQL you write the more likely you're vulnerable to an injection attack.
You have to write code for everything else.
Ok, you've done insert. Now update? select? Don't forget about subqueries, group by, unions, joins...
If you want to write a SQL query builder, cool! If, instead, you have a job to do using SQL, writing yet another SQL query builder is not your job.
It's harder for anyone else to understand.
There's a good chance that any given Python programmer knows how SQLAlchemy works, and there's plenty of tutorials and documentation if they don't. There's no chance they know about your home-rolled SQL functions, and you have to write all the tutorials and docs.

You shouldn't try to write your own ORMs without an argumented need. You will have a lot of problems, for example here's quick 25 reasons not to.
Instead use any popular orm that is proven. I recommend using SQLAlchemy as a go to outside of Django. Using it you can map a dict of values to insert it into a model just like insert(schema_name).values(**dict_name) (here's an example of insert/update).

Change your function to this:
def add2Db(dbName, tableName, data):
num_qs = len(data)
qm = ','.join(list('?' * num_qs))
query = """
INSERT INTO {table}
VALUES ({qms})
""".format(table=tableName,
qms=qm)
connection = sqlite3.connect(dbName)
cur = connection.cursor()
cur.execute(query, data)
connection.commit()
connection.close()

Efficient way to run select query for millions of data

I want to run various select query 100 million times and I have aprox. 1 million rows in a table. Therefore, I am looking for the fastest method to run all these select queries.
So far I have tried three different methods, and the results were similar.
The following three methods are, of course, not doing anything useful, but are purely for comparing performance.
first Method:
for i in range (100000000):
cur.execute("select id from testTable where name = 'aaa';")
second method:
cur.execute("""PREPARE selectPlan AS
SELECT id FROM testTable WHERE name = 'aaa' ;""")
for i in range (10000000):
cur.execute("""EXECUTE selectPlan ;""")
third method:
def _data(n):
cur = conn.cursor()
for i in range (n):
yield (i, 'test')
sql = """SELECT id FROM testTable WHERE name = 'aaa' ;"""
cur.executemany(sql, _data(10000000))
And the table is created like this:
cur.execute("""CREATE TABLE testTable ( id int, name varchar(1000) );""")
cur.execute("""CREATE INDEX indx_testTable ON testTable(name)""")
I thought that using the prepared statement functionality would really speed up the queries, but as it seems like this will not happen, I thought you could give me a hint on other ways of doing this.

This sort of benchmark is unlikely to produce any useful data, but the second method should be fastest, as once the statement is prepared it is stored in memory by the database server. Further calls to repeat the query do not require the text of the query to be transmitted, so saving a small about of time.
This is likely to be moot as the query is very small (likely the same quantity of packets over the wire as repeating sending the query text), and the query cache will serve the same data for every request.

What's the purpose of retrieving such amount of data at once? I don't know your situation, but I'd definitely page the results using limit and offset. Take a look at:
7.6. LIMIT and OFFSET

If you just want to benchmark SQL all on it's own and not mix Python into the equation try pgbench.
http://developer.postgresql.org/pgdocs/postgres/pgbench.html
Also what is your goal here?

What's the difference between these two ways of implementing `on duplication increment`?

Assume that I have a table user_count defined as follows:
id primary key, auto increment
user_id unique
count default 0
What I want to do is increment count by one when an existing record of a user exists or else insert a new record.
Currently, I do it this way (in Python):
try:
cursor.execute("INSERT INTO user_count (user_id) VALUES (%s)", user.id)
except IntegrityError:
cursor.execute("UPDATE user_count SET count = count+1 WHERE user_id = %s", user.id)
And it can also be implement this way:
cursor.execute("INSERT INTO user_count (user_id) VALUES (user_id) ON DUPLICATE KEY UPDATE count = count + 1", user.id)
What's the difference between these two ways, and which one is better?

The second one is a single SQL command which makes use of the feature that the database offers for solving exactly the problem you have here.
It'd use that as it should be faster and more reliable.
The first one is a fallback if that feature is not available (older database version?).

The first one uses an exception to direct the flow of the program, which isn't what you should do unless you have no other solutions to it (e.g. getting exclusive access to a file). Also, it takes the work from the database which should know better to handle the case.
The second code handles all the work in the database which in turn can optimize the query plan to a very efficient manner.
I would use the second solution as the database usually knows better than yourself how to handle a case.

how to select a long list of id's in sql using python

I have a very large db that I am working with, and I need to know how to select a large set of id's which doesn't have any real pattern to them. This is segment of code I have so far:
longIdList = [1, 3, 5 ,8 ....................................]
for id in longIdList
sql = "select * from Table where id = %s" %id
result = cursor.execute(sql)
print result.fetchone()
I was thinking, That there must be a quicker way of doing this... I mean my script needs to search through a db that has over 4 million id's. Is there a way that I can use a select command to grab them all in one shot. could I use the where statement with a list of id's? Thanks

Yes, you can use SQL's IN() predicate to compare a column to a set of values. This is standard SQL and it's supported by every SQL database.
There may be a practical limit to the number of values you can put in an IN() predicate before it becomes too inefficient or simply exceeds a length limit on SQL queries. The largest practical list of values depends on what database you use (in Oracle it's 1000, MS SQL Server it's around 2000). My feeling is that if your list exceeds a few dozen values, I'd seek another solution.
For example, #ngroot suggests using a temp table in his answer. For analysis of this solution, see this blog by StackOverflow regular #Quassnoi: Passing parameters in MySQL: IN list vs. temporary table.
Parameterizing a list of values into an SQL query a safe way can be tricky. You should be mindful of the risk of SQL injection.
Also see this popular question on Stack Overflow: Parameterizing a SQL IN clause?

You can use IN to look for multiple items simultaneously:
SELECT * FROM Table WHERE id IN (x, y, z, ...)
So maybe something like:
sql = "select * from Table where id in (%s)" % (', '.join(str(id) for id in longIdList))

Serialize the list in some fashion (comma-separated or XML would be reasonable choices), then have a stored procedure on the other side that will deserialize the list into a temp table. You can then do an INNER JOIN against the temp table.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Proper way to insert iterative data into Cassandra using Python - python

Related

How to insert user variable into an SQL Update/Select statement using python [duplicate]

Inserting arrays into databases

Efficient way to run select query for millions of data

What's the difference between these two ways of implementing `on duplication increment`?

how to select a long list of id's in sql using python

Categories

Resources