I'm using plugin DataStax Python Driver for Apache Cassandra.
I want to read 100 rows from database and then insert them again into database after changing one value. I do not want to miss previous records.
I know how to get my rows:
rows = session.execute('SELECT * FROM columnfamily LIMIT 100;')
for myrecord in rows:
print(myrecord.timestamp)
I know how to insert new rows into database:
stmt = session.prepare('''
INSERT INTO columnfamily (rowkey, qualifier, info, act_date, log_time)
VALUES (, ?, ?, ?, ?)
IF NOT EXISTS
''')
results = session.execute(stmt, [arg1, arg2, ...])
My problems are that:
I do not know how to change only one value in a row.
I don't know how to insert rows into database without using CQL. My columnfamily has more than 150 columns and writing all their names in query does not seem as a best idea.
To conclude:
Is there a way to get rows, modify one value from every one of them and then insert this rows into database without using only CQL?
First, you need to select only needed columns from Cassandra - it will be faster to transfer the data. You need to include all columns of primary key + column that you want to change.
After you get the data, you can use UPDATE command to update only necessary column (example from documentation):
UPDATE cycling.cyclist_name
SET comments ='='Rides hard, gets along with others, a real winner'
WHERE id = fb372533-eb95-4bb4-8685-6ef61e994caa
You can also use prepared statement to make it more performant...
But be careful - the UPDATE & INSERT in CQL are really UPSERTs, so if you change columns that are part of primary key, then it will create new entry...
Related
I have a code that is updating a few mySQL tables with the data that is coming from a Sybase database. The table structures are exactly the same.
Since the number of tables may increase in the future, I wrote a Python script that loops over an array of table names, and based on the number of columns in each of those tables, the insert statement dynamically changes:
'''insert into databaseName.{} ({}) values ({})'''.format(table, columns, parameters)
as you can see, the value parameters are not hardcoded, which has caused this problem where I can't modify this query to do an "ON DUPLICATE KEY UPDATE".
for example, the insert statement may look like:
insert into databaseName.table_foo (col1,col2,col3,col4,col5) values (%s,%s,%s,%s,%s)
or
insert into databaseName.table_bar (col1,col2,col3) values (%s,%s,%s)
how can I use "ON DUPLICATE KEY UPDATE" in here to update non-index columns with their corresponding non-index values?
I can update this question by including more details if needed.
The easiest solution is this:
'''replace into databaseName.{} ({}) values ({})'''.format(table, columns, parameters)
This works similarly to IODKU, in that if the values conflict with a PRIMARY KEY or UNIQUE KEY of the table, it replaces the row, overwriting the other columns, instead of causing a duplicate key error.
The difference is that REPLACE does a DELETE of the old row followed by an INSERT of the new row. Whereas IODKU does either an INSERT or an UPDATE. We know this because if you create triggers on the table, you'll see which triggers are activated.
Anyway, using REPLACE would make your task a lot simpler in this case.
If you must use IODKU, you would need to add more syntax after the update at the end. Unfortunately, there is no syntax for "assign all the columns respectively to the new row's values." You must assign them individually.
For MySQL 8.0.19 or later use this syntax:
INSERT INTO t1 (a,b,c) VALUES (?,?,?) AS new
ON DUPLICATE KEY UPDATE a = new.a, b = new.b, c = new.c;
In earlier MySQL, use this syntax:
INSERT INTO t1 (a,b,c) VALUES (?,?,?)
ON DUPLICATE KEY UPDATE a = VALUES(a), b = VALUES(b), c = VALUES(c);
I have created a database using sqlite3 in python that has thousands of tables. Each of these tables contains thousands of rows and ten columns. One of the columns is the date and time of an event: it is a string that is formatted as YYYY-mm-dd HH:MM:SS, which I have defined to be the primary key for each table. Every so often, I collect some new data (hundreds of rows) for each of these tables. Each new dataset is pulled from a server and loaded in directly as a pandas data frame or is stored as a CSV file. The new data contains the same ten columns as my original data. I need to update the tables in my database using this new data in the following way:
Given a table in my database, for each row in the new dataset, if the date and time of the row matches the date and time of an existing row in my database, update the remaining columns of that row using the values in the new dataset.
If the date and time does not yet exist, create a new row and insert it to my database.
Below are my questions:
I've done some searching on Google and it looks like I should be using the UPSERT (merge) functionality of sqlite but I can't seem to find any examples showing how to use it. Is there an actual UPSERT command, and if so, could someone please provide an example (preferably with sqlite3 in Python) or point me to a helpful resource?
Also, is there a way to do this in bulk so that I can UPSERT each new dataset into my database without having to go row by row? (I found this link, which suggests that it is possible, but I'm new to using databases and am not sure how to actually run the UPSERT command.)
Can UPSERT also be performed directly using pandas.DataFrame.to_sql?
My backup solution is loading in the table to be UPSERTed using pd.read_sql_query("SELECT * from table", con), performing pandas.DataFrame.merge, deleting the said table from the database, and then adding in the updated table to the database using pd.DataFrame.to_sql (but this would be inefficient).
Instead of going through upsert command, why don't you create your own algorithim that will find values and replace them if date & time is found, else it will insert new row. Check out my code, i wrote for you. Let me know if you are still confused. You can even do that for hundereds of tables just by replacing table name in algorithim with some variable and changing it for the whole list of your table names.
import sqlite3
import pandas as pd
csv_data = pd.read_csv("my_CSV_file.csv") # Your CSV Data Path
def manual_upsert():
con = sqlite3.connect(connection_str)
cur = con.cursor()
cur.execute("SELECT * FROM my_CSV_data") # Viewing Data from Column
data = cur.fetchall()
old_data_list = [] # Collection of All Dates already in Database table.
for line in data:
old_data_list.append(line[0]) # I suppose you Date Column is on 0 Index.
for new_data in csv_data:
if new_data[0] in old_data_list:
cur.execute("UPDATE my_CSV_data SET column1=?, column2=?, column3=? WHERE my_date_column=?", # it will update column based on date if condition is true
(new_data[1],new_data[2],new_data[3],new_data[0]))
else:
cur.execute("INSERT INTO my_CSV_data VALUES(?,?,?,?)", # It will insert new row if date is not found.
(new_data[0],new_data[1],new_data[2],new_data[3]))
con.commit()
con.close()
manual_upsert()
First, even though the questions are related, ask them separately in the future.
There is documentation on UPSERT handling in SQLite that documents how to use it but it is a bit abstract. You can check examples and discussion here: SQLite - UPSERT *not* INSERT or REPLACE
Use a transaction and the statements are going to be executed in bulk.
As presence of this library suggests to_sql does not create UPSERT commands (only INSERT).
I need to insert in three columns of a table in mysql at a time. First two columns are inserted by selecting data from other tables by using select statement while the third column needs to be inserted directly and it doesn't need any select. I don't know its syntax in mysql. pos is an array and i need to insert it simultaneously.
here is my sql command in python.
sql="insert into quranic_index_2(quran_wordid,translationid,pos) select quranic_words.wordid,quran_english_translations.translationid from quranic_words, quran_english_translation where quranic_words.lemma=%s and quran_english_translations.verse_no=%s and
quran_english_translations.translatorid="%s,values(%s)"
data=l,words[2],var1,words[i+1]
r=cursor.execute(sql,data)
data is passing variables in which all the variables are stored. words[i+1] holds values for pos.
Try using below sample query :
INSERT INTO table_name(field_1, field_2, field3) VALUES
('Value_1', (SELECT value_2,from user_table ), 'value_3')
I am using the below python code to update postgres DB column valuebased on Id. This loop has to run for thousands of records and it is taking longer time.
Is there a way where I can pass array of dataframe values instead of looping each row?
for i in range(0,len(df)):
QUERY=""" UPDATE "Table" SET "value"='%s' WHERE "Table"."id"='%s'
""" % (df['value'][i], df['id'][i])
cur.execute(QUERY)
conn.commit()
Depends on a library you use to communicate with PostgreSQL, but usually bulk inserts are much faster via COPY FROM command.
If you use psycopg2 it is as simple as following:
cursor.copy_from(io.StringIO(string_variable), "destination_table", columns=('id', 'value'))
Where string_variable is tab and new line delimited dataset like 1\tvalue1\n2\tvalue2\n.
To achieve a performant bulk update I would do:
Create a temporary table: CREATE TEMPORARY TABLE tmp_table;;
Insert records with copy_from;
Just update destination table with query UPDATE destination_table SET value = t.value FROM tmp_table t WHERE id = t.id or any other preferred syntax
Similar to my last question, but I ran into problem lets say I have a simple dictionary like below but its Big, when I try inserting a big dictionary using the methods below I get operational error for the c.execute(schema) for too many columns so what should be my alternate method to populate an sql databases columns? Using the alter table command and add each one individually?
import sqlite3
con = sqlite3.connect('simple.db')
c = con.cursor()
dic = {
'x1':{'y1':1.0,'y2':0.0},
'x2':{'y1':0.0,'y2':2.0,'joe bla':1.5},
'x3':{'y2':2.0,'y3 45 etc':1.5}
}
# 1. Find the unique column names.
columns = set()
for _, cols in dic.items():
for key, _ in cols.items():
columns.add(key)
# 2. Create the schema.
col_defs = [
# Start with the column for our key name
'"row_name" VARCHAR(2) NOT NULL PRIMARY KEY'
]
for column in columns:
col_defs.append('"%s" REAL NULL' % column)
schema = "CREATE TABLE simple (%s);" % ",".join(col_defs)
c.execute(schema)
# 3. Loop through each row
for row_name, cols in dic.items():
# Compile the data we have for this row.
col_names = cols.keys()
col_values = [str(val) for val in cols.values()]
# Insert it.
sql = 'INSERT INTO simple ("row_name", "%s") VALUES ("%s", "%s");' % (
'","'.join(col_names),
row_name,
'","'.join(col_values)
)
If I understand you right, you're not trying to insert thousands of rows, but thousands of columns. SQLite has a limit on the number of columns per table (by default 2000), though this can be adjusted if you recompile SQLite. Never having done this, I do not know if you then need to tweak the Python interface, but I'd suspect not.
You probably want to rethink your design. Any non-data warehouse / OLAP application is highly unlikely to need or be terribly efficient with thousands of columns (rows, yes) and SQLite is not a good solution for a data warehouse / OLAP type situation. You may get a bit further with something like an entity-attribute-value setup (not a normal recommendation for genuine relational databases, but a valid application data model and much more likely to accommodate your needs without pushing the limits of SQLite too far).
If you really are adding a massive number of rows and are running into problems, maybe your single transaction is getting too large.
Do a COMMIT (commit()) after a given number of lines (or even after each insert as a test) if that is acceptable.
Thousands of rows should be easily doable with sqlite. Getting to millions and above, at some point there might be need for more. Depends on a lot of things, of course.