I have an xlsx file and I have loaded most of the sheets in transposed manner in oracle.
Ex:-
Xlsx :
Roll no
Name.
Age
Job
1
Harshita
25
IT
Oracle :
Roll No
Parame
Param_Value
1
Name
Harshita
1
Age
25
1
Job
IT
Now, in oracle table, roll no 1 has three rows where as in xlsx it has 1.
How to verify if the count is right and and the value is same when comparing the table data with the sheets?
Is there a way to write a script to compare the sheets with tables after loading the data?
I tried macros but that is not possible for me to put in prod as the sheets will change every 6 months and I tried few queries like groupBy.
One way to verify the count and value of the data is to use SQL queries to compare the data between the Oracle table and the xlsx sheet.
First, you need to create a new table in Oracle that has the same
structure as the xlsx sheet.
Import the xlsx data into this new table.
Use a SQL query to compare the count of rows in both the tables, and
to verify that the values in both the tables are the same. For
example:
SELECT COUNT(*) FROM xlsx_table
UNION
SELECT COUNT(*) FROM oracle_table;
If the count is not the same, then use another
query to find out which rows are missing:
SELECT * FROM xlsx_table
MINUS
SELECT * FROM oracle_table;
If the values are not the same, then use another query to find out which values are different:
SELECT xlsx_table.*, oracle_table.* FROM xlsx_table, oracle_table
WHERE xlsx_table.roll_no = oracle_table.roll_no
AND (xlsx_table.name != oracle_table.name
OR xlsx_table.age != oracle_table.age
OR xlsx_table.job != oracle_table.job)
By using these queries, you can verify the count and values of the data and make sure that the data has been loaded correctly.
Related
I am quite new to python. I have a table that I want to update daily. I get a csv file with large amount of data, about 15000 entries. Each row from the csv file has to be inserted in my table. But If a specific value from the file matches the primary key of any of the rows, the I want to delete the row from the table and instead insert the corresponding row from the csv file. So for eg. if my csv file is like this:
001|test1|test11|test111
002|test2|test22|test222
003|test3|test33|test333
And in my table I have a row with primary key column value=002, then delete that row and insert corresponding row from the file.
I don't have an idea about how many rows I could get in that csv every day, with values matching primary key. I know this can be done with a MERGE query but I am not really sure if it will take a longer time than any other method. And it would also require me to create a temp table and truncate it every time. Same if I use WHERE EXISTS, I would need a temp table.
What is the most efficient way to do this task?
I am using Python 2.7.5 and SQL Server 2017
I think using merge statement is the optimal solution. Create a stage-table matching your target table, truncate it and insert the csv to the stage table. If your sqlserver instance has access to the file you can use bulk insert or open rowset to load it, othervise use python. To load staged data to target table use a MERGE statement.
If your table has column names Id, Col1, Col2, Col3 then something like this:
MERGE INTO dbo.MyTable as TargetTable USING
(
SELECT
Id,Col1,Col2,Col3
FROM dbo.stage_MyTable
) as SourceTable
ON TargetTable.Id = SourceTable.Id
WHEN MATCHED THEN UPDATE SET
Col1 = SourceTable.Col1,
Col2 = SourceTable.Col2,
Col3 = SourceTable.Col3
WHEN NOT MATCHED BY TARGET THEN INSERT
(Id,Col1, Col2,Col3)
VALUES
(SourceTable.Id,SourceTable.Col1, SourceTable.Col2,SourceTable.Col3)
;
The benefit of this approach is that the query will be executed as a single transaction so if there are duplicate rows or similar the table status will be rolled back to previous state.
I have created a database using sqlite3 in python that has thousands of tables. Each of these tables contains thousands of rows and ten columns. One of the columns is the date and time of an event: it is a string that is formatted as YYYY-mm-dd HH:MM:SS, which I have defined to be the primary key for each table. Every so often, I collect some new data (hundreds of rows) for each of these tables. Each new dataset is pulled from a server and loaded in directly as a pandas data frame or is stored as a CSV file. The new data contains the same ten columns as my original data. I need to update the tables in my database using this new data in the following way:
Given a table in my database, for each row in the new dataset, if the date and time of the row matches the date and time of an existing row in my database, update the remaining columns of that row using the values in the new dataset.
If the date and time does not yet exist, create a new row and insert it to my database.
Below are my questions:
I've done some searching on Google and it looks like I should be using the UPSERT (merge) functionality of sqlite but I can't seem to find any examples showing how to use it. Is there an actual UPSERT command, and if so, could someone please provide an example (preferably with sqlite3 in Python) or point me to a helpful resource?
Also, is there a way to do this in bulk so that I can UPSERT each new dataset into my database without having to go row by row? (I found this link, which suggests that it is possible, but I'm new to using databases and am not sure how to actually run the UPSERT command.)
Can UPSERT also be performed directly using pandas.DataFrame.to_sql?
My backup solution is loading in the table to be UPSERTed using pd.read_sql_query("SELECT * from table", con), performing pandas.DataFrame.merge, deleting the said table from the database, and then adding in the updated table to the database using pd.DataFrame.to_sql (but this would be inefficient).
Instead of going through upsert command, why don't you create your own algorithim that will find values and replace them if date & time is found, else it will insert new row. Check out my code, i wrote for you. Let me know if you are still confused. You can even do that for hundereds of tables just by replacing table name in algorithim with some variable and changing it for the whole list of your table names.
import sqlite3
import pandas as pd
csv_data = pd.read_csv("my_CSV_file.csv") # Your CSV Data Path
def manual_upsert():
con = sqlite3.connect(connection_str)
cur = con.cursor()
cur.execute("SELECT * FROM my_CSV_data") # Viewing Data from Column
data = cur.fetchall()
old_data_list = [] # Collection of All Dates already in Database table.
for line in data:
old_data_list.append(line[0]) # I suppose you Date Column is on 0 Index.
for new_data in csv_data:
if new_data[0] in old_data_list:
cur.execute("UPDATE my_CSV_data SET column1=?, column2=?, column3=? WHERE my_date_column=?", # it will update column based on date if condition is true
(new_data[1],new_data[2],new_data[3],new_data[0]))
else:
cur.execute("INSERT INTO my_CSV_data VALUES(?,?,?,?)", # It will insert new row if date is not found.
(new_data[0],new_data[1],new_data[2],new_data[3]))
con.commit()
con.close()
manual_upsert()
First, even though the questions are related, ask them separately in the future.
There is documentation on UPSERT handling in SQLite that documents how to use it but it is a bit abstract. You can check examples and discussion here: SQLite - UPSERT *not* INSERT or REPLACE
Use a transaction and the statements are going to be executed in bulk.
As presence of this library suggests to_sql does not create UPSERT commands (only INSERT).
I have a list of entries that is around 6 million in a text file. I have to check against table to return ALL rows are in text file. For that purpose I want to use SEELCT IN. I want to it is OK to convert all of them in a single query and run?
I am using MySQL.
You can create a temporary table or variable in Database insert the values into that table or variable and then you can perform IN operation like given below.
SELECT field
FROM table
WHERE value IN SELECT somevalue from sometable
Thanks
I have a database, which I store as a .db file on my disk. I implemented all the function neccessary for managing this database using sqlite3. However, I noticed that updating the rows in the table takes a large amount of time. My database has currently 608042 rows. The database has one table - let's call it Table1. This table consists of the following columns:
id | name | age | address | job | phone | income
(id value is generated automaticaly while a row is inserted to the database).
After reading-in all the rows I perform some operations (ML algorithms for predicting the income) on the values from the rows, and next I have to update (for each row) the value of income (thus, for each one from 608042 rows I perform the SQL update operation).
In order to update, I'm using the following function (copied from my class):
def update_row(self, new_value, idkey):
update_query = "UPDATE Table1 SET income = ? WHERE name = ?" %
self.cursor.execute(update_query, (new_value, idkey))
self.db.commit()
And I call this function for each person registered in the database.
for each i out of 608042 rows:
update_row(new_income_i, i.name)
(values of new_income_i are different for each i).
This takes a huge amount of time, even though the dataset is not giant. Is there any way to speed up the updating of the database? Should I use something else than sqlite3? Or should I instead of storing the database as a .db file store it in memory (using sqlite3.connect(":memory:"))?
Each UPDATE statement must scan the entire table to find any row(s) that match the name.
An index on the name column would prevent this and make the search much faster. (See Query Planning and How does database indexing work?)
However, if the name column is not unique, then that value is not even suitable to find individual rows: each update with a duplicate name would modify all rows with the same name. So you should use the id column to identify the row to be updated; and as the primary key, this column already has an implicit index.
I am trying to upload data from a csv file (its on my local desktop) to my remote SQL database. This is my query
dsn = "dsnname";pwd="password"
import pyodbc
csv_data =open(r'C:\Users\folder\Desktop\filename.csv')
def func(dsn):
cnnctn=pyodbc.connect(dsn)
cnnctn.autocommit =True
cur=cnnctn.cursor()
for rows in csv_data:
cur.execute("insert into database.tablename (colname) value(?)", rows)
cur.commit()
cnnctn.commit()
cur.close()
cnnctn.close()
return()
c=func(dsn)
The problem is that all of my data gets uploaded in one col- that I specified. If I don't specify a col name it won't run. I have 9 cols in my database table and I want to upload this data into separate cols.
When you insert with SQL, you need to make sure you are telling which columns you want to be inserting on. For example, when you execute:
INSERT INTO table (column_name) VALUES (val);
You are letting SQL know that you want to map column_name to val for that specific row. So, you need to make sure that the number of columns in the first parentheses matches the number of values in the second set of parentheses.