I'd like to create a snapshot of a database periodically, and execute some queries on the snapshot data to generate data for next step. Finally I want to discard the snapshot.
I read and convert all data into memory data structure(python dict) from the database and execute queries(implemented by my own code) on data structure
The program have a bottleneck on "execute query" step after data size increased
How can I query on data snapshot elegantly? Thanks much for any advice.
you can get all tables from your database with
SHOW TABLES FROM <yourDBname>
after that you may create copies of the tables in a new DB via
CREATE TABLE copy.tableA AS SELECT * FROM <yourDBname>.tableA
afterwars you can query the copy-database instead of the real data.
if you do queries on the tables, pls add indexes since they are not copied.
Related
I am trying to fetch data from SQL server Database (Just a simple SELECT * query).
The table contains around 3-5 Million records. Perfomring a SELECT * on the SQL server directly using SSMS takes around 11-15 minutes.
However, when I am connecting via Python and trying to save data into a pandas dataframe, it takes forever. More than 1 hour.
Here is the code I am using:
import pymssql
import pandas as pd
startTime = datetime.now()
## instance a python db connection object- same form as psycopg2/python-mysql drivers also
conn = pymssql.connect(server=r"xyz", database = "abc", user="user",password="pwd")
print ('Connecting to DB: ',datetime.now() - startTime )
stmt = "SELECT * FROM BIG_TABLE;"
# Excute Query here
df_big_table = pd.read_sql(stmt,conn)
There must be a way to do this in a better way? Perhaps parallel processing or something to fetch the data quickly.
My end goal is to Migrate this table from SQL server to PostGres.
This is the way I am doing:
Fetch data from SQL server using python
Save it to a pandas dataframe
Save this data in CSV to disk.
Copy the CSV from disk to Postgres.
Proably, I can combine step 3,4 so that I can do the transition in memory, rather than using disk IO.
There are many complexity like table constrains and definitions, etc. Which I will be taking care later on. I cannot use a third party tool.
I am stuck at Step 1,2. So help with the Python script/ Some other opensource language would be really appreciated.
If there is any other way to reach to my end goal, I welcome sugessions!
Have you tried using 'chunksize' option of pandas.read_sql? you can get all of that into a single dataframe and produce the csv.
If it takes more time then you can split each chunk into multiple files using the pandas.read_sql as a iterator and then after you did your work consolidate those files into a single one and submit it to postgres.
We're doing streaming inserts on a BigQuery table.
We want to update the schema of a table without changing its name.
For example, we want to drop a column because it has sensitive data but we want to keep all the other data and the table name the same.
Our process is as follows:
copy original table to temp table
delete original table
create new table with original table name and new schema
populate new table with old table's data
cry because the last (up to) 90 minutes of data is stuck in streaming buffer and was not transferred.
How to avoid the last step?
I believe the new streaming API does not use streaming buffer anymore. Instead, it writes data directly to the destination table.
To enable API you have to enroll with BigQuery Streaming V2 Beta Enrollment Form:
You can find out more in the following link
I hope it addresses your case.
I have a specific problem where I have to query different database to show results in dashboard. The tables that I have to query in those db are exactly same. The number of database can be max 50 and minimum is 5.
NOTE: I can't put all the data in same database.
I am using postgres and django. I am not able to understand how to query those database to get data. I also need to filter and aggregate and sort the data and show 10 - 100 results on the search query params.
APPROACH that I have in mind
Loop through all database and fetch the data based on the search params and order it by created date. After that take 10 - 100 results as per search params.
I am not able to understand what should be the correct approach and how it should be done considering speed and reliability.
I am open to use any other database for temporary storage. or any other ideas also.
I'm looking for advice on efficient ways to stream data incrementally from a Postgres table into Python. I'm in the process of implementing an online learning algorithm and I want to read batches of training examples from the database table into memory to be processed. Any thoughts on good ways to maximize throughput? Thanks for your suggestions.
If you are using psycopg2, then you will want to use a named cursor, otherwise it will try to read the entire query data into memory at once.
cursor = conn.cursor("some_unique_name")
cursor.execute("SELECT aid FROM pgbench_accounts")
for record in cursor:
something(record)
This will fetch the records from the server in batches of 2000 (default value of itersize) and then parcel them out to the loop one at a time.
You may want to look into the Postgres LISTEN/NOTIFY functionality https://www.postgresql.org/docs/9.1/static/sql-notify.html
My python project involves an externally provided database: A text file of approximately 100K lines.
This file will be updated daily.
Should I load it into an SQL database, and deal with the diff daily? Or is there an effective way to "query" this text file?
ADDITIONAL INFO:
Each "entry", or line, contains three fields - any one of which can be used as an index.
The update is is the form of the entire database - I would have to manually generate a diff
The queries are just looking up records and displaying the text.
Querying the database will be a fundamental task of the application.
How often will the data be queried? On the one extreme, if once per day, you might use a sequential search more efficiently than maintaining a database or index.
For more queries and a daily update, you could build and maintain your own index for more efficient queries. Most likely, it would be worth a negligible (if any) sacrifice in speed to use an SQL database (or other database, depending on your needs) in return for simpler and more maintainable code.
What I've done before is create SQLite databases from txt files which were created from database extracts, one SQLite db for each day.
One can query across SQLite db to check the values etc and create additional tables of data.
I added an additional column of data that was the SHA1 of the text line so that I could easily identify lines that were different.
It worked in my situation and hopefully may form the barest sniff of an acorn of an idea for you.