I have few questions to ask about Amazon RDS
I was able to create an instance and I was able to view data that I insert using the below code and I was able to see the results through workbench, but is there any way to view data on RDS?
I have a huge data set, probably containing 80k tuples, I read this data from csv and I put it in the database, does changing to a better instance will affect the speed of inserting the data to the database, because I tried a larger instance with around 7.5 GB RAM, but it didn't work, insertion was very slow.What could be the problem? am I not connected properly?
I used the following code to connect
db = MySQLdb.connect(host='west........{the endpoint}', port=3306, db='clouddb', user='{my_username}',passwd='{my_password')
cursor = db.cursor()
cursor.execute('CREATE TABLE consumer_complaint(Complaint VARCHAR(255),Product VARCHAR(255),Subproduct VARCHAR(255),Issue VARCHAR(255),State VARCHAR(255),ZIPcode VARCHAR(255),Company VARCHAR(255),Companyresponse VARCHAR(255),Timelyresponse VARCHAR(255),Consumerdisputed VARCHAR(255));')
I'm getting no errors in connecting.
Related
'''
cursor.execute(Select * From Table);
'''
Iam using the above code to execute the above select query, but this code gets stucked, because in the table, I am having 93 million records,
Do we have any other method to extract all the data from snowflake table in python script
Depending on what you are trying to do with that data, it'd probably be most efficient to run a COPY INTO location statement to extract the data into a file to a stage, and then run a GET via Python to bring that file locally to wherever you are running python.
However, you might want to provide more detail on how you are using the data in python after the cursor.execute statement. Are you going to iterate over that data set to do something (in which case, you may be better off issuing SQL statements directly to Snowflake, instead), loading it into Pandas to do something (there are better Snowflake functions for pandas in that case), or something else? If you are just creating a file from it, then my suggestion above will work.
The problem is when you are fetching data from Snowflake to Python, the query is getting stuck due to the volume of record and the SF to Python Data conversion.
Are you trying to fetch all the data from the table and how are you using the Data in the downstream which is most important. Restrict the number of columns
Improving Query Performance by Bypassing Data Conversion
To improve query performance, use the SnowflakeNoConverterToPython class in the snowflake.connector.converter_null module to bypass data conversions from the Snowflake internal data type to the native Python data type, e.g.:
con = snowflake.connector.connect(
...
converter_class=SnowflakeNoConverterToPython
)
for rec in con.cursor().execute("SELECT * FROM large_table"):
# rec includes raw Snowflake data
I am trying to fetch data from SQL server Database (Just a simple SELECT * query).
The table contains around 3-5 Million records. Perfomring a SELECT * on the SQL server directly using SSMS takes around 11-15 minutes.
However, when I am connecting via Python and trying to save data into a pandas dataframe, it takes forever. More than 1 hour.
Here is the code I am using:
import pymssql
import pandas as pd
startTime = datetime.now()
## instance a python db connection object- same form as psycopg2/python-mysql drivers also
conn = pymssql.connect(server=r"xyz", database = "abc", user="user",password="pwd")
print ('Connecting to DB: ',datetime.now() - startTime )
stmt = "SELECT * FROM BIG_TABLE;"
# Excute Query here
df_big_table = pd.read_sql(stmt,conn)
There must be a way to do this in a better way? Perhaps parallel processing or something to fetch the data quickly.
My end goal is to Migrate this table from SQL server to PostGres.
This is the way I am doing:
Fetch data from SQL server using python
Save it to a pandas dataframe
Save this data in CSV to disk.
Copy the CSV from disk to Postgres.
Proably, I can combine step 3,4 so that I can do the transition in memory, rather than using disk IO.
There are many complexity like table constrains and definitions, etc. Which I will be taking care later on. I cannot use a third party tool.
I am stuck at Step 1,2. So help with the Python script/ Some other opensource language would be really appreciated.
If there is any other way to reach to my end goal, I welcome sugessions!
Have you tried using 'chunksize' option of pandas.read_sql? you can get all of that into a single dataframe and produce the csv.
If it takes more time then you can split each chunk into multiple files using the pandas.read_sql as a iterator and then after you did your work consolidate those files into a single one and submit it to postgres.
New to Pandas & SQL. Haven't found an answer specific to this config, and not sure if standard SQL wisdom applies when introducing pandas to the mix.
Doing a school project that involves ~300 gb of data in ~6gb .csv chunks.
School advised syncing data via dropbox, but this seemed impractical for a 4-person team.
So, current solution is AWS EC2 & RDS instance (MySQL, I think it'll be, 1 table).
What I wanted to confirm before we start setting it up:
If multiple users are working with (and occasionally modifying) the data, can this arrangement manage conflicts? e.g., if user A uses pandas to construct a dataframe from a query, are the records in that query frozen if user B tries to work with them?
My assumption is that the data in the frame are in memory, and the records in the SQL database are free to be modified by others until the dataframe is written back to the db, but I'm hoping that either I'm wrong or there's a simple solution here (like a random sample query for each user or something).
A pandas DataFrame object does not interact directly with the db. Once you read it in it sits in memory locally. You would have to use a method like DataFrame.to_sql to write your changes back to the MySQL DB. For more information on reading and writing to SQL tables, see the pandas documentation here.
I have a MySQL server providing access to both a database for the Django ORM and a separate database called "STATES" that I built. I would like to query tables in my STATES database and return results (typically a couple of rows) to Django for rendering, but I don't know the best way to do this.
One way would be to use Django directly. Maybe I could move the relevant tables into the Django ORM database? I'm nervous about doing this because the STATES database contains large tables (10 million rows x 100 columns), and I worry about deleting that data or messing it up in some other way (I'm not very experienced with Django). I also imagine I should avoid creating a separate connection for each query, so I should use the Django connection to query STATE tables?
Alternatively, I could treat the STATE database as existing on a totally different server. I could import SQLAlchemy, create a connection, query STATE.table, return the result, and close that connection.
Which is better, or is there another path?
The docs describe how to connect to multiple databases by adding another database ("state_db") to DATABASES in settings.py, I can then do the following.
from django.db import connections
def query(lname)
c = connections['state_db'].cursor()
c.execute("SELECT last_name FROM STATE.table WHERE last_name=%s;",[lname])
rows = c.fetchall()
...
This is slower than I expected, but I'm guessing this is close to optimal because it uses the open connection and Django without adding extra complexity.
I'd like to create a snapshot of a database periodically, and execute some queries on the snapshot data to generate data for next step. Finally I want to discard the snapshot.
I read and convert all data into memory data structure(python dict) from the database and execute queries(implemented by my own code) on data structure
The program have a bottleneck on "execute query" step after data size increased
How can I query on data snapshot elegantly? Thanks much for any advice.
you can get all tables from your database with
SHOW TABLES FROM <yourDBname>
after that you may create copies of the tables in a new DB via
CREATE TABLE copy.tableA AS SELECT * FROM <yourDBname>.tableA
afterwars you can query the copy-database instead of the real data.
if you do queries on the tables, pls add indexes since they are not copied.