snowflake select cursor statement fails - python

'''
cursor.execute(Select * From Table);
'''
Iam using the above code to execute the above select query, but this code gets stucked, because in the table, I am having 93 million records,
Do we have any other method to extract all the data from snowflake table in python script

Depending on what you are trying to do with that data, it'd probably be most efficient to run a COPY INTO location statement to extract the data into a file to a stage, and then run a GET via Python to bring that file locally to wherever you are running python.
However, you might want to provide more detail on how you are using the data in python after the cursor.execute statement. Are you going to iterate over that data set to do something (in which case, you may be better off issuing SQL statements directly to Snowflake, instead), loading it into Pandas to do something (there are better Snowflake functions for pandas in that case), or something else? If you are just creating a file from it, then my suggestion above will work.

The problem is when you are fetching data from Snowflake to Python, the query is getting stuck due to the volume of record and the SF to Python Data conversion.
Are you trying to fetch all the data from the table and how are you using the Data in the downstream which is most important. Restrict the number of columns
Improving Query Performance by Bypassing Data Conversion
To improve query performance, use the SnowflakeNoConverterToPython class in the snowflake.connector.converter_null module to bypass data conversions from the Snowflake internal data type to the native Python data type, e.g.:
con = snowflake.connector.connect(
...
converter_class=SnowflakeNoConverterToPython
)
for rec in con.cursor().execute("SELECT * FROM large_table"):
# rec includes raw Snowflake data

Related

Fetching Million records from SQL server and saving to pandas dataframe

I am trying to fetch data from SQL server Database (Just a simple SELECT * query).
The table contains around 3-5 Million records. Perfomring a SELECT * on the SQL server directly using SSMS takes around 11-15 minutes.
However, when I am connecting via Python and trying to save data into a pandas dataframe, it takes forever. More than 1 hour.
Here is the code I am using:
import pymssql
import pandas as pd
startTime = datetime.now()
## instance a python db connection object- same form as psycopg2/python-mysql drivers also
conn = pymssql.connect(server=r"xyz", database = "abc", user="user",password="pwd")
print ('Connecting to DB: ',datetime.now() - startTime )
stmt = "SELECT * FROM BIG_TABLE;"
# Excute Query here
df_big_table = pd.read_sql(stmt,conn)
There must be a way to do this in a better way? Perhaps parallel processing or something to fetch the data quickly.
My end goal is to Migrate this table from SQL server to PostGres.
This is the way I am doing:
Fetch data from SQL server using python
Save it to a pandas dataframe
Save this data in CSV to disk.
Copy the CSV from disk to Postgres.
Proably, I can combine step 3,4 so that I can do the transition in memory, rather than using disk IO.
There are many complexity like table constrains and definitions, etc. Which I will be taking care later on. I cannot use a third party tool.
I am stuck at Step 1,2. So help with the Python script/ Some other opensource language would be really appreciated.
If there is any other way to reach to my end goal, I welcome sugessions!
Have you tried using 'chunksize' option of pandas.read_sql? you can get all of that into a single dataframe and produce the csv.
If it takes more time then you can split each chunk into multiple files using the pandas.read_sql as a iterator and then after you did your work consolidate those files into a single one and submit it to postgres.

Bigquery data not getting inserted

I'm using python client library to insert data to big query table. The code is as follows.
client = bigquery.Client(project_id)
errors = client.insert_rows_json(table=tablename,json_rows=data_to_insert)
assert errors == []
There are no errors, but the data is also not getting inserted.
Sample JSON rows:
[{'a':'b','c':'d'},{'a':'f','q':'r'},.....}]
What's the problem? No exception also
client.insert_rows_json method using StreamingInsert .
Inserting data to BigQuery using StreamingInsert will be cause of latency on table preview on BigQuery console.
The data is not appeared immediately. So,
You need to query them to confirm the data inserted.
It can be 2 possible situations:
your data does not match the schema
your table is freshly created, and the update is just not yet available
References:
Related GitHub issue
Data availability
got the answer to my question. The problem was I was inserting one more column data for which data was not there. I found a hack in order to find out if the data is not inserting to bigquery table.
Change the data to newline delimited json with the keys as the column names and values as values you want for that particular column.
bq --location=US load --source_format=NEWLINE_DELIMITED_JSON dataset.tablename newline_delimited_json_file.json. Run this command in you terminal and see if throws any errors. If it throws an error it's likely that something is wrong with your data/table schema.
Change the data/table schema as per the error and retry inserting the same via python.
It's better if the python API throws an error/exception like on the terminal, it would be helpful.

Create Database using Python on Jupyter Notebook

so i am building a database for a larger program and do not have much experience in this area of coding (mostly embedded system programming). My task is to import a large excel file into python. It is large so i'm assuming I must convert it to a CSV then truncate it by parsing and then partitioning and then import to avoid my computer crashing. Once the file is imported i must be able to extract/search specific information based on the column titles. There are other user interactive aspects that are simply string based so not very difficult. As for the rest, I am getting the picture but would like a more efficient and specific design. Can anyone offer me guidance on this?
An excel or csv can be read into python using pandas. The data is stored as rows and columns and is called a dataframe. To import data in such a structure, you need to import pandas first and then read the csv or excel into the dataframe structure.
import pandas as pd
df1= pd.read_csv('excelfilename.csv')
This dataframe structure is similar to tables and you can perform joining of different dataframes, grouping of data etc.
I am not sure if this is what you need, let me know if you need any further clarifications.
I would recommend actually loading it into a proper database such as Mariadb or Postgresql. This will allow you to access the data from other applications and it takes the load off of you for writing a database. You can then use a ORM if you would like to interact with the data or simply use plain SQL via python.
read the CSV
df = pd.read_csv('sample.csv')
connect to a database
conn = sqlite3.connect("Any_Database_Name.db") #if the db does not exist, this creates a Any_Database_Name.db file in the current directory
store your table in the database:
df.to_sql('Some_Table_Name', conn)
read a SQL Query out of your database and into a pandas dataframe
sql_string = 'SELECT * FROM Some_Table_Name' df = pd.read_sql(sql_string, conn)

mysql: update millions of records by applying a python function

I have a python function (pyfunc):
def pyfunc(x):
...
return someString
I want to apply this function to every item in a mysql table column,
something like:
UPDATE tbl SET mycol=pyfunc(mycol);
This update includes tens of millions of records.
Is there an efficient way to do this?
Note: I cannot rewrite this function in sql or any other programming language.
If your pyfunc does not depend on other data sources like apis or any cache, and just does some data processing like string or mathematical manipulations, or depends on the data stored in same database in mysql, you shall go for MySQL user defined functions.
Lets assume you create a MySQL function called colFunc , then your query would be
Update tbl set mycol=colFunc(mycol)
Just prepare update.sql file using python script.
After this you can check update on your local machine(with dump of db). Just connect to sql and run update.sql script which was prepared from python.
In this case you will use raw sql query without python for updating data.
I think it is not bad solution.

Multiple pandas users connecting to SQL DB

New to Pandas & SQL. Haven't found an answer specific to this config, and not sure if standard SQL wisdom applies when introducing pandas to the mix.
Doing a school project that involves ~300 gb of data in ~6gb .csv chunks.
School advised syncing data via dropbox, but this seemed impractical for a 4-person team.
So, current solution is AWS EC2 & RDS instance (MySQL, I think it'll be, 1 table).
What I wanted to confirm before we start setting it up:
If multiple users are working with (and occasionally modifying) the data, can this arrangement manage conflicts? e.g., if user A uses pandas to construct a dataframe from a query, are the records in that query frozen if user B tries to work with them?
My assumption is that the data in the frame are in memory, and the records in the SQL database are free to be modified by others until the dataframe is written back to the db, but I'm hoping that either I'm wrong or there's a simple solution here (like a random sample query for each user or something).
A pandas DataFrame object does not interact directly with the db. Once you read it in it sits in memory locally. You would have to use a method like DataFrame.to_sql to write your changes back to the MySQL DB. For more information on reading and writing to SQL tables, see the pandas documentation here.

Categories