I have a dataframe which has million of records and while pulling the dataframe in jupyter it takes lot of memory and I am unable to do so as the server get's crashed because there are million's of records in database.
I got to know about DASK package which helps in getting huge dataframe in the python , I am new to dask and not sure how can I set up a connection using dask and mysql server.
I usually make connection with jupyter and mysql server using the following way , I would really appreciate if someone could provide me how to make connection for the same table and server using dask framework.
sql_conn = pyodbc.connect("DSN=CNVDED")
query = "SELECT * FROM Abc table"
df_training = pd.read_sql(query, sql_conn)
data=df_training
I would really appreciate if someone could help me on this and I can't use csv and then use dask need proper connection with mysql server
Related
I am trying to connect a jupyter notebook I'm running in a conda environment to a Hadoop cluster through Apache Hive on cloudera. I understand from this post that I should install/set up the cloudera odbc driver and use pydobc and with a connection as follows:
import pyodbc
import pandas as pd
with pyodbc.connect("DSN=<replace DSN name>", autocommit=True) as conn:
df = pd.read_sql("<Hive Query>", conn)
My question is about the autocommit parameter. I see in the pyodbc connection documentation that setting autocommit to True will make it so that I don't have to explicitly commit transactions, but it doesn't specify what that actually means.
What exactly is a transaction ?
I want to select data from the hive server using pd.read_sql_query() but I don't want to make any changes to the actual data on the server.
Apologies if this question is formatted incorrectly or if there are (seemingly simple) details I'm overlooking in my question - this is my first time posting on stackoverflow and I'm new to working with cloudera / Hive.
I haven't tried connecting yet or running any queries yet because I don't want to mess up anything on the server.
Hive do not have concept of commit and starting transactions like RDBMS systems.
You should not worry about autocommit.
I am trying to move from SQLite to MySQL - and I am almost there. However, I all the time encounter one problem, that plagues me regardless if I am connecting to my local MySQL db, or the one hosted on Google Cloud. Have anyone had the same issue.
This is the code I use to append pandas dataframe to the table
import pandas as pd
from sqlalchemy import create_engine
connection = create_engine(f"mysql+mysqlconnector://{user}:{pw}#{host}/{db}")
tablename = 'TABLENAME'
df.to_sql(tablename, connection, if_exists='append', index=False)
The pandas table is not very large, a few tens of rows at a time.
Every now and then I get this error
sqlalchemy.exc.OperationalError: (mysql.connector.errors.OperationalError) 2055: Lost connection to MySQL server at 'serverip:port', system error: 32 Broken pipe
So far I have been using sqlite3 package for SQLite - do you think it is sqlalchemy or MySQL which is the problem?
I'm developing a website where users import csv files directly to a database and a front end that performs some data analytics on the data once it has been filed in the database. I'm using pandas to convert the csv to a dataframe and to subsequently import that dataframe into the MySQL database:
Import to MySQL database:
engine = create_engine('mysql+mysqlconnector://[username]:[password]#[host]:[port]/[schema]', echo=False)
df = pd.read_csv('C:/Users/[user]/Documents/Sales_Records.csv')
df.to_sql(con= engine, name='data', if_exists='replace')
The problem with this is that for the datasets I work with (5 million rows), the performance is too slow and the action times out without importing the data. However, if I try the same thing except using SQLite3:
import to SQLite3 database:
conn = sqlite3.connect('customer.db')
df = pd.read_csv('C:/Users/[user]/Documents/Sales_Records.csv')
df.to_sql('Sales', conn, if_exists='append', index=False)
mycursor = conn.cursor()
query = 'SELECT * FROM Sales LIMIT 10'
print(mycursor.execute(query).fetchall())
This block of code executes in seconds and imports all 5 million rows of the dataset. So what should I do? I do not anticipate multiple people passing in large datasets all at the same time so I suppose it would not hurt to just ditch MySQL for the clear performance advantages provided by SQLite in this application. It just feels like there's a better way though...
MySQL sends the data to a disk over a network connection.
SQLite3 send the data over a disk directly.
Look at https://gist.github.com/jboner/2841832
You did not mention where the MySQL server is. But even if it was on your local machine, it will pass through a TCP/IP stack whereas SQLite will just write directly to disk.
I'm trying to connect to an oldschool jTDS ms server for a variety of different analysis tasks. Firstly just using Python with SQL alchemy, as well as using Tableau and Presto.
Focusing on SQL Alchemy first at the moment I'm getting an error of:
Data source name not found and no default driver specified
With this, based on this thread here Connecting to SQL Server 2012 using sqlalchemy and pyodbc
i.e,
import urllib
params = urllib.parse.quote_plus("DRIVER={FreeTDS};"
"SERVER=x-y.x.com;"
"DATABASE=;"
"UID=user;"
"PWD=password")
engine = sa.create_engine("mssql+pyodbc:///?odbc_connect={FreeTDS}".format(params))
Connecting works fine through Dbeaver, using a jTDS SQL Server (MSSQL) driver (which is labelled as legacy).
Curious as to how to resolve this issue, I'll keep researching away, but would appreciate any help.
I imagine there is an old drive on the internet I need to integrate into SQL Alchemy to begin with, and then perhaps migrating this data to something newer.
Appreciate your time
Ultimately, I want to grab data from a Vertica DB into Spark, train a machine learning model, make predictions, and store these predictions into another Vertica DB.
Current issue is identifying the bottleneck in the last part of the flow: storing values in Vertica DB from Spark. It takes about 38 minutes to store 63k rows of data in a Vertica DB. In comparison, when I transfer that same data from Spark to MySQL database, it takes 10 seconds.
I don't know why the difference is so huge.
I have classes called VerticaContext and MySQLContext for Vertica and MySQL connections respectively. Both classes use the SQLContext to read entries using the jdbc format.
df = self._sqlContext.read.format('jdbc').options(url=self._jdbcURL, dbtable=subquery).load()
And write using jdbc.
df.write.jdbc(self._jdbcURL, table_name, save_mode)
There's no difference between the two classes aside from writing to a different target database. I'm confused as to why there's a huge difference in the time it takes to save tables. Is it because of the inherent difference in hardware between the two different databases?
I figured out an alternative solution. Before I dive in, I'll explain what I found and why I think saves to Vertica DB are slow.
Vertica log (search for the file "vertica.log" on your Vertica machine) contains all the recent logs related to reads/writes from and to Vertica DBs. After running the write command, I found out that this is essentially creating INSERT statements into the Vertica DB.
INSERT statements (without the "DIRECT" directive) are slow because they are written into the WOS (RAM) instead of ROS (disk). I don't know the exact details as to why that's the case. The writes were issuing individual INSERT statements
Slow inserts are a known issue. I had trouble finding this information, but I finally found a few links that support that. I'm placing them here for posterity: http://www.vertica-forums.com/viewtopic.php?t=267, http://vertica-forums.com/viewtopic.php?t=124
My solution:
There is documentation that says the COPY command (with a "DIRECT" keyword) is the most efficient way of loading large amounts of data to the database. Since I was looking for a python solution, I used Uber's vertica-python package which allowed me to establish a connection with the Vertica DB and send Vertica commands to execute.
I want to exploit the efficiency of the COPY command, but the data lives somewhere outside of the Vertica cluster. I need to send the data from my Spark cluster to Vertica DB. Fortunately, there's a way to do that from HDFS (see here). I settled on converting the dataframe to a csv file and saving that on HDFS. Then I sent the COPY command to the Vertica DB to grab the file from HDFS.
My code is below (assuming I have a variable that stores the pyspark dataframe already. Let's call that 'df'):
import vertica_python as VertPy
df.toPandas().to_csv(hdfs_table_absolute_filepath, header=False, index=False)
conn_info = {
'host': ['your-host-here']
'port': [port #],
'user': ['username'],
'password': ['password'],
'database': ['database']
}
conn = VertPy.connect(**conn_info)
cur = conn.cursor()
copy_command = create_copy_command(table_name, hdfs_table_relative_filepath)
cursor.execute(copy_command)
def create_copy_command(table_name, table_filepath):
copy_command = "COPY " + table_name + " SOURCE Hdfs(url='http://hadoop:50070/webhdfs/v1" + table_filepath + "', username='root') DELIMITER ',' DIRECT ABORT ON ERROR"
return copy_command