Please suggest a way to execute SQL statement and pandas dataframe .to_sql() in one transaction
I have the dataframe and want to delete some rows on the database side before insertion
So basically I need to delete and then insert in one transaction using .to_sql of dataframe
I use sqlalchemy engine with pandas.df.to_sql()
After further investigation I realized that it is possible to do only with sqllite3, because to_sql supports both sqlalchemy engine and plain connection object as conn parameter, but as a connection it is supported only for sqllite3 database
In other words you have no influence on connection which will be created by to_sql function of dataframe
Related
I imported a table from sql server database as a dataframe, I am trying to export it as PostgreSQL table
this is what I am doing
from sqlalchemy import create_engine
import psycopg2
engine = create_engine('postgresql://postgres:000000#localhost:5432/sinistrePY')
df.to_sql('table_name3', engine)
and this is the result
the data integration is working fine but
I get the table with read-only privileges
data types are not as I should be
no primary key
I don't need the index column
how can I fix that and control how I want my table to be, from my notebook or directly from PostgreSQL server if needed, thanks.
I have a dataframe in my python program with columns corresponding to a table on my SQL server. I want to append the contents of my dataframe to the SQL table. Here's the catch: I'm not permissioned to access the SQL table itself, I can only interact with it through a view.
I know if I could write directly to the table I could use SQL alchemy's to_sql function. However, I can only use a view to write to the table in the database.
Is this even possible? Thanks for the help.
I am using sqlalchemy(sqlalchemy-redshift) as engine with pandas, while writing to redshift with to_sql i get the following error:
sqlalchemy.exc.NotSupportedError: (psycopg2.NotSupportedError) SQL command "CREATE INDEX ix_western_union_answer_pivot_index ON western_union_answer_pivot (index)" not supported on Redshift tables.
[SQL: 'CREATE INDEX ix_western_union_answer_pivot_index ON western_union_answer_pivot (index)'] (Background on this error at: http://sqlalche.me/e/tw8g)
While i understand the problem , How to create an Index in Amazon Redshift
I have two questions,
1. shouldn't sqlalchemy-redshift translate the create index to a redshift supporting sortkey statement? Thats the point of using ORM right?
As a workaround can i stop to_sql from creating the db index?
UPDATE:
on setting index=False in to_sql, the above issue is solved but i endup with
sqlalchemy.exc.DataError: (psycopg2.DataError) value too long for type character varying(256)
Is 256 the max size in redshift? Any solution to this apart from slicing the data to 256 and losing information?
I would like to add columns which is a result of two existing columns in BigQuery. I am using Apache Beam to read from BigQuery and then process it and update the results to the same BigQuery table as a new column.
Beam BigQuery connector does not explicitly support BigQuery DML, however you can write a pipeline to insert the result of your processing into a separate table, and after the pipeline runs, run a DML statement to update the column in the original table using that auxiliary table.
Alternatively, if your processing logic can be expressed in SQL, you're probably better off just implementing it as an SQL DML statement without using a pipeline.
I want to use Zeppelin to query databases. I currently see two possibilities but none of them is sufficient for me:
Configure a database connection as "interpreter", name it e.g. "sql1", use it in a paragraph, run a sql query and use the inbuilt nice plotting tools. It seems that all the tutorials and tips deal with it but then the documentation suddenly stops! But I want to do more with the data: I want to filter and process. If I want to plot it again (with other limitations), I have to do the query (that may last some seconds or minutes) again (see my other question Zeppelin SQL: reuse data of query without another interpreter or a new query)
Use spark with python, scala or similar. But the documentation seems only to load csv data, put in into a dataframe and then accesses this dataframe with sql. There is no accessing the data with sql in the first place. How do I access the sql data the best way? Can I use a already configured "interpreter" (database connection)?
You can use Zeppelin API to retrieve paragraph data:
val buffer = scala.io.Source.fromURL("http://XXXXX:9995/api/notebook/2CN2QP93H/paragraph/20170713-092810_1633770798").mkString
val df = sqlContext.read.json(sc.parallelize(buffer :: Nil)).select("body.text")
df.first.getAs[String](0)
This Spark Scala lines will retrieve the SQL query used by a paragprah. You could do same thing to get results I think.
I cannot find a solution for 1. But I have made a short solution for 2. that works within zeppelin with python (2.7), sqlalchemy (sql wrapper), mysqldb (mysql implementation) and pandas (make sure that have these packages installed, all of them are in Debian 9). I wonder why I have not found such a solution before...
%python
from sqlalchemy import create_engine
import pandas as pd
sql = "select col1, col2 from table limit 10"
df = pd.read_sql(sql,
create_engine('mysql+mysqldb://user:password#host:3306/database').connect())
z.show(df)
If you want to connect to another database like db2 or oracle, you have to use other python packages and adjust the first part in the create_engine string.