I have a database that contains two tables in the data, cdr and mtr. I want a join of the two based on columns ego_id and alter_id, and I want to output this into another table in the same database, complete with the column names, without the use of pandas.
Here's my current code:
mtr_table = Table('mtr', MetaData(), autoload=True, autoload_with=engine)
print(mtr_table.columns.keys())
cdr_table = Table('cdr', MetaData(), autoload=True, autoload_with=engine)
print(cdr_table.columns.keys())
query = db.select([cdr_table])
query = query.select_from(mtr_table.join(cdr_table,
((mtr_table.columns.ego_id == cdr_table.columns.ego_id) &
(mtr_table.columns.alter_id == cdr_table.columns.alter_id))),
)
results = connection.execute(query).fetchmany()
Currently, for my test code, what I do is to convert the results as a pandas dataframe and then put it back in the original SQL database:
df = pd.DataFrame(results, columns=results[0].keys())
df.to_sql(...)
but I have two problems:
loading everything into a pandas dataframe would require too much memory when I start working with the full database
the columns names are (apparently) not included in results and would need to be accessed by results[0].keys()
I've checked this other stackoverflow question but it uses the ORM framework of sqlalchemy, which I unfortunately don't understand. If there's a simpler way to do this (like pandas' to_sql), I think this would be easier.
What's the easiest way to go about this?
So I found out how to do this via CREATE TABLE AS:
query = """
CREATE TABLE mtr_cdr AS
SELECT
mtr.idx,cdr.*
FROM mtr INNER JOIN cdr
ON (mtr.ego_id = cdr.ego_id AND mtr.alter_id = cdr.alter_id)""".format(new_table)
with engine.connect() as conn:
conn.execute(query)
The query string seems to be highly sensitive to parentheses though. If I put a parentheses enclosing the whole SELECT...FROM... statement, it doesn't work.
Related
Here is my code
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine("connection string")
conn_obj = engine.connect()
my_df = pd.DataFrame({'col1': ['29199'], 'date_created': ['2022-06-29 17:15:49.776867']})
my_df.to_sql('SomeSQLTable', conn_obj, if_exists='append', index = False)
I also created SomeSQLTable with script:
CREATE TABLE SomeSQLTable(
col1 nvarchar(90),
date_created datetime2)
GO
Everything runs fine, but no records are inserted into SQL table and no errors are displayed. I am not sure how to troubleshoot. conn_obj works fine, I was able to pull data.
I don't think it's exactly the answer but I don't have the privileges of commenting right now.
First of all, the pd.to_sql() returns the number of rows affected by the operation, can you please check that?
Lastly, you are defining the data types in the table creation, it could be a problem of casting the data types. I never create the table through sql as pd.to_sql() can create it if needed.
Thirdly, Please check on the table name, there could be an issue with the pascal case in some db's.
I'm trying to figure out how to treat a pandas data frame as a SQL table when querying a database in Python.
I'm coming from a SAS background where work tables can easily be incorporated into direct database queries.
For example:
Select a.first_col,
b.second_col
from database.table1 a
left join work.table1 b on a.id = b.id;
Here work.table1 is not in the database, but is a dataset held in the local SAS server.
In my research I have found ways to write a data frame to a database and then include that in the query. I do not have write access to the database, so that is not an option for me.
I also know that I can use sqlalchemy with pd.to_sql() to put a data frame into a SQL engine, but I can't figure out if there is a way to connect that engine with the pyodbc connection I have with the database.
I also tried this though I didn't think it would work (names of tables and columns altered).
df = pd.DataFrame([A342,B432,W345],columns=['id'])
query = '''
select a.id, b.id
from df a
left join database.base_table b on a.id= b.id
'''
query_results = pd.read_sql_query(query,connection)
As I expected it didn't work.
I'm connecting to a Netezza database, I'm not sure if that matters.
I don't think it is possible, it would have to be written to its own table in order to be queried, although I'm not familiar with Netezza.
Why not perform the join (in pandas a "merge") purely in pandas? Understand if the sql table is massive this isn't feasible but,
query = """
SELECT id, x
FROM a
"""
a = pd.read_sql(query, conn)
df = # some dataframe in memory
pd.merge(df, a, on='id', how='left')
see https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html for good docs about sql and pandas similarities
I have trouble querying a table, created with sqlalchemy on postgres db (local).
While I am able to execute, and receive query result with:
SELECT * FROM olympic_games
I am getting an error message when I'm trying to access single column, or perform any other operation on table:
SELECT games FROM olympic_games
The error message is (couple sentences translated from Polish):
ProgrammingError: (psycopg2.errors.UndefinedColumn) BŁĄD: column "games" does not exist
LINE 1: SELECT COUNT(Sport)
^
HINT: maybe you meant "olympic_games.Games".
SQL: SELECT games FROM olympic_games LIMIT 5;]
(Background on this error at: http://sqlalche.me/e/f405)
It pretty much sums to that program doesn't see, or can access specific column, and display that it doesn't exist.
I tried accessing with table.column format, it didn't work as well. I am also able to see column names, via information_schema.columns
Data (.csv) was loaded with pd.read_csv, and then DataFrame.to_sql. Code below, thanks for help!
engine = create_engine('postgresql://:#:/olympic_games')
with open('olympic_athletes_2016_14.csv', 'r') as file:
df = pd.read_csv(file, index_col='ID')
df.to_sql(name = 'olympic_games', con = engine, if_exists = 'replace', index_label = 'ID')
Both execute commands returned with same error:
with engine.connect() as con:
rs = con.execute("SELECT games FROM olympic_games LIMIT 5;")
df_fetch = pd.DataFrame(rs.fetchall())
df_fetch2 = engine.execute("""SELECT games FROM olympic_games LIMIT 5;""").fetchall()
Essentially, this is the double quoting issue of column identifiers as mentioned in the PostgreSQL manual:
Quoting an identifier also makes it case-sensitive, whereas unquoted names are always folded to lower case. For example, the identifiers FOO, foo, and "foo" are considered the same by PostgreSQL, but "Foo" and "FOO" are different from these three and each other.
When any of your Pandas data frame columns have mixed cases, the DataFrame.to_sql preserves the case sensitivity by creating columns with double quotes at CREATE TABLE stage. Specifically, the below Python Pandas code when using replace
df.to_sql(name='olympic_games', con=engine, if_exists='replace', index_label='ID')
Translates as below in Postgres if Sport was a titled case column in data frame:
DROP TABLE IF EXISTS public."olympic_games";
CREATE TABLE public."olympic_games"
(
...
"Sport" varchar(255)
"Games" varchar(255)
...
);
Once an identifier is quoted with mixed cases, it must always be referred to in that manner. Therefore sport is not the same as "Sport". Remember in SQL, double quotes actually is different than single quotes which can be interchangeable in Python.
To fix, consider rendering all your Pandas columns to lower case since "games" is the same as games, Games or GAMES (but not "Games" or "GAMES").
df.columns = df.columns.str.lower()
df.to_sql(name='olympic_games', con=engine, if_exists='replace', index_label='ID')
Alternatively, leave as is and quote appropriately:
SELECT "Games" FROM olympic_games
Try SELECT "games" FROM olympic_games. In some cases PostgreSQL create the quotes around a columns names. For example if the column name contained mixed register. I have to remind you: PostgreSQL is case sensitive
I am making a script, that should create a schema for each customer. I’m fetching all metadata from a database that defines how each customer’s schema should look like, and then create it. Everything is well defined, the types, names of tables, etc. A customer has many tables (fx, address, customers, contact, item, etc), and each table has the same metadata.
My procedure now:
get everything I need from the metadataDatabase.
In a for loop, create a table, and then Alter Table and add each metadata (This is done for each table).
Right now my script runs in about a minute for each customer, which I think is too slow. It has something to do with me having a loop, and in that loop, I’m altering each table.
I think that instead of me altering (which might be not so clever approach), I should do something like the following:
Note that this is just a stupid but valid example:
for table in tables:
con.execute("CREATE TABLE IF NOT EXISTS tester.%s (%s, %s);", (table, "last_seen date", "valid_from timestamp"))
But it gives me this error (it seems like it reads the table name as a string in a string..):
psycopg2.errors.SyntaxError: syntax error at or near "'billing'"
LINE 1: CREATE TABLE IF NOT EXISTS tester.'billing' ('last_seen da...
Consider creating tables with a serial type (i.e., autonumber) ID field and then use alter table for all other fields by using a combination of sql.Identifier for identifiers (schema names, table names, column names, function names, etc.) and regular format for data types which are not literals in SQL statement.
from psycopg2 import sql
# CREATE TABLE
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (ID serial)"""
cur.execute(sql.SQL(query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table")))
# ALTER TABLE
items = [("last_seen", "date"), ("valid_from", "timestamp")]
query = """ALTER TABLE {shm}.{tbl} ADD COLUMN {col} {typ}"""
for item in items:
# KEEP IDENTIFIER PLACEHOLDERS
final_query = query.format(shm="{shm}", tbl="{tbl}", col="{col}", typ=i[1])
cur.execute(sql.SQL(final_query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"),
col = sql.Identifier(item[0]))
Alternatively, use str.join with list comprehension for one CREATE TABLE:
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (
"id" serial,
{vals}
)"""
items = [("last_seen", "date"), ("valid_from", "timestamp")]
val = ",\n ".join(["{{}} {typ}".format(typ=i[1]) for i in items])
# KEEP IDENTIFIER PLACEHOLDERS
pre_query = query.format(shm="{shm}", tbl="{tbl}", vals=val)
final_query = sql.SQL(pre_query).format(*[sql.Identifier(i[0]) for i in items],
shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"))
cur.execute(final_query)
SQL (sent to database)
CREATE TABLE IF NOT EXISTS "tester"."table" (
"id" serial,
"last_seen" date,
"valid_from" timestamp
)
However, this becomes heavy as there are too many server roundtrips.
How many tables with how many columns are you creating that this is slow? Could you ssh to a machine closer to your server and run the python there?
I don't get that error. Rather, I get an SQL syntax error. A values list is for conveying data. But ALTER TABLE is not about data, it is about metadata. You can't use a values list there. You need the names of the columns and types in double quotes (or no quotes) rather than single quotes. And you can't have a comma between name and type. And you can't have parentheses around each pair. And each pair needs to be introduced with "ADD", you can't have it just once. You are using the wrong tool for the job. execute_batch is almost the right tool, except it will use single quotes rather than double quotes around the identifiers. Perhaps you could add a flag to it tell it to use quote_ident.
Not only is execute_values the wrong tool for the job, but I think python in general might be as well. Why not just load from a .sql file?
I'm trying to replace some old MSSQL stored procedures with python, in an attempt to take some of the heavy calculations off of the sql server. The part of the procedure I'm having issues replacing is as follows
UPDATE mytable
SET calc_value = tmp.calc_value
FROM dbo.mytable mytable INNER JOIN
#my_temp_table tmp ON mytable.a = tmp.a AND mytable.b = tmp.b AND mytable.c = tmp.c
WHERE (mytable.a = some_value)
and (mytable.x = tmp.x)
and (mytable.b = some_other_value)
Up to this point, I've made some queries with SQLAlchemy, stored those data in Dataframes, and done the requisite calculations on them. I don't know now how to put the data back into the server using SQLAlchemy, either with raw SQL or function calls. The dataframe I have on my end would essentially have to work in the place of the temporary table created in MSSQL Server, but I'm not sure how I can do that.
The difficulty is of course that I don't know of a way to join between a dataframe and a mssql table, and I'm guessing this wouldn't work so I'm looking for a workaround
As the pandas doc suggests here :
from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://user:password#DSN", echo = False)
dataframe.to_sql('tablename', engine , if_exists = 'replace')
engine parameter for msSql is basically the connection string check it here
if_exist parameter is a but tricky since 'replace' actually drops the table first and then recreates it and then inserts all data at once.
by setting the echo attribute to True it shows all background logs and sql's.