Data frame does not display column names - python

I wrote a script which first runs a SQL query to get the data from Redshift (via Databricks). Then, I want to display it in a pandas data frame. The problem is that somehow the names of the columns were removes/are not displayed. Why?
#SQL Query
query = """
SELECT * FROM table1 limit 1;
"""
# Execute the query
try:
cursor.execute(query)
except OperationalError as msg:
print ("Command skipped: ")
#Fetch all rows from the result
rows = cursor.fetchall()
# Convert into a Pandas Dataframe
df = pd.DataFrame( [[ij for ij in i] for i in rows] )
df.head()
Output:
As you can see, the column names turned into numbers (in yellow). The intent was to display column name 1: Customer_id, column name 2: Purchases, column name 3: Product_id etc.
I appreciate any help. Thanks!

As suggested by #Chris you can use pd.read_sql in the following way:-
query = """SELECT * FROM table1 limit 1;"""
connection = psycopg2.connect(user = 'your_username',
password = 'password',
host = 'host_ip',
port = 5432,
database = 'db_name')
data = pd.read_sql(sql=query, con=connection)
Now when you will print your data it will show the column names as well!

Related

Duplicated values as an output of python function

I would like create a Python function for multiple SQL inserts. This function I will use in my Airflow DAG for inserts into Snowflake db. I will need do create an SnowflakeOperator which will use this function. I'm just start to use Airflow so please correct me if I'm wrong.
My example:
I'm connecting to Snowflake db in order to get data from table with information schema name and table name. This output I'm using for inserts per schema. I'm selecting schema and creating variable my_schema = 'my_schema'.
First approach:
sql = "SELECT SCHEMA, TABLE FROM TABLE"
cur.execute(sql)
df = pd.DataFrame.from_records(iter(cur), columns=[x[0] for x in cur.description])
my_dict = dict()
for i in df['SCHEMA'].unique().tolist():
df_x = df[df['SCHEMA'] == i]
my_dict[i] = df_x['TABLE'].tolist()
for schema, tables in my_dict.items():
for table in tables:
query = f"INSERT INTO {schema}.{table} SELECT * FROM {schema}.{table} where col2 = 1;"
try:
cur.execute(query)
except snowflake.connector.errors.ProgrammingError as e:
# Something went wrong with the insert
logging.error(f"Inserting in {schema}.{table}: {e}")
conn.close()
For testing I created pandas datframe with two columns schema and table.
data = [['test', 'table01'], ['test', 'table02'], ['my_schema', 'table03'], ['schemaxxx', 'table04']]
# Create the pandas DataFrame
df_new = pd.DataFrame(data, columns=['schema', 'table'])
I created function for inserts.
my_schema = 'my_schema'
def my_insert_fnc(df):
my_dict = dict()
for i in df['schema'].unique().tolist():
df_x = df[df['schema'] == i]
my_dict[i] = df_x['table'].tolist()
sql_list = []
for schema, tables in my_dict.items():
for table in tables:
if schema == my_schema:
sql_list.append(f"INSERT INTO {schema}.{table} SELECT * FROM {schema}.{table} where col2 = 1;")
print(sql_list)
But I'm getting duplicates.
my_insert_fnc(df_new)
['INSERT INTO my_schema.table03 SELECT * FROM my_schema.table03 where col2 = 1;']
['INSERT INTO my_schema.table03 SELECT * FROM my_schema.table03 where col2 = 1;']
I would like to remove duplicates and logging errors.
try:
cur.execute(query)
except snowflake.connector.errors.ProgrammingError as e:
# Something went wrong with the insert
logging.error(f"Inserting in {schema}.{table}: {e}")
This function as I mentioned I need to use in my Airflow DAG so It needs to give me a string output in order to use it in SnowflakeOperator. Please correct me if I'm wrong.

Printing data results from postgresql to panda dataframe

I am trying to print the results of the joined table from postgresql to python. However when I try to print the results, the table shows up but I receive NaN data. Can someone help?
conn = psy.connect( dbname = "funda_project", host = "localhost", user =
"postgres", password = "ledidhima2021.")
cursor = conn.cursor()
conn.commit()
createjointable2 = '''SELECT(
distance_data."Municipality",
distance_data."Childcare/Nursery",
distance_data."Leisure/Culture/Library",
sales_details."Purchase_price",
sales_details."Publication_date",
sales_details."Date_of_signature",
house_details."Type_of_house",
house_details."Object_categorie",
house_details."Construction_year",
house_details."Energy_label_class",
demo_data."Age_Group_Relation_(15-20)",
demo_data."Age_Group_Relation_(20-25)",
demo_data."Age_Group_Relation_(25-45)")
FROM "distance_data"
INNER JOIN "zip_data"
ON "distance_data"."Municipality" = "zip_data"."Municipality"
INNER JOIN "demo_data"
ON "zip_data"."Municipality" = "demo_data"."Municipality"
INNER JOIN "sales_details"
ON "zip_data"."globalId" = "sales_details"."GlobalID"
INNER JOIN "house_details"
ON "zip_data"."globalId" = "house_details"."GlobalID"
;'''
cursor.execute(createjointable2);
from pandas import DataFrame
eri= pd.DataFrame(cursor.fetchall())
datalist = list(eri)
results = pd.DataFrame (eri, columns = ["Municipality", "Childcare/Nursery",
"Leisure/Culture/Library", "Purchase_price", "Publication_date", "Date_of_signature",
"Type_of_house", "Object_categorie", "Construction_year", "Energy_label_class",
"Age_Group_Relation_(15-20)", "Age_Group_Relation_(20-25)", "Age_Group_Relation_(25-45)"])
results
Pandas has a built-in SQL query reading function pd.read_sql_query(query, connection), which assign the returned table value to a dataframe.
dataframe = pd.read_sql_query("SELECT * FROM table;", conn)
conn being the connection object you created and is also in your code.
Another way is almost what you tried as well:
from pandas import DataFrame
df = DataFrame(cursor.fetchall())
df.columns = cursor.keys()

Looping through dataframe values in columns and using them as a FROM clause using SQL

I am running BigQuery in Jupyter notebook.
query ="""
SELECT
table_catalog,
table_schema,
table_name,
FROM `Project-A.schema_A`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
"""
The output leads me to the following table:
# This is the output of the query
data = {'table_catalog':['project-A', 'project-A', 'project-A', 'project-A','Project-A','Project-A','Project-A'],
'table_catalog':['schema_A', 'schema_A', 'schema_A', 'schema_A','schema_A','schema_A','schema_A']
'table_name':['Table_A', 'Table_B', 'Table_B', 'Table_C','Table_C','Table_A','Table_A']}
d# Create DataFrame
df = pd.DataFrame(data)
I want to use Table_A, Table_B and Table_C in my next query in the FROM CLAUSE such that it looks like:
query =f"""
SELECT
*
FROM Project-A.Schema_A.{I want to edit this dyanmically - either Table_A, Table_B, Table_C}"""
I tried the following but have been failing, please help me with this:
list_of_tables = list(df['table_name'].unique())
def loop_tables(x):
for tables in list_of_tables:
if x == tables
# x = df['table_name']
loop_tables()
try this
def loop_tables():
list_of_dataframes = []
for table in list_of_tables:
print(table)
dynamic_sql = "select * from project.dataset."
dynamic_sql += table
df = client.query(dynamic_sql).to_dataframe()
list_of_dataframes.append(df)
return list_of_dataframes

Fetching data from Postgresq Database using sqlalchemy.select() in python

I am using python and SQLalchemy to fetch data from a table.
import sqlalchemy as db
import pandas as pd
DATABASE_URI = 'postgres+psycopg2://postgres:postgresql#localhost:5432/postgres'
engine = db.create_engine(DATABASE_URI)
connection = engine.connect()
project_table = db.Table('project', metadata, autoload=True, autoload_with=engine)
here i want to fetch records based on a list of ids which i have.
l=[557997, 558088, 623106, 558020, 623108, 557836, 557733, 622792, 623511, 623185]
query1 = db.select([project_table ]).where(project_table .columns.project_id.in_(l))
#sql query= "select * from project where project_id in l"
Result = connection.execute(query1)
Rset = Result.fetchall()
df = pd.DataFrame(Rset)
print(df.head())
Here when i print df.head() I am getting an empty dataframe. I am not able to pass a list to the above query. Is there a way to send a list to in to above query.
The result should contain the rows in the table which are equal to project_id's given.
i.e.
project_id project_name project_date project_developer
557997 Test1 24-05-2011 Ajay
558088 Test2 24-06-2003 Alex
These rows will be inserted into dataset.
The Query is
"select * from project where project_id in (557997, 558088, 623106, 558020, 623108, 557836, 557733, 622792, 623511, 623185)"
here as i cant give static values I will insert the values to a list and pass this list to query as a parameter.
This is where i am having a problem. I cant pass a list as a parameter to db.select().How can i pass a list to db.select()
After many trails i have found out that because of large data the query is fetching and also less ram in my workstation, the query returned null(no results). so what I did was
Result = connection.execute(query1)
while True:
rows = Result.fetchmany(10000)
if not rows:
break
for row in rows:
table_data.append(row)
pass
df1 = pd.DataFrame(table_data)
df1.columns = columns
After this the program was working fine.

How can I handle errors inside of a for loop inside of a cx_Oracle connection?

here's a run down of what I'd like to do: I have a list of table names, and I want to run sql against an oracle database and pull back the table name and row count for every table in my table list. However, not every table name in my list of table names is necessarily actually in the database. This causes my code to throw a database error. What I would like to do, is whenever I come to a table name that is not in the database, I create a dataframe that contains the table name and instead of count(*), there's some text that says 'table not found', or something similar. At the end of the loop I'm concatenating all of the dataframes into one dataframe. The overall goal here is to validate that certain tables exist and that they have the expected row counts.
query_list=[]
df_List=[]
connstr= '%s/%s#%s' %(username, password, server)
conn = cx_Oracle.connect(connstr)
with conn:
query_list = ["SELECT '%s' as tbl, count(*) FROM %s." %(elm, database) +elm for elm in table_list]
df_List = [pd.read_sql(elm,conn) for elm in query_list]
df = pd.concat(df_List)
Consider try/except handling to return query output or table not found output:
def get_table_count(sql, conn, elm):
try:
return pd.read_sql(sql, conn)
except:
return pd.DataFrame({'tbl': elm, 'note': 'table not found'}, index = [0])
with conn:
sql = "SELECT '{t}' as tbl, count(*) as table_count FROM {d}.{t}"
df_List = [get_table_count(sql.format(t = elm, d = database), conn, elm) \
for elm in table_list]
df = pd.concat(df_List, ignore_index = True)
Get a list of all the Table Names which are in the DB, then create a loop to query each Table to get the row count.
Here is a SQL statement to get a list of all Tables in an Oracle DB:
SQL:
SELECT DISTINCT TABLE_NAME FROM ALL_TAB_COLUMNS ORDER BY TABLE_NAME ASC;
Python (to make list of tables you want row counts for and which exist in the DB):
list(set(tables_that_exist_in_DB) - (set(tables_that_exist_in_DB) - set(list_of_tables_you_want)))

Categories