I'm trying to import the results of a complex SQL query into a pandas dataframe. My query requires me to create several temporary tables since the final result table I want includes some aggregates.
My code looks like this:
cnxn = pyodbc.connect(r'DRIVER=foo;SERVER=bar;etc')
cursor = cnxn.cursor()
cursor.execute('SQL QUERY HERE')
cursor.execute('SECONDARY SQL QUERY HERE')
...
df = pd.DataFrame(cursor.fetchall(),columns = [desc[0] for desc in cursor.description])
I get an error that tells me shapes aren't matching:
ValueError: Shape of passed values is (1,900000),indices imply (5,900000)
And indeed, the result of all the SQL queries should be a table with 5 columns rather than 1. I've run the SQL query using Microsoft SQL Server Management Studio and it works and returns the 5 column table that I want. I've tried to not pass any column names into the dataframe and printed out the head of the dataframe and found that pandas has put all the information in 5 columns into 1. The values in each row is a list of 5 values separated by commas, but pandas treats the entire list as 1 column. Why is pandas doing this? I've also tried going the pd.read_sql route but I still get the same error.
EDIT:
I have done some more debugging, taking the comments into account. The issue doesn't appear to stem from the fact that my query is nested. I tried a simple (one line) query to return a 3 column table and I still got the same error. Printing out fetchall() looks like this:
[(str1,str2,str3,datetime.date(stuff),datetime.date(stuff)),
(str1,str2,str3,datetime.date(stuff),datetime.date(stuff)),...]
Use pd.DataFrame.from_records instead:
df = pd.DataFrame.from_records(cursor.fetchall(),
columns = [desc[0] for desc in cursor.description])
Simply adjust the pd.DataFrame() call as right now cursor.fetchall() returns one-length list of tuples. Use tuple() or list to map child elements into their own columns:
df = pd.DataFrame([tuple(row) for row in cur.fetchall()],
columns = [desc[0] for desc in cursor.description])
Related
I need query 200+ tables in database.
By using spark.sql = f"" select ... " statement i get col(0) (because result of the query give me specific information about column that i've retrive) and result of calculation for particulare table, like this:
col(0)
1
My goal is to have 1 csv file, with name of table and the result of calculation:
Table name
Count
accounting
3
sales
1
So far my main part of my code:
list_tables = ['accounting', 'sales',...]
for table in list_tables:
df = spark.sql(
f""" select distinct errors as counts from {database}.{table} where errors is not null""")
df.repartition(1).write.mode("append").option("header","true").csv(f"s3:.......)
rename_part_file(dir,output,newdir)
I'm kinda new to PySpark and all structures included.
Soo far i'm confused because i heard iteration dataframe isn't best idea.
By using following code i get only 1 csv with last recent record, not all processed tables from my list_tables.
Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?
Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?
Both of the options you mentioned lead to the same thing - you have to iterate over a list of tables (you can't read multiple tables at once), read each of it, execute a SQL statement and save the results into DataFrame, then union all of the DataFrames and save as a single CSV file. The sample code could look something like this:
from pyspark.sql.functions import lit
from functools import reduce
tables = ["tableA", "tableB", "tableC"]
dfs = []
for table in tables:
dfs.append(spark.read.table(table).sql("my sql statement").withColumn("TableName", lit(table))) # Append the DF with SQL query results
df = reduce(lambda df1, df2: df1.union(df2), dfs) # Union all DFs
df.coalesce(1).write.mode("overwrite").csv("my_csv.csv") # Combine and write as single file
Note: the union operation takes into account only the position of the column, and not its name. I assume for your case that is the desired behaviour, as your are only extracting a single statistic.
I've been looking around so hopefully someone here can assist:
I'm attempting to use cx_Oracle in python to interface with a database; my task is to insert data from an excel file to an empty (but existing) table.
I have the excel file with almost all of the same column names as the columns in the database's table, so I essentially want to check if the columns share the same name; and if so, I insert that column from the excel (dataframe --pandas) file to the table in Oracle.
import pandas as pd
import numpy as np
import cx_Oracle
df = pd.read_excel("employee_info.xlsx")
con = None
try:
con = cx_Oracle.connect (
config.username,
config.password,
config.dsn,
encoding = config.encoding)
except cx_Oracle.Error as error:
print(error)
finally:
cursor = con.cursor()
rows = [tuple(x) for x in df.values]
cursor.executemany( ''' INSERT INTO ODS.EMPLOYEES({x} VALUES {rows}) '''
I'm not sure what sql I should put or if there's a way I can use a for-loop to iterate through the columns but my main issue stems from how can I dynamically add these for when our dataset grows in columns?
I check the columns that match by using:
sql = "SELECT * FROM ODS.EMPLOYEES"
cursor.execute(sql)
data = cursor.fetchall()
col_names = []
for i in range (0, len(cursor.description)):
col_names.append(cursor.description[i][0])
a = np.intersect1d(df.columns, col_names)
print("common columns:", a)
that gives me a list of all the common columns; so I'm not sure? I've renamed the columns in my excel file to match the columns in the database's table but my issue is that how can I match these in a dynamic/automated way so I can continue to add to my datasets without worrying about changing the code.
Bonus: I also am importing SQL in a case statement to create a new column where I'm rolling up a few other columns; if there's a way to add this to the first part of my SQL or if it's advisable to do all manipulations before using an insert statement that'll be helpful to know as well.
Look at https://github.com/oracle/python-oracledb/blob/main/samples/load_csv.py
You would replace the CSV reading bit with parsing your data frame. You need to construct a SQL statement similar to the one used in that example:
sql = "insert into LoadCsvTab (id, name) values (:1, :2)"
For each spreadsheet column that you decide matches a table column, construct the (id, name) bit of the statement and add another id to the bind section (:1, :2).
I have dataset containing multiple tables. I want to check the
unique columns list
col lists for all table.
I Tried:
this gave me df and then list of all table names
%%bigquery --project ProjectID df
SELECT* EXCEPT(is_typed) FROM tenjin.INFORMATION_SCHEMA.TABLES
#sort list a-z of all the tables inside tenjin
all_tables = sorted(list(df.table_name))
now I want to run a loop or SQL query that can give me all columns name:
I tried
for table in all_tables:
print("bring magic unique columns list here")
print("columnslist")
There are a few ways, but depending on your needs I think you could skip a few steps by querying <dataset-name>.INFORMATION_SCHEMA.COLUMNS, e.g.
%%bigquery --project ProjectID df
SELECT * FROM tenjin.INFORMATION_SCHEMA.COLUMNS
result = df.groupby("table_name").column_name.apply(list).to_dict()
The to_dict call is optional but may make life easier downstream. You can get your all_tables back as follows, for example:
all_tables = sorted(list(result.keys()))
I have a pandas dataframe containing two columns: ID and MY_DATA. I have an SQL database that contains a column named ID and some other data. I want to match the ID of the SQL database column to the rows of the dataframe ID column and update it with a new column MY_DATA.
So far I used the following:
import sqlite3
df = pd.read_csv('my_filename.csv')
con = sqlite3.connect('my_database.sqlite')
cur = con.cursor()
for row in cur.execute('SELECT ID FROM main;'):
for i in len(df):
if (row[i] == df.ID.iloc[i]):
update_sqldb(df, i)
However, I think this way of having two nested for-loops is probably ugly and not very pythonic. I thought that maybe I should use the map() function, but is this the right direction to go?
I'm attempting to write the results of a regression back to MySQL, but am having problems iterating through the fitted values and getting the NaNs to write as null values. Originally, I did the iteration this way:
for i in dataframe:
cur = cnx.cursor()
query = ("UPDATE Regression_Data.Input SET FITTEDVALUES="+(dataframe['yhat'].__str__())+" where timecount="+(datafrane['timecount'].__str__())+";")
cur.execute(query)
cnx.commit()
cur.close()
.....which SQL thew back to me by saying:
"mysql.connector.errors.ProgrammingError: 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'NaN'
So, I've been trying to filter out the NaNs by only asking Python to commit when yhat does not equal NaN:
for i in dataframe:
if cleandf['yhat']>(-1000):
cur = cnx.cursor()
query = ("UPDATE Regression_Data.Input SET FITTEDVALUES="+(dataframe['yhat'].__str__())+" where timecount="+(datafrane['timecount'].__str__())+";")
cur.execute(query)
cnx.commit()
cur.close()
But then I get this:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
So, I try to get around it with this in my above syntax:
if cleandf['yhat'][i]>(-1000):
but then get this:
ValueError: Can only tuple-index with a MultiIndex
And then tried adding itterows() to both as in:
for i in dataframe.iterrows():
if cleandf['yhat'][i]>(-1000):
but get the same problems as above.
I'm not sure what I'm doing wrong here, but assume it's something with iterating in Pandas DataFrames. But, even if I got the iteration right, I would want to write Nulls into SQL where the NaN appeared.
So, how do you think I should do this?
I don't have a complete answer, but perhaps I have some tips that might help. I believe you are thinking of your dataframe as an object similar to a SQL record set.
for i in dataframe
This will iterate over the column name strings in the dataframe. i will take on column names, not rows.
dataframe['yhat']
This returns an entire column (pandas.Series, which is a numpy.ndarray), not a single value. Therefore:
dataframe['yhat'].__str__()
will give a string representation of an entire column that is useful for humans to read. It is certainly not a single value that can be converted to string for your query.
if cleandf['yhat']>(-1000)
This gives an error, because again, cleandf['yhat'] is an entire array of values, not just a single value. Think of it as an entire column, not the value from a single row.
if cleandf['yhat'][i]>(-1000):
This is getting closer, but you really want i to be an integer here, not another column name.
for i in dataframe.iterrows():
if cleandf['yhat'][i]>(-1000):
Using iterrows seems like the right thing for you. However, i takes on the value of each row, not an integer that can index into a column (cleandf['yhat'] is a full column).
Also, note that pandas has better ways to check for missing values than relying on a huge negative number. Try something like this:
non_missing_index = pandas.isnull(dataframe['yhat'])
cleandf = dataframe[non_missing_index]
for row in cleandf.iterrows():
row_index, row_values = row
query = ("UPDATE Regression_Data.Input SET FITTEDVALUES="+(row_values['yhat'].__str__())+" where timecount="+(row_values['timecount'].__str__())+";")
execute_my_query(query)
You can implement execute_my_query better than I can, I expect. However, this solution is not quite what you want. You really want to iterate over all rows and do two types of inserts. Try this:
for row in dataframe.iterrows():
row_index, row_values = row
if pandas.isnull(row_values['yhat']):
pass # populate the 'null' insert query here
else:
query = ("UPDATE Regression_Data.Input SET FITTEDVALUES="+(row_values['yhat'].__str__())+" where timecount="+(row_values['timecount'].__str__())+";")
execute_my_query(query)
Hope it helps.