Use pandas series as input parameter - python

I'm a R user trying to pick up Python. In R, I often used vectors to pass as arguments to SQL query. For example,
ID <- c(1,2,3,4,5)
df <- dbGetQuery(con, paste("select * from table where ID in (", ID, ")")
How can I achieve this in Python? I have a dataframe and would like to use one of its columns as the parameters. So with a dataframe like this,
data = {'ID': [1,2,3,4,5],
'Value': [10,20,30,40,50]}
df = pd.DataFrame(data)
[Edit]
So basically I need a string that would read "Select * from table where ID in (1,2,3,4,5)" except, instead of manually typing "1,2,3,4,5" I want to use parameters.

OP are looking for something like
query = f"select * from table where ID in ({','.join(df['ID'].astype(str))})"
For more ways to create this query from list, one can also check this post provided by #Erfan in a comment.

Related

Iterate through database with PySpark DataFrame

I need query 200+ tables in database.
By using spark.sql = f"" select ... " statement i get col(0) (because result of the query give me specific information about column that i've retrive) and result of calculation for particulare table, like this:
col(0)
1
My goal is to have 1 csv file, with name of table and the result of calculation:
Table name
Count
accounting
3
sales
1
So far my main part of my code:
list_tables = ['accounting', 'sales',...]
for table in list_tables:
df = spark.sql(
f""" select distinct errors as counts from {database}.{table} where errors is not null""")
df.repartition(1).write.mode("append").option("header","true").csv(f"s3:.......)
rename_part_file(dir,output,newdir)
I'm kinda new to PySpark and all structures included.
Soo far i'm confused because i heard iteration dataframe isn't best idea.
By using following code i get only 1 csv with last recent record, not all processed tables from my list_tables.
Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?
Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?
Both of the options you mentioned lead to the same thing - you have to iterate over a list of tables (you can't read multiple tables at once), read each of it, execute a SQL statement and save the results into DataFrame, then union all of the DataFrames and save as a single CSV file. The sample code could look something like this:
from pyspark.sql.functions import lit
from functools import reduce
tables = ["tableA", "tableB", "tableC"]
dfs = []
for table in tables:
dfs.append(spark.read.table(table).sql("my sql statement").withColumn("TableName", lit(table))) # Append the DF with SQL query results
df = reduce(lambda df1, df2: df1.union(df2), dfs) # Union all DFs
df.coalesce(1).write.mode("overwrite").csv("my_csv.csv") # Combine and write as single file
Note: the union operation takes into account only the position of the column, and not its name. I assume for your case that is the desired behaviour, as your are only extracting a single statistic.

How to extract list of all columns name from BigQuery tables?

I have dataset containing multiple tables. I want to check the
unique columns list
col lists for all table.
I Tried:
this gave me df and then list of all table names
%%bigquery --project ProjectID df
SELECT* EXCEPT(is_typed) FROM tenjin.INFORMATION_SCHEMA.TABLES
#sort list a-z of all the tables inside tenjin
all_tables = sorted(list(df.table_name))
now I want to run a loop or SQL query that can give me all columns name:
I tried
for table in all_tables:
print("bring magic unique columns list here")
print("columnslist")
There are a few ways, but depending on your needs I think you could skip a few steps by querying <dataset-name>.INFORMATION_SCHEMA.COLUMNS, e.g.
%%bigquery --project ProjectID df
SELECT * FROM tenjin.INFORMATION_SCHEMA.COLUMNS
result = df.groupby("table_name").column_name.apply(list).to_dict()
The to_dict call is optional but may make life easier downstream. You can get your all_tables back as follows, for example:
all_tables = sorted(list(result.keys()))

Is There a Way To Select All Columns From The Table And Except One Column

I have been using sqlite3 with python for creating databases. Till Now I have been successful,
But Unfortunately I have No way Out Of This. I have A Table With 63 columns but I Want To Select Only 62 Out Of Them, I am Sure That I can write The Names of The Columns In The Select Statement. But Writing 62 Of Them seems like a non-logical(for a programmer like me) idea for me. I am using Python-sqlite3 databases. Is There A Way Out Of This
I'm Sorry If I am Grammarly Mistaken.
Thanks in advance
With Sqlite, you can:
do a PRAGMA table_info(tablename); query to get a result set that describes that table's columns
pluck the column names out of that result set and remove the one you don't want
compose a column list for the select statement using e.g. ', '.join(column_names) (though you might want to consider a higher-level SQL statement builder instead of playing with strings).
Example
A simple example using a simple table and an in-memory SQLite database:
import sqlite3
con = sqlite3.connect(":memory:")
con.executescript("CREATE TABLE kittens (id INTEGER, name TEXT, color TEXT, furriness INTEGER, age INTEGER)")
columns = [row[1] for row in con.execute("PRAGMA table_info(kittens)")]
print(columns)
selected_columns = [column for column in columns if column != 'age']
print(selected_columns)
query = f"SELECT {', '.join(selected_columns)} FROM kittens"
print(query)
This prints out
['id', 'name', 'color', 'furriness', 'age']
['id', 'name', 'color', 'furriness']
SELECT id, name, color, furriness FROM kittens

Importing SQL query into Pandas results in only 1 column

I'm trying to import the results of a complex SQL query into a pandas dataframe. My query requires me to create several temporary tables since the final result table I want includes some aggregates.
My code looks like this:
cnxn = pyodbc.connect(r'DRIVER=foo;SERVER=bar;etc')
cursor = cnxn.cursor()
cursor.execute('SQL QUERY HERE')
cursor.execute('SECONDARY SQL QUERY HERE')
...
df = pd.DataFrame(cursor.fetchall(),columns = [desc[0] for desc in cursor.description])
I get an error that tells me shapes aren't matching:
ValueError: Shape of passed values is (1,900000),indices imply (5,900000)
And indeed, the result of all the SQL queries should be a table with 5 columns rather than 1. I've run the SQL query using Microsoft SQL Server Management Studio and it works and returns the 5 column table that I want. I've tried to not pass any column names into the dataframe and printed out the head of the dataframe and found that pandas has put all the information in 5 columns into 1. The values in each row is a list of 5 values separated by commas, but pandas treats the entire list as 1 column. Why is pandas doing this? I've also tried going the pd.read_sql route but I still get the same error.
EDIT:
I have done some more debugging, taking the comments into account. The issue doesn't appear to stem from the fact that my query is nested. I tried a simple (one line) query to return a 3 column table and I still got the same error. Printing out fetchall() looks like this:
[(str1,str2,str3,datetime.date(stuff),datetime.date(stuff)),
(str1,str2,str3,datetime.date(stuff),datetime.date(stuff)),...]
Use pd.DataFrame.from_records instead:
df = pd.DataFrame.from_records(cursor.fetchall(),
columns = [desc[0] for desc in cursor.description])
Simply adjust the pd.DataFrame() call as right now cursor.fetchall() returns one-length list of tuples. Use tuple() or list to map child elements into their own columns:
df = pd.DataFrame([tuple(row) for row in cur.fetchall()],
columns = [desc[0] for desc in cursor.description])

Using Pivot in SQL Query for Long to Wide Formatting

I am using sqlalchemy in Python 3 to query a sqlite database. My initial query is:
df = pd.read_sql_query('SELECT member, sku, quantity FROM table GROUP BY member, csvdatabase')
This works and gives me three columns; however, the problem is that I want to convert this from Long to Wide format. I tried using pivot function, but my machine does not have the memory to handle this:
df_new = df.pivot(index = 'member', columns = 'sku', values = 'quantity')
Is there a way to make the SQL query do this? I came across several examples where the pivot function is used or the MAX(CASE WHEN Key..., but the number of sku's is too large to type into a sql query

Categories