I am using sqlalchemy in Python 3 to query a sqlite database. My initial query is:
df = pd.read_sql_query('SELECT member, sku, quantity FROM table GROUP BY member, csvdatabase')
This works and gives me three columns; however, the problem is that I want to convert this from Long to Wide format. I tried using pivot function, but my machine does not have the memory to handle this:
df_new = df.pivot(index = 'member', columns = 'sku', values = 'quantity')
Is there a way to make the SQL query do this? I came across several examples where the pivot function is used or the MAX(CASE WHEN Key..., but the number of sku's is too large to type into a sql query
Related
I need query 200+ tables in database.
By using spark.sql = f"" select ... " statement i get col(0) (because result of the query give me specific information about column that i've retrive) and result of calculation for particulare table, like this:
col(0)
1
My goal is to have 1 csv file, with name of table and the result of calculation:
Table name
Count
accounting
3
sales
1
So far my main part of my code:
list_tables = ['accounting', 'sales',...]
for table in list_tables:
df = spark.sql(
f""" select distinct errors as counts from {database}.{table} where errors is not null""")
df.repartition(1).write.mode("append").option("header","true").csv(f"s3:.......)
rename_part_file(dir,output,newdir)
I'm kinda new to PySpark and all structures included.
Soo far i'm confused because i heard iteration dataframe isn't best idea.
By using following code i get only 1 csv with last recent record, not all processed tables from my list_tables.
Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?
Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?
Both of the options you mentioned lead to the same thing - you have to iterate over a list of tables (you can't read multiple tables at once), read each of it, execute a SQL statement and save the results into DataFrame, then union all of the DataFrames and save as a single CSV file. The sample code could look something like this:
from pyspark.sql.functions import lit
from functools import reduce
tables = ["tableA", "tableB", "tableC"]
dfs = []
for table in tables:
dfs.append(spark.read.table(table).sql("my sql statement").withColumn("TableName", lit(table))) # Append the DF with SQL query results
df = reduce(lambda df1, df2: df1.union(df2), dfs) # Union all DFs
df.coalesce(1).write.mode("overwrite").csv("my_csv.csv") # Combine and write as single file
Note: the union operation takes into account only the position of the column, and not its name. I assume for your case that is the desired behaviour, as your are only extracting a single statistic.
I'm a R user trying to pick up Python. In R, I often used vectors to pass as arguments to SQL query. For example,
ID <- c(1,2,3,4,5)
df <- dbGetQuery(con, paste("select * from table where ID in (", ID, ")")
How can I achieve this in Python? I have a dataframe and would like to use one of its columns as the parameters. So with a dataframe like this,
data = {'ID': [1,2,3,4,5],
'Value': [10,20,30,40,50]}
df = pd.DataFrame(data)
[Edit]
So basically I need a string that would read "Select * from table where ID in (1,2,3,4,5)" except, instead of manually typing "1,2,3,4,5" I want to use parameters.
OP are looking for something like
query = f"select * from table where ID in ({','.join(df['ID'].astype(str))})"
For more ways to create this query from list, one can also check this post provided by #Erfan in a comment.
I have been using sqlite3 with python for creating databases. Till Now I have been successful,
But Unfortunately I have No way Out Of This. I have A Table With 63 columns but I Want To Select Only 62 Out Of Them, I am Sure That I can write The Names of The Columns In The Select Statement. But Writing 62 Of Them seems like a non-logical(for a programmer like me) idea for me. I am using Python-sqlite3 databases. Is There A Way Out Of This
I'm Sorry If I am Grammarly Mistaken.
Thanks in advance
With Sqlite, you can:
do a PRAGMA table_info(tablename); query to get a result set that describes that table's columns
pluck the column names out of that result set and remove the one you don't want
compose a column list for the select statement using e.g. ', '.join(column_names) (though you might want to consider a higher-level SQL statement builder instead of playing with strings).
Example
A simple example using a simple table and an in-memory SQLite database:
import sqlite3
con = sqlite3.connect(":memory:")
con.executescript("CREATE TABLE kittens (id INTEGER, name TEXT, color TEXT, furriness INTEGER, age INTEGER)")
columns = [row[1] for row in con.execute("PRAGMA table_info(kittens)")]
print(columns)
selected_columns = [column for column in columns if column != 'age']
print(selected_columns)
query = f"SELECT {', '.join(selected_columns)} FROM kittens"
print(query)
This prints out
['id', 'name', 'color', 'furriness', 'age']
['id', 'name', 'color', 'furriness']
SELECT id, name, color, furriness FROM kittens
I made two tables into a MySQL database using Python. The following SQL code is to perform join on two tables in the database. How can I do the same by writing equivalent python code?
MySQL code:
SELECT A.num, B.co_name, A.rep_name
FROM A
JOIN B
ON A.num=B.no
Desired Python codes:
sql = "XXX"
df_merged = pd.read_sql(sql, con=cnx)
I managed to resolve by enclosing my query with appropriate apostrophes:
sql = '''SELECT A.num, B.co_name, A.rep_name
FROM A
LEFT JOIN B ON A.num=B.no '''
df_merged = pd.read_sql(sql, con=cnx)
One approach that you can take is taking each element one by one and inserting into a new table/data frame.
zip( *map(lambda x: pd.read_sql_query(SQL.format(x),connection).loc[0],
df.yourDataFrame))
This is will generate a key value pair, the SQL table as the key and the pandas df as the value. You can them add these values whatever you like (df, or sql table.)
Hoped this helped :)
I'm trying to import the results of a complex SQL query into a pandas dataframe. My query requires me to create several temporary tables since the final result table I want includes some aggregates.
My code looks like this:
cnxn = pyodbc.connect(r'DRIVER=foo;SERVER=bar;etc')
cursor = cnxn.cursor()
cursor.execute('SQL QUERY HERE')
cursor.execute('SECONDARY SQL QUERY HERE')
...
df = pd.DataFrame(cursor.fetchall(),columns = [desc[0] for desc in cursor.description])
I get an error that tells me shapes aren't matching:
ValueError: Shape of passed values is (1,900000),indices imply (5,900000)
And indeed, the result of all the SQL queries should be a table with 5 columns rather than 1. I've run the SQL query using Microsoft SQL Server Management Studio and it works and returns the 5 column table that I want. I've tried to not pass any column names into the dataframe and printed out the head of the dataframe and found that pandas has put all the information in 5 columns into 1. The values in each row is a list of 5 values separated by commas, but pandas treats the entire list as 1 column. Why is pandas doing this? I've also tried going the pd.read_sql route but I still get the same error.
EDIT:
I have done some more debugging, taking the comments into account. The issue doesn't appear to stem from the fact that my query is nested. I tried a simple (one line) query to return a 3 column table and I still got the same error. Printing out fetchall() looks like this:
[(str1,str2,str3,datetime.date(stuff),datetime.date(stuff)),
(str1,str2,str3,datetime.date(stuff),datetime.date(stuff)),...]
Use pd.DataFrame.from_records instead:
df = pd.DataFrame.from_records(cursor.fetchall(),
columns = [desc[0] for desc in cursor.description])
Simply adjust the pd.DataFrame() call as right now cursor.fetchall() returns one-length list of tuples. Use tuple() or list to map child elements into their own columns:
df = pd.DataFrame([tuple(row) for row in cur.fetchall()],
columns = [desc[0] for desc in cursor.description])