Merge dataframes by left join SQL & Pandas

Merge dataframes by left join SQL & Pandas - python

I made two tables into a MySQL database using Python. The following SQL code is to perform join on two tables in the database. How can I do the same by writing equivalent python code?
MySQL code:
SELECT A.num, B.co_name, A.rep_name
FROM A
JOIN B
ON A.num=B.no
Desired Python codes:
sql = "XXX"
df_merged = pd.read_sql(sql, con=cnx)

I managed to resolve by enclosing my query with appropriate apostrophes:
sql = '''SELECT A.num, B.co_name, A.rep_name
FROM A
LEFT JOIN B ON A.num=B.no '''
df_merged = pd.read_sql(sql, con=cnx)

One approach that you can take is taking each element one by one and inserting into a new table/data frame.
zip( *map(lambda x: pd.read_sql_query(SQL.format(x),connection).loc[0],
df.yourDataFrame))
This is will generate a key value pair, the SQL table as the key and the pandas df as the value. You can them add these values whatever you like (df, or sql table.)
Hoped this helped :)

Related

Iterate through database with PySpark DataFrame

I need query 200+ tables in database.
By using spark.sql = f"" select ... " statement i get col(0) (because result of the query give me specific information about column that i've retrive) and result of calculation for particulare table, like this:
col(0)
1
My goal is to have 1 csv file, with name of table and the result of calculation:
Table name
Count
accounting
3
sales
1
So far my main part of my code:
list_tables = ['accounting', 'sales',...]
for table in list_tables:
df = spark.sql(
f""" select distinct errors as counts from {database}.{table} where errors is not null""")
df.repartition(1).write.mode("append").option("header","true").csv(f"s3:.......)
rename_part_file(dir,output,newdir)
I'm kinda new to PySpark and all structures included.
Soo far i'm confused because i heard iteration dataframe isn't best idea.
By using following code i get only 1 csv with last recent record, not all processed tables from my list_tables.
Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?

Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?
Both of the options you mentioned lead to the same thing - you have to iterate over a list of tables (you can't read multiple tables at once), read each of it, execute a SQL statement and save the results into DataFrame, then union all of the DataFrames and save as a single CSV file. The sample code could look something like this:
from pyspark.sql.functions import lit
from functools import reduce
tables = ["tableA", "tableB", "tableC"]
dfs = []
for table in tables:
dfs.append(spark.read.table(table).sql("my sql statement").withColumn("TableName", lit(table))) # Append the DF with SQL query results
df = reduce(lambda df1, df2: df1.union(df2), dfs) # Union all DFs
df.coalesce(1).write.mode("overwrite").csv("my_csv.csv") # Combine and write as single file
Note: the union operation takes into account only the position of the column, and not its name. I assume for your case that is the desired behaviour, as your are only extracting a single statistic.

How to Left Join a Dataframe into an Existing Table in Databricks

I have a delta table in Databricks that I am loading into a dataframe to perform a specific numpy operation:
import pyspark.pandas as ps
df = spark.sql("SELECT Id,Field_to_Transform FROM bronze.Table_A where Field_to_Transform is not null").toPandas()
The operation I perform is to remove special characters:
df['Transformed_Field'] = [''.join(e for e in x if e.isalnum()) for x in df['Field_to_Transform ']]
df.drop(['Field_to_Transform '], axis=1)
So this leaves me with the dataframe "df" which has just the id and the Transformed_Field in it:
Id
Transformed_Field
00A1
12345
00A2
123456
00A3
1234
00A4
1234568
Now I want to left join the df back to bronze.Table_A in databricks by simply joining back on the id field.
What is the most effecient way to join df back to bronze.Table_A?
Things I have tried so far:
Saved the entire bronze.Table_A in a new dataframe df2, used df.merge to put them together, and then created a brand new table. This worked but it was way too excessive and I do not need a new table, just the transformed column joined back.
Tried to use spark.sql to perform the left join in a SQL query but it seems I cannot access a df inside a spark.sql query.
Any help is much appreciated, thank you.

Option 1 - DataFrame API way
The first option is a modification of your first bullet point:
Saved the entire bronze.Table_A in a new dataframe df2, used df.merge to put them together, and then created a brand new table. This worked but it was way too excessive and I do not need a new table, just the transformed column joined back.
The merge operation is a Pandas method and not PySpark, hence the performance might not be optimal - docs here. Loading the whole table into DataFrame is the correct approach, that just needs the built-in join method to the required left-join, like this:
table_df = spark.read.table("bronze.Table_A")
# Join
merged_df = table_df.join(df, on="Id", how="left")
Option 2 - SQL way
The second option builds on your second bullet point:
Tried to use spark.sql to perform the left join in a SQL query but it seems I cannot access a df inside a spark.sql query.
You can temporarily register DataFrame as a view, and then query it using plain SQL, whether in spark.sql method or any other way. Try doing this:
df.createOrReplaceTempView("transformed_df")
# Then join like this
spark.sql("SELECT * FROM bronze.Table_A ta LEFT JOIN transformed_df tdf ON ta.Id = tdf.Id")

Update SQL database with dataframe content

I have a pandas dataframe containing two columns: ID and MY_DATA. I have an SQL database that contains a column named ID and some other data. I want to match the ID of the SQL database column to the rows of the dataframe ID column and update it with a new column MY_DATA.
So far I used the following:
import sqlite3
df = pd.read_csv('my_filename.csv')
con = sqlite3.connect('my_database.sqlite')
cur = con.cursor()
for row in cur.execute('SELECT ID FROM main;'):
for i in len(df):
if (row[i] == df.ID.iloc[i]):
update_sqldb(df, i)
However, I think this way of having two nested for-loops is probably ugly and not very pythonic. I thought that maybe I should use the map() function, but is this the right direction to go?

Importing SQL query into Pandas results in only 1 column

I'm trying to import the results of a complex SQL query into a pandas dataframe. My query requires me to create several temporary tables since the final result table I want includes some aggregates.
My code looks like this:
cnxn = pyodbc.connect(r'DRIVER=foo;SERVER=bar;etc')
cursor = cnxn.cursor()
cursor.execute('SQL QUERY HERE')
cursor.execute('SECONDARY SQL QUERY HERE')
...
df = pd.DataFrame(cursor.fetchall(),columns = [desc[0] for desc in cursor.description])
I get an error that tells me shapes aren't matching:
ValueError: Shape of passed values is (1,900000),indices imply (5,900000)
And indeed, the result of all the SQL queries should be a table with 5 columns rather than 1. I've run the SQL query using Microsoft SQL Server Management Studio and it works and returns the 5 column table that I want. I've tried to not pass any column names into the dataframe and printed out the head of the dataframe and found that pandas has put all the information in 5 columns into 1. The values in each row is a list of 5 values separated by commas, but pandas treats the entire list as 1 column. Why is pandas doing this? I've also tried going the pd.read_sql route but I still get the same error.
EDIT:
I have done some more debugging, taking the comments into account. The issue doesn't appear to stem from the fact that my query is nested. I tried a simple (one line) query to return a 3 column table and I still got the same error. Printing out fetchall() looks like this:
[(str1,str2,str3,datetime.date(stuff),datetime.date(stuff)),
(str1,str2,str3,datetime.date(stuff),datetime.date(stuff)),...]

Use pd.DataFrame.from_records instead:
df = pd.DataFrame.from_records(cursor.fetchall(),
columns = [desc[0] for desc in cursor.description])

Simply adjust the pd.DataFrame() call as right now cursor.fetchall() returns one-length list of tuples. Use tuple() or list to map child elements into their own columns:
df = pd.DataFrame([tuple(row) for row in cur.fetchall()],
columns = [desc[0] for desc in cursor.description])

Using Pivot in SQL Query for Long to Wide Formatting

I am using sqlalchemy in Python 3 to query a sqlite database. My initial query is:
df = pd.read_sql_query('SELECT member, sku, quantity FROM table GROUP BY member, csvdatabase')
This works and gives me three columns; however, the problem is that I want to convert this from Long to Wide format. I tried using pivot function, but my machine does not have the memory to handle this:
df_new = df.pivot(index = 'member', columns = 'sku', values = 'quantity')
Is there a way to make the SQL query do this? I came across several examples where the pivot function is used or the MAX(CASE WHEN Key..., but the number of sku's is too large to type into a sql query

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge dataframes by left join SQL & Pandas - python

I managed to resolve by enclosing my query with appropriate apostrophes: sql = '''SELECT A.num, B.co_name, A.rep_name FROM A LEFT JOIN B ON A.num=B.no ''' df_merged = pd.read_sql(sql, con=cnx)

Related

Iterate through database with PySpark DataFrame

How to Left Join a Dataframe into an Existing Table in Databricks

Update SQL database with dataframe content

Importing SQL query into Pandas results in only 1 column

Using Pivot in SQL Query for Long to Wide Formatting

Categories

Resources