How to Left Join a Dataframe into an Existing Table in Databricks - python

I have a delta table in Databricks that I am loading into a dataframe to perform a specific numpy operation:
import pyspark.pandas as ps
df = spark.sql("SELECT Id,Field_to_Transform FROM bronze.Table_A where Field_to_Transform is not null").toPandas()
The operation I perform is to remove special characters:
df['Transformed_Field'] = [''.join(e for e in x if e.isalnum()) for x in df['Field_to_Transform ']]
df.drop(['Field_to_Transform '], axis=1)
So this leaves me with the dataframe "df" which has just the id and the Transformed_Field in it:
Id
Transformed_Field
00A1
12345
00A2
123456
00A3
1234
00A4
1234568
Now I want to left join the df back to bronze.Table_A in databricks by simply joining back on the id field.
What is the most effecient way to join df back to bronze.Table_A?
Things I have tried so far:
Saved the entire bronze.Table_A in a new dataframe df2, used df.merge to put them together, and then created a brand new table. This worked but it was way too excessive and I do not need a new table, just the transformed column joined back.
Tried to use spark.sql to perform the left join in a SQL query but it seems I cannot access a df inside a spark.sql query.
Any help is much appreciated, thank you.

Option 1 - DataFrame API way
The first option is a modification of your first bullet point:
Saved the entire bronze.Table_A in a new dataframe df2, used df.merge to put them together, and then created a brand new table. This worked but it was way too excessive and I do not need a new table, just the transformed column joined back.
The merge operation is a Pandas method and not PySpark, hence the performance might not be optimal - docs here. Loading the whole table into DataFrame is the correct approach, that just needs the built-in join method to the required left-join, like this:
table_df = spark.read.table("bronze.Table_A")
# Join
merged_df = table_df.join(df, on="Id", how="left")
Option 2 - SQL way
The second option builds on your second bullet point:
Tried to use spark.sql to perform the left join in a SQL query but it seems I cannot access a df inside a spark.sql query.
You can temporarily register DataFrame as a view, and then query it using plain SQL, whether in spark.sql method or any other way. Try doing this:
df.createOrReplaceTempView("transformed_df")
# Then join like this
spark.sql("SELECT * FROM bronze.Table_A ta LEFT JOIN transformed_df tdf ON ta.Id = tdf.Id")

Related

Iterate through database with PySpark DataFrame

I need query 200+ tables in database.
By using spark.sql = f"" select ... " statement i get col(0) (because result of the query give me specific information about column that i've retrive) and result of calculation for particulare table, like this:
col(0)
1
My goal is to have 1 csv file, with name of table and the result of calculation:
Table name
Count
accounting
3
sales
1
So far my main part of my code:
list_tables = ['accounting', 'sales',...]
for table in list_tables:
df = spark.sql(
f""" select distinct errors as counts from {database}.{table} where errors is not null""")
df.repartition(1).write.mode("append").option("header","true").csv(f"s3:.......)
rename_part_file(dir,output,newdir)
I'm kinda new to PySpark and all structures included.
Soo far i'm confused because i heard iteration dataframe isn't best idea.
By using following code i get only 1 csv with last recent record, not all processed tables from my list_tables.
Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?
Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?
Both of the options you mentioned lead to the same thing - you have to iterate over a list of tables (you can't read multiple tables at once), read each of it, execute a SQL statement and save the results into DataFrame, then union all of the DataFrames and save as a single CSV file. The sample code could look something like this:
from pyspark.sql.functions import lit
from functools import reduce
tables = ["tableA", "tableB", "tableC"]
dfs = []
for table in tables:
dfs.append(spark.read.table(table).sql("my sql statement").withColumn("TableName", lit(table))) # Append the DF with SQL query results
df = reduce(lambda df1, df2: df1.union(df2), dfs) # Union all DFs
df.coalesce(1).write.mode("overwrite").csv("my_csv.csv") # Combine and write as single file
Note: the union operation takes into account only the position of the column, and not its name. I assume for your case that is the desired behaviour, as your are only extracting a single statistic.

Case Condition in Python with Pandas dataframe

I've been learning pandas for a couple of days. I am migrating a SQL DB to PYTHON and have encountered the sql statement (example):
select * from
table_A a
left join table_B b
on a.ide = b.ide
and a.credit_type = case when b.type > 0 then b.credit_type else a.credit_type end
I've only been able to migrate to the first condition. My difficulty is in the last line and I don't know how to migrate it. Tables are actually sql queries that I've stored in dataframes.
merge = pd.merge(df_query_a, df_query_b),on='ide', how='left')
any suggestions please.
The Case condition is like an if-then-else statement, which you can implement in Pandas using np.where() like below:
Based on left join resulting dataframe merge:
import numpy as np
merge['credit_type_x'] = np.where(merge['type_y'] > 0, merge['credit_type_y'], merge['credit_type_x'])
Here the column names credit_type_x credit_type_y should have been created on the Pandas merge function after renaming conflicting (same) column names on the 2 sources dataframes. In case dataframe merge doesn't have the column type_y because column type appears only on Table_B but not on Table_A, you can use column name type here instead.
Alternatively, as you just need to modify the value of credit_type_x only when type_y > 0 and retain the value of credit_type_x without modification if NOT type_y > 0, we can also do it simply by:
merge.loc[merge['type_y'] > 0, 'credit_type_x'] = merge['credit_type_y']
Below two options to face your problem
You can add a column in df_query_a based on the condition that you need considering two dataframes, and after that, make the merge.
You can try with some library as pandasql3.

Replicating Excel VLOOKUP in Python

So I have 2 tables, Table 1 and Table 2, Table 2 is sorted with the dates- recent dates to old dates. So in excel when I do a lookup in Table 1 and the lookup is done from Table 2, It only picks the first value from table 2 and does not move on to search for the same value after the first.
So I tried replicating it in python with the merge function, but found out it gets to repeat the value the number of times it appears in the second table.
pd.merge(Table1, Table2, left_on='Country', right_on='Country', how='left', indicator='indicator_column')
TABLE1
TABLE2
Merger result
Expected Result(Excel vlookup)
Is there any way this could be achieved with the merge function or any other python function?
Typing this in the blind as you are including your data as images, not text.
# The index is a very important element in a DataFrame
# We will see that in a bit
result = table1.set_index('Country')
# For each country, only keep the first row
tmp = table2.drop_duplicates(subset='Country').set_index('Country')
# When you assign one or more columns of a DataFrame to one or more columns of
# another DataFrame, the assignment is aligned based on the index of the two
# frames. This is the equivalence of VLOOKUP
result.loc[:, ['Age', 'Date']] = tmp[['Age', 'Date']]
result.reset_index(inplace=True)
Edit: Since you want a straight up Vlookup, just use join. It appears to find the very first one.
table1.join(table2, rsuffix='r', lsuffix='l')
The docs seem to indicate it performs similarly to a vlookup: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
I'd recommend approaching this more like a SQL join than a Vlookup. Vlookup finds the first matching row, from top to bottom, which could be completely arbitrary depending on how you sort your table/array in excel. "True" database systems and their related functions are more detailed than this, for good reason.
In order to join only one row from the right table onto one row of the left table, you'll need some kind of aggregation or selection - So in your case, that'd be either MAX or MIN.
The question is, which column is more important? The date or age?
import pandas as pd
df1 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN'],
'Name':['Dave','Mike','Pete','Shirval','Kwasi','Delali']
})
df2 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN','LIB','ARG','BNG'],
'Age':[35,40,27,87,90,30,61,18,45],
'Date':['7/10/2020','7/9/2020','7/8/2020','7/7/2020','7/6/2020','7/5/2020','7/4/2020','7/3/2020','7/2/2020']
})
df1.set_index('Country')\
.join(
df2.groupby('Country')\
.agg({'Age':'max','Date':'max'}), how='left', lsuffix='l', rsuffix='r')

Merge the data with the help of python or tableau

I have the 2 Excel sheets one have 63000 rows and the other one had 67000 rows which contains careers and their elgibility both have same title so I merged based on the title but the output shows me 44,00,000 rows why so , pls help me in this problem thank you,
Import pandas as pd
Df = pd.read_excel('c/downloads/knowledge.xlsx')
Df1 = pd.read_excel('c/downloads/Abilities.xlsx')
Df2 = pd .merge(df,df1,on = 'Title')
# Create a list of the files in the order you want to merge
all_df_list = [df, df1]
# Merge all the dataframes in all_df_list. Pandas will automatically append based on similar column names if that is what you meant by "same title".
appended_df = pd.concat(all_df_list)
# export as an excel file
appended_df.to_excel("data.xlsx", index=False)
Let me know if this helps. Works only if you have same labels in both of the files.
Make sure you're using the correct join type. Left, Right, Inner, Outer etc. It sounds like you need to use a Left Join. That will match data from the table on the right to the one on the left and return values accordingly, similar to a VLOOKUP. If the default join type is an Outer join then it will include all values from both tables and will dramatically increase your records.

Merge dataframes by left join SQL & Pandas

I made two tables into a MySQL database using Python. The following SQL code is to perform join on two tables in the database. How can I do the same by writing equivalent python code?
MySQL code:
SELECT A.num, B.co_name, A.rep_name
FROM A
JOIN B
ON A.num=B.no
Desired Python codes:
sql = "XXX"
df_merged = pd.read_sql(sql, con=cnx)
I managed to resolve by enclosing my query with appropriate apostrophes:
sql = '''SELECT A.num, B.co_name, A.rep_name
FROM A
LEFT JOIN B ON A.num=B.no '''
df_merged = pd.read_sql(sql, con=cnx)
One approach that you can take is taking each element one by one and inserting into a new table/data frame.
zip( *map(lambda x: pd.read_sql_query(SQL.format(x),connection).loc[0],
df.yourDataFrame))
This is will generate a key value pair, the SQL table as the key and the pandas df as the value. You can them add these values whatever you like (df, or sql table.)
Hoped this helped :)

Categories