how to perform sql update like operation on pandas dataframe? - python

I have two csv files with 30 to 40 thousands records each.
I loaded the csv files into two corresponding dataframes.
Now I want to perform this sql operation on the dataframes instead of in sqlite : update table1 set column1 = (select column1 from table2 where table1.Id == table2.Id), column2 = (select column2 from table2 where table1.Id == table2.Id) where column3 = 'some_value';
I tried to perform the update on dataframe in 4 steps:
1. merging dataframes on common Id
2. getting Ids from dataframe where column 3 has 'some_value'
3. filtering the dataframe of 1st step based on Ids received in 2nd step.
4. using lambda function to insert in dataframe where Id matches.
I just want to know other views on this approach and if there are any better solutions. One important thing is that the size of dataframe is quite large, so I feel like using sqlite will be better than pandas as it gives result in single query and is much faster.
Shall I use sqlite or there are any better way to perform this operation on dataframe?
Any views on this will be appreciated. Thank you.

Related

Iterate through database with PySpark DataFrame

I need query 200+ tables in database.
By using spark.sql = f"" select ... " statement i get col(0) (because result of the query give me specific information about column that i've retrive) and result of calculation for particulare table, like this:
col(0)
1
My goal is to have 1 csv file, with name of table and the result of calculation:
Table name
Count
accounting
3
sales
1
So far my main part of my code:
list_tables = ['accounting', 'sales',...]
for table in list_tables:
df = spark.sql(
f""" select distinct errors as counts from {database}.{table} where errors is not null""")
df.repartition(1).write.mode("append").option("header","true").csv(f"s3:.......)
rename_part_file(dir,output,newdir)
I'm kinda new to PySpark and all structures included.
Soo far i'm confused because i heard iteration dataframe isn't best idea.
By using following code i get only 1 csv with last recent record, not all processed tables from my list_tables.
Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?
Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?
Both of the options you mentioned lead to the same thing - you have to iterate over a list of tables (you can't read multiple tables at once), read each of it, execute a SQL statement and save the results into DataFrame, then union all of the DataFrames and save as a single CSV file. The sample code could look something like this:
from pyspark.sql.functions import lit
from functools import reduce
tables = ["tableA", "tableB", "tableC"]
dfs = []
for table in tables:
dfs.append(spark.read.table(table).sql("my sql statement").withColumn("TableName", lit(table))) # Append the DF with SQL query results
df = reduce(lambda df1, df2: df1.union(df2), dfs) # Union all DFs
df.coalesce(1).write.mode("overwrite").csv("my_csv.csv") # Combine and write as single file
Note: the union operation takes into account only the position of the column, and not its name. I assume for your case that is the desired behaviour, as your are only extracting a single statistic.

Case Condition in Python with Pandas dataframe

I've been learning pandas for a couple of days. I am migrating a SQL DB to PYTHON and have encountered the sql statement (example):
select * from
table_A a
left join table_B b
on a.ide = b.ide
and a.credit_type = case when b.type > 0 then b.credit_type else a.credit_type end
I've only been able to migrate to the first condition. My difficulty is in the last line and I don't know how to migrate it. Tables are actually sql queries that I've stored in dataframes.
merge = pd.merge(df_query_a, df_query_b),on='ide', how='left')
any suggestions please.
The Case condition is like an if-then-else statement, which you can implement in Pandas using np.where() like below:
Based on left join resulting dataframe merge:
import numpy as np
merge['credit_type_x'] = np.where(merge['type_y'] > 0, merge['credit_type_y'], merge['credit_type_x'])
Here the column names credit_type_x credit_type_y should have been created on the Pandas merge function after renaming conflicting (same) column names on the 2 sources dataframes. In case dataframe merge doesn't have the column type_y because column type appears only on Table_B but not on Table_A, you can use column name type here instead.
Alternatively, as you just need to modify the value of credit_type_x only when type_y > 0 and retain the value of credit_type_x without modification if NOT type_y > 0, we can also do it simply by:
merge.loc[merge['type_y'] > 0, 'credit_type_x'] = merge['credit_type_y']
Below two options to face your problem
You can add a column in df_query_a based on the condition that you need considering two dataframes, and after that, make the merge.
You can try with some library as pandasql3.

Replicating Excel VLOOKUP in Python

So I have 2 tables, Table 1 and Table 2, Table 2 is sorted with the dates- recent dates to old dates. So in excel when I do a lookup in Table 1 and the lookup is done from Table 2, It only picks the first value from table 2 and does not move on to search for the same value after the first.
So I tried replicating it in python with the merge function, but found out it gets to repeat the value the number of times it appears in the second table.
pd.merge(Table1, Table2, left_on='Country', right_on='Country', how='left', indicator='indicator_column')
TABLE1
TABLE2
Merger result
Expected Result(Excel vlookup)
Is there any way this could be achieved with the merge function or any other python function?
Typing this in the blind as you are including your data as images, not text.
# The index is a very important element in a DataFrame
# We will see that in a bit
result = table1.set_index('Country')
# For each country, only keep the first row
tmp = table2.drop_duplicates(subset='Country').set_index('Country')
# When you assign one or more columns of a DataFrame to one or more columns of
# another DataFrame, the assignment is aligned based on the index of the two
# frames. This is the equivalence of VLOOKUP
result.loc[:, ['Age', 'Date']] = tmp[['Age', 'Date']]
result.reset_index(inplace=True)
Edit: Since you want a straight up Vlookup, just use join. It appears to find the very first one.
table1.join(table2, rsuffix='r', lsuffix='l')
The docs seem to indicate it performs similarly to a vlookup: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
I'd recommend approaching this more like a SQL join than a Vlookup. Vlookup finds the first matching row, from top to bottom, which could be completely arbitrary depending on how you sort your table/array in excel. "True" database systems and their related functions are more detailed than this, for good reason.
In order to join only one row from the right table onto one row of the left table, you'll need some kind of aggregation or selection - So in your case, that'd be either MAX or MIN.
The question is, which column is more important? The date or age?
import pandas as pd
df1 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN'],
'Name':['Dave','Mike','Pete','Shirval','Kwasi','Delali']
})
df2 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN','LIB','ARG','BNG'],
'Age':[35,40,27,87,90,30,61,18,45],
'Date':['7/10/2020','7/9/2020','7/8/2020','7/7/2020','7/6/2020','7/5/2020','7/4/2020','7/3/2020','7/2/2020']
})
df1.set_index('Country')\
.join(
df2.groupby('Country')\
.agg({'Age':'max','Date':'max'}), how='left', lsuffix='l', rsuffix='r')

PySpark - loop through each row of dataframe and run a hive query

I have a dataframe with 100 rows [ name, age, date, hour] . I need to partition this dataframe with distinct values of date. Let's say there are 20 distinct date values in these 100 rows , then i need to spawn up 20 parallel hive queries where each hive QL will join each of these partitions with a hive table . Hive table - [dept, couse , date] is partitioned by date field.
Hive table is huge and hence I need to optimize these joins in to multiple smaller joins and then aggregate these results. Any recommendations on how can I achieve this ?
You can do this in single query. Partition both df on date and join. During join broadcast you first table which have small data (~10MB). Here is example:-
df3 = df1.repartition("date").join(
F.broadcast(df2.repartition("date")),
"date"
)
#df2 is your dataframe smaller dataframe in your case it is name, age, date, ,hour.
#Now perform any operation on df3

How to SELECT DISTINCT from a pandas hdf5store?

I have a large amount of data in an HDFStore (as a table), on the order of 80M rows with 1500 columns. Column A has integer values ranging between 1 and 40M or so. The values in column A are not unique and there may be between 1 and 30 rows with the same column A value. In addition, all rows which share a common value in column A will also have a common value in column B (not the same value as column A though).
I would like to do a select against the table to get a list of column A values and their corresponding column B values. The equivalent SQL statement would be something like SELECT DISTINCT ColA, ColB FROM someTable What are some ways to achieve this? Can it be done such that the results of the query are stored directly into another table in the HDF5Store?
Blocked Algorithms
One solution would be to look at dask.dataframe which implements a subset of the Pandas API with blocked algorithms.
import dask.dataframe as dd
df = dd.read_hdf('myfile.hdf5', '/my/data', columns=['A', 'B'])
result = df.drop_duplicates().compute()
In this particular case dd.DataFrame.drop_duplicates would pull out a medium-sized block of rows, perform the pd.DataFrame.drop_duplicates call and store the (hopefully smaller) result. It would do this for all blocks, concatenate them, and then perform a final pd.DataFrame.drop_duplicates on the concatenated intermediate result. You could also do this with just a for loop. Your case is a bit odd in that you also have a large number of unique elements. This might still be a challenge to compute even with blocked algorithms. Worth a shot though.
Column Store
Alternatively you should consider looking into a storage format that can store your data as individual columns. This would let you collect just the two columns that you need, A and B, rather than having to wade through all of your data on disk. Arguably you should be able to fit 80 million rows into a single Pandas dataframe in memory. You could consider bcolz for this.
To be clear, you tried something like this and it didn't work?
import pandas
import tables
import pandasql
check that your store is the type you think it is:
in: store
out: <class 'pandas.io.pytables.HDFStore'>
You can select a table from a store like this:
df = store.select('tablename')
Check that it worked:
in: type(tablename)
out: pandas.core.frame.DataFrame
Then you can do something like this:
q = """SELECT DISTINCT region, segment FROM tablename"""
distinct_df = (pandasql.sqldf(q, locals()))
(note that you will get deprecation warnings doing it this way, but it works)

Categories