Comparing two large dataframes w/ pySpark

Comparing two large dataframes w/ pySpark - python

This post mentions symmetric difference and leveraging code df1.except(df2).union(df2.except(df1)) and/ordf1.unionAll(df2).except(df1.intersect(df2)) but I'm getting syntax errors when using except.
I'm trying to compare two dataframes who can have up to 50 or 50+ columns. I have the working code below but need to avoid hard coding columns.
sample code and example
# Create the two dataframes
df1 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3500,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Vom',5000,'mex','IT','2/11/2019'),(66,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df2 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3000,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Xom',5000,'mex','IT','2/11/2019'),(77,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df1 = df1.withColumn('FLAG',lit('DF1'))
df2 = df2.withColumn('FLAG',lit('DF2'))
# Concatenate the two DataFrames, to create one big dataframe.
df = df1.union(df2)
#Use window function to check if the count of same rows is more than 1 and if it indeed is, then mark column FLAG as SAME, else keep it the way it is. Finally, drop the duplicates.
my_window = Window.partitionBy('No','Name','Sal','Address','Dept','Join_Date').rowsBetween(-sys.maxsize, sys.maxsize)
df = df.withColumn('FLAG', when((count('*').over(my_window) > 1),'SAME').otherwise(col('FLAG'))).dropDuplicates()
df.show()

You can get all column names from df and use that list as parameter for the Window function:
cols = df.columns
cols.remove('FLAG')
my_window = Window.partitionBy(cols).rowsBetween(-sys.maxsize, sys.maxsize)
The remaining code stays unchanged.

Related

Best way to move an unexpected column in a Pandas DF to a new DF?

Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.

You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)

Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1

Featurizer to eliminate features

I am trying to set up a featurizers which drops out all but the first 10 columns of my database. The database consists of 76 columns in total. The idea is to apply a PolynomialFeatures(1)) to the 10 columns I would like to keep, but then I cannot see a way to eliminate smartly the remaining 66 columns (I was thinking something like PolynomialFeatures(0)) but it does not seem to work. The idea was to multiply them by the constant 0). The issues are basically 2: 1) how to tell DataFrameMapper to apply the same featurizer to a range of columns (namely A_11 to A_76); 2) how to tell DataFrameMapper to apply aa featurizer that eliminates such columns.
The (incomplete) code I tried so far looks as follows. I denoted A_11-A_76 the issue 1) (i.e. the range) and as ? the issue 2 in the code:
from dml_iv.utilities import SubsetWrapper, ConstantModel
from econml.sklearn_extensions.linear_model import StatsModelsLinearRegression
col = ["A_"+str(k) for k in range(XW.shape[1])]
XW_db = pd.DataFrame(XW, columns=col)
from sklearn_pandas import DataFrameMapper
subset_names = set(['A_0','A_1','A_2','A_3','A_4','A_5','A_6','A_7','A_8','A_9','A_10'])
# list of indices of features X to use in the final model
mapper = DataFrameMapper([
('A_0', PolynomialFeatures(1)),
('A_1', PolynomialFeatures(1)),
('A_2', PolynomialFeatures(1)),
('A_3', PolynomialFeatures(1)),
('A_4', PolynomialFeatures(1)),
('A_5', PolynomialFeatures(1)),
('A_11 - A_66', ?)]) ## PROBLEMATIC PART

Why don't you drop columns you don't want from your dataframe and map what's left?
cols_map = [...] # list of columns to map
cols_drop = [...] # list of columns to drop
XW_db = XW_db.drop(cols_drop, axis=1) # you're left with only what to map
mapper = DataFrameMapper(cols_map)
...
If the reason for not wanting to drop columns is that they will be used later, you can simply assign the result of your drop to other variables, thus creating several subset dataframes which are easier to manipulate:
df2 = df1.drop(cols_drop2,axis=1) # df2 is a subset of df1
df3 = df1.drop(cols_drop3,axis=1) # df3 is a subset of df1
# Alternative is to decide what to keep instead of what to drop
df4 = df1[cols_keep] # df4 is a subset of df1
# df1 remains the full dataframe

Python Pandas - filter pandas dataframe to get rows with minimum values in one column for each unique value in another column

Here is a dummy example of the DF I'm working with ('ETC' represents several columns):
df = pd.DataFrame(data={'PlotCode':['A','A','A','A','B','B','B','C','C'],
'INVYR':[2000,2000,2000,2005,1990,2000,1990,2005,2001],
'ETC':['a','b','c','d','e','f','g','h','i']})
picture of df (sorry not enough reputation yet)
And here is what I want to end up with:
df1 = pd.DataFrame(data={'PlotCode':['A','A','A','B','B','C'],
'INVYR':[2000,2000,2000,1990,1990,2001],
'ETC':['a','b','c','e','g','i']})
picture of df1
NOTE: I want ALL rows with minimum 'INVYR' values for each 'PlotCode', not just one or else I'm assuming I could do something easier with drop_duplicates and sort.
So far, following the answer here Appending pandas dataframes generated in a for loop I've tried this with the following code:
df1 = []
for i in df['PlotCode'].unique():
j = df[df['PlotCode']==i]
k = j[j['INVYR']==j['INVYR'].min()]
df1.append(k)
df1 = pd.concat(df1)
This code works but is very slow, my actual data contains some 40,000 different PlotCodes so this isn't a feasible solution. Does anyone know some smooth filtering way of doing this? I feel like I'm missing something very simple.
Thank you in advance!

Try not to use for loops when using pandas, they are extremely slow in comparison to the vectorized operations that pandas has.
Solution 1:
Determine the minimum INVYR for every plotcode, using .groupby():
min_invyr_per_plotcode = df.groupby('PlotCode', as_index=False)['INVYR'].min()
And use pd.merge() to do an inner join between your orignal df with this minimum you just found:
result_df = pd.merge(
df,
min_invyr_per_plotcode,
how='inner',
on=['PlotCode', 'INVYR'],
)
Solution 2:
Again, determine the minimum per group, but now add it as a column to your dataframe. This minimum per group gets added to every row by using .groupby().transform()
df['min_per_group'] = (df
.groupby('PlotCode')['INVYR']
.transform('min')
)
Now filter your dataframe where INVYR in a row is equal to the minimum of that group:
df[df['INVYR'] == df['min_per_group']]

Filling a dataframe with multiple dataframe values

I have some 100 dataframes that need to be filled in another big dataframe. Presenting the question with two dataframes
import pandas as pd
df1 = pd.DataFrame([1,1,1,1,1], columns=["A"])
df2 = pd.DataFrame([2,2,2,2,2], columns=["A"])
Please note that both the dataframes have same column names.
I have a master dataframe that has repetitive index values as follows:-
master_df=pd.DataFrame(index=df1.index)
master_df= pd.concat([master_df]*2)
Expected Output:-
master_df['A']=[1,1,1,1,1,2,2,2,2,2]
I am using for loop to replace every n rows of master_df with df1,df2... df100.
Please suggest a better way of doing it.
In fact df1,df2...df100 are output of a function where the input is column A values (1,2). I was wondering if there is something like
another_df=master_df['A'].apply(lambda x: function(x))
Thanks in advance.

If you want to concatenate the dataframes you could just use pandas concat with a list as the code below shows.
First you can add df1 and df2 to a list:
df_list = [df1, df2]
Then you can concat the dfs:
master_df = pd.concat(df_list)
I used the default value of 0 for 'axis' in the concat function (which is what I think you are looking for), but if you want to concatenate the different dfs side by side you can just set axis=1.

how can I select data in a multiindex dataFrame and have the result dataFrame have an appropriate index

I have a multiindex DataFrame and I'm trying to select data in it base on certain criteria, so far so good. The problem is that once I have selected my data using .loc and pd.IndexSlice, the resulting DataFrame which should logically have less rows and less element in the first level of the multiindex keeps exactly the same multiIndex but with some keys in it refering to empty dataframe.
I've tried creating a completely new DataFrame with a new index, but the structure of my data set is complicating and there is not always the same number of elements in a given level, so it is not easy to created a dataFrame with the right shape in which I can put the data.
import numpy as np
import pandas as pd
np.random.seed(3) #so my exemple is reproductible
idx = pd.IndexSlice
iterables = [['A','B','C'],[0,1,2],['some','rdm','data']]
my_index = pd.MultiIndex.from_product(iterables,names =
['first','second','third'])
my_columns = ['col1','col2','col3']
df1 = pd.DataFrame(data = np.random.randint(10,size =
(len(my_index),len(my_columns))),
index = my_index,
columns = my_columns
)
#Ok, so let's say I want to keep only the elements in the first level of my index (["A","B","C"]) for
#which the total sum in column 3 is less than 35 for some reasons
boolean_mask = (df1.groupby(level = "first").col3.sum() < 35).tolist()
first_level_to_keep = df1.index.levels[0][boolean_mask].tolist()
#lets select the wanted data and put it in df2
df2 = df1.loc[idx[first_level_to_keep,:,:],:]
So far, everything is as expected
The problem is when I want to access the df2 index. I expected the following:
df2.index.levels[0].tolist() == ['B','C']
to be true. But this is what gives a True statement:
df2.index.levels[0].tolist() == ['A','B','C']
So my question is the following: is there a way to select data and to have in retrun a dataFrame with a multiindex reflecting what is in it. Because I find weird to be able to select non existing data in my df2:
I tried to put some images of the dataframes in question but I couldn't because I dont't have enough «reputation»... sorry about that.
Thank you for your time!

Even if you delete the rows corresponding to a particular value in an index level, that value still exists. You can reset the index and then set those columns back as an index in order to generate a MultiIndex with new level values.
df2 = df2.reset_index().set_index(['first','second','third'])
print(df2.index.levels[0].tolist() == ['B','C'])
True

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing two large dataframes w/ pySpark - python

You can get all column names from df and use that list as parameter for the Window function: cols = df.columns cols.remove('FLAG') my_window = Window.partitionBy(cols).rowsBetween(-sys.maxsize, sys.maxsize) The remaining code stays unchanged.

Related

Best way to move an unexpected column in a Pandas DF to a new DF?

Featurizer to eliminate features

Python Pandas - filter pandas dataframe to get rows with minimum values in one column for each unique value in another column

Filling a dataframe with multiple dataframe values

how can I select data in a multiindex dataFrame and have the result dataFrame have an appropriate index

Categories

Resources