How to create summary of multiple columns from multiple pandas dataframes? - python

i am trying to check any loss of data in categorical columns (such as data for an entire category) after data cleansing. i have 2 series that contains unique values of each categorical column in the dataframes.
Before Data Cleansing
dataframe1.nunique()
Column 1
10
Column 2
20
After Data Cleansing
dataframe2.nunique()
Column 1
10
Column 2
15
Any idea how to get a table in the following format for better presentation ? Both dataframe has same columns, but not same row count.
Column 1
10
10
Column 2
20
15

You can use concat() method:
df=pd.concat([df1,df2],axis=1)
df.columns=['Unique Value Count_before','Unique Value Count_after']
OR
via to_frame() and merge() method
df=df1.to_frame().merge(df2.to_frame(),on='Column Name',suffixes=('_before','_after'))
Output:
Column Name Unique Value Count_Before Unique Value Count_After
Column 1 10 10
Column 2 20 15

Related

Fill in NA with other dataframe and then add the rows that are not in the first dataframe

I am currently working with python to merge two dataframes that look like below:
# Primary
df1 = [['A','2021-03','NA',9,'NA'], ['B','2021-09','NA','NA',27], ['C','2021-12','NA',12,28]]
df1_fin = pd.DataFrame(df1, columns=['ID','Date','Value_1','Value_2','Value_3'])
# Secondatry
df2 = [['A','2021-03',80,20,30], ['B','2021-09',90,'NA',20], ['B','2021-12','NA','NA',27], ['D','2020-06',4,12,28]]
df2_fin = pd.DataFrame(df2, columns=['ID','Date','Value_1','Value_2','Value_3'])
I want to perform outer join but keep the value of first dataframe if it already exist.
The key columns will be ID and Date.
If the ID and Date matches, the NA value will be replaced by second dataframe and existing values will not be replaced.
If the ID and Date does not matches, new row will be created
The result dataframe will look like below:
ID
Date
Value_1
Value_2
Value_3
A
2021-03
80
9
30
B
2021-09
90
NA
27
B
2021-12
NA
NA
27
C
2021-12
NA
12
28
D
2020-06
4
12
28
Should I fill in NA first and then combine the rest rows? or is there a function that I can define the parameters to perform both actions?
Yes, there's a function for it in pandas combine_first:
Combine two DataFrame objects by filling null values in one DataFrame
with non-null values from other DataFrame. The row and column indexes
of the resulting DataFrame will be the union of the two.
df1_fin.set_index(['ID', 'Date']).combine_first(df2_fin.set_index(['ID', 'Date'])).reset_index()
(Please note that in your example, you provide two dataframes without any NaN values but with the string 'NaN' instead, which has no special meaning. Replace 'NaN' with None in the example to get the intended meaning.)

Update dataframe 1 using two columns in dataframe 2 in python

I want to update Freq column in df1 using Freq column in data frame 2 as shown below,
data = {'Cell':[1,2,3,4,'10-05','10-09'], 'Freq':[True, True,True,True,True,True]}
df1 = pd.DataFrame(data)
Dataframe 1
Dataframe 1
Dataframe 2
data2 = {'Cell-1':[1,1,1,1,1,1,2,2,2,2,2,2],'Cell-2':[1,2,3,4,'10-05','10-09',1,2,3,4,'10-05','10-09'] ,'Freq':[True, False,True,False,True,True,True, False,True,False,True,False]}
df2 = pd.DataFrame(data2)
Dataframe 2
df1 column 1 has keys while column 2 is corresponding value which in this case is either True or False.
Lets take for example key = 1 in Dataframe 1. This key = 1 has multiple values in Dataframe 2 as shown in the figure. The multiple values for this key = 1 in dataframe 2 is due to values in Column 2, Dataframe 2 which in turn are keys to Dataframe 1 which I want to update in column 2 of df1.
Algorithm in action figure
Alogrithm in action

Sum multiple dataframe columns based on a condition

I have a python dataframe with 30 columns,
I would like to add new column and set it to be the sum only the columns that equal to 1 from the last 10 columns (20:30)
How can I do that ?
Thanks

Pandas dataframe select row by index and column by name

Is there any way to select the row by index (i.e. integer) and column by column name in a pandas data frame?
I tried using loc but it returns an error, and I understand iloc only works with indexes.
Here is the first rows of the data frame df. I am willing to select the first row, column named 'Volume' and tried using df.loc[0,'Volume']
Use get_loc method of Index to get integer location of a column name.
Suppose this dataframe:
>>> df
A B C
10 1 2 3
11 4 5 6
12 7 8 9
You can use .iloc like this:
>>> df.iloc[1, df.columns.get_loc('B')]
5

I have to compare two large dataframes, how can I do it using multiprocessing in python?

One row in one dataframe should be compared with all other rows in other dataframe and should print column names that are equal in each row in second dataframe.
eg:
a=[['apple','cotton','pineapple']]
b=[['apple','lemon','pineapple'],['apple','cotton','mango'],['grapes','cotton','pineapple']]
consider a is a dataframe with one row 3 columns,and b is a dataframe with 3 rows and 3 columns:
My output while comparing first row of a with b should be:
0 2
0 1
1 2
0 is name if first column, 1 is name of second column,2 is name of third column.
Actual problem has million rows.So how can I do it using multiprocessing.

Categories