Compare Columns In Dataframe

Compare Columns In Dataframe - python

I have two data frames that I have concatenated into one. What I ultimately want to end up with is a list of all the columns the exist in both. The data frames come from two different db tables, and I need to generate queries based on the ones that exist in both tables.
I tried doing the following: concat_per.query('doe_per==focus_per') but it returned an empty data frame.
doe_per focus_per
2 NaN Period_02
3 Period_01 Period_06
4 Period_02 Period_08
5 Period_03 NaN
6 Period_04 NaN
7 Period_05 NaN
8 Period_06 NaN
9 Period_07 NaN
10 Period_08 NaN

also you can use function isin().
At first ,you can transform the first column to a set or list as you base columns. Then use isin() to filter the second dataframe.
firstList = set(df1st.doe_per)
targetDF = df2nd[df2nd.focus_per.isin(firstList)==True]
If you want to combine two dataframes into one, you can use
pd.merge(df1,df2,left_on=df1st.doe_per,right_on = df2nd.focus_per,join='inner')
or
pd.concat([df1,df2],on_,join='inner',ignore_index=True)
I'm sorry that i forgot some params in the function.But if you want to combine some dataframe into one, you need to use these two function. Maybe pd.combine() is ok. You can look up the api of pandas.

Related

Pandas loop over 2 dataframe and drop duplicates

I have 2 csv files with some random numbers, as follow:
csv1.csv
0 906018
1 007559
2 910475
3 915104
4 600393
...
5070 907525
5071 903079
5072 001910
5073 909735
5074 914861
length 5075
csv2.csv
0 5555
1 7859
2 501303
3 912414
4 913257
...
7497 915031
7498 915030
7499 915033
7500 902060
7501 915038
length 7502
Some elements in csv1 are present in csv2 but I don't know exactly which one and I would like to extract those unique values. So my idea was to start merging together the 2 data frame, and than remove the duplicates.
so I wrote the following code:
import pandas as pd
import csv
unique_users = pd.read_csv('./csv1.csv')
unique_users['id']
identity = pd.read_csv('./csv2.csv')
identityNumber = identity['IDNumber']
identityNumber
df = pd.concat([identityNumber, unique_users])
Until here everything is perfect and the length of the df is the sum of the 2 length, but I realised the part where I got stuck.
the df concat it did its job and concat based on the index, so now I have tons of NaN.
and when I use the code:
final_result = df.drop_duplicates(keep=False)
The data frame does not drop any value because the df structure now look like this:
Identitynumber. ID
5555 NaN
so I guess that drop duplicate is looking for the same exact values, but as they don't exist it just keeps it.
So what I would like to do, is loop over both data frame, and if a value in csv1 exists in csv2, I want them to be dropped.
Can anyone help with this please?
And please if you need more info just let me know.
UPDATE:
I think I found the reason why is not working but I am not sure how to solve this.
my csv1 looks like this:
id
906018,
007559,
910475,
915104,
600393,
007992,
502313,
004609,
910017,
007954,
006678,
in Jupiter notebook when I open the csv, it looks this way.
id
906018 NaN
007559 NaN
910475 NaN
915104 NaN
600393 NaN
... ...
907525 NaN
903079 NaN
001910 NaN
909735 NaN
914861 NaN
and I do not understand why is seeing the id as NaN.
in fact I tried to add a new column into csv2, and as value I passed the id from csv1..and I can confirm that they are all NaN.
So I believe the source of the problem is surely this, which than reflects on all the other events.
Can anyone help to understand how I can solve this issue?

you can achieve this using df.merge():
# Data samples
data_1 = {'col_a': [906018,7559,910475,915104,600393,907525,903079,1910,909735,914861]}
data_2 = {'col_b': [5555,7859,914861,912414,913257,915031,1910,915104,7559,915038]}
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
# using isin() method
unique_vals = df1.merge(df2, right_on='col_b', left_on='col_a')['col_a']
new_df1 = df1[~df1.col_a.isin(unique_vals)]
# another approach
new_df1 = df1[df1.merge(df2, right_on='col_b', left_on='col_a', how='left')['col_b'].isna()]
print(new_df1)
# col_a
# 0 906018
# 2 910475
# 4 600393
# 5 907525
# 6 903079
# 8 909735

This will remove the duplicates between your two dataframes and keep all the records in one dataframe df.
df = pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

You are getting NaN because when you concatenate, Pandas doesn't know what you want to do with the different column names of your two dataframes. One of your dataframes has an IdentityNumber column and the other has an ID column. Pandas can't figure out what you want, so it puts both columns into the resulting dataframe.
Try this:
pd.concat([df1["IDNumber"], df2["ID"]]).drop_duplicates().reset_index(drop=True)

How to refer to the 2nd duplicate column in pandas

My dataframe has two columns both called Scanned Blank. I want to always select the second column named 'Scanned Blank' below:
df['Scanned Blank'].head()
Scanned Blank Scanned Blank
1 NaN Y
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
I tried
df['Scanned Blank'][1]
which didn't work.
It's not feasible for me to use integer selection, because sometimes the columns will move around. For instance, sometimes the first scanned blank will be column 20 and the second one column 40. Sometimes it'll be respectively 21 and 41. However the exact position of the column, I know I will always want the one after it.

Realized that I was just returning another dataframe, so,
df['Scanned Blank'].iloc[:,1]
Also, yes, I'm aware this is bad practice. Unfortunately, I don't have any control over this dataset, and this script needs to reliably run when other people use it.

Let us do duplicated
df.loc[:,df.columns.duplicated()]

How to subtract two partial columns with pandas?

I'm just getting started with Pandas so I may be missing something important, but I can't seem to successfully subtract two columns I'm working with. I have a spreadsheet in excel that I imported as follows:
df = pd.read_excel('/path/to/file.xlsx',sheetname='Sheet1')
My table when doing df.head() looks similar to the following:
a b c d
0 stuff stuff stuff stuff
1 stuff stuff stuff stuff
2 data data data data
... ... ... ... ...
89 data data data data
I don't care about the "stuff;" I would like to subtract two columns of just the data and make this its own column. Therefore, it seemed obvious that I should trim off the rows I'm not interested in and work with what remains, so I have tried the following:
dataCol1 = df.ix[2:,0:1]
dataCol2 = df.ix[2:,1:2]
print(dataCol1.sub(dataCol2,axis=0))
But it results in
a b
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
89 NaN NaN
I get the same result if I also simply try print(dataCol1-dataCol2). I really don't understand how both of these subtraction operations not only result in all NaN's, but also two columns instead of just one with the end result. Because when I print(dataCol1), for example, I do obtain the column I want to work with:
a
2 data
3 data
4 data
... ...
89 data
Is there any way to both work simply and directly from an Excel spreadsheet and perform basic operations with a truncated portion of the columns of said spreadsheet? Maybe there is a better way to go about this than using df.ix and I am definitely open to those methods as well.

The problem is the misallignment of your indices.
One thing to do would be to subtract the values, so you don't have to deal with alignment issues:
dataCol1 = df.iloc[2: , 0:1] # ix is deprecated
dataCol2 = df.iloc[2: , 1:2]
result = pd.DataFrame(dataCol1.values - dataCol2.values)

Create multiindex from existing dataframe

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!

You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.

For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex

Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])

There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:

The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

Pandas merge datagrames based on most similar values

I am attempting to Merge 2 pandas dataframes, however, the values are not exactly the same in the merge columns.
I am using the command
pd.merge(D_data, L_data,on="R_Time")
however, in D_data my R_time Column looks like
4.316667, 4.320834, 4.325000
and in my L_data column my data looks like:
4.31000, 4.32000, ...
Essentially, what I am trying to do is take every item in the first set, and match it to the closest element in the second set. I've done this with the vlookup function in Excel, but I'm not entire sure how to get the same functionality in Pandas Dataframe objects.
Given Data:
D_data:
4.316667
4.320834
4.325
4.329167
4.333334
4.3375
4.341667
4.345834
4.35
4.354167
4.358334
L_Data
4.316667
4.318667
4.320667
4.322667
4.324667
4.326667
4.328667
4.330667
4.332667
4.334667
4.336667
I Want to produce a pairing between exactly these elements, even though they are not exactly identical in most cases.

you can use Pandas' merge_asof():

First create a column in L_data with the value from R_data that is closest (index of smallest absolute difference) and then merge:
import pandas as pd
D_data =pd.DataFrame({"R_Time":[4.316667,4.320834,4.325,4.329167,4.333334,4.3375,4.341667,4.345834,4.35,4.354167,4.358334]})
L_data =pd.DataFrame({"_R_Time":[4.316667,4.318667,4.320667,4.322667,4.324667,4.326667,4.328667,4.330667,4.332667,4.334667,4.336667]})
L_data["R_Time"]=L_data.apply(lambda x:D_data["R_Time"][abs(D_data["R_Time"]-x["_R_Time"]).idxmin()],axis=1)
pd.merge(D_data, L_data,on="R_Time")
Result:
R_Time _R_Time
0 4.316667 4.316667
1 4.316667 4.318667
2 4.320834 4.320667
3 4.320834 4.322667
4 4.325000 4.324667
5 4.325000 4.326667
6 4.329167 4.328667
7 4.329167 4.330667
8 4.333334 4.332667
9 4.333334 4.334667
10 4.337500 4.336667

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.