Pandas loop over 2 dataframe and drop duplicates

Pandas loop over 2 dataframe and drop duplicates - python

I have 2 csv files with some random numbers, as follow:
csv1.csv
0 906018
1 007559
2 910475
3 915104
4 600393
...
5070 907525
5071 903079
5072 001910
5073 909735
5074 914861
length 5075
csv2.csv
0 5555
1 7859
2 501303
3 912414
4 913257
...
7497 915031
7498 915030
7499 915033
7500 902060
7501 915038
length 7502
Some elements in csv1 are present in csv2 but I don't know exactly which one and I would like to extract those unique values. So my idea was to start merging together the 2 data frame, and than remove the duplicates.
so I wrote the following code:
import pandas as pd
import csv
unique_users = pd.read_csv('./csv1.csv')
unique_users['id']
identity = pd.read_csv('./csv2.csv')
identityNumber = identity['IDNumber']
identityNumber
df = pd.concat([identityNumber, unique_users])
Until here everything is perfect and the length of the df is the sum of the 2 length, but I realised the part where I got stuck.
the df concat it did its job and concat based on the index, so now I have tons of NaN.
and when I use the code:
final_result = df.drop_duplicates(keep=False)
The data frame does not drop any value because the df structure now look like this:
Identitynumber. ID
5555 NaN
so I guess that drop duplicate is looking for the same exact values, but as they don't exist it just keeps it.
So what I would like to do, is loop over both data frame, and if a value in csv1 exists in csv2, I want them to be dropped.
Can anyone help with this please?
And please if you need more info just let me know.
UPDATE:
I think I found the reason why is not working but I am not sure how to solve this.
my csv1 looks like this:
id
906018,
007559,
910475,
915104,
600393,
007992,
502313,
004609,
910017,
007954,
006678,
in Jupiter notebook when I open the csv, it looks this way.
id
906018 NaN
007559 NaN
910475 NaN
915104 NaN
600393 NaN
... ...
907525 NaN
903079 NaN
001910 NaN
909735 NaN
914861 NaN
and I do not understand why is seeing the id as NaN.
in fact I tried to add a new column into csv2, and as value I passed the id from csv1..and I can confirm that they are all NaN.
So I believe the source of the problem is surely this, which than reflects on all the other events.
Can anyone help to understand how I can solve this issue?

you can achieve this using df.merge():
# Data samples
data_1 = {'col_a': [906018,7559,910475,915104,600393,907525,903079,1910,909735,914861]}
data_2 = {'col_b': [5555,7859,914861,912414,913257,915031,1910,915104,7559,915038]}
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
# using isin() method
unique_vals = df1.merge(df2, right_on='col_b', left_on='col_a')['col_a']
new_df1 = df1[~df1.col_a.isin(unique_vals)]
# another approach
new_df1 = df1[df1.merge(df2, right_on='col_b', left_on='col_a', how='left')['col_b'].isna()]
print(new_df1)
# col_a
# 0 906018
# 2 910475
# 4 600393
# 5 907525
# 6 903079
# 8 909735

This will remove the duplicates between your two dataframes and keep all the records in one dataframe df.
df = pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

You are getting NaN because when you concatenate, Pandas doesn't know what you want to do with the different column names of your two dataframes. One of your dataframes has an IdentityNumber column and the other has an ID column. Pandas can't figure out what you want, so it puts both columns into the resulting dataframe.
Try this:
pd.concat([df1["IDNumber"], df2["ID"]]).drop_duplicates().reset_index(drop=True)

Related

Speed up operations over Python Pandas dataframes

I would like to speed up a loop over a python Pandas Dataframe. Unfortunately, decades of using low-level languages mean I often struggle to find prepackaged solutions. Note: data is private, but I will see if I can fabricate something and add it into an edit if it helps.
The code has three pandas dataframes: drugUseDF, tempDF, which holds the data, and tempDrugUse, which stores what's been retrieved. I look over every row of tempDF (there will be several million rows), retrieving the prodcode identified from each row and then using that to retrieve the corresponding value from use1 column in the drugUseDF. I've added comments to help navigate.
This is the structure of the dataframes:
tempDF
patid eventdate consid prodcode issueseq
0 20001 21/04/2005 2728 85 0
1 25001 21/10/2000 3939 40 0
2 25001 21/02/2001 3950 37 0
drugUseDF
index prodcode ... use1 use2
0 171 479 ... diabetes NaN
1 172 9105 ... diabetes NaN
2 173 5174 ... diabetes NaN
tempDrugUse
use1
0 NaN
1 NaN
2 NaN
This is the code:
dfList = []
# if the drug dataframe contains the use1 column. Can this be improved?
if sum(drugUseDF.columns.isin(["use1"])) == 1:
#predine dataframe where we will store the results to be the same length as the main data dataframe.
tempDrugUse = DataFrame(data=None, index=range(len(tempDF.index)), dtype=np.str, columns=["use1"])
#go through each row of the main data dataframe.
for ind in range(len(tempDF)):
#retrieve the prodcode from the *ind* row of the main data dataframe
prodcodeStr = tempDF.iloc[ind]["prodcode"]
#get the corresponding value from the use1 column matching the prodcode column
useStr = drugUseDF[drugUseDF.loc[:, "prodcode"] == prodcodeStr]["use1"].values[0]
#update the storing dataframe
tempDrugUse.iloc[ind]["use1"] = useStr
print("[DEBUG] End of loop for use1")
dfList.append(tempDrugUse)
The order of the data matters. I can't retrieve multiple rows by matching the prodcode because each row has a date column. Retrieving multiple rows and adding them to the tempDrugUse dataframe could mean that the rows are no longer in chronological date order.

When trying to combine data in two dataframes you should use the merge (similar to JOIN in sql-like languages). Performance wise, you should never loop over the rows - you should use the pandas built-in methods whenever possible. Ordering can be achieved with the sort_values method.

If I understand you correctly, you want to map the prodcode from both tables. You can do this via pd.merge (please note the example in the code below differs from your data):
tempDF = pd.DataFrame({'patid': [20001, 25001, 25001],
'prodcode': [101,102,103]})
drugUseDF = pd.DataFrame({'prodcode': [101,102,103],
'use1': ['diabetes', 'hypertonia', 'gout']})
merged_df = pd.merge(tempDF, drugUseDF, on='prodcode', how='left')

Find string in one csv and replace with string in a different csv in a loop

I have two csv files. csv1 looks like this:
Title,glide gscore,IFDScore
235,-9.01,-1020.18
235,-8.759,-1020.01
235,-7.301,-1019.28
while csv2 looks like this:
ID,smiles,number
28604361,NC(=O)CNC(=O)CC(c(cc1)cc(c12)OCO2)c3ccccc3,102
14492699,COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C,235
16888863,COc1cc(ccc1O)CN2CCN(CC=C(C)C)C(C2)CCO,108
Both are much larger than what I show here. I need some way to match each value in the Title column of csv1 to the corresponding value in the number column of csv2. When a match is found, I need the value in the Title column of csv1 to be replaced with the corresponding value in the ID column of csv2. Thus I would want my desired output to be:
Title,glide gscore,IFDScore
14492699,-9.01,-1020.18
14492699,-8.759,-1020.01
14492699,-7.301,-1019.28
I am looking for a way to do it through pandas, bash or python.
This answer is close but gives me an ambiguous truth value of a DataFrame.
I also tried update in pandas without luck.
I'm not pasting the exact code I've tried yet because it would be overwhelming to see faulty code in pandas, bash and python all at once.

You could map it; then use fillna in case there were any "Titles" that did not have a matching "number":
csv1 = pd.read_csv('first_csv.csv')
csv2 = pd.read_csv('second_csv.csv')
csv1['Title'] = csv1['Title'].map(csv2.set_index('number')['ID']).fillna(csv1['Title']).astype(int)
Output:
Title glide gscore IFDScore
0 14492699 -9.010 -1020.18
1 14492699 -8.759 -1020.01
2 14492699 -7.301 -1019.28

You can use pandas module to load your dataframe, and then, using merge function, you can achieve what you are seeking for:
import pandas as pd
df1 = pd.read_csv("df1.csv")
df2 = pd.read_csv("df2.csv")
merged = df1.merge(df2, left_on="Title", right_on="number", how="right")
merged["Title"] = merged["ID"]
merged
Output
Title
glide gscore
IFDScore
ID
smiles
number
0
28604361
nan
nan
28604361
NC(=O)CNC(=O)CC(c(cc1)cc(c12)OCO2)c3ccccc3
102
1
14492699
-9.01
-1020.18
14492699
COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C
235
2
14492699
-8.759
-1020.01
14492699
COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C
235
3
14492699
-7.301
-1019.28
14492699
COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C
235
4
16888863
nan
nan
16888863
COc1cc(ccc1O)CN2CCN(CC=C(C)C)C(C2)CCO
108
Note that the Nan values are due to unavailable values. If your dataframe covers these parts too, it won't result in Nan.

Assign value to dataframe from another dataframe based on two conditions

I am trying to assign values from a column in df2['values'] to a column df1['values']. However values should only be assigned if:
df2['category'] is equal to the df1['category'] (rows are part of the same category)
df1['date'] is in df2['date_range'] (date is in a certain range for a specific category)
So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).
for i in df1.category.unique():
for j in df2.category.unique():
if i == j: # matching categories
for ia, ra in df1.loc[df1['category'] == i].iterrows():
for ib, rb in df2.loc[df2['category'] == j].iterrows():
if df1['date'][ia] in df2['date_range'][ib]:
df1.loc[ia, 'values'] = rb['values']
break
I read that I should try to avoid using for-loops when working with dataframes. List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.
How can I iterate over this problem more efficient? What essential key aspect should I think about when iterating over dataframes with conditions?
The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards. And the biggest problem, that it is really slow.
Thank you.
Some df1 insight:
df1.head()
date category
0 2015-01-07 f2
1 2015-01-26 f2
2 2015-01-26 f2
3 2015-04-08 f2
4 2015-04-10 f2
Some df2 insight:
df2.date_range[0]
DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
'2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
'2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
'2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
'2011-11-18'],
dtype='datetime64[ns]', freq='D')
df2 other two columns:
df2[['values','category']].head()
values category
0 01 f1
1 02 f1
2 2.1 f1
3 2.2 f1
4 03 f1

Edit: Corrected erroneous code and added OP input from a comment
Alright so if you want to join the dataframes on similar categories, you can merge them :
import pandas as pd
df3 = df1.merge(df2, on = "category")
Next, since date is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :
mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])
subset = df3.loc[mask]
Now we get back to df1 and merge on the common dates while keeping all the values from df1. This will create NaN for the subset values where they didn't match with df1 in the earlier merge.
As such, we set df1["values"] where the entries in common are not NaN and we leave them be otherwise.
common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values
df1["values"] = np.where(common_dates["values_y"].notna(),
common_dates["values_y"], df1["values"])
N.B : If more than one df1["date"] matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.

You could accomplish the first point:
1. df2['category'] is equal to the df1['category']
with the use of a join.
You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range]. Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.

Pandas: how to add data to a MultiIndex empty DataFrame?

I would like to use a MultiIndex DataFrame to easily select portions of the DataFrame. I created an empty DataFrame as follows:
mi = mindex = {'input':['a','b','c'],'optim':['pareto','alive']}
mi = pd.MultiIndex.from_tuples([(c,k) for c in mi.keys() for k in mi[c]])
mc = pd.MultiIndex(names=['Generation','Individual'],labels=[[],[]],levels=[[],[]])
population = pd.DataFrame(index=mi,columns=mc)
which seems to be good.
However, I do not know how to insert a single data to start populating my DataFrame. I tried the followings:
population.loc[('optim','pareto'),(0,0)]=True
where I tried to define a new column double index (0,0) leading to a NotImplementedError. I also tried with (0,1), which gave a ValueError.
I tried also with no columns indexes:
population.loc[('optim','pareto')]=True
Which gave no error...but no change in the DataFrame either...
Any help? Thanks in advance.
EDIT
To clarify my question, once populated, my DataFrame should look like this:
Generation 1 2
Individual 1 2 3 4 5 6
input a 1 1 2 ...
b 1 2 2 ...
c 1 1 2 ...
optim pareto True True False ...
alive True True False ...
EDIT 2
I found out that what I was doing works if I define my first column at the DataFrame creation. In particular with:
mc = pd.MultiIndex.from_tuples([(0,0)])
I get a first column full of nan and I can add data as I wanted to (also for new columns):
population.loc[('optim','pareto'),(0,1)]=True
I still do not know what is wrong with my first definition...

Even if I do not know why my initial definition was wrong, the following works as expected:
mi = {'input':['a','b','c'],'optim':['pareto','alive']}
mi = pd.MultiIndex.from_tuples([(c,k) for c in mi.keys() for k in mi[c]])
mc = pd.MultiIndex.from_tuples([(0,0)],names=['Generation','Individual'])
population = pd.DataFrame(index=mi,columns=mc)
It looks like the solution was to initialize the columns at the DataFrame creation (here with a (0,0) column). The created DataFrame is then:
Generation 0
Individual 0
input a NaN
b NaN
c NaN
optim pareto NaN
alive NaN
which can be then be populated adding values to the current column or new columns/rows.

How to subtract two partial columns with pandas?

I'm just getting started with Pandas so I may be missing something important, but I can't seem to successfully subtract two columns I'm working with. I have a spreadsheet in excel that I imported as follows:
df = pd.read_excel('/path/to/file.xlsx',sheetname='Sheet1')
My table when doing df.head() looks similar to the following:
a b c d
0 stuff stuff stuff stuff
1 stuff stuff stuff stuff
2 data data data data
... ... ... ... ...
89 data data data data
I don't care about the "stuff;" I would like to subtract two columns of just the data and make this its own column. Therefore, it seemed obvious that I should trim off the rows I'm not interested in and work with what remains, so I have tried the following:
dataCol1 = df.ix[2:,0:1]
dataCol2 = df.ix[2:,1:2]
print(dataCol1.sub(dataCol2,axis=0))
But it results in
a b
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
89 NaN NaN
I get the same result if I also simply try print(dataCol1-dataCol2). I really don't understand how both of these subtraction operations not only result in all NaN's, but also two columns instead of just one with the end result. Because when I print(dataCol1), for example, I do obtain the column I want to work with:
a
2 data
3 data
4 data
... ...
89 data
Is there any way to both work simply and directly from an Excel spreadsheet and perform basic operations with a truncated portion of the columns of said spreadsheet? Maybe there is a better way to go about this than using df.ix and I am definitely open to those methods as well.

The problem is the misallignment of your indices.
One thing to do would be to subtract the values, so you don't have to deal with alignment issues:
dataCol1 = df.iloc[2: , 0:1] # ix is deprecated
dataCol2 = df.iloc[2: , 1:2]
result = pd.DataFrame(dataCol1.values - dataCol2.values)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.