How to remove rows based on another Dataframe? - python

I've been working with pandas for a while but I haven't figured out how to achieve the following result.
DF A consists of records that contain active and inactive LOB. I want to remove the inactive LOBs. But the inactive LOBs differ between states.
DF B consists of States as columns and the inactive LOBs in the resulting columns.
So, I want a resulting DF which does not contain any inactive LOBs.
ex: an LOB 78 inactive in OH could be active in MI.
Reasoning:
In the DF a: you can see a record with state OH, and LOB 78. I do not want this record in the DF C because it is considered inactive due to 78 existing in the column OH in DF b.
In DF a: you can see a record with state MI, and LOB 78. I want the record in my DF C because there is no 78 in the column MI in DF b
DF A has 500k records in it. Time to run isn't an issue, but it would be great if it was less than 5 minutes.
(I read DF B from a list of dict : [{state: [list of inactive lob]}] )
Sample DF A:
Name, state, LOB, ID
a , OH , 66 , 7979
aa , OH , 78 , 12341
bas , OH , 67 , 13434
basd, VT , 99 , 1241234
badf, MI , 77 , 12341234
bbdf, MI , 78 , 12341234
caff, VT , 66 , 2134
cdse, AZ , 01 , 232
sample DF B:
OH , VT , MI
66 , 99 , 77
78 , 23
I want a DF C:
Name, state, LOB, ID
bas , OH , 67 , 13434
bbdf, MI , 78 , 12341234
caff, VT , 66 , 2134
cdse, AZ , 01 , 232

IIUC, you can do an anti left join by first melting dfb
dfc= pd.merge(
dfa,
pd.melt(dfb, var_name="state", value_name="LOB"),
on=["state", "LOB"],
how="left",
indicator=True,
).query('_merge != "both"').drop("_merge", axis=1)
print(dfc)
Name state LOB ID
2 bas OH 67 13434
5 bbdf MI 78 12341234
6 caff VT 66 2134
7 cdse AZ 1 232

You can use multi-index to achieve this as follows:
First, index A using both state and LOB:
A2 = A.set_index(['state', 'LOB'])
Then remove the rows you don't want in A:
to_remove = sum([[(list(d.keys())[0], vi) for vi in list(d.values())[0]] for d in B], []) # If we use the list dictionaries without converting it to DataFrame
C = A2.loc[list(set(A2.index) - set(to_remove))]
After this C will contain only the rows you want. Let me know if it helps.

Related

Get the sum of absolutes of columns for a dataframe

If I have a dataframe and I want to sum the values of the columns I could do something like
import pandas as pd
studentdetails = {
"studentname":["Ram","Sam","Scott","Ann","John","Bobo"],
"mathantics" :[80,90,85,70,95,100],
"science" :[85,95,80,90,75,100],
"english" :[90,85,80,70,95,100]
}
index_labels=['r1','r2','r3','r4','r5','r6']
df = pd.DataFrame(studentdetails ,index=index_labels)
print(df)
df3 = df.sum()
print(df3)
col_list= ['studentname', 'mathantics', 'science']
print( df[col_list].sum())
How can I do something similar but instead of getting only the sum, getting the sum of absolute values (which in this particular case would be the same though) of some columns?
I tried abs in several way but it did not work
Edit:
studentname mathantics science english
r1 Ram 80 85 90
r2 Sam 90 95 -85
r3 Scott -85 80 80
r4 Ann 70 90 70
r5 John 95 -75 95
r6 Bobo 100 100 100
Expected output
mathantics 520
science 525
english 520
Edit2:
The col_list cannot include string value columns
You need numeric columns for absolute values:
col_list = df.columns.difference(['studentname'])
df[col_list].abs().sum()
df.set_index('studentname').abs().sum()
df.select_dtypes(np.number).abs().sum()

Check if a row in one DataFrame exist in another, BASED ON SPECIFIC COLUMNS ONLY

I have two Pandas DataFrame with different columns number.
df1 is a single row DataFrame:
a X0 b Y0 c
0 233 100 56 shark -23
df2, instead, is multiple rows Dataframe:
d X0 e f Y0 g h
0 snow 201 32 36 cat 58 336
1 rain 176 99 15 tiger 63 845
2 sun 193 81 42 dog 48 557
3 storm 100 74 18 shark 39 673 # <-- This row
4 cloud 214 56 27 wolf 66 406
I would to verify if the df1's row is in df2, but considering X0 AND Y0 columns only, ignoring all other columns.
In this example the df1's row match the df2's row at index 3, that have 100 in X0 and 'shark' in Y0.
The output for this example is:
True
Note: True/False as output is enough for me, I don't care about index of matched row.
I founded similar questions but all of them check the entire row...
Use df.merge with an if condition check on len:
In [219]: if len(df1[['X0', 'Y0']].merge(df2)):
...: print(True)
...:
True
OR:
In [225]: not (df1[['X0', 'Y0']].merge(df2)).empty
Out[225]: True
Try this:
df2[(df2.X0.isin(df1.X0))&(df2.Y0.isin(df1.Y0))]
Output:
d X0 e f Y0 g h
3 storm 100 74 18 shark 39 673
duplicated
df2.append(df1).duplicated(['X0', 'Y0']).iat[-1]
True
Save a tad bit of time
df2[['X0', 'Y0']].append(df1[['X0', 'Y0']]).duplicated().iat[-1]

How can I merge these two datasets on 'Name' and 'Year'?

I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')
You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.
Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!

Python Pandas compare CSV keyerror

I am using Python Pandas to try and match the references from CSV2 to the data in CSV1 and create a new output file.
CSV1
reference,name,house
234 8A,john,37
564 68R,bill,3
RT4 VV8,kate,88
76AA,harry ,433
CSV2
reference
234 8A
RT4 VV8
CODE
import pandas as pd
df1 = pd.read_csv(r'd:\temp\data1.csv')
df2 = pd.read_csv(r'd:\temp\data2.csv')
df3 = pd.merge(df1,df2, on= 'reference', how='inner')
df3.to_csv('outpt.csv')
I am getting a keyerror for reference when I run it, could it be the spaces in the data that is causing the issue? The data is comma delimited.
most probably you have either leading or trailing white spaces in reference column after reading your CSV files.
you can check it in this way:
print(df1.columns.tolist())
print(df2.columns.tolist())
you can "fix" it by adding sep=r'\s*,\s*' parameter to your pd.read_csv() calls
Example:
In [74]: df1
Out[74]:
reference name house
0 234 8A john 37
1 564 68R bill 3
2 RT4 VV8 kate 88
3 76AA harry 433
In [75]: df2
Out[75]:
reference
0 234 8A
1 RT4 VV8
In [76]: df2.columns.tolist()
Out[76]: ['reference ']
In [77]: df1.columns.tolist()
Out[77]: ['reference', 'name', 'house']
In [78]: df1.merge(df2, on='reference')
...
KeyError: 'reference'
fixing df2:
data = """\
reference
234 8A
RT4 VV8"""
df2 = pd.read_csv(io.StringIO(data), sep=r'\s*,\s*')
now it works:
In [80]: df1.merge(df2, on='reference')
Out[80]:
reference name house
0 234 8A john 37
1 RT4 VV8 kate 88

Annotating one dataFrame using another dataFrame, if there is matching value in column

I have two DataFrames I want to first look for matching values in col1 in dataFrame1 to col1 in DataFrame2 and print out all the columns from DataFrame1 with additional columns from DataFrame2.
For Example
I have tried following ,
data = 'file_1'
Up = pd.DataFrame.from_csv(data, sep='\t')
Up = Up.reset_index(drop=False)
Up.head()
Gene_id baseMean log2FoldChange lfcSE stat pvalue padj
0 ENSG.16 176.275036 0.9475260059 0.4310373793 2.1982455617 0.0279316115 0.198658
1 ENSG.10 80.199435 0.4349592748 0.2691551416 1.6160169639 0.1060906455 0.369578
2 ENSG.15 1649.400749 -0.0215428237 0.1285061198 -0.1676404495 0.8668661474 0.947548
3 ENSG.10 25507.767530 0.5145516695 0.2473335499 2.0803957642 0.0374892475 0.229378
4 ENSG.12 70.122885 -0.2612483888 0.2593848667 -1.00718439
and the second dataframe is,
mydata = 'file_2'
annon = pd.DataFrame.from_csv(mydata, sep='\t')
annon = annon.reset_index(drop=False)
annon.head()
Gene_id sam_1 sam2 sam3 sam4 sam5 sam6 sam7 sam8 sam9 sam10 sam11
0 ENSG.16 404 55 33 39 102 43 193 244 600 174 120
1 ENSG.10 58 89 110 69 64 48 61 81 98 75 119
2 ENSG.15 1536 1246 2540 1751 1850 2137 1460 1362 2158 1367 1320
3 ENSG.10 28508 23073 19982 13821 20355 28835 26875 25632 27131 30991 29351
4 ENSG.12 87 81 121 67 98 47 37 59 68 44 81
and following is what i tried so far,
x=pd.merge(Up[['Gene_id' , 'log2FoldChange ', 'pvalue ' , 'padj']] , annon , on = 'Gene_id')
x.head()
Gene_id log2FoldChange pvalue padj sam_1 sam2 sam3 sam4 sam5 sam6 sam7 sam8 sam9 sam10 sam11
Its just giving me header of the file and nothing else..
And so I looked into file1(Up) with one row value like following,
This what i am getting
print(Up.loc[Up['Gene_id'] =='ENSG.16'])
Empty DataFrame
Columns: [Gene_id, baseMean , log2FoldChange , lfcSE , stat , pvalue , padj]
Index: []
But infact that is not empty and it has values in dataframe Up.
Any solutions would be great..!!!
pd.merge(df1[['Gene_Id' , 'log2FoldChange', 'pvalue' , 'padj']] , df2 , left_on='Gene_Id' , right_on= 'Gene_id')
you can then easily drop Gene_id if you want
Hope this helps you.
Let me know if it works.
import pandas as pd
# creating test Dataframe1
df = pd.DataFrame(['ENSG1', 162.315169869338, 0.920583258294463, 0.260406974056691, 3.53517128959092, 0.000407510906151687, 0.0176112964515702])
df=df.T
# important thing is make column 0 as its index
df.index=df[0]
print(df)
# creating test Dataframe2
df2 = pd.DataFrame(['ENSG1', 404, 55, 33, 39, 102, 43, 193, 244, 600, 174, 120])
df2=df2.T
# important thing is make column 0 as its index
df2.index=df2[0]
print(df2)
# concatinate both the frames using axis=1 (outer or inner as per your need)
x = pd.concat([df,df2],axis=1,join='outer')
print(x)

Categories