Pandas drop subset of dataframe - python

Assume we have df and df_drop:
df = pd.DataFrame({'A': [1,2,3], 'B': [1,1,1]})
df_drop = df[df.A==df.B]
I want to delete df_drop from df without using the explicit conditions used when creating df_drop. I.e. I'm not after the solution df[df.A!=df.B], but would like to, basically, take df minus df_drop somehow. Hopes this is clear enough. Otherwise happy to elaborate!

You can merge both dataframes setting indicator=True and drop those columns where the indicator column is both:
out = pd.merge(df,df_drop, how='outer', indicator=True)
out[out._merge.ne('both')].drop('_merge',1)
A B
1 2 1
2 3 1
Or as jon clements points out, if checking by index is enough, you could simply use:
df.drop(df_drop.index)

In this case, drop_duplicates works because the test criteria is the equality of two rows.
More generally, you can use loc to find the rows that meet or do not meet the specified criteria.
a = np.random.randint(1, 50, 100)
b = np.random.randint(1, 50, 100)
df = pd.DataFrame({'a': a, 'b': b})
criteria = df.a > 2 * df.b
df.loc[criteria, :]

Like this maybe:
In [1468]: pd.concat([df, df_drop]).drop_duplicates(keep=False)
Out[1468]:
A B
1 2 1
2 3 1

Related

Pandas: correct way to merge dataframes (not with pd.merge() apparently)

I read another question that probably has some similar problem (this one) but I couldn't understand the answer.
Consider 2 dataframes defined like this
df1 = pd.DataFrame({
'A' : ['a','b','c','d'],
'B': [1,np.nan,np.nan,4]
})
df2 = pd.DataFrame({
'A' : ['a','c','b','d'],
'B' : [np.nan, 8, 9, np.nan]
I want to merge them to fill blank cells. At first I used
df1.merge(df2, on='A')
but this caused my df1 to have 2 different columns named B_x and B_y. I also tried with different parameters for the .merge() method but still couldn't solve the issue.
The final dataframe should look like this one
df1 = pd.DataFrame({
'A' : ['a','b','c','d'],
'B': [1,9,8,4]
})
Do you know what's the most logic way to do that?
I think that pd.concat() should be a useful tool for this job but I have no idea on how to apply it.
EDIT:
I modified values so that 'A' columns don't have the same order in both dataframes. This should avoid any ambiguity.
Maybe you can use fillna instead:
new_df = df1.fillna(df2)
Output:
>>> new_df
A B
0 a 1.0
1 b 9.0
2 c 8.0
3 d 4.0
Here's a different solution:
df1.merge(df2, on='A', suffixes=(';',';')).pipe(lambda x: x.set_axis(x.columns.str.strip(';'), axis=1)).groupby(level=0, axis=1).first()
The semicolons (;) are arbitrary; you can use any character as long as it doesn't appear in any columns names.
You can map the values from df2:
df1['B'] = df1['B'].fillna(df1['A'].map(df2.set_index('A')['B']))
output:
A B
0 a 1.0
1 b 9.0
2 c 8.0
3 d 4.0
Alternative
if the values in A are unique, you can merge only the column A of df1 and combine_first:
df1 = df1.combine_first(df1[['A']].merge(df2))

Apply the same operation to multiple DataFrames efficiently

I have two data frames with the same columns, and similar content.
I'd like apply the same functions on each, without having to brute force them, or concatenate the dfs. I tried to pass the objects into nested dictionaries, but that seems more trouble than it's worth (I don't believe dataframe.to_dict supports passing into an existing list).
However, it appears that the for loop stores the list of dfs in the df object, and I don't know how to get it back to the original dfs... see my example below.
df1 = {'Column1': [1,2,2,4,5],
'Column2': ["A","B","B","D","E"]}
df1 = pd.DataFrame(df1, columns=['Column1','Column2'])
df2 = {'Column1': [2,11,2,2,14],
'Column2': ["B","Y","B","B","V"]}
df2 = pd.DataFrame(df2, columns=['Column1','Column2'])
def filter_fun(df1, df2):
for df in (df1, df2):
df = df[(df['Column1']==2) & (df['Column2'].isin(['B']))]
return df1, df2
filter_fun(df1, df2)
If you write the filter as a function you can apply it in a list comprehension:
def filter(df):
return df[(df['Column1']==2) & (df['Column2'].isin(['B']))]
df1, df2 = [filter(df) for df in (df1, df2)]
I would recommend concatenation with custom specified keys, because 1) it is easy to assign it back, and 2) you can do the same operation once instead of twice.
# Concatenate df1 and df2
df = pd.concat([df1, df2], keys=['a', 'b'])
# Perform your operation
out = df[(df['Column1'] == 2) & df['Column2'].isin(['B'])]
out.loc['a'] # result for `df1`
Column1 Column2
1 2 B
2 2 B
out.loc['b'] # result for `df2`
Column1 Column2
0 2 B
2 2 B
3 2 B
This should work fine for most operations. For groupby, you will want to group on the 0th index level as well.

comparing two pandas dataframes with different column names and finding match

I have two dataframes :
df1:
A B C
1 ss 123
2 sv 234
3 sc 333
df2:
A dd xc
1 ss 123
df2 will always have a single row. How to check whether there is a match for that row of df2, in df1?
Using Numpy comparisons with np.all with parameter axis=1 for rows:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['ss', 'sv', 'sc'], 'C': [123, 234, 333]})
df2 = pd.DataFrame({'A': [1], 'dd': ['ss'], 'xc': [123]})
df3 = df1.loc[np.all(df1.values == df2.values, axis=1),:]
Or:
df3 = df1.loc[np.all(df1[['B','C']].values == df2[['dd','xc']].values, axis=1),:]
print(df3)
A B C
0 1 ss 123
Additional to Sandeep's answer, can do:
df1[np.all(df1.values == df2.values,1)].any().any()
For getting a boolean.
Or another way:
df1[(df2.values==df1.values).all(1)].any().any()
Or:
pd.merge(df1,df2).equals(df1)
Note: both output True
Check specific column (same as Sandeep's):
df1[col].isin(df2[col]).any()
How to check whether there is a match for that row of df2, in df1?
You can align columns and then check equality of df1 with the only row of df2:
df2.columns = df1.columns
res = (df1 == df2.iloc[0]).all(1).any() # True
The benefit of this solution is you aren't subsetting df1 (expensive), but instead constructing a Boolean dataframe / array (cheap) and checking if all values in at least one row are True.
This is still not particularly efficient as you are considering every row in df1 rather than stopping when a condition is satisfied. With numeric data, in particular, there are more efficient solutions.

Compare 2 Pandas dataframes, row by row, cell by cell

I have 2 dataframes, df1 and df2, and want to do the following, storing results in df3:
for each row in df1:
for each row in df2:
create a new row in df3 (called "df1-1, df2-1" or whatever) to store results
for each cell(column) in df1:
for the cell in df2 whose column name is the same as for the cell in df1:
compare the cells (using some comparing function func(a,b) ) and,
depending on the result of the comparison, write result into the
appropriate column of the "df1-1, df2-1" row of df3)
For example, something like:
df1
A B C D
foo bar foobar 7
gee whiz herp 10
df2
A B C D
zoo car foobar 8
df3
df1-df2 A B C D
foo-zoo func(foo,zoo) func(bar,car) func(foobar,foobar) func(7,8)
gee-zoo func(gee,zoo) func(whiz,car) func(herp,foobar) func(10,8)
I've started with this:
for r1 in df1.iterrows():
for r2 in df2.iterrows():
for c1 in r1:
for c2 in r2:
but am not sure what to do with it, and would appreciate some help.
So to continue the discussion in the comments, you can use vectorization, which is one of the selling points of a library like pandas or numpy. Ideally, you shouldn't ever be calling iterrows(). To be a little more explicit with my suggestion:
# with df1 and df2 provided as above, an example
df3 = df1['A'] * 3 + df2['A']
# recall that df2 only has the one row so pandas will broadcast a NaN there
df3
0 foofoofoozoo
1 NaN
Name: A, dtype: object
# more generally
# we know that df1 and df2 share column names, so we can initialize df3 with those names
df3 = pd.DataFrame(columns=df1.columns)
for colName in df1:
df3[colName] = func(df1[colName], df2[colName])
Now, you could even have different functions applied to different columns by, say, creating lambda functions and then zipping them with the column names:
# some example functions
colAFunc = lambda x, y: x + y
colBFunc = lambda x, y; x - y
....
columnFunctions = [colAFunc, colBFunc, ...]
# initialize df3 as above
df3 = pd.DataFrame(columns=df1.columns)
for func, colName in zip(columnFunctions, df1.columns):
df3[colName] = func(df1[colName], df2[colName])
The only "gotcha" that comes to mind is that you need to be sure that your function is applicable to the data in your columns. For instance, if you were to do something like df1['A'] - df2['A'] (with df1, df2 as you have provided), that would raise a ValueError as the subtraction of two strings is undefined. Just something to be aware of.
Edit, re: your comment: That is doable as well. Iterate over the dfX.columns that is larger, so you don't run into a KeyError, and throw an if statement in there:
# all the other jazz
# let's say df1 is [['A', 'B', 'C']] and df2 is [['A', 'B', 'C', 'D']]
# so iterate over df2 columns
for colName in df2:
if colName not in df1:
df3[colName] = np.nan # be sure to import numpy as np
else:
df3[colName] = func(df1[colName], df2[colName])

How to keep index when using pandas merge

I would like to merge two DataFrames, and keep the index from the first frame as the index on the merged dataset. However, when I do the merge, the resulting DataFrame has integer index. How can I specify that I want to keep the index from the left data frame?
In [4]: a = pd.DataFrame({'col1': {'a': 1, 'b': 2, 'c': 3},
'to_merge_on': {'a': 1, 'b': 3, 'c': 4}})
In [5]: b = pd.DataFrame({'col2': {0: 1, 1: 2, 2: 3},
'to_merge_on': {0: 1, 1: 3, 2: 5}})
In [6]: a
Out[6]:
col1 to_merge_on
a 1 1
b 2 3
c 3 4
In [7]: b
Out[7]:
col2 to_merge_on
0 1 1
1 2 3
2 3 5
In [8]: a.merge(b, how='left')
Out[8]:
col1 to_merge_on col2
0 1 1 1.0
1 2 3 2.0
2 3 4 NaN
In [9]: _.index
Out[9]: Int64Index([0, 1, 2], dtype='int64')
EDIT: Switched to example code that can be easily reproduced
In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
col1 to_merge_on col2
index
a 1 1 1
b 2 3 2
c 3 4 NaN
Note that for some left merge operations, you may end up with more rows than in a when there are multiple matches between a and b. In this case, you may need to drop duplicates.
You can make a copy of index on left dataframe and do merge.
a['copy_index'] = a.index
a.merge(b, how='left')
I found this simple method very useful while working with large dataframe and using pd.merge_asof() (or dd.merge_asof()).
This approach would be superior when resetting index is expensive (large dataframe).
There is a non-pd.merge solution using Series.map and DataFrame.set_index.
a['col2'] = a['to_merge_on'].map(b.set_index('to_merge_on')['col2']))
col1 to_merge_on col2
a 1 1 1.0
b 2 3 2.0
c 3 4 NaN
This doesn't introduce a dummy index name for the index.
Note however that there is no DataFrame.map method, and so this approach is not for multiple columns.
df1 = df1.merge(df2, how="inner", left_index=True, right_index=True)
This allows to preserve the index of df1
Assuming that the resulting df has the same number of rows and order as your first df, you can do this:
c = pd.merge(a, b, on='to_merge_on')
c.set_index(a.index,inplace=True)
another simple option is to rename the index to what was before:
a.merge(b, how="left").set_axis(a.index)
merge preserves the order at dataframe 'a', but just resets the index so it's safe to use set_axis
You can also use DataFrame.join() method to achieve the same thing. The join method will persist the original index. The column to join can be specified with on parameter.
In [17]: a.join(b.set_index("to_merge_on"), on="to_merge_on")
Out[17]:
col1 to_merge_on col2
a 1 1 1.0
b 2 3 2.0
c 3 4 NaN
Think I've come up with a different solution. I was joining the left table on index value and the right table on a column value based off index of left table. What I did was a normal merge:
First10ReviewsJoined = pd.merge(First10Reviews, df, left_index=True, right_on='Line Number')
Then I retrieved the new index numbers from the merged table and put them in a new column named Sentiment Line Number:
First10ReviewsJoined['Sentiment Line Number']= First10ReviewsJoined.index.tolist()
Then I manually set the index back to the original, left table index based off pre-existing column called Line Number (the column value I joined on from left table index):
First10ReviewsJoined.set_index('Line Number', inplace=True)
Then removed the index name of Line Number so that it remains blank:
First10ReviewsJoined.index.name = None
Maybe a bit of a hack but seems to work well and relatively simple. Also, guess it reduces risk of duplicates/messing up your data. Hopefully that all makes sense.
For the people that wants to maintain the left index as it was before the left join:
def left_join(
a: pandas.DataFrame, b: pandas.DataFrame, on: list[str], b_columns: list[str] = None
) -> pandas.DataFrame:
if b_columns:
b_columns = set(on + b_columns)
b = b[b_columns]
df = (
a.reset_index()
.merge(
b,
how="left",
on=on,
)
.set_index(keys=[x or "index" for x in a.index.names])
)
df.index.names = a.index.names
return df

Categories