How to extract a subset of a bigger dataset [duplicate]

How to extract a subset of a bigger dataset [duplicate] - python

A simple pandas question:
Is there a drop_duplicates() functionality to drop every row involved in the duplication?
An equivalent question is the following: Does pandas have a set difference for dataframes?
For example:
In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
In [7]: df1
Out[7]:
col1 col2
0 1 2
1 2 3
2 3 4
In [8]: df2
Out[8]:
col1 col2
0 4 6
1 2 3
2 5 5
so maybe something like df2.set_diff(df1) will produce this:
col1 col2
0 4 6
2 5 5
However, I don't want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes.
By the way, I initially thought about an extension of the current drop_duplicates() method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though.
Thanks!

Bit convoluted but if you want to totally ignore the index data. Convert the contents of the dataframes to sets of tuples containing the columns:
ds1 = set(map(tuple, df1.values))
ds2 = set(map(tuple, df2.values))
This step will get rid of any duplicates in the dataframes as well (index ignored)
set([(1, 2), (3, 4), (2, 3)]) # ds1
can then use set methods to find anything. Eg to find differences:
ds1.difference(ds2)
gives:
set([(1, 2), (3, 4)])
can take that back to dataframe if needed. Note have to transform set to list 1st as set cannot be used to construct dataframe:
pd.DataFrame(list(ds1.difference(ds2)))

Here's another answer that keeps the index and does not require identical indexes in two data frames. (EDIT: make sure there is no duplicates in df2 beforehand)
pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
It is fast and the result is
col1 col2
0 4 6
2 5 5

from pandas import DataFrame
df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
print(df2[~df2.isin(df1).all(1)])
print(df2[(df2!=df1)].dropna(how='all'))
print(df2[~(df2==df1)].dropna(how='all'))

Apply by the columns of the object you want to map (df2); find the rows that are not in the set (isin is like a set operator)
In [32]: df2.apply(lambda x: df2.loc[~x.isin(df1[x.name]),x.name])
Out[32]:
col1 col2
0 4 6
2 5 5
Same thing, but include all values in df1, but still per column in df2
In [33]: df2.apply(lambda x: df2.loc[~x.isin(df1.values.ravel()),x.name])
Out[33]:
col1 col2
0 NaN 6
2 5 5
2nd example
In [34]: g = pd.DataFrame({'x': [1.2,1.5,1.3], 'y': [4,4,4]})
In [35]: g.columns=df1.columns
In [36]: g
Out[36]: 
   col1  col2
0   1.2     4
1   1.5     4
2   1.3     4
In [32]: g.apply(lambda x: g.loc[~x.isin(df1[x.name]),x.name])
Out[32]:
col1 col2
0 1.2 NaN
1 1.5 NaN
2 1.3 NaN
Note, in 0.13, there will be an isin operator on the frame level, so something like: df2.isin(df1) should be possible

There are 3 methods which work, but two of them have some flaws.
Method 1 (Hash method):
It worked for all cases I tested.
df1.loc[:, "hash"] = df1.apply(lambda x: hash(tuple(x)), axis = 1)
df2.loc[:, "hash"] = df2.apply(lambda x: hash(tuple(x)), axis = 1)
df1 = df1.loc[~df1["hash"].isin(df2["hash"]), :]
Method 2 (Dict method):
It fails if DataFrames contain datetime columns.
df1 = df1.loc[~df1.isin(df2.to_dict(orient="list")).all(axis=1), :]
Method 3 (MultiIndex method):
I encountered cases when it failed on columns with None's or NaN's.
df1 = df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)

Edit: You can now make MultiIndex objects directly from data frames as of pandas 0.24.0 which greatly simplifies the syntax of this answer
df1mi = pd.MultiIndex.from_frame(df1)
df2mi = pd.MultiIndex.from_frame(df2)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)
Original Answer
Pandas MultiIndex objects have fast set operations implemented as methods, so you can convert the DataFrames to MultiIndexes, use the difference() method, then convert the result back to a DataFrame. This solution should be much faster (by ~100x or more from my brief testing) than the solutions given here so far, and it will not depend on the row indexing of the original frames. As Piotr mentioned for his answer, this will fail with null values, since np.nan != np.nan. Any row in df2 with a null value will always appear in the difference. Also, the columns should be in the same order for both DataFrames.
df1mi = pd.MultiIndex.from_arrays(df1.values.transpose(), names=df1.columns)
df2mi = pd.MultiIndex.from_arrays(df2.values.transpose(), names=df2.columns)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)

Numpy's setdiff1d would work and perhaps be faster.
For each column:
np.setdiff1(df1.col1.values, df2.col1.values)
So something like:
setdf = pd.DataFrame({
col: np.setdiff1d(getattr(df1, col).values, getattr(df2, col).values)
for col in df1.columns
})
numpy.setdiff1d docs

Get the indices of the intersection with a merge, then drop them:
>>> df_all = pd.DataFrame(np.arange(8).reshape((4,2)), columns=['A','B']); df_all
A B
0 0 1
1 2 3
2 4 5
3 6 7
>>> df_completed = df_all.iloc[::2]; df_completed
A B
0 0 1
2 4 5
>>> merged = pd.merge(df_all.reset_index(), df_completed); merged
index A B
0 0 0 1
1 2 4 5
>>> df_pending = df_all.drop(merged['index']); df_pending
A B
1 2 3
3 6 7

Assumption:
df1 and df2 have identical columns
it is a set operation so duplicates are ignored
sets are not extremely large so you do not worry about memory
union = pd.concat([df1,df2])
sym_diff = union[~union.duplicated(keep=False)]
union_of_df1_and_sym_diff = pd.concat([df1, sym_diff])
diff = union_of_df1_and_sym_diff[union_of_df1_and_sym_diff.duplicated()]

I'm not sure how pd.concat() implicitly joins overlapping columns but I had to do a little tweak on #radream's answer.
Conceptually, a set difference (symmetric) on multiple columns is a set union (outer join) minus a set intersection (or inner join):
df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
o = pd.merge(df1, df2, how='outer')
i = pd.merge(df1, df2)
set_diff = pd.concat([o, i]).drop_duplicates(keep=False)
This yields:
col1 col2
0 1 2
2 3 4
3 4 6
4 5 5

In Pandas 1.1.0 you can count unique rows with value_counts and find difference between counts:
df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
diff = df2.value_counts().sub(df1.value_counts(), fill_value=0)
Result:
col1 col2
1 2 -1.0
2 3 0.0
3 4 -1.0
4 6 1.0
5 5 1.0
dtype: float64
Get positive counts:
diff[diff > 0].reset_index(name='counts')
col1 col2 counts
0 4 6 1.0
1 5 5 1.0

this should work even if you have multiple columns in both dataframes. But make sure that the column names of both the dataframes are the exact same.
set_difference = pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
With multiple columns you can also use:
col_names=['col_1','col_2']
set_difference = pd.concat([df2[col_names], df1[col_names],
df1[col_names]]).drop_duplicates(keep=False)

Related

How to switch column values in the same Pandas DataFrame

I have the following DataFrame:
I need to switch values of col2 and col3 with the values of col4 and col5. Values of col1 will remain the same. The end result needs to look as the following:
Is there a way to do this without looping through the DataFrame?

Use rename in pandas
In [160]: df = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]})
In [161]: df
Out[161]:
A B
0 1 3
1 2 4
2 3 5
In [167]: df.rename({'B':'A','A':'B'},axis=1)
Out[167]:
B A
0 1 3
1 2 4
2 3 5

This should do:
og_cols = df.columns
new_cols = [df.columns[0], *df.columns[3:], *df.columns[1:3]]
df = df[new_cols] # Sort columns in the desired order
df.columns = og_cols # Use original column names

If you want to swap the column values:
df.iloc[:, 1:3], df.iloc[:, 3:] = df.iloc[:,3:].to_numpy(copy=True), df.iloc[:,1:3].to_numpy(copy=True)

Pandas reindex could help :
cols = df.columns
#reposition the columns
df = df.reindex(columns=['col1','col4','col5','col2','col3'])
#pass in new names
df.columns = cols

update dataframe with series

having a dataframe, I want to update subset of columns with a series of same length as number of columns being updated:
>>> df = pd.DataFrame(np.random.randint(0,5,(6, 2)), columns=['col1','col2'])
>>> df
col1 col2
0 1 0
1 2 4
2 4 4
3 4 0
4 0 0
5 3 1
>>> df.loc[:,['col1','col2']] = pd.Series([0,1])
...
ValueError: shape mismatch: value array of shape (6,) could not be broadcast to indexing result of shape (2,6)
it fails, however, I am able to do the same thing using list:
>>> df.loc[:,['col1','col2']] = list(pd.Series([0,1]))
>>> df
col1 col2
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
5 0 1
could you please help me to understand, why updating with series fails? do I have to perform some particular reshaping?

When assigning with a pandas object, pandas treats the assignment more "rigorously". A pandas to pandas assignment must pass stricter protocols. Only when you turn it to a list (or equivalently pd.Series([0, 1]).values) did pandas give in and allow you to assign in the way you'd imagine it should work.
That higher standard of assignment requires that the indices line up as well, so even if you had the right shape, it still wouldn't have worked without the correct indices.
df.loc[:, ['col1', 'col2']] = pd.DataFrame([[0, 1] for _ in range(6)])
df
df.loc[:, ['col1', 'col2']] = pd.DataFrame([[0, 1] for _ in range(6)], columns=['col1', 'col2'])
df

How to factorize two data frame meanwhile with python-pandas?

I have two data frame, one is user-item-rating and the other is side information of the items:
#df1
A12VH45Q3H5R5I B000NWJTKW 5.0
A3J8AQWNNI3WSN B000NWJTKW 4.0
A1XOBWIL4MILVM BDASK99000 1.0
#df2
B000NWJTKW ....
BDASK99000 ....
Now I w'd like to map the name of item and user to integer ID. I know there is a way of factorize:
df.apply(lambda x: pd.factorize(x)[0] + 1)
But I 'd like to ensure that the integer of the items in two data frame are consistent. So the resulting data frames is:
#df1
1 1 5.0
2 1 4.0
3 2 1.0
#df2
1 ...
2 ...
Do you know how to ensure that? Thanks in advance!

Concatenate the common column(s), and apply pd.factorize (or pd.Categorical) on that:
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
For example,
import pandas as pd
df1 = pd.DataFrame(
[('A12VH45Q3H5R5I', 'B000NWJTKW', 5.0),
('A3J8AQWNNI3WSN', 'B000NWJTKW', 4.0),
('A1XOBWIL4MILVM', 'BDASK99000', 1.0)], columns=['user', 'item', 'rating'])
df2 = pd.DataFrame(
[('B000NWJTKW', 10),
('BDASK99000', 20)], columns=['item', 'extra'])
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
codes, uniques = pd.factorize(df1['user'])
df1['user'] = codes + 1
print(df1)
print(df2)
yields
# df1
user item rating
0 1 1 5
1 2 1 4
2 3 2 1
# df2
item extra
0 1 10
1 2 20
Another way to work-around the problem (if you have enough memory) would be to merge the two DataFrames: df3 = pd.merge(df1, df2, on='item', how='outer'), and then factorize df3['item']:
df3 = pd.merge(df1, df2, on='item', how='outer')
for col in ['item', 'user']:
df3[col] = pd.factorize(df3[col])[0] + 1
print(df3)
yields
user item rating extra
0 1 1 5 10
1 2 1 4 10
2 3 2 1 20

Another option could be to apply factorize on the first dataframe, and then apply the resulting mapping to the second dataframe:
# create factorization:
idx, levels = pd.factorize(df1['item'])
# replace the item codes in the first dataframe with the new index value
df1['item'] = idx
# create a dictionary mapping the original code to the new index value
d = {code: i for i, code in enumerate(codes)}
# apply this mapping to the second dataframe
df2['item'] = df2.item.apply(lambda code: d[code])
This approach will only work if every level is present in both dataframes.

set difference for pandas

A simple pandas question:
Is there a drop_duplicates() functionality to drop every row involved in the duplication?
An equivalent question is the following: Does pandas have a set difference for dataframes?
For example:
In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
In [7]: df1
Out[7]:
col1 col2
0 1 2
1 2 3
2 3 4
In [8]: df2
Out[8]:
col1 col2
0 4 6
1 2 3
2 5 5
so maybe something like df2.set_diff(df1) will produce this:
col1 col2
0 4 6
2 5 5
However, I don't want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes.
By the way, I initially thought about an extension of the current drop_duplicates() method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though.
Thanks!

Bit convoluted but if you want to totally ignore the index data. Convert the contents of the dataframes to sets of tuples containing the columns:
ds1 = set(map(tuple, df1.values))
ds2 = set(map(tuple, df2.values))
This step will get rid of any duplicates in the dataframes as well (index ignored)
set([(1, 2), (3, 4), (2, 3)]) # ds1
can then use set methods to find anything. Eg to find differences:
ds1.difference(ds2)
gives:
set([(1, 2), (3, 4)])
can take that back to dataframe if needed. Note have to transform set to list 1st as set cannot be used to construct dataframe:
pd.DataFrame(list(ds1.difference(ds2)))

Here's another answer that keeps the index and does not require identical indexes in two data frames. (EDIT: make sure there is no duplicates in df2 beforehand)
pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
It is fast and the result is
col1 col2
0 4 6
2 5 5

from pandas import DataFrame
df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
print(df2[~df2.isin(df1).all(1)])
print(df2[(df2!=df1)].dropna(how='all'))
print(df2[~(df2==df1)].dropna(how='all'))

Apply by the columns of the object you want to map (df2); find the rows that are not in the set (isin is like a set operator)
In [32]: df2.apply(lambda x: df2.loc[~x.isin(df1[x.name]),x.name])
Out[32]:
col1 col2
0 4 6
2 5 5
Same thing, but include all values in df1, but still per column in df2
In [33]: df2.apply(lambda x: df2.loc[~x.isin(df1.values.ravel()),x.name])
Out[33]:
col1 col2
0 NaN 6
2 5 5
2nd example
In [34]: g = pd.DataFrame({'x': [1.2,1.5,1.3], 'y': [4,4,4]})
In [35]: g.columns=df1.columns
In [36]: g
Out[36]: 
   col1  col2
0   1.2     4
1   1.5     4
2   1.3     4
In [32]: g.apply(lambda x: g.loc[~x.isin(df1[x.name]),x.name])
Out[32]:
col1 col2
0 1.2 NaN
1 1.5 NaN
2 1.3 NaN
Note, in 0.13, there will be an isin operator on the frame level, so something like: df2.isin(df1) should be possible

There are 3 methods which work, but two of them have some flaws.
Method 1 (Hash method):
It worked for all cases I tested.
df1.loc[:, "hash"] = df1.apply(lambda x: hash(tuple(x)), axis = 1)
df2.loc[:, "hash"] = df2.apply(lambda x: hash(tuple(x)), axis = 1)
df1 = df1.loc[~df1["hash"].isin(df2["hash"]), :]
Method 2 (Dict method):
It fails if DataFrames contain datetime columns.
df1 = df1.loc[~df1.isin(df2.to_dict(orient="list")).all(axis=1), :]
Method 3 (MultiIndex method):
I encountered cases when it failed on columns with None's or NaN's.
df1 = df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)

Edit: You can now make MultiIndex objects directly from data frames as of pandas 0.24.0 which greatly simplifies the syntax of this answer
df1mi = pd.MultiIndex.from_frame(df1)
df2mi = pd.MultiIndex.from_frame(df2)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)
Original Answer
Pandas MultiIndex objects have fast set operations implemented as methods, so you can convert the DataFrames to MultiIndexes, use the difference() method, then convert the result back to a DataFrame. This solution should be much faster (by ~100x or more from my brief testing) than the solutions given here so far, and it will not depend on the row indexing of the original frames. As Piotr mentioned for his answer, this will fail with null values, since np.nan != np.nan. Any row in df2 with a null value will always appear in the difference. Also, the columns should be in the same order for both DataFrames.
df1mi = pd.MultiIndex.from_arrays(df1.values.transpose(), names=df1.columns)
df2mi = pd.MultiIndex.from_arrays(df2.values.transpose(), names=df2.columns)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)

Numpy's setdiff1d would work and perhaps be faster.
For each column:
np.setdiff1(df1.col1.values, df2.col1.values)
So something like:
setdf = pd.DataFrame({
col: np.setdiff1d(getattr(df1, col).values, getattr(df2, col).values)
for col in df1.columns
})
numpy.setdiff1d docs

Get the indices of the intersection with a merge, then drop them:
>>> df_all = pd.DataFrame(np.arange(8).reshape((4,2)), columns=['A','B']); df_all
A B
0 0 1
1 2 3
2 4 5
3 6 7
>>> df_completed = df_all.iloc[::2]; df_completed
A B
0 0 1
2 4 5
>>> merged = pd.merge(df_all.reset_index(), df_completed); merged
index A B
0 0 0 1
1 2 4 5
>>> df_pending = df_all.drop(merged['index']); df_pending
A B
1 2 3
3 6 7

Assumption:
df1 and df2 have identical columns
it is a set operation so duplicates are ignored
sets are not extremely large so you do not worry about memory
union = pd.concat([df1,df2])
sym_diff = union[~union.duplicated(keep=False)]
union_of_df1_and_sym_diff = pd.concat([df1, sym_diff])
diff = union_of_df1_and_sym_diff[union_of_df1_and_sym_diff.duplicated()]

I'm not sure how pd.concat() implicitly joins overlapping columns but I had to do a little tweak on #radream's answer.
Conceptually, a set difference (symmetric) on multiple columns is a set union (outer join) minus a set intersection (or inner join):
df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
o = pd.merge(df1, df2, how='outer')
i = pd.merge(df1, df2)
set_diff = pd.concat([o, i]).drop_duplicates(keep=False)
This yields:
col1 col2
0 1 2
2 3 4
3 4 6
4 5 5

In Pandas 1.1.0 you can count unique rows with value_counts and find difference between counts:
df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
diff = df2.value_counts().sub(df1.value_counts(), fill_value=0)
Result:
col1 col2
1 2 -1.0
2 3 0.0
3 4 -1.0
4 6 1.0
5 5 1.0
dtype: float64
Get positive counts:
diff[diff > 0].reset_index(name='counts')
col1 col2 counts
0 4 6 1.0
1 5 5 1.0

this should work even if you have multiple columns in both dataframes. But make sure that the column names of both the dataframes are the exact same.
set_difference = pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
With multiple columns you can also use:
col_names=['col_1','col_2']
set_difference = pd.concat([df2[col_names], df1[col_names],
df1[col_names]]).drop_duplicates(keep=False)

How to keep index when using pandas merge

I would like to merge two DataFrames, and keep the index from the first frame as the index on the merged dataset. However, when I do the merge, the resulting DataFrame has integer index. How can I specify that I want to keep the index from the left data frame?
In [4]: a = pd.DataFrame({'col1': {'a': 1, 'b': 2, 'c': 3},
'to_merge_on': {'a': 1, 'b': 3, 'c': 4}})
In [5]: b = pd.DataFrame({'col2': {0: 1, 1: 2, 2: 3},
'to_merge_on': {0: 1, 1: 3, 2: 5}})
In [6]: a
Out[6]:
col1 to_merge_on
a 1 1
b 2 3
c 3 4
In [7]: b
Out[7]:
col2 to_merge_on
0 1 1
1 2 3
2 3 5
In [8]: a.merge(b, how='left')
Out[8]:
col1 to_merge_on col2
0 1 1 1.0
1 2 3 2.0
2 3 4 NaN
In [9]: _.index
Out[9]: Int64Index([0, 1, 2], dtype='int64')
EDIT: Switched to example code that can be easily reproduced

In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
col1 to_merge_on col2
index
a 1 1 1
b 2 3 2
c 3 4 NaN
Note that for some left merge operations, you may end up with more rows than in a when there are multiple matches between a and b. In this case, you may need to drop duplicates.

You can make a copy of index on left dataframe and do merge.
a['copy_index'] = a.index
a.merge(b, how='left')
I found this simple method very useful while working with large dataframe and using pd.merge_asof() (or dd.merge_asof()).
This approach would be superior when resetting index is expensive (large dataframe).

There is a non-pd.merge solution using Series.map and DataFrame.set_index.
a['col2'] = a['to_merge_on'].map(b.set_index('to_merge_on')['col2']))
col1 to_merge_on col2
a 1 1 1.0
b 2 3 2.0
c 3 4 NaN
This doesn't introduce a dummy index name for the index.
Note however that there is no DataFrame.map method, and so this approach is not for multiple columns.

df1 = df1.merge(df2, how="inner", left_index=True, right_index=True)
This allows to preserve the index of df1

Assuming that the resulting df has the same number of rows and order as your first df, you can do this:
c = pd.merge(a, b, on='to_merge_on')
c.set_index(a.index,inplace=True)

another simple option is to rename the index to what was before:
a.merge(b, how="left").set_axis(a.index)
merge preserves the order at dataframe 'a', but just resets the index so it's safe to use set_axis

You can also use DataFrame.join() method to achieve the same thing. The join method will persist the original index. The column to join can be specified with on parameter.
In [17]: a.join(b.set_index("to_merge_on"), on="to_merge_on")
Out[17]:
col1 to_merge_on col2
a 1 1 1.0
b 2 3 2.0
c 3 4 NaN

Think I've come up with a different solution. I was joining the left table on index value and the right table on a column value based off index of left table. What I did was a normal merge:
First10ReviewsJoined = pd.merge(First10Reviews, df, left_index=True, right_on='Line Number')
Then I retrieved the new index numbers from the merged table and put them in a new column named Sentiment Line Number:
First10ReviewsJoined['Sentiment Line Number']= First10ReviewsJoined.index.tolist()
Then I manually set the index back to the original, left table index based off pre-existing column called Line Number (the column value I joined on from left table index):
First10ReviewsJoined.set_index('Line Number', inplace=True)
Then removed the index name of Line Number so that it remains blank:
First10ReviewsJoined.index.name = None
Maybe a bit of a hack but seems to work well and relatively simple. Also, guess it reduces risk of duplicates/messing up your data. Hopefully that all makes sense.

For the people that wants to maintain the left index as it was before the left join:
def left_join(
a: pandas.DataFrame, b: pandas.DataFrame, on: list[str], b_columns: list[str] = None
) -> pandas.DataFrame:
if b_columns:
b_columns = set(on + b_columns)
b = b[b_columns]
df = (
a.reset_index()
.merge(
b,
how="left",
on=on,
)
.set_index(keys=[x or "index" for x in a.index.names])
)
df.index.names = a.index.names
return df

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract a subset of a bigger dataset [duplicate] - python

Here's another answer that keeps the index and does not require identical indexes in two data frames. (EDIT: make sure there is no duplicates in df2 beforehand) pd.concat([df2, df1, df1]).drop_duplicates(keep=False) It is fast and the result is col1 col2 0 4 6 2 5 5

from pandas import DataFrame df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]}) df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]}) print(df2[~df2.isin(df1).all(1)]) print(df2[(df2!=df1)].dropna(how='all')) print(df2[~(df2==df1)].dropna(how='all'))

Numpy's setdiff1d would work and perhaps be faster. For each column: np.setdiff1(df1.col1.values, df2.col1.values) So something like: setdf = pd.DataFrame({ col: np.setdiff1d(getattr(df1, col).values, getattr(df2, col).values) for col in df1.columns }) numpy.setdiff1d docs

Related

How to switch column values in the same Pandas DataFrame

update dataframe with series

How to factorize two data frame meanwhile with python-pandas?

set difference for pandas

How to keep index when using pandas merge

Categories

Resources