Comparing columns of 2 dataframes

Comparing columns of 2 dataframes - python

I am trying to get the columns that are unique to a data frame.
DF_A has 10 columns
DF_B has 3 columns (all three match column names in DF_A).
Before I was using:
cols_to_use = DF_A.columns - DF_B.columns.
Since my pandas update, I am getting this error:
TypeError: cannot perform sub with this index type:
What should I be doing now instead?
Thank you!

You can use difference method:
Demo:
In [12]: df
Out[12]:
a b c d
0 0 8 0 3
1 3 4 1 7
2 0 5 4 0
3 0 9 7 0
4 5 8 5 4
In [13]: df2
Out[13]:
a d
0 4 3
1 3 1
2 1 2
3 3 4
4 0 3
In [14]: df.columns.difference(df2.columns)
Out[14]: Index(['b', 'c'], dtype='object')
In [15]: cols = df.columns.difference(df2.columns)
In [16]: df[cols]
Out[16]:
b c
0 8 0
1 4 1
2 5 4
3 9 7
4 8 5

Related

Expanding the rows of a data frame based on its column containing lists [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 4 years ago.
Say I have the following Pandas Dataframe:
df = pd.DataFrame({"a" : [1,2,3], "b" : [[1,2],[2,3,4],[5]]})
a b
0 1 [1, 2]
1 2 [2, 3, 4]
2 3 [5]
How would I "unstack" the lists in the "b" column in order to transform it into the dataframe:
a b
0 1 1
1 1 2
2 2 2
3 2 3
4 2 4
5 3 5

Starting from Pandas 0.25.0, there is internal method DataFrame.explode(), which was designed just for that:
res = df.explode("b")
output
In [98]: res
Out[98]:
a b
0 1 1
0 1 2
1 2 2
1 2 3
1 2 4
2 3 5
Solution for Pandas versions < 0.25: generic vectorized approach - will work also for multiple columns DFs:
assuming we have the following DF:
In [159]: df
Out[159]:
a b c
0 1 [1, 2] 5
1 2 [2, 3, 4] 6
2 3 [5] 7
Solution:
In [160]: lst_col = 'b'
In [161]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.difference([lst_col])
...: }).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns.tolist()]
...:
Out[161]:
a b c
0 1 1 5
1 1 2 5
2 2 2 6
3 2 3 6
4 2 4 6
5 3 5 7
Setup:
df = pd.DataFrame({
"a" : [1,2,3],
"b" : [[1,2],[2,3,4],[5]],
"c" : [5,6,7]
})
Vectorized NumPy approach:
In [124]: pd.DataFrame({'a':np.repeat(df.a.values, df.b.str.len()),
'b':np.concatenate(df.b.values)})
Out[124]:
a b
0 1 1
1 1 2
2 2 2
3 2 3
4 2 4
5 3 5
OLD answer:
Try this:
In [89]: df.set_index('a', append=True).b.apply(pd.Series).stack().reset_index(level=[0, 2], drop=True).reset_index()
Out[89]:
a 0
0 1 1.0
1 1 2.0
2 2 2.0
3 2 3.0
4 2 4.0
5 3 5.0
Or bit nicer solution provided by #Boud:
In [110]: df.set_index('a').b.apply(pd.Series).stack().reset_index(level=-1, drop=True).astype(int).reset_index()
Out[110]:
a 0
0 1 1
1 1 2
2 2 2
3 2 3
4 2 4
5 3 5

Here is another approach with itertuples -
df = pd.DataFrame({"a" : [1,2,3], "b" : [[1,2],[2,3,4],[5]]})
data = []
for i in df.itertuples():
lst = i[2]
for col2 in lst:
data.append([i[1], col2])
df_output = pd.DataFrame(data =data, columns=df.columns)
df_output
Output is -
a b
0 1 1
1 1 2
2 2 2
3 2 3
4 2 4
5 3 5
Edit: You can also compress the loops into a single code and populate data as -
data = [[i[1], col2] for i in df.itertuples() for col2 in i[2]]

Reshape MultiIndex dataframe to tabular format

Given a sample MultiIndex:
idx = pd.MultiIndex.from_product([[0, 1, 2], ['a', 'b', 'c', 'd']])
df = pd.DataFrame({'value' : np.arange(12)}, index=idx)
df
value
0 a 0
b 1
c 2
d 3
1 a 4
b 5
c 6
d 7
2 a 8
b 9
c 10
d 11
How can I efficiently convert this to a tabular format like so?
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Furthermore, given the dataframe above, how can I bring it back to its original multi-indexed state?
What I've tried:
pd.DataFrame(df.values.reshape(-1, df.index.levels[1].size),
index=df.index.levels[0], columns=df.index.levels[1])
Which works for the first problem, but I'm not sure how to bring it back to its original from there.

Using unstack and stack
In [5359]: dff = df['value'].unstack()
In [5360]: dff
Out[5360]:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [5361]: dff.stack().to_frame('name')
Out[5361]:
name
0 a 0
b 1
c 2
d 3
1 a 4
b 5
c 6
d 7
2 a 8
b 9
c 10
d 11

By using get_level_values
pd.crosstab(df.index.get_level_values(0),df.index.get_level_values(1),values=df.value,aggfunc=np.sum)
Out[477]:
col_0 a b c d
row_0
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11

Another alternative, which you should think of when using stack/unstack (though unstack is clearly better in this case!) is pivot_table:
In [11]: df.pivot_table(values="value", index=df.index.get_level_values(0), columns=df.index.get_level_values(1))
Out[11]:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11

Finding the maximum entry based on another column in a data frame

Suppose I have a data frame with 3 columns: A, B, C. I want to group by column A, and find the row (for each unique A) with the maximum entry in C, so that I can store that row.A, row.B, row.C into a dictionary elsewhere.
What's the best way to do this without using iterrows?

# generate sample data
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,(10,3)))
df.columns = ['A','B','C']
# sort by C, group by A, take last row of each group
df.sort('C').groupby('A').nth(-1)

Here's another method. If df is the DataFrame, you can write df.groupby('A').apply(lambda d: d.ix[d['C'].argmax()]).
For example,
In [96]: df
Out[96]:
A B C
0 1 0 3
1 3 0 4
2 0 4 5
3 2 4 0
4 3 1 1
5 1 6 2
6 3 6 0
7 4 0 1
8 2 3 4
9 0 5 0
10 7 6 5
11 3 1 2
In [97]: g = df.groupby('A').apply(lambda d: d['C'].argmax())
In [98]: g
Out[98]:
A
0 2
1 0
2 8
3 1
4 7
7 10
dtype: int64
In [99]: df.ix[g.values]
Out[99]:
A B C
2 0 4 5
0 1 0 3
8 2 3 4
1 3 0 4
7 4 0 1
10 7 6 5

Concatenate Two DataFrames With Hierarchical Columns

I would like to merge two DataFrames while creating a multilevel column naming scheme denoting which dataframe the rows came from. For example:
In [98]: A=pd.DataFrame(np.arange(9.).reshape(3,3),columns=list('abc'))
In [99]: A
Out[99]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
In [100]: B=A.copy()
If I use pd.merge(), then I get
In [104]: pd.merge(A,B,left_index=True,right_index=True)
Out[104]:
a_x b_x c_x a_y b_y c_y
0 0 1 2 0 1 2
1 3 4 5 3 4 5
2 6 7 8 6 7 8
Which is what I expect with that statement, what I would like (but I don't know how to get!) is:
In [104]: <<one or more statements>>
Out[104]:
A B
a b c a b c
0 0 1 2 0 1 2
1 3 4 5 3 4 5
2 6 7 8 6 7 8
Can this be done without changing the original pd.DataFrame calls? I am reading the data in the dataframes in from .csv files and that might be my problem.

first case can be ordered arbitrarily among A,B (not the columns, just the order A or B)
2nd should preserve ordering
IMHO this is pandonic!
In [5]: concat(dict(A = A, B = B),axis=1)
Out[5]:
A B
a b c a b c
0 0 1 2 0 1 2
1 3 4 5 3 4 5
2 6 7 8 6 7 8
In [6]: concat([ A, B ], keys=['A','B'],axis=1)
Out[6]:
A B
a b c a b c
0 0 1 2 0 1 2
1 3 4 5 3 4 5
2 6 7 8 6 7 8

Here's one way, which does change A and B:
In [10]: from itertools import cycle
In [11]: A.columns = pd.MultiIndex.from_tuples(zip(cycle('A'), A.columns))
In [12]: A
Out[12]:
A
a b c
0 0 1 2
1 3 4 5
2 6 7 8
In [13]: B.columns = pd.MultiIndex.from_tuples(zip(cycle('B'), B.columns))
In [14]: A.join(B)
Out[14]:
A B
a b c a b c
0 0 1 2 0 1 2
1 3 4 5 3 4 5
2 6 7 8 6 7 8
I actually think this would be a good alternative behaviour, rather than suffixes...

python pandas groupby() result

I have the following python pandas data frame:
df = pd.DataFrame( {
'A': [1,1,1,1,2,2,2,3,3,4,4,4],
'B': [5,5,6,7,5,6,6,7,7,6,7,7],
'C': [1,1,1,1,1,1,1,1,1,1,1,1]
} );
df
A B C
0 1 5 1
1 1 5 1
2 1 6 1
3 1 7 1
4 2 5 1
5 2 6 1
6 2 6 1
7 3 7 1
8 3 7 1
9 4 6 1
10 4 7 1
11 4 7 1
I would like to have another column storing a value of a sum over C values for fixed (both) A and B. That is, something like:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2
I have tried with pandas groupby and it kind of works:
res = {}
for a, group_by_A in df.groupby('A'):
group_by_B = group_by_A.groupby('B', as_index = False)
res[a] = group_by_B['C'].sum()
but I don't know how to 'get' the results from res into df in the orderly fashion. Would be very happy with any advice on this. Thank you.

Here's one way (though it feels this should work in one go with an apply, I can't get it).
In [11]: g = df.groupby(['A', 'B'])
In [12]: df1 = df.set_index(['A', 'B'])
The size groupby function is the one you want, we have to match it to the 'A' and 'B' as the index:
In [13]: df1['D'] = g.size() # unfortunately this doesn't play nice with as_index=False
# Same would work with g['C'].sum()
In [14]: df1.reset_index()
Out[14]:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2

You could also do a one liner using transform applied to the groupby:
df['D'] = df.groupby(['A','B'])['C'].transform('sum')

You could also do a one liner using merge as follows:
df = df.merge(pd.DataFrame({'D':df.groupby(['A', 'B'])['C'].size()}), left_on=['A', 'B'], right_index=True)

you can use this method :
columns = ['col1','col2',...]
df.groupby('col')[columns].sum()
if you want you can also use .sort_values(by = 'colx', ascending = True/False) after .sum() to sort the final output by a specific column (colx) and in an ascending or descending order.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing columns of 2 dataframes - python

Related

Expanding the rows of a data frame based on its column containing lists [duplicate]

Reshape MultiIndex dataframe to tabular format

Finding the maximum entry based on another column in a data frame

Concatenate Two DataFrames With Hierarchical Columns

python pandas groupby() result

Categories

Resources