Pandas inner merge/join returning all rows

Pandas inner merge/join returning all rows - python

I'm trying to merge two data frames based on a column present in both, keeping only the intersection of the two sets.
The desired result is:
foo bar foobar
x y z x j i x y z j i
a 1 2 a 9 0 a 1 2 9 0
b 3 4 b 9 0 b 3 4 9 0
c 5 6 c 9 0 c 5 6 9 0
d 7 8 e 9 0
f 9 0
My code that does not produce the desired result is:
pd.merge(foo, bar, how='inner', on='x')
Instead, the code seems to return:
foo bar foobar
x y z x j i x y z j i
a 1 2 a 9 0 a 1 2 9 0
b 3 4 b 9 0 b 3 4 9 0
c 5 6 c 9 0 c 5 6 9 0
d 7 8 e 9 0 e * * 9 0
f 9 0 f * * 9 0
(where * represents an NaN)
Where am I going wrong? I've already reached the third Google page trying to fix this an nothing works. Whatever I do I get an outer join, with all rows in both sets.

Usually it means that you have duplicates in the column(s) used for joining, resulting in cartesian product.
Demo:
In [35]: foo
Out[35]:
x y z
0 a 1 2
1 b 3 4
2 c 5 6
3 d 7 8
In [36]: bar
Out[36]:
x j i
0 a 9 0
1 b 9 0
2 a 9 0
3 a 9 0
4 b 9 0
In [37]: pd.merge(foo, bar)
Out[37]:
x y z j i
0 a 1 2 9 0
1 a 1 2 9 0
2 a 1 2 9 0
3 b 3 4 9 0
4 b 3 4 9 0

Related

Can You Preserve Column Order When Pandas Dataframe.Combine Or DataFrame.Combine_First?

If you have 2 dataframes, represented as:
A F Y
0 1 2 3
1 4 5 6
And
B C T
0 7 8 9
1 10 11 12
When combining it becomes:
A B C F T Y
0 1 7 8 2 9 3
1 4 10 11 5 12 6
I would like it to become:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
How do I combine 1 data frame with another but keep the original column order?

In [1294]: new_df = df.join(df1)
In [1295]: new_df
Out[1295]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
OR you can also use pd.merge(not a very clean solution though)
In [1297]: df['tmp' ] =1
In [1298]: df1['tmp'] = 1
In [1309]: pd.merge(df, df1, on=['tmp'], left_index=True, right_index=True).drop('tmp', 1)
Out[1309]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12

Creating a list of sliced dataframes

I am trying to create a list of dataframes where each dataframe is 3 rows of a larger dataframe.
dframes = [df[0:3], df[3:6],...,df[2000:2003]]
I am still fairly new to programming, why does:
x = 3
dframes = []
for i in range(0, len(df)):
dframes = dframes.append(df[i:x])
i = x
x = x + 3
dframes = dframes.append(df[i:x])
AttributeError: 'NoneType' object has no attribute 'append'

Use np.split
Setup
Consider the dataframe df
df = pd.DataFrame(dict(A=range(15), B=list('abcdefghijklmno')))
Solution
dframes = np.split(df, range(3, len(df), 3))
Output
for d in dframes:
print(d, '\n')
A B
0 0 a
1 1 b
2 2 c
A B
3 3 d
4 4 e
5 5 f
A B
6 6 g
7 7 h
8 8 i
A B
9 9 j
10 10 k
11 11 l
A B
12 12 m
13 13 n
14 14 o

Python raise this error because function append return None and next time in your loot variable dframes will be None
You can use this:
[list(dframes[i:i+3]) for i in range(0, len(dframes), 3)]

You can use list comprehension with groupby by numpy array created by length of index floor divided by 3:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
5 4 3 7 1 1
6 7 7 0 2 9
7 9 3 2 5 8
8 1 0 7 6 2
9 0 8 2 5 1
dfs = [x for i, x in df.groupby(np.arange(len(df.index)) // 3)]
print (dfs)
[ A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8, A B C D E
3 4 0 9 6 2
4 4 1 5 3 4
5 4 3 7 1 1, A B C D E
6 7 7 0 2 9
7 9 3 2 5 8
8 1 0 7 6 2, A B C D E
9 0 8 2 5 1]
If default monotonic index (0,1,2...) solution can be simplify:
dfs = [x for i, x in df.groupby(df.index // 3)]

top k columns with values in pandas dataframe for every row

I have a pandas dataframe like the following:
A B C D
0 7 2 5 2
1 3 3 1 1
2 0 2 6 1
3 3 6 2 9
There can be 100s of columns, in the above example I have only shown 4.
I would like to extract top-k columns for each row and their values.
I can get the top-k columns using:
pd.DataFrame({n: df.T[column].nlargest(k).index.tolist() for n, column in enumerate(df.T)}).T
which, for k=3 gives:
0 1 2
0 A C B
1 A B C
2 C B D
3 D B A
But what I would like to have is:
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
Is there a pand(a)oic way to achieve this?

You can use numpy solution:
numpy.argsort for columns names
array already sort (thanks Jeff), need values by indices
interweave for new array
DataFrame constructor
k = 3
vals = df.values
arr1 = np.argsort(-vals, axis=1)
a = df.columns[arr1[:,:k]]
b = vals[np.arange(len(df.index))[:,None], arr1][:,:k]
c = np.empty((vals.shape[0], 2 * k), dtype=a.dtype)
c[:,0::2] = a
c[:,1::2] = b
print (c)
[['A' 7 'C' 5 'B' 2]
['A' 3 'B' 3 'C' 1]
['C' 6 'B' 2 'D' 1]
['D' 9 'B' 6 'A' 3]]
df = pd.DataFrame(c)
print (df)
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3

>>> def foo(x):
... r = []
... for p in zip(list(x.index), list(x)):
... r.extend(p)
... return r
...
>>> pd.DataFrame({n: foo(df.T[row].nlargest(k)) for n, row in enumerate(df.T)}).T
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
Or, using list comprehension:
>>> def foo(x):
... return [j for i in zip(list(x.index), list(x)) for j in i]
...
>>> pd.DataFrame({n: foo(df.T[row].nlargest(k)) for n, row in enumerate(df.T)}).T
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3

This does the job efficiently : It uses argpartition that found the n biggest in O(n), then sort only them.
values=df.values
n,m=df.shape
k=4
I,J=mgrid[:n,:m]
I=I[:,:1]
if k<m: J=(-values).argpartition(k)[:,:k]
values=values[I,J]
names=np.take(df.columns,J)
J2=(-values).argsort()
names=names[I,J2]
values=values[I,J2]
names_and_values=np.empty((n,2*k),object)
names_and_values[:,0::2]=names
names_and_values[:,1::2]=values
result=pd.DataFrame(names_and_values)
For
0 1 2 3 4 5
0 A 7 C 5 B 2
1 B 3 A 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3

Split Python Dataframe into multiple Dataframes (where chosen rows are the same)

I would like to split one DataFrame into N Dataframes based on columns X and Z where they are the same (as eachother by column value).
For example, this input:
df =
NAME X Y Z Other
0 a 1 1 1 1
1 b 1 1 2 2
2 c 1 2 1 3
3 d 1 2 2 4
4 e 1 1 1 5
5 f 2 1 2 6
6 g 2 2 1 7
7 h 2 2 2 8
8 i 2 1 1 9
9 j 2 1 2 0
Would have this output:
df_group_0 =
NAME X Y Z Other
0 a 1 1 1 1
2 c 1 2 1 3
4 e 1 1 1 5
df_group_1 =
NAME X Y Z Other
1 b 1 1 2 2
3 d 1 2 2 4
df_group_2 =
NAME X Y Z Other
6 g 2 2 1 7
8 i 2 1 1 9
df_group_3 =
NAME X Y Z Other
7 h 2 2 2 8
9 j 2 1 2 0
Is this possible?

groupby generates an iterator of tuples with the first element be the group id, so if you iterate through the groupers and extract the second element from each tuple, you can get a list of data frames each having a unique group:
grouper = [g[1] for g in df.groupby(['X', 'Z'])]
grouper[0]
NAME X Y Z Other
0 a 1 1 1 1
2 c 1 2 1 3
4 e 1 1 1 5
grouper[1]
NAME X Y Z Other
1 b 1 1 2 2
3 d 1 2 2 4
grouper[2]
NAME X Y Z Other
6 g 2 2 1 7
8 i 2 1 1 9
grouper[3]
NAME X Y Z Other
5 f 2 1 2 6
7 h 2 2 2 8
9 j 2 1 2 0

Opposite of melt in python pandas

I cannot figure out how to do "reverse melt" using Pandas in python.
This is my starting data
import pandas as pd
from StringIO import StringIO
origin = pd.read_table(StringIO('''label type value
x a 1
x b 2
x c 3
y a 4
y b 5
y c 6
z a 7
z b 8
z c 9'''))
origin
Out[5]:
label type value
0 x a 1
1 x b 2
2 x c 3
3 y a 4
4 y b 5
5 y c 6
6 z a 7
7 z b 8
8 z c 9
This is the output I would like to have:
label a b c
x 1 2 3
y 4 5 6
z 7 8 9
I'm sure there is an easy way to do this, but I don't know how.

there are a few ways;
using .pivot:
>>> origin.pivot(index='label', columns='type')['value']
type a b c
label
x 1 2 3
y 4 5 6
z 7 8 9
[3 rows x 3 columns]
using pivot_table:
>>> origin.pivot_table(values='value', index='label', columns='type')
value
type a b c
label
x 1 2 3
y 4 5 6
z 7 8 9
[3 rows x 3 columns]
or .groupby followed by .unstack:
>>> origin.groupby(['label', 'type'])['value'].aggregate('mean').unstack()
type a b c
label
x 1 2 3
y 4 5 6
z 7 8 9
[3 rows x 3 columns]

DataFrame.set_index + DataFrame.unstack
df.set_index(['label','type'])['value'].unstack()
type a b c
label
x 1 2 3
y 4 5 6
z 7 8 9
simplifying the passing of pivot arguments
df.pivot(*df)
type a b c
label
x 1 2 3
y 4 5 6
z 7 8 9
[*df]
#['label', 'type', 'value']
For expected output we need DataFrame.reset_index and DataFrame.rename_axis
df.pivot(*df).rename_axis(columns = None).reset_index()
label a b c
0 x 1 2 3
1 y 4 5 6
2 z 7 8 9
if there are duplicates in a,b columns we could lose information so we need GroupBy.cumcount
print(df)
label type value
0 x a 1
1 x b 2
2 x c 3
3 y a 4
4 y b 5
5 y c 6
6 z a 7
7 z b 8
8 z c 9
0 x a 1
1 x b 2
2 x c 3
3 y a 4
4 y b 5
5 y c 6
6 z a 7
7 z b 8
8 z c 9
df.pivot_table(index = ['label',
df.groupby(['label','type']).cumcount()],
columns = 'type',
values = 'value')
type a b c
label
x 0 1 2 3
1 1 2 3
y 0 4 5 6
1 4 5 6
z 0 7 8 9
1 7 8 9
Or:
(df.assign(type_2 = df.groupby(['label','type']).cumcount())
.set_index(['label','type','type_2'])['value']
.unstack('type'))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas inner merge/join returning all rows - python

Related

Can You Preserve Column Order When Pandas Dataframe.Combine Or DataFrame.Combine_First?

Creating a list of sliced dataframes

top k columns with values in pandas dataframe for every row

Split Python Dataframe into multiple Dataframes (where chosen rows are the same)

Opposite of melt in python pandas

Categories

Resources