Is there any way to merge two data frames while one of them has duplicated indices such as following:
dataframe A:
value
key
a 1
b 2
b 3
b 4
c 5
a 6
dataframe B:
number
key
a I
b X
c V
after merging, I want to have a data frame like the following:
value number
key
a 1 I
b 2 X
b 3 X
b 4 X
c 5 V
a 6 I
Or maybe there are better ways to do it using groupby?
>>> a.join(b).sort('value')
value number
key
a 1 I
b 2 X
b 3 X
b 4 X
c 5 V
a 6 I
Use join:
>>> a = pd.DataFrame(range(1,7), index=list('abbbca'), columns=['value'])
>>> b = pd.DataFrame(['I', 'X', 'V'], index=list('abc'), columns=['number'])
>>> a.join(b)
value number
a 1 I
a 6 I
b 2 X
b 3 X
b 4 X
c 5 V
Related
I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.
IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])
I have two dataframes:
ONE=pd.read_csv('ONE.csv')
value_one value_two
2 4
3 1
4 2
TWO=pd.read_csv('TWO.csv')
X 1 2 3 4 5 6 7 8
1 a c j a d c c d
2 c k a d c c d e
3 f c k a d c c d
4 c k a d c c d j
I need to create additional column in ONE dataframe ( ONE['result'])
in conditions:
if value_one is equal to value from header of dataframe TWO
and value_two is equal to value from TWO dataframe in X column,
set in new column common value.
expected result:
value_one value_two result
2 4 k
3 1 j
4 2 d
I tried: use to compare only header if ONE[value_one]==TWO.iloc[0]
Thank you,
S.
lookup
You can lookup your second dataframe:
df_two = df_two.set_index('X') # set 'X' column as index
df_two.columns = df_two.columns.astype(int) # ensure column labels are numeric
df_one['result'] = df_two.lookup(df_one['value_two'], df_one['value_one'])
print(df_one)
value_one value_two result
0 2 4 k
1 3 1 j
2 4 2 d
I have the following question: I have the following table:
A B C
1 A A
2 A A.B
3 B B.C
4 A,B A.A,A.B,B.C
Column A is an index (1 through 4). Column B lists the letters, which appear in column C before the point (if there is any, if there is none, this is implicit, so the entry in (C,1) = A is the letter after the (.) (so this entry = A.A).
And column C either lists both letters before and after or only after the point.
The idea is to split these points and lists up. So column C should first be split up by the comma to separate rows (that works). Problematic here is, whenever there are different letter possible in B - because after splitting up, B should also only contain 1 letter (the correct on for column C).
So the result should look like this:
A B C
1 A A
2 A B
3 B C
4 A A
4 B B
4 B C
Can someone help me with ensuring, that column B contains the correct (i.e., fitting) information, which is denoted in column C?
Thanks and kind regards.
First, stack your dataframe to get your combinations:
out = (
df.set_index(['A', 'B']).C
.str.split(',').apply(pd.Series)
.stack().reset_index([0,1]).drop('B', 1)
)
A 0
0 1 A
1 2 A.B
2 3 B.C
3 4 A.A
4 4 A.B
5 4 B.C
Then replace single entries with their counterpart and apply pd.Series:
(out.set_index('A')[0].str
.replace(r'^([A-Z])$', r'\1.\1')
.str.split('.').apply(pd.Series)
.reset_index()
).rename(columns={0: 'B', 1: 'C'})
Output:
A B C
0 1 A A
1 2 A B
2 3 B C
3 4 A A
4 4 A B
5 4 B C
With a comprehension
def m0(x):
"""Take a string, return a dictionary split on '.' or a self mapping"""
if '.' in x:
return dict([x.split('.')])
else:
return {x: x}
def m1(s):
"""split string on ',' then do the dictionary thing in m0"""
return [*map(m0, s.split(','))]
pd.DataFrame([
(a, b, m[b])
for a, B, C in df.itertuples(index=False)
for b in B.split(',')
for m in m1(C) if b in m
], df.index.repeat(df.C.str.count(',') + 1), df.columns)
A B C
0 1 A A
1 2 A B
2 3 B C
3 4 A A
3 4 A B
3 4 B C
I have data in this format
ID Val
1 A
1 B
1 C
2 A
2 C
2 D
I want to group by data at each ID and see combinations that exist and sum the multiple combinations up. The resulting output should look like
v1 v2 count
A B 1
A C 2
A D 1
B C 1
C D 1
Is there a smart way to get this instead of looping through each possible combinations?
this should work:
>>> ts = df.groupby('Val')['ID'].aggregate(lambda ts: set(ts))
>>> ts
Val
A set([1, 2])
B set([1])
C set([1, 2])
D set([2])
Name: ID, dtype: object
>>> from itertools import product
>>> pd.DataFrame([[i, j, len(ts[i] & ts[j])] for i, j in product(ts.index, ts.index) if i < j],
... columns=['v1', 'v2', 'count'])
v1 v2 count
0 A B 1
1 A C 2
2 A D 1
3 B C 1
4 B D 0
5 C D 1
What I came up with:
Use pd.merge to create the cartesian product
Filter the cartesian product to include only combinations of the form that you desire
Count the number of combinations
Convert to the desired dataframe format
Unsure if it is faster than looping through all possible combinations.
#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
import pandas as pd
from itertools import izip
# Create the dataframe
df = pd.DataFrame([
[1, 'A'],
[1, 'B'],
[1, 'C'],
[2, 'A'],
[2, 'C'],
[2, 'D'],
], columns=['ID', 'Val'])
'''
ID Val
0 1 A
1 1 B
2 1 C
3 2 A
4 2 C
5 2 D
[6 rows x 2 columns]
'''
# Create the cartesian product
df2 = pd.merge(df, df, on='ID')
'''
ID Val_x Val_y
0 1 A A
1 1 A B
2 1 A C
3 1 B A
4 1 B B
5 1 B C
6 1 C A
7 1 C B
8 1 C C
9 2 A A
10 2 A C
11 2 A D
12 2 C A
13 2 C C
14 2 C D
15 2 D A
16 2 D C
17 2 D D
[18 rows x 3 columns]
'''
# Count the values, filtering A, A pairs, and B, A pairs.
counts = pd.Series([
v for v in izip(df2.Val_x, df2.Val_y)
if v[0] != v[1] and v[0] < v[1]
]).value_counts(sort=False).sort_index()
'''
(A, B) 1
(A, C) 2
(A, D) 1
(B, C) 1
(C, D) 1
dtype: int64
'''
# Combine the counts
df3 = pd.DataFrame(dict(
v1=[v1 for v1, _ in counts.index],
v2=[v2 for _, v2 in counts.index],
count=counts.values
))
'''
count v1 v2
0 1 A B
1 2 A C
2 1 A D
3 1 B C
4 1 C D
'''
Suppose I create a pandas DataFrame with two columns, one of which contains some numbers and the other contains letters. Like this:
import pandas as pd
from pprint import pprint
df = pd.DataFrame({'a': [1,2,3,4,5,6], 'b': ['y','x','y','x','y', 'y']})
pprint(df)
a b
0 1 y
1 2 x
2 3 y
3 4 x
4 5 y
5 6 y
Now say that I want to make a third column (c) whose value is equal to the last value of a when b was equal to x. In the cases where a value of x was not encountered in b yet, the value in c should default to 0.
The procedure should produce pretty much the following result:
last_a = 0
c = []
for i,b in enumerate(df['b']):
if b == 'x':
last_a = df.iloc[i]['a']
c += [last_a]
df['c'] = c
pprint(df)
a b c
0 1 y 0
1 2 x 2
2 3 y 2
3 4 x 4
4 5 y 4
5 6 y 4
Is there a more elegant way to accomplish this either with or without pandas?
In [140]: df = pd.DataFrame({'a': [1,2,3,4,5,6], 'b': ['y','x','y','x','y', 'y']})
In [141]: df
Out[141]:
a b
0 1 y
1 2 x
2 3 y
3 4 x
4 5 y
5 6 y
FInd out where column 'b' == x, then return the value in that column (not the location); this column is already the 'a' column
In [142]: df['c'] = df.loc[df['b']=='x','a'].apply(lambda v: v if v < len(df) else np.nan)
Fill the rest of the values forward, then fill holes with 0
In [143]: df['c'] = df['c'].ffill().fillna(0)
In [144]: df
Out[144]:
a b c
0 1 y 0
1 2 x 2
2 3 y 2
3 4 x 4
4 5 y 4
5 6 y 4