Combine and expand Dataframe based on IDs in columns - python

I have 3 dataframes A,B,C:
import pandas as pd
A = pd.DataFrame({"id": [1,2],
"connected_to_B_id1":["A","B"],
"connected_to_B_id2":["B","C"],
"connected_to_B_id3":["C", np.nan],
# entry can have multiple ids from B
})
B = pd.DataFrame({"id": ["A","B","C"],
"connected_to_C_id1":[1,1,2],
"connected_to_C_id2":[2,2,np.nan],
# entry can have multiple ids from C
})
C = pd.DataFrame({"id": [1,2],
"name":["a","b"],
})
#Output should be D:
D = pd.DataFrame({"id_A": [1,1,1,1,1,2,2,2],
"id_B": ["A","A","B","B","C","B","B","C"],
"id_C": [1,2,1,2,2,1,2,1],
"name": ["a","b","a","b","b","a","b","a"]
})
I want to use the IDs stored in the "connected_to_X" columns of each dataframe to create a dataframe, which contains all relationships recorded in the three individual dataframes.
What is the most elegant way to combine the dataframes to A, B and C to D?
Currently I am using dicts,lists and for loops and its messy and complicated.
D:
|idx |id_A|id_B|id_C|name|
|---:|--:|--:|--:|--:|
| 0 | 1 | A | 1 | a |
| 1 | 1 | A | 2 | b |
| 2 | 1 | B | 1 | a |
| 3 | 1 | B | 2 | b |
| 4 | 1 | C | 2 | b |
| 5 | 2 | B | 1 | a |
| 6 | 2 | B | 2 | b |
| 7 | 2 | C | 1 | a |

You just need to unpivot A and B then you can join the tables up.
(A.
melt(id_vars='id').
merge(B.melt(id_vars='id'), left_on = 'value', right_on='id', how='left').
merge(C, left_on = 'value_y', right_on='id').
drop(columns = ['variable_x', 'variable_y', 'value_x']).
sort_values(['id_x', 'id_y']).
reset_index(drop=True).
reset_index()
)
index id_x id_y value_y id name
0 0 1 A 1.0 1 a
1 1 1 A 2.0 2 b
2 2 1 B 1.0 1 a
3 3 1 B 2.0 2 b
4 4 1 C 2.0 2 b
5 5 2 B 1.0 1 a
6 6 2 B 2.0 2 b
7 7 2 C 2.0 2 b

Related

Movement of a specific string from one column to another

Hi I have a DataFrame which have values like
| ID| Value| comments |
| 1 | a | |
| 2 | b | |
| 3 | a;b;c| |
| 4 | b;c | |
| 5 | d;a;c| |
I need to transfer to a and b from Value to Comments for all the rows they are in. so that only value other that a and b will remain in data.
the new df would look like this
| ID| Value| comments |
| 1 | | a |
| 2 | | b |
| 3 | c | a;b |
| 4 | c | b |
| 5 | d;c | a |
Can you give me a direction where should i look for the answer to this
(i) Use str.split to split on ';' and explode the "Value" column
(ii) Use boolean indexing to filter rows where 'a' or 'b' exist, take them out and groupby index and join them with ';' as separators
exploded_series = df['Value'].str.split(';').explode()
mask = exploded_series.isin(['a','b'])
df['comments'] = exploded_series[mask].groupby(level=0).apply(';'.join)
df['Value'] = exploded_series[~mask].groupby(level=0).apply(';'.join)
df = df.fillna('')
Output:
ID Value comments
0 1 a
1 2 b
2 3 c a;b
3 4 c b
4 5 d;c a
Explode your Value column then label it to the right column:
out = df.assign(Value=df['Value'].str.split(';')).explode('Value')
out['col'] = np.where(out['Value'].isin(['a', 'b']), 'comments', 'Value')
print(out)
# Intermediate output
ID Value comments col
0 1 a NaN comments
1 2 b NaN comments
2 3 a NaN comments
2 3 b NaN comments
2 3 c NaN Value
3 4 b NaN comments
3 4 c NaN Value
4 5 d NaN Value
4 5 a NaN comments
4 5 c NaN Value
Now pivot your dataframe:
out = out.pivot_table(index='ID', columns='col', values='Value', aggfunc=';'.join) \
.fillna('').reset_index().rename_axis(columns=None)
print(out)
# Final output
ID Value comments
0 1 a
1 2 b
2 3 c a;b
3 4 c b
4 5 d;c a

Python: how do I filter data as long as a group contains any of a particular value

df = pd.DataFrame({'VisitID':[1,1,1,1,2,2,2,3,3,4,4], 'Item':['A','B','C','D','A','D','B','B','C','D','C']})
I have a dataset like this:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
4 | D |
4 | C |
I want to return VisitID rows as long as that VisitID had a occurrence of item A OR B. How do I go about? Expected Result:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
In R, I can do this via
library(dplyr)
df %>% group_by(VisitID) %>% filter(any(Item %in% c('A', 'B')))
How can I perform this in Python?
Something like df.groupby(['VisitID']).query(any(['A','B']))?
The syntax is similar, just use groupby.filter:
df.groupby('VisitID').filter(lambda g: g.Item.isin(['A','B']).any())
VisitID Item
0 1 A
1 1 B
2 1 C
3 1 D
4 2 A
5 2 D
6 2 B
7 3 B
8 3 C
To extract groups contains either we can just use groupby().transform('any') on isin():
s = (df.Item.isin(['A','B'])
.groupby(df['VisitID']).transform('any')
)
df[s]
Output:
VisitID Item
0 1 A
1 1 B
2 1 C
3 1 D
4 2 A
5 2 D
6 2 B
7 3 B
8 3 C

Replace Some Columns of DataFrame with Another (Based on Column Names)

I have a DataFrame df1:
| A | B | C | D |
-----------------
| 0 | 1 | 3 | 4 |
| 2 | 1 | 8 | 4 |
| 0 | 2 | 3 | 1 |
and a DataFrame df2:
| A | D |
---------
| 2 | 2 |
| 3 | 2 |
| 1 | 9 |
I want to replace column A and D of df1 with the equivalent columns of df2.
Surely I could do something like
df1['A'] = df2['A']
df1['D'] = df2['D']
But I need a solution for doing this automatically since I have thousands of columns.
You can use combine_first:
df2.combine_first(df1)
# A B C D
#0 2 1.0 3.0 2
#1 3 1.0 8.0 2
#2 1 2.0 3.0 9
The way to do this is with pd.DataFrame.update
Update will modify a dataframe in place with information in another dataframe.
df1.update(df2)
The advantage of this is that your dtypes in df1 are preserved.
df1
A B C D
0 2 1 3 2
1 3 1 8 2
2 1 2 3 9
Another way to have done this with out updating in place would be to have used pd.DataFrame.assign and dictionary unpacking on pd.DataFrame.iteritems. However, this would include new additional columns if they existed in df2.
df1.assign(**dict(df2.iteritems()))
A B C D
0 2 1 3 2
1 3 1 8 2
2 1 2 3 9
a simple for loop should suffice:
for c in df2.columns:
df1[c] = df2[c]
for col in df1.columns:
if col in df2.columns.tolist():
df1[col] = df2[col]

Pandas: replace zero value with value of another column

How to replace zero value in a column with value from same row of another column where previous row value of column is zero i.e. replace only where non-zero has not been encountered yet?
For example: Given a dataframe with columns a, b and c:
+----+-----+-----+----+
| | a | b | c |
|----+-----+-----|----|
| 0 | 2 | 0 | 0 |
| 1 | 5 | 0 | 0 |
| 2 | 3 | 4 | 0 |
| 3 | 2 | 0 | 3 |
| 4 | 1 | 8 | 1 |
+----+-----+-----+----+
replace zero values in b and c with values of a where previous value is zero
+----+-----+-----+----+
| | a | b | c |
|----+-----+-----|----|
| 0 | 2 | 2 | 2 |
| 1 | 5 | 5 | 5 |
| 2 | 3 | 4 | 3 |
| 3 | 2 | 0 | 3 | <-- zero in this row is not replaced because of
| 4 | 1 | 8 | 1 | non-zero value (4) in row before it.
+----+-----+-----+----+
In [90]: (df[~df.apply(lambda c: c.eq(0) & c.shift().fillna(0).eq(0))]
...: .fillna(pd.DataFrame(np.tile(df.a.values[:, None], df.shape[1]),
...: columns=df.columns, index=df.index))
...: .astype(int)
...: )
Out[90]:
a b c
0 2 2 2
1 5 5 5
2 3 4 3
3 2 0 3
4 1 8 1
Explanation:
In [91]: df[~df.apply(lambda c: c.eq(0) & c.shift().fillna(0).eq(0))]
Out[91]:
a b c
0 2 NaN NaN
1 5 NaN NaN
2 3 4.0 NaN
3 2 0.0 3.0
4 1 8.0 1.0
now we can fill NaN's with the corresponding values from the DF below (which is built as 3 concatenated a columns):
In [92]: pd.DataFrame(np.tile(df.a.values[:, None], df.shape[1]), columns=df.columns, index=df.index)
Out[92]:
a b c
0 2 2 2
1 5 5 5
2 3 3 3
3 2 2 2
4 1 1 1

How to groupby count across multiple columns in pandas

I have the following sample dataframe in Python pandas:
+---+------+------+------+
| | col1 | col2 | col3 |
+---+------+------+------+
| 0 | a | d | b |
+---+------+------+------+
| 1 | a | c | b |
+---+------+------+------+
| 2 | c | b | c |
+---+------+------+------+
| 3 | b | b | c |
+---+------+------+------+
| 4 | a | a | d |
+---+------+------+------+
I would like to perform a count of all the 'a,' 'b,' 'c,' and 'd' values across columns 1-3 so that I would end up with a dataframe like this:
+---+--------+-------+
| | letter | count |
+---+--------+-------+
| 0 | a | 4 |
+---+--------+-------+
| 1 | b | 5 |
+---+--------+-------+
| 2 | c | 4 |
+---+--------+-------+
| 3 | d | 2 |
+---+--------+-------+
One way I can do this is stack the columns on top of each other and THEN do a groupby count, but I feel like there has to be a better way. Can someone help me with this?
You can stack() the dataframe to put all columns into rows and then do value_counts:
df.stack().value_counts()
b 5
c 4
a 4
d 2
dtype: int64
You can apply value_counts with sum:
print (df.apply(pd.value_counts))
col1 col2 col3
a 3.0 1 NaN
b 1.0 2 2.0
c 1.0 1 2.0
d NaN 1 1.0
df1 = df.apply(pd.value_counts).sum(1).reset_index()
df1.columns = ['letter','count']
df1['count'] = df1['count'].astype(int)
print (df1)
letter count
0 a 4
1 b 5
2 c 4
3 d 2

Categories