Pandas: replace zero value with value of another column - python

How to replace zero value in a column with value from same row of another column where previous row value of column is zero i.e. replace only where non-zero has not been encountered yet?
For example: Given a dataframe with columns a, b and c:
+----+-----+-----+----+
| | a | b | c |
|----+-----+-----|----|
| 0 | 2 | 0 | 0 |
| 1 | 5 | 0 | 0 |
| 2 | 3 | 4 | 0 |
| 3 | 2 | 0 | 3 |
| 4 | 1 | 8 | 1 |
+----+-----+-----+----+
replace zero values in b and c with values of a where previous value is zero
+----+-----+-----+----+
| | a | b | c |
|----+-----+-----|----|
| 0 | 2 | 2 | 2 |
| 1 | 5 | 5 | 5 |
| 2 | 3 | 4 | 3 |
| 3 | 2 | 0 | 3 | <-- zero in this row is not replaced because of
| 4 | 1 | 8 | 1 | non-zero value (4) in row before it.
+----+-----+-----+----+

In [90]: (df[~df.apply(lambda c: c.eq(0) & c.shift().fillna(0).eq(0))]
...: .fillna(pd.DataFrame(np.tile(df.a.values[:, None], df.shape[1]),
...: columns=df.columns, index=df.index))
...: .astype(int)
...: )
Out[90]:
a b c
0 2 2 2
1 5 5 5
2 3 4 3
3 2 0 3
4 1 8 1
Explanation:
In [91]: df[~df.apply(lambda c: c.eq(0) & c.shift().fillna(0).eq(0))]
Out[91]:
a b c
0 2 NaN NaN
1 5 NaN NaN
2 3 4.0 NaN
3 2 0.0 3.0
4 1 8.0 1.0
now we can fill NaN's with the corresponding values from the DF below (which is built as 3 concatenated a columns):
In [92]: pd.DataFrame(np.tile(df.a.values[:, None], df.shape[1]), columns=df.columns, index=df.index)
Out[92]:
a b c
0 2 2 2
1 5 5 5
2 3 3 3
3 2 2 2
4 1 1 1

Related

Combine and expand Dataframe based on IDs in columns

I have 3 dataframes A,B,C:
import pandas as pd
A = pd.DataFrame({"id": [1,2],
"connected_to_B_id1":["A","B"],
"connected_to_B_id2":["B","C"],
"connected_to_B_id3":["C", np.nan],
# entry can have multiple ids from B
})
B = pd.DataFrame({"id": ["A","B","C"],
"connected_to_C_id1":[1,1,2],
"connected_to_C_id2":[2,2,np.nan],
# entry can have multiple ids from C
})
C = pd.DataFrame({"id": [1,2],
"name":["a","b"],
})
#Output should be D:
D = pd.DataFrame({"id_A": [1,1,1,1,1,2,2,2],
"id_B": ["A","A","B","B","C","B","B","C"],
"id_C": [1,2,1,2,2,1,2,1],
"name": ["a","b","a","b","b","a","b","a"]
})
I want to use the IDs stored in the "connected_to_X" columns of each dataframe to create a dataframe, which contains all relationships recorded in the three individual dataframes.
What is the most elegant way to combine the dataframes to A, B and C to D?
Currently I am using dicts,lists and for loops and its messy and complicated.
D:
|idx |id_A|id_B|id_C|name|
|---:|--:|--:|--:|--:|
| 0 | 1 | A | 1 | a |
| 1 | 1 | A | 2 | b |
| 2 | 1 | B | 1 | a |
| 3 | 1 | B | 2 | b |
| 4 | 1 | C | 2 | b |
| 5 | 2 | B | 1 | a |
| 6 | 2 | B | 2 | b |
| 7 | 2 | C | 1 | a |
You just need to unpivot A and B then you can join the tables up.
(A.
melt(id_vars='id').
merge(B.melt(id_vars='id'), left_on = 'value', right_on='id', how='left').
merge(C, left_on = 'value_y', right_on='id').
drop(columns = ['variable_x', 'variable_y', 'value_x']).
sort_values(['id_x', 'id_y']).
reset_index(drop=True).
reset_index()
)
index id_x id_y value_y id name
0 0 1 A 1.0 1 a
1 1 1 A 2.0 2 b
2 2 1 B 1.0 1 a
3 3 1 B 2.0 2 b
4 4 1 C 2.0 2 b
5 5 2 B 1.0 1 a
6 6 2 B 2.0 2 b
7 7 2 C 2.0 2 b

Python: how do I filter data as long as a group contains any of a particular value

df = pd.DataFrame({'VisitID':[1,1,1,1,2,2,2,3,3,4,4], 'Item':['A','B','C','D','A','D','B','B','C','D','C']})
I have a dataset like this:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
4 | D |
4 | C |
I want to return VisitID rows as long as that VisitID had a occurrence of item A OR B. How do I go about? Expected Result:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
In R, I can do this via
library(dplyr)
df %>% group_by(VisitID) %>% filter(any(Item %in% c('A', 'B')))
How can I perform this in Python?
Something like df.groupby(['VisitID']).query(any(['A','B']))?
The syntax is similar, just use groupby.filter:
df.groupby('VisitID').filter(lambda g: g.Item.isin(['A','B']).any())
VisitID Item
0 1 A
1 1 B
2 1 C
3 1 D
4 2 A
5 2 D
6 2 B
7 3 B
8 3 C
To extract groups contains either we can just use groupby().transform('any') on isin():
s = (df.Item.isin(['A','B'])
.groupby(df['VisitID']).transform('any')
)
df[s]
Output:
VisitID Item
0 1 A
1 1 B
2 1 C
3 1 D
4 2 A
5 2 D
6 2 B
7 3 B
8 3 C

Add column with average value grouped by column

I want to replace column value of dataframe with mean(without zeros) value of column grouped by another column.
Dataframe df is like:
ID | TYPE | rate
-------------
1 | A | 0 <- Replace this
2 | B | 2
3 | C | 1
4 | A | 2
5 | C | 1
6 | C | 0 <- Replace this
7 | C | 8
8 | C | 2
9 | D | 0 <- Replace this
I have to replace values in rating where rating = 0:
df['rate'][df['rate']==0] = ?
with average value for that TYPE.
Average(without zeros) value for every type is:
A = 2/1 = 2
B = 2/1 = 2
C = (1 + 1 + 8 + 2)/4 = 3
D = 0 (default value when there isn't information for type)
Expected result:
ID | TYPE | rate
-------------
1 | A | 2 <- Changed
2 | B | 2
3 | C | 1
4 | A | 2
5 | C | 1
6 | C | 3 <- Changed
7 | C | 8
8 | C | 2
9 | D | 0 <- Changed
You could mask the rate column in the dataframe, GroupBy the TYPE and transform with the mean, which will exlude NaNs. The use fillna to replace the values in the masked dataframe:
ma = df.rate.mask(df.rate.eq(0))
df['rate'] = ma.fillna(ma.groupby(df.TYPE).transform('mean').fillna(0))
ID TYPE rate
0 1 A 2.0
1 2 B 2.0
2 3 C 1.0
3 4 A 2.0
4 5 C 1.0
5 6 C 3.0
6 7 C 8.0
7 8 C 2.0
8 9 D 0.0

Python pandas count occurrences in each column

I am new to pandas. Can someone help me in calculating frequencies of values for each columns.
Dataframe:
id|flag1|flag2|flag3|
---------------------
1 | 1 | 2 | 1 |
2 | 3 | 1 | 1 |
3 | 3 | 4 | 4 |
4 | 4 | 1 | 4 |
5 | 2 | 3 | 2 |
I want something like
id|flag1|flag2|flag3|
---------------------
1 | 1 | 2 | 2 |
2 | 1 | 1 | 1 |
3 | 2 | 1 | 0 |
4 | 1 | 1 | 2 |
Explanation - id 1 has 1 value in flag1, 2 values in flag2 and 2 values in flag3.
First filter only flag columns by filter or removing id column and then apply function value_counts, last replace NaNs to 0 and cast to ints:
df = df.filter(like='flag').apply(lambda x: x.value_counts()).fillna(0).astype(int)
print (df)
flag1 flag2 flag3
1 1 2 2
2 1 1 1
3 2 1 0
4 1 1 2
Or:
df = df.drop('id', 1).apply(lambda x: x.value_counts()).fillna(0).astype(int)
print (df)
flag1 flag2 flag3
1 1 2 2
2 1 1 1
3 2 1 0
4 1 1 2
Thank you, Bharath for suggestion:
df = df.filter(like='flag').apply(pd.Series.value_counts()).fillna(0).astype(int)

Interleaving Pandas Dataframes by Timestamp

I've got 2 Pandas DataFrame, each of them containing 2 columns. One of the columns is a timestamp column [t], the other one contains sensor readings [s].
I now want to create a single DataFrame, containing 4 columns, that is interleaved on the timestamp column.
Example:
First Dataframe:
+----+----+
| t1 | s1 |
+----+----+
| 0 | 1 |
| 2 | 3 |
| 3 | 3 |
| 5 | 2 |
+----+----+
Second DataFrame:
+----+----+
| t2 | s2 |
+----+----+
| 1 | 5 |
| 2 | 3 |
| 4 | 3 |
+----+----+
Target:
+----+----+----+----+
| t1 | t2 | s1 | s2 |
+----+----+----+----+
| 0 | 0 | 1 | 0 |
| 0 | 1 | 1 | 5 |
| 2 | 1 | 3 | 5 |
| 2 | 2 | 3 | 3 |
| 3 | 2 | 3 | 3 |
| 3 | 4 | 3 | 3 |
| 5 | 4 | 2 | 3 |
+----+----+----+----+
I hat a look at pandas.merge, but that left me with a lot of NaNs and an unsorted table.
a.merge(b, how='outer')
Out[55]:
t1 s1 t2 s2
0 0 1 NaN NaN
1 2 3 2 3
2 3 3 NaN NaN
3 5 2 NaN NaN
4 1 NaN 1 5
5 4 NaN 4 3
Merging will put NaNs in common columns that you merge on, if those values are not present in both indexes. It will not create new data that is not present in the dataframes that are being merged.
For example, index 0 in your target dataframe shows t2 with a value of 0. This is not present in the second dataframe, so you cannot expect it to appear in the merged dataframe either. Same applies for other rows as well.
What you can do instead is reindex the dataframes to a common index. In your case, since the maximum index is 5 in the target dataframe, lets use this list to reindex both input dataframes:
In [382]: ind
Out[382]: [0, 1, 2, 3, 4, 5]
Now, we will reindex according both inputs to this index:
In [372]: x = a.set_index('t1').reindex(ind).fillna(0).reset_index()
In [373]: x
Out[373]:
t1 s1
0 0 1
1 1 0
2 2 3
3 3 3
4 4 0
5 5 2
In [374]: y = b.set_index('t2').reindex(ind).fillna(0).reset_index()
In [375]: y
Out[375]:
t2 s2
0 0 0
1 1 5
2 2 3
3 3 0
4 4 5
5 5 0
And, now we merge it to get something close to the target dataframe:
In [376]: x.merge(y, left_on=['t1'], right_on=['t2'], how='outer')
Out[376]:
t1 s1 t2 s2
0 0 1 0 0
1 1 0 1 5
2 2 3 2 3
3 3 3 3 0
4 4 0 4 5
5 5 2 5 0

Categories