python find the nearest nonzero element in df column - python

I have df:
id number
1 5
1 0
1 0
1 2
2 0
3 1
I want to write a function to fill 0 values.I want for each id(for each group) , when the value in number column is zero, to search the closet non zero value in the column and return the value. for example to id 1 to fill the second and third-row with 2. If I dont have such value like in id 2 , just to remain it as is.
How can I do that?

You can mask the 0, bfill per group, finally fillna with then original value for the groups than only have zeros:
df['number2'] = (df['number']
.mask(df['number'].eq(0))
.groupby(df['id'])
.bfill()
.fillna(df['number'], downcast='infer')
)
output:
id number number2
0 1 5 5
1 1 0 2
2 1 0 2
3 1 2 2
4 2 0 0
5 3 1 1

Related

pandas: replace values in column with the last character in the column name

I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))
Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)
you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))

Subsetting the data frame and applying cumulative operation on multiple columns

I have a dataset that looks like below.
df=pd.DataFrame({'unit': ['ABC', 'DEF', 'GEH','IJK','DEF','XRF','BRQ'], 'A': [1,1,1,0,0,0,1], 'B': [1,1,1,1,1,1,0],'C': [1,1,1,0,0,0,1],'row_num': [7,6,5,4,3,2,1]})
I am trying to get the logic
Step 1-Consider a subset with row_number <=4.
Step 2- Column A,B,C has total 12 values(0's and 1's).
Steps 3-Count number of '1' within columns A,B,C. From the example
there are five 1's and seven 0's which calculates to 40%(5/12) of
1's.
Steps-4 since count of 1's is greater than 40% create a column flag
with 1 else if count of 1 is less than 10% then 0.
Hopefully I got it this time:
subdf = df.iloc[3:, 1:4]
df['flag'] = 1 if subdf.values.sum()/subdf.size >= 0.1 else 0
output:
unit A B C row_num flag
0 ABC 1 1 1 7 1
1 DEF 1 1 1 6 1
2 GEH 1 1 1 5 1
3 IJK 0 1 0 4 1
4 DEF 0 1 0 3 1
5 XRF 0 1 0 2 1
6 BRQ 1 0 1 1 1

Python: counting frequency for two columns with the same possible values

I have two columns with two possible values (0 or 1). One column is the predicted value and the other is the real value. Something like this.
ID Predicted Real
1 1 1
2 1 0
3 0 0
4 0 1
5 1 0
6 1 0
I want to count the frequency for 0 and 1 on each column. Something like this
Value Predicted Real
1 4 2
0 2 4
And I want to make a vertical bar plot with the results
You can apply pd.value_counts to the dataframe (assuming ID is the index and not a column, if not set ID as index first)
out = df.apply(pd.value_counts).rename_axis('Value').reset_index()
Value Predicted Real
0 0 2 4
1 1 4 2
df.apply(pd.value_counts).rename_axis('Value').plot(kind='bar') #customize as you want

Python Selecting and Adding row values of columns in dataframe to create an aggregated dataframe

I need to process my dataframe in Python such that I add the numeric values of numeric columns that lie between 2 rows of the dataframe.
The dataframe can be created using
df = pd.DataFrame(np.array([['a',0,1,0,0,0,0,'i'],
['b',1,0,0,0,0,0,'j'],
['c',0,0,1,0,0,0,'k'],
['None',0,0,0,1,0,0,'l'],
['e',0,0,0,0,1,0,'m'],
['f',0,1,0,0,0,0,'n'],
['None',0,0,0,1,0,0,'o'],
['h',0,0,0,0,1,0,'p']]),
columns=[0,1,2,3,4,5,6,7],
index=[0,1,2,3,4,5,6,7])
I need to add all rows that occur before the 'None' entries and move the aggregated row to a new dataframe that should look like:
Your data frame dtype is mess up , cause you are using the array to assign the value , since one array only accpet one type , so it push all int to become string , we need convert it firstly
df=df.apply(pd.to_numeric,errors ='ignore')# convert
df['newkey']=df[0].eq('None').cumsum()# using cumsum create the key
df.loc[df[0].ne('None'),:].groupby('newkey').agg(lambda x : x.sum() if x.dtype=='int64' else x.head(1))# then we agg
Out[742]:
0 1 2 3 4 5 6 7
newkey
0 a 1 1 1 0 0 0 i
1 e 0 1 0 0 1 0 m
2 h 0 0 0 0 1 0 p
You can also specify the agg funcs
s = lambda s: sum(int(k) for k in s)
d = {i: s for i in range(8)}
d.update({0: 'first', 7: 'first'})
df.groupby((df[0] == 'None').cumsum().shift().fillna(0)).agg(d)
0 1 2 3 4 5 6 7
0
0.0 a 1 1 1 1 0 0 i
1.0 e 0 1 0 1 1 0 m
2.0 h 0 0 0 0 1 0 p

pandas: Grouping or filtering based on values in list, instead of dataframe

I want to get a row count of the frequency of each value, even if that value doesn't exist in the dataframe.
d = {'light' : pd.Series(['b','b','c','a','a','a','a'], index=[1,2,3,4,5,6,9]),'injury' : pd.Series([1,5,5,5,2,2,4], index=[1,2,3,4,5,6,9])}
testdf = pd.DataFrame(d)
injury light
1 1 b
2 5 b
3 5 c
4 5 a
5 2 a
6 2 a
9 4 a
I want to get a count of the number of occurrences of each unique value of 'injury' for each unique value in 'light'.
Normally I would just use groupby(), or (in this case, since I want it to be in a specific format), pivot_table:
testdf.reset_index().pivot_table(index='light',columns='injury',fill_value=0,aggfunc='count')
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
But in this case I actually want to compare the records in the dataframe to an external list of values-- in this case, ['a','b','c','d']. So if 'd' doesn't exist in this dataframe, then I want it to return a count of zero:
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
d 0 0 0 0
The closest I've come is filtering the dataframe based on each value, and then getting the size of that dataframe:
for v in sorted(['a','b','c','d']):
idx2 = (df['light'].isin([v]))
df2 = df[idx2]
print(df2.shape[0])
4
2
1
0
But that only returns counts from the 'light' column-- instead of a cross-tabulation of both columns.
Is there a way to make a pivot table, or a groupby() object, that groups things based on values in a list, rather than in a column in a dataframe? Or is there a better way to do this?
Try this:
df = pd.crosstab(df.light, df.injury,margins=True)
df
injury 1 2 4 5 All
light
a 0 2 1 1 4
b 1 0 0 1 2
c 0 0 0 1 1
All 1 2 1 3 7
df["All"]
light
a 4
b 2
c 1
All 7

Categories