Python If statement on dataframe - python

I would like to replace column df['pred'] with 0 if the respective value of df['nonzero'] is not 'NAN' and "<= 1".
beta0 beta1 number_repair t pred nonzero
0 NaN NaN NaN 6 0 NaN
1 NaN NaN NaN 7 0 NaN
2 NaN NaN NaN 8 0 NaN
3 NaN NaN NaN 9 3 0
4 NaN NaN NaN 10 2 0
5 NaN NaN NaN 11 1 0
I tried the following code but it returned error. How could I correct the code or could someone suggest other way to achieve it? Thanks!
mapping['pred'] = 0 if (np.all(np.isnan(mapping['nonzero'])),
(mapping['nonzero'] <= 1)) else mapping['pred']

I think you can use loc with mask by function notnull:
mask = (df['nonzero'].notnull()) & (df['nonzero'] <= 1)
print mask
0 False
1 False
2 False
3 True
4 True
5 True
Name: nonzero, dtype: bool
By comment (Thank you PhilChang) it is same as:
mask = df['nonzero'] <= 1
print mask
0 False
1 False
2 False
3 True
4 True
5 True
Name: nonzero, dtype: bool
df.loc[ mask, 'pred'] = 0
print df
beta0 beta1 number_repair t pred nonzero
0 NaN NaN NaN 6 0 NaN
1 NaN NaN NaN 7 0 NaN
2 NaN NaN NaN 8 0 NaN
3 NaN NaN NaN 9 0 0.0
4 NaN NaN NaN 10 0 0.0
5 NaN NaN NaN 11 0 0.0
Another solution with mask:
df['pred'] = df.pred.mask(mask,0)
print df
beta0 beta1 number_repair t pred nonzero
0 NaN NaN NaN 6 0 NaN
1 NaN NaN NaN 7 0 NaN
2 NaN NaN NaN 8 0 NaN
3 NaN NaN NaN 9 0 0.0
4 NaN NaN NaN 10 0 0.0
5 NaN NaN NaN 11 0 0.0

I don't know how to check on a Series if cells contain 'NaN', but for the other condition, this works quite well:
df.ix[df.ix[:,'nonzero'] <=1,'pred'] = 0
You then just have to add after the first test "and my_second_test".

Related

How to select and print some columns for all the rows that are not NA and are a specifc number in python?

I am having trouble selecting the rowns I want and printing the colums of choice.
So I have 8 columns and what I am looking to do is take all the rows where column 8 is not NA and is equal to 2, and print only columns 2 to 5.
I have tried this:
df.where(df['dhch'].notnull())[['scchdg', 'dhch']]
Here I have just entered 2 columns to check that the conditional statement that dhch is not NA worked and I got as expected:
scchdg dhch
0 3 1
1 -1 2
2 -1 2
3 1 1
4 3 1
... ...
12094 -9 1
12095 1 1
12096 4 1
12097 3 1
12098 4 1
[12099 rows x 2 columns]
And when I check the conditional value I get expected output (i.e., values of 2 and nans in the dhch col:
df.where(df['dhch']==2)[['scchdg', 'dhch']]
Out[50]:
scchdg dhch
0 NaN NaN
1 -1.0 2.0
2 -1.0 2.0
3 NaN NaN
4 NaN NaN
... ...
12094 NaN NaN
12095 NaN NaN
12096 NaN NaN
12097 NaN NaN
12098 NaN NaN
[12099 rows x 2 columns]
But When I combine these, I just get piles of NAs
df.where(df['dhch'].notnull() & df['dhch']==2)[['scchdg', 'dhch']]
Out[51]:
scchdg dhch
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ...
12094 NaN NaN
12095 NaN NaN
12096 NaN NaN
12097 NaN NaN
12098 NaN NaN
[12099 rows x 2 columns]
What am I doing wrong please?
in R what I want to do is as follows:
df[!is.na(df$dhch) & df$dhch==2, c('scchdg', 'dhch')]
But how do i do exactly this in Python please?

Determine with pandas if values in two columns are close to each other

This is my DataFrame:
max hits
0 NaN NaN
1 NaN NaN
2 NaN True (bad)
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 True NaN
7 NaN True (good)
8 NaN NaN
9 NaN NaN
10 NaN True (good)
11 True NaN
12 NaN NaN
13 NaN NaN
I want to count how many True values in 'hits' column are near with True values in 'max' column. Proximity criterion is two steps up and two steps down. So in my example answer is 2.
Now I count this way:
# get indexes of True values in hits column
indexes = df.dropna(subset = ['hits']).index
count = 0
for index in indexes:
df_slice = df_work.iloc [index-2 : index+2+1].dropna(subset = ['max'])
if len(df_slice) > 0:
count += 1 # True in 'hits' is close to True value in 'max'
It works as expected, but very slowly. My DataFrame is very large and I loose many time. Is there a faster way?
Updated:
It started to fly using this method:
df.hits.fillna(method='bfill', inplace=True, limit=2)
df.hits.fillna(method='ffill', inplace=True, limit=2)
count = len (df.dropna(subset=['hits', 'max'], inplace=False, how = 'any'))
Let's try bfill/ffill with limit:
(df.hits.bfill(limit=2).ffill(limit=2) & df['max']).sum()
# out 2
#Introduce a test column
df=df.assign(test=df.sum(1).replace(0, np.nan).fillna(method='ffill',limit=2))
#Create Condition and Choices
cond=[df.hits.notna()&df['test'].ne(df['test'].shift(3)),df.hits.notna()&df['test'].eq(df['test'].shift(3))]
choices=['bad','good']
#Update staus using np.where
df['status']=np.select(cond,choices,'')
print(df)
max hits test status
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN True 1.0 bad
3 NaN NaN 1.0
4 NaN NaN 1.0
5 NaN NaN NaN
6 True NaN 1.0
7 NaN True 1.0 good
8 NaN NaN 1.0
9 NaN NaN 1.0
10 NaN True 1.0 good
11 True NaN 1.0
12 NaN NaN 1.0
13 NaN NaN 1.0

How to assign numerical value to each new grouping in a pandas data frame row?

If I have a Pandas Data frame like this:
0 1 2 3 4 5
1 NaN NaN 1 NaN 1 1
2 1 NaN NaN 1 NaN 1
3 NaN 1 1 NaN 1 1
4 1 1 1 1 1 1
5 NaN NaN NaN NaN NaN NaN
How do I count each group of ones and assign a value based on the number of groups in each row? Such that I get a data frame like this:
0 1 2 3 4 5
1 NaN NaN 1 NaN 2 2
2 1 NaN NaN 2 NaN 3
3 NaN 1 NaN NaN 2 2
4 1 1 1 1 1 1
5 NaN NaN NaN NaN NaN NaN
It is a little bit hard to finding a simple way
s=df.isnull().cumsum(1) # cumsum get the null
s=s[df.notnull()].apply(lambda x : pd.factorize(x)[0],1)+1 # then we need assign the groukey
df=s.mask(s==0)# and mask 0 as NaN
df
0 1 2 3 4 5
1 NaN NaN 1.0 NaN 2.0 2.0
2 1.0 NaN NaN 2.0 NaN 3.0
3 NaN 1.0 1.0 NaN 2.0 2.0
4 1.0 1.0 1.0 1.0 1.0 1.0
5 NaN NaN NaN NaN NaN NaN

reshape a pandas dataframe index to columns

Consider the below pandas Series object,
index = list('abcdabcdabcd')
df = pd.Series(np.arange(len(index)), index = index)
My desired output is,
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
I have put some effort with pd.pivot_table, pd.unstack and probably the solution lies with correct use of one of them. The closest i have reached is
df.reset_index(level = 1).unstack(level = 1)
but this does not gives me the output i my looking for
// here is something even closer to the desired output, but i am not able to handle the index grouping.
df.to_frame().set_index(df1.values, append = True, drop = False).unstack(level = 0)
a b c d
0 0.0 NaN NaN NaN
1 NaN 1.0 NaN NaN
2 NaN NaN 2.0 NaN
3 NaN NaN NaN 3.0
4 4.0 NaN NaN NaN
5 NaN 5.0 NaN NaN
6 NaN NaN 6.0 NaN
7 NaN NaN NaN 7.0
8 8.0 NaN NaN NaN
9 NaN 9.0 NaN NaN
10 NaN NaN 10.0 NaN
11 NaN NaN NaN 11.0
A bit more general solution using cumcount to get new index values, and pivot to do the reshaping:
# Reset the existing index, and construct the new index values.
df = df.reset_index()
df.index = df.groupby('index').cumcount()
# Pivot and remove the column axis name.
df = df.pivot(columns='index', values=0).rename_axis(None, axis=1)
The resulting output:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Here is a way that will work if the index is always cycling in the same order, and you know the "period" (in this case 4):
>>> pd.DataFrame(df.values.reshape(-1,4), columns=list('abcd'))
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
>>>

Fill in rows to get the same amount in each group after group by. Pandas

My goal is to get the same number of rows for each group by group. Originally, after the group by would get something like this:
count mean std min 25% 50% 75% max
X Y
56 2 5 25200 21 0.0 20000.0 20000.0 26000.0 60000.0
8 1.0 20000 NaN 20000 20000 20000 20000 20000.0
952 2 25.0 216132 239321 0 35000 93100 55000 650000.0
233 2 1.0 0 NaN 0 0.0 0.0 0.0 0.0
335 2 9.0 853 60018 0.0 35000 98000 130000 150000.0
6 11.0 3409 4943 0.0 0.0 0.0 7750.0 11000.0
And to meet my goal I should get the following.
count mean std min 25% 50% 75% max
X Y
56 1 0 0 NaN NaN NaN NaN NaN NaN
2 5 252 21 0.0 20000.0 20000.0 26000.0 60000.0
3 0 0 NaN NaN NaN NaN NaN NaN
4 0 0 NaN NaN NaN NaN NaN NaN
5 0 0 NaN NaN NaN NaN NaN NaN
6 0 0 NaN NaN NaN NaN NaN NaN
7 0 0 NaN NaN NaN NaN NaN NaN
8 1.0 20000 NaN 20000 200 20000 20000 20000.0
952 1 0 0 NaN NaN NaN NaN NaN NaN
2 25.0 216132 239 0 35000 93100 55000 650000.0
3 0 0 NaN NaN NaN NaN NaN NaN
4 0 0 NaN NaN NaN NaN NaN NaN
5 0 0 NaN NaN NaN NaN NaN NaN
6 0 0 NaN NaN NaN NaN NaN NaN
7 0 0 NaN NaN NaN NaN NaN NaN
8 1.0 0 NaN 0 0 0 0 0
I think you can use reindex by MultiIndex.from_product:
#if need in second level only all unique values from all second levels
mux = pd.MultiIndex.from_product([df.index.get_level_values('X').unique(),
df.index.get_level_values('Y').unique()])
#if need range from 1 to 8 (1,2,3,..8)
mux = pd.MultiIndex.from_product([df.index.get_level_values('X').unique(),
np.arange(1,9)])
#if need range starts with min and end max value of all levels
idx = df.index.get_level_values('Y')
mux = pd.MultiIndex.from_product([df.index.get_level_values('X').unique(),
np.arange(idx.min(),idx.max())])
df = df.reindex(mux)
print (df)

Categories