I have a dataframe with 50 columns. I want to replace NAs with 0 in 10 columns.
What's the simplest, most readable way of doing this?
I was hoping for something like:
cols = ['a', 'b', 'c', 'd']
df[cols].fillna(0, inplace=True)
But that gives me ValueError: Must pass DataFrame with boolean values only.
I found this answer, but it's rather hard to understand.
you can use update():
In [145]: df
Out[145]:
a b c d e
0 NaN NaN NaN 3 8
1 NaN NaN NaN 8 7
2 NaN NaN NaN 2 8
3 NaN NaN NaN 7 4
4 NaN NaN NaN 4 9
5 NaN NaN NaN 1 9
6 NaN NaN NaN 7 7
7 NaN NaN NaN 6 5
8 NaN NaN NaN 0 0
9 NaN NaN NaN 9 5
In [146]: df.update(df[['a','b','c']].fillna(0))
In [147]: df
Out[147]:
a b c d e
0 0.0 0.0 0.0 3 8
1 0.0 0.0 0.0 8 7
2 0.0 0.0 0.0 2 8
3 0.0 0.0 0.0 7 4
4 0.0 0.0 0.0 4 9
5 0.0 0.0 0.0 1 9
6 0.0 0.0 0.0 7 7
7 0.0 0.0 0.0 6 5
8 0.0 0.0 0.0 0 0
9 0.0 0.0 0.0 9 5
In [15]: cols= ['one', 'two']
In [16]: df
Out[16]:
one two three four five
a -0.343241 0.453029 -0.895119 bar False
b NaN NaN NaN NaN NaN
c 0.839174 0.229781 -1.244124 bar True
d NaN NaN NaN NaN NaN
e 1.300641 -1.797828 0.495313 bar True
f -0.182505 -1.527464 0.712738 bar False
g NaN NaN NaN NaN NaN
h 0.626568 -0.971003 1.192831 bar True
In [17]: df[cols]=df[cols].fillna(0)
In [18]: df
Out[18]:
one two three four five
a -0.343241 0.453029 -0.895119 bar False
b 0.000000 0.000000 NaN NaN NaN
c 0.839174 0.229781 -1.244124 bar True
d 0.000000 0.000000 NaN NaN NaN
e 1.300641 -1.797828 0.495313 bar True
f -0.182505 -1.527464 0.712738 bar False
g 0.000000 0.000000 NaN NaN NaN
h 0.626568 -0.971003 1.192831 bar True
And a version using column slicing which might be useful in your case:
In [46]:
df
Out[46]:
a b c d e
0 NaN NaN NaN 3 8
1 NaN NaN NaN 8 7
2 NaN NaN NaN 2 8
3 NaN NaN NaN 7 4
4 NaN NaN NaN 4 9
5 9 NaN NaN 1 9
6 NaN NaN NaN 7 7
7 NaN NaN NaN 6 5
8 NaN NaN NaN 0 0
9 NaN NaN NaN 9 5
In [47]:
df.loc[:,'a':'c'] = df.loc[:,'a':'c'].fillna(0)
df
Out[47]:
a b c d e
0 0 0 0 3 8
1 0 0 0 8 7
2 0 0 0 2 8
3 0 0 0 7 4
4 0 0 0 4 9
5 9 0 0 1 9
6 0 0 0 7 7
7 0 0 0 6 5
8 0 0 0 0 0
9 0 0 0 9 5
Related
I am looking for a method to create an array of numbers to label groups, based on the value of the 'number' column. If it's possible?
With this abbreviated example DF:
number = [nan,nan,1,nan,nan,nan,2,nan,nan,3,nan,nan,nan,nan,nan,4,nan,nan]
df = pd.DataFrame(columns=['number'])
df = pd.DataFrame.assign(df, number=number)
Ideally I would like to make a new column, 'group', based on the int in column 'number' - so there would be effectively be array's of 1, ,2, 3, etc. FWIW, the DF is 1000's lines long, with sporadically placed int's.
The result would be a new column, something like this:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
All advice much appreciated!
You can use notna combined with cumsum:
df['group'] = df['number'].notna().cumsum()
NB. if you had zeros: df['group'] = df['number'].ne(0).cumsum().
output:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
You can use forward fill:
df['number'].ffill().fillna(0)
Output:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 4.0
16 4.0
17 4.0
Name: number, dtype: float64
The objective is to fill NaN with respect to two columns (i.e., a, b) .
a b c d
2,0,1,4
5,0,5,6
6,0,1,1
1,1,1,4
4,1,5,6
5,1,5,6
6,1,1,1
1,2,2,3
6,2,5,6
Such that, there should be continous value of between 1 to 6 for the column a for a fixed value in column b. Then, the other rows assigned to nan.
The code snippet does the trick
import numpy as np
import pandas as pd
maxval_col_a=6
lowval_col_a=1
maxval_col_b=2
lowval_col_b=0
r=list(range(lowval_col_b,maxval_col_b+1))
df=pd.DataFrame(np.column_stack([[2,5,6,1,4,5,6,1,6,],
[0,0,0,1,1,1,1,2,2,], [1,5,1,1,5,5,1,2,5,],[4,6,1,4,6,6,1,3,6,]]),columns=['a','b','c','d'])
all_df=[]
for idx in r:
k=df.loc[df['b']==idx].set_index('a').reindex(range(lowval_col_a, maxval_col_a+1, 1)).reset_index()
k['b']=idx
all_df.append(k)
df=pd.concat(all_df)
But, I am curious whether there are more efficient and better way of doing this with Pandas.
The expected output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
0 1 1 1.0 4.0
1 2 1 NaN NaN
2 3 1 NaN NaN
3 4 1 5.0 6.0
4 5 1 5.0 6.0
5 6 1 1.0 1.0
0 1 2 2.0 3.0
1 2 2 NaN NaN
2 3 2 NaN NaN
3 4 2 NaN NaN
4 5 2 NaN NaN
5 6 2 5.0 6.0
Create the cartesian product of combinations:
mi = pd.MultiIndex.from_product([df['b'].unique(), range(1, 7)],
names=['b', 'a']).swaplevel()
out = df.set_index(['a', 'b']).reindex(mi).reset_index()
print(out)
# Output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
6 1 1 1.0 4.0
7 2 1 NaN NaN
8 3 1 NaN NaN
9 4 1 5.0 6.0
10 5 1 5.0 6.0
11 6 1 1.0 1.0
12 1 2 2.0 3.0
13 2 2 NaN NaN
14 3 2 NaN NaN
15 4 2 NaN NaN
16 5 2 NaN NaN
17 6 2 5.0 6.0
First create a multindex with cols [a,b] then a new multindex with all the combinations and then you reindex with the new multindex:
(showing all steps)
# set both a and b as index (it's a multiindex)
df.set_index(['a','b'],drop=True,inplace=True)
# create the new multindex
new_idx_a=np.tile(np.arange(0,6+1),3)
new_idx_b=np.repeat([0,1,2],6+1)
new_multidx=pd.MultiIndex.from_arrays([new_idx_a,
new_idx_b])
# reindex
df=df.reindex(new_multidx)
# convert the multindex back to columns
df.index.names=['a','b']
df.reset_index()
results:
a b c d
0 0 0 NaN NaN
1 1 0 NaN NaN
2 2 0 1.0 4.0
3 3 0 NaN NaN
4 4 0 NaN NaN
5 5 0 5.0 6.0
6 6 0 1.0 1.0
7 0 1 NaN NaN
8 1 1 1.0 4.0
9 2 1 NaN NaN
10 3 1 NaN NaN
11 4 1 5.0 6.0
12 5 1 5.0 6.0
13 6 1 1.0 1.0
14 0 2 NaN NaN
15 1 2 2.0 3.0
16 2 2 NaN NaN
17 3 2 NaN NaN
18 4 2 NaN NaN
19 5 2 NaN NaN
20 6 2 5.0 6.0
We can do it by using a groupby on the column b, then set a as index and add the missing values of a using numpy.arange.
To finish, reset the index to get the expected result :
import numpy as np
df.groupby('b').apply(lambda x : x.set_index('a').reindex(np.arange(1, 7))).drop('b', 1).reset_index()
Output :
b a c d
0 0 1 NaN NaN
1 0 2 1.0 4.0
2 0 3 NaN NaN
3 0 4 NaN NaN
4 0 5 5.0 6.0
5 0 6 1.0 1.0
6 1 1 1.0 4.0
7 1 2 NaN NaN
8 1 3 NaN NaN
9 1 4 5.0 6.0
10 1 5 5.0 6.0
11 1 6 1.0 1.0
12 2 1 2.0 3.0
13 2 2 NaN NaN
14 2 3 NaN NaN
15 2 4 NaN NaN
16 2 5 NaN NaN
17 2 6 5.0 6.0
Hi I have the following dataframe that is ~1,400,000 rows:
x = pd.DataFrame({'ID':['A','B','D','D','F'], 'start1':[1,2,3,4,5], 'start2':[12,11,10,6,7], 'start3':[1,6,2,4,5], 'start4':[5,4,2,3,1], 'start5':[0,0,0,0,0], 'end1':[2,3,4,7,9] })
ID start1 start2 start3 start4 start5 end1
A 1 12 1 5 0 2
B 2 11 6 4 0 3
D 3 10 2 2 0 4
D 4 6 4 3 0 7
F 5 7 5 1 0 9
I'm looking to collapse all rows that contain column headers 'start' or 'end' into the following format:
desired output:
ID start end
A 1 NaN
A 12 NAN
A 1 NAN
A 5 NaN
A 0 NaN
A NaN 2
B 2 NaN
B 11 NaN
B 6 NaN
B 4 NaN
B 0 NaN
B 3 NaN
...
F 1 NaN
F 0 NaN
F NaN 9
I have tried:
joined = df2.apply(lambda x: ' '.join([str(xi) for xi in x]), axis=1)
split = joined.str.split('', expand=True).reset_index(drop=False).melt(id_vars='index')
However this seems to use up all my memory and the environment crashes.
Any help would be great
Try melt the start columns and concat
(pd.concat([x.iloc[:,:-1].melt('ID', value_name='start')
.sort_values(['ID','variable']).drop('variable',axis=1),
x[['ID','end1']]
])
.sort_values('ID', kind='mergesort')
)
Output:
ID start end1
0 A 1.0 NaN
5 A 12.0 NaN
10 A 1.0 NaN
15 A 5.0 NaN
20 A 0.0 NaN
0 A NaN 2.0
1 B 2.0 NaN
6 B 11.0 NaN
11 B 6.0 NaN
16 B 4.0 NaN
21 B 0.0 NaN
1 B NaN 3.0
2 D 3.0 NaN
3 D 4.0 NaN
7 D 10.0 NaN
8 D 6.0 NaN
12 D 2.0 NaN
13 D 4.0 NaN
17 D 2.0 NaN
18 D 3.0 NaN
22 D 0.0 NaN
23 D 0.0 NaN
2 D NaN 4.0
3 D NaN 7.0
4 F 5.0 NaN
9 F 7.0 NaN
14 F 5.0 NaN
19 F 1.0 NaN
24 F 0.0 NaN
4 F NaN 9.0
Remember that you are trying to duplicate a large amount of data here, so you need to be careful.
How about this?
import numpy as np
out = pd.DataFrame(columns = ['ID', 'start', 'end'])
for col in x.columns:
if 'start'in col:
out_col = 'start'
if 'end' in col:
out_col = 'end'
if 'ID' not in col:
temp = x[['ID', col]].rename(columns = {col:out_col})
out = pd.concat([out,temp])
Output:
ID start end
0 A 1 NaN
0 A 0 NaN
0 A NaN 2
0 A 12 NaN
0 A 5 NaN
0 A 1 NaN
1 B 0 NaN
1 B 4 NaN
1 B NaN 3
1 B 6 NaN
1 B 11 NaN
1 B 2 NaN
2 D 10 NaN
2 D 0 NaN
2 D 2 NaN
3 D 4 NaN
3 D NaN 7
2 D NaN 4
2 D 2 NaN
3 D 3 NaN
3 D 4 NaN
2 D 3 NaN
3 D 6 NaN
3 D 0 NaN
4 F 5 NaN
4 F 1 NaN
4 F 7 NaN
4 F 5 NaN
4 F 0 NaN
4 F NaN 9
You can merge all columns into one by .ravel()
So: add end1 and Id to another variable
end1values = x['end1']a
idvalues = x['ID']
Remove end and Id from data set:
x.drop('end1',
axis='columns', inplace=True)
x.drop('ID',
axis='columns', inplace=True)
use Ravel for starts:
df = pd.DataFrame({'start':x.values.ravel()})
Add end1 + ID
df['ID'] = idvalues
df['end'] = end1values
Result:
If I have a Pandas Data frame like this:
0 1 2 3 4 5
1 NaN NaN 1 NaN 1 1
2 1 NaN NaN 1 NaN 1
3 NaN 1 1 NaN 1 1
4 1 1 1 1 1 1
5 NaN NaN NaN NaN NaN NaN
How do I count each group of ones and assign a value based on the number of groups in each row? Such that I get a data frame like this:
0 1 2 3 4 5
1 NaN NaN 1 NaN 2 2
2 1 NaN NaN 2 NaN 3
3 NaN 1 NaN NaN 2 2
4 1 1 1 1 1 1
5 NaN NaN NaN NaN NaN NaN
It is a little bit hard to finding a simple way
s=df.isnull().cumsum(1) # cumsum get the null
s=s[df.notnull()].apply(lambda x : pd.factorize(x)[0],1)+1 # then we need assign the groukey
df=s.mask(s==0)# and mask 0 as NaN
df
0 1 2 3 4 5
1 NaN NaN 1.0 NaN 2.0 2.0
2 1.0 NaN NaN 2.0 NaN 3.0
3 NaN 1.0 1.0 NaN 2.0 2.0
4 1.0 1.0 1.0 1.0 1.0 1.0
5 NaN NaN NaN NaN NaN NaN
I would like to replace column df['pred'] with 0 if the respective value of df['nonzero'] is not 'NAN' and "<= 1".
beta0 beta1 number_repair t pred nonzero
0 NaN NaN NaN 6 0 NaN
1 NaN NaN NaN 7 0 NaN
2 NaN NaN NaN 8 0 NaN
3 NaN NaN NaN 9 3 0
4 NaN NaN NaN 10 2 0
5 NaN NaN NaN 11 1 0
I tried the following code but it returned error. How could I correct the code or could someone suggest other way to achieve it? Thanks!
mapping['pred'] = 0 if (np.all(np.isnan(mapping['nonzero'])),
(mapping['nonzero'] <= 1)) else mapping['pred']
I think you can use loc with mask by function notnull:
mask = (df['nonzero'].notnull()) & (df['nonzero'] <= 1)
print mask
0 False
1 False
2 False
3 True
4 True
5 True
Name: nonzero, dtype: bool
By comment (Thank you PhilChang) it is same as:
mask = df['nonzero'] <= 1
print mask
0 False
1 False
2 False
3 True
4 True
5 True
Name: nonzero, dtype: bool
df.loc[ mask, 'pred'] = 0
print df
beta0 beta1 number_repair t pred nonzero
0 NaN NaN NaN 6 0 NaN
1 NaN NaN NaN 7 0 NaN
2 NaN NaN NaN 8 0 NaN
3 NaN NaN NaN 9 0 0.0
4 NaN NaN NaN 10 0 0.0
5 NaN NaN NaN 11 0 0.0
Another solution with mask:
df['pred'] = df.pred.mask(mask,0)
print df
beta0 beta1 number_repair t pred nonzero
0 NaN NaN NaN 6 0 NaN
1 NaN NaN NaN 7 0 NaN
2 NaN NaN NaN 8 0 NaN
3 NaN NaN NaN 9 0 0.0
4 NaN NaN NaN 10 0 0.0
5 NaN NaN NaN 11 0 0.0
I don't know how to check on a Series if cells contain 'NaN', but for the other condition, this works quite well:
df.ix[df.ix[:,'nonzero'] <=1,'pred'] = 0
You then just have to add after the first test "and my_second_test".