I am trying to do the following: on a dataframe X, I want to select all rows where X['a']>0 but I want to preserve the dimension of X, so that any other row will appear as containing NaN. Is there a fast way to do it? If one does X[X['a']>0] the dimensions of X are not preserved.
Use double subscript [[]]:
In [42]:
df = pd.DataFrame({'a':np.random.randn(10)})
df
Out[42]:
a
0 1.042971
1 0.978914
2 0.764374
3 -0.338405
4 0.974011
5 -0.995945
6 -1.649612
7 0.965838
8 -0.142608
9 -0.804508
In [48]:
df[df[['a']] > 1]
Out[48]:
a
0 1.042971
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
The key semantic difference here is what is returned is a df when you double subscript so this masks the df itself rather than the index
Note though that if you have multiple columns then it will mask all those as NaN
Related
I have a dataframe and would like to assign multiple values from one row to multiple other rows.
I get it to work with .iloc but for some when I use conditions with .loc it only returns nan.
df = pd.DataFrame(dict(A = [1,2,0,0],B=[0,0,0,10],C=[3,4,5,6]))
df.index = ['a','b','c','d']
When I use loc with conditions or with direct index names:
df.loc[df['A']>0, ['B','C']] = df.loc['d',['B','C']]
df.loc[['a','b'], ['B','C']] = df.loc['d',['B','C']]
it will return
A B C
a 1.0 NaN NaN
b 2.0 NaN NaN
c 0.0 0.0 5.0
d 0.0 10.0 6.0
but when I use .iloc it actually works as expected
df.iloc[0:2,1:3] = df.iloc[3,1:3]
A B C
a 1 10 6
b 2 10 6
c 0 0 5
d 0 10 6
is there a way to do this with .loc or do I need to rewrite my code to get the row numbers from my mask?
When you use labels, pandas perform index alignment, and in your case there is no common indices thus the NaNs, while location based indexing does not align.
You can assign a numpy array to prevent index alignment:
df.loc[['a','b'], ['B','C']] = df.loc['d',['B','C']].values
output:
A B C
a 1 10 6
b 2 10 6
c 0 0 5
d 0 10 6
ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS
0 104.0 PUTNAM Y 3.0
1 197.0 LEXINGTON NaN NaN
2 NaN LEXINGTON N 3.0
3 201.0 BERKELEY NaN 1.0
4 203.0 BERKELEY Y NaN
this is my data frame. I wanted to create a user defined functions which returns the data frame which shows number of missing values in data frame by column and row number of missing value.
output df should look like this.
col_name index
st_num 2
st_num 6
st_name 8
Num_bedrooms 2
Num_bedrooms 5
Num_bedrooms 7
Num_bedrooms 8 .......
You can slice the index by the isnull for each column to get the indices. Also possible with stacking and groupby.
def summarize_missing(df):
# Null counts
s1 = df.isnull().sum().rename('No. Missing')
s2 = pd.Series(data=[df.index[m].tolist() for m in [df[col].isnull() for col in df.columns]],
index=df.columns,
name='Index')
# Other way, probably overkill
#s2 = (df.isnull().replace(False, np.NaN).stack().reset_index()
# .groupby('level_1')['level_0'].agg(list)
# .rename('Index'))
return pd.concat([s1, s2], axis=1, sort=False)
summarize_missing(df)
# No. Missing Index
#ST_NUM 1 [2]
#ST_NAME 0 NaN
#OWN_OCCUPIED 2 [1, 3]
#NUM_BEDROOMS 2 [1, 4]
Here's another way:
m = df.isna().sum().to_frame().rename(columns={0: 'No. Missing'})
m['index'] = m.index.map(lambda x: ','.join(map(str, df.loc[df[x].isna()].index.values)))
print(m)
No. Missing index
ST_NUM 1 2
ST_NAME 0
OWN_OCCUPIED 2 1,3
NUM_BEDROOMS 2 1,4
I am trying to learn python 2.7 by converting code I wrote in VB to python. I have column names and I am trying to create a empty dataframe or list then add rows by iterating (see below). I do not know the total number of rows I will need to add in advance. I can create a dataframe with the column names but can't figure out how to add the data. I have looked at several questions like mine but the row/columns of data are unknown in advance.
snippet of code:
cnames=['Security','Time','Vol_21D','Vol2_21D','MaxAPV_21D','MinAPV_21D' ]
df_Calcs = pd.DataFrame(index=range(10), columns=cnames)
this creates the empty df (df_Calcs)...then the code below is where I get the data to fill the rows...I use n as a counter for the new row # to insert (there are 20 other columns that I add to the row), but the below should explain what I am trying to do.
i = 0
n = 0
while True:
df_Calcs.Security[n] = i + 1
df_Calcs.Time[n] = '09:30:00'
df_Calcs.Vol_21D[n] = i + 2
df_Calcs.Vol2_21D[n] = i + 3
df_Calcs.MaxAPV_21D[n] = i + 4
df_Calcs.MinAPV_21D[n] = i + 5
i = i +1
n = n +1
if i > 4:
break
print df_Calcs
If I should use a list or array instead please let me know, I am trying to do this in the fastest most efficient way. This data will then be sent to a MySQL db table.
Result...
Security Time Vol_21D Vol2_21D MaxAPV_21D MinAPV_21D
0 1 09:30:00 2 3 4 5
1 2 09:30:00 3 4 5 6
2 3 09:30:00 4 5 6 7
3 4 09:30:00 5 6 7 8
4 5 09:30:00 6 7 8 9
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
You have many ways to do that.
Create empty dataframe:
cnames=['Security', 'Time', 'Vol_21D', 'Vol2_21D', 'MaxAPV_21D', 'MinAPV_21D']
df = pd.DataFrame(columns=cnames)
Output:
Empty DataFrame
Columns: [Security, Time, Vol_21D, Vol2_21D, MaxAPV_21D, MinAPV_21D]
Index: []
Then, in loop you can create a pd.series and append to your dataframe, example:
df.append(pd.Series([1, 2, 3, 4, 5, 6], cnames), ignore_index=True)
Or you can append a dict:
df.append({'Security': 1,
'Time': 2,
'Vol_21D': 3,
'Vol2_21D': 4,
'MaxAPV_21D': 5,
'MinAPV_21D': 6
}, ignore_index=True)
It will be the same output:
Security Time Vol_21D Vol2_21D MaxAPV_21D MinAPV_21D
0 1 2 3 4 5 6
But I think, more faster and pythonic way: first create an array, then append all raws to array and make data frame from array.
data = []
for i in range(0,5):
data.append([1,2,3,4,i,6])
df = pd.DataFrame(data, columns=cnames)
I hope it helps.
I would like to fill N/A values in a DataFrame in a selective manner. In particular, if there is a sequence of consequetive nans within a column, I want them to be filled by the preceeding non-nan value, but only if the length of the nan sequence is below a specified threshold. For example, if the threshold is 3 then a within-column sequence of 3 or less will be filled with the preceeding non-nan value, whereas a sequence of 4 or more nans will be left as is.
That is, if the input DataFrame is
2 5 4
nan nan nan
nan nan nan
5 nan nan
9 3 nan
7 9 1
I want the output to be:
2 5 4
2 5 nan
2 5 nan
5 5 nan
9 3 nan
7 9 1
The fillna function, when applied to a DataFrame, has the method and limit options. But these are unfortunately not sufficient to acheive the task. I tried to specify method='ffill' and limit=3, but that fills in the first 3 nans of any sequence, not selectively as described above.
I suppose this can be coded by going column by column with some conditional statements, but I suspect there must be something more Pythonic. Any suggestinos on an efficient way to acheive this?
Working with contiguous groups is still a little awkward in pandas.. or at least I don't know of a slick way to do this, which isn't at all the same thing. :-)
One way to get what you want would be to use the compare-cumsum-groupby pattern:
In [68]: nulls = df.isnull()
...: groups = (nulls != nulls.shift()).cumsum()
...: to_fill = groups.apply(lambda x: x.groupby(x).transform(len) <= 3)
...: df.where(~to_fill, df.ffill())
...:
Out[68]:
0 1 2
0 2.0 5.0 4.0
1 2.0 5.0 NaN
2 2.0 5.0 NaN
3 5.0 5.0 NaN
4 9.0 3.0 NaN
5 7.0 9.0 1.0
Okay, another alternative which I don't like because it's too tricky:
def method_2(df):
nulls = df.isnull()
filled = df.ffill(limit=3)
unfilled = nulls & (~filled.notnull())
nf = nulls.replace({False: 2.0, True: np.nan})
do_not_fill = nf.combine_first(unfilled.replace(False, np.nan)).bfill() == 1
return df.where(do_not_fill, df.ffill())
This doesn't use any groupby tools and so should be faster. Note that a different approach would be to manually (using shifts) determine which elements are to be filled because they're a group of length 1, 2, or 3.
I have an issue with pandas pivot_table.
Sometimes, the order of the columns specified on "values" list does not match
In [11]: p = pivot_table(df, values=["x","y"], cols=["month"],
rows="name", aggfunc=np.sum)
i get the wrong order (y,x) instead of (x,y)
Out[12]:
y x
month 1 2 3 1 2 3
name
a 1 NaN 7 2 NaN 8
b 3 NaN 9 4 NaN 10
c NaN 5 NaN NaN 6 NaN
Is there something i don't do well ?
According to the pandas documentation, values should take the name of a single column, not an iterable.
values : column to aggregate, optional