I have a pandas dataframe tdf
I am extracting a slice based on boolean labels
idx = tdf['MYcol1'] == 1
myslice = tdf.loc[idx] //I want myslice to be a view not a copy
Now i want to fill the missing values in a column of myslice and i want this to be reflected in tdf my original dataframe
myslice.loc[:,'MYcol2'].fillna(myslice['MYcol2'].mean(), inplace = True) // 1
myslice.ix[:,'MYcol2'].fillna(myslice['MYcol2'].mean(), inplace = True) // 2
Both 1 and 2 above throw the warning that: A value is trying to be set on a copy of a slice from a DataFrame
What am i doing wrong?
When you assign it to a new variable, it creates a copy. The things you do after that are irrelevant. Consider this:
tdf
Out:
A B C
0 NaN 0.195070 -1.781563
1 -0.729045 0.196557 0.354758
2 0.616887 0.008628 NaN
3 NaN NaN 0.037006
4 0.767902 NaN NaN
5 -0.805627 NaN NaN
6 1.133080 NaN -0.659892
7 -1.139802 0.784958 -0.554310
8 -0.470638 -0.216950 NaN
9 -0.392389 -3.046143 0.543312
idx = tdf['A'] > 0
myslice = tdf.loc[idx]
Fill NaN's in myslice:
myslice.loc[:,'B'].fillna(myslice['B'].mean(), inplace = True)
C:\Anaconda3\envs\p3\lib\site-packages\pandas\core\generic.py:3191: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
myslice
Out:
A B C
2 0.616887 0.008628 NaN
4 0.767902 0.008628 NaN
6 1.133080 0.008628 -0.659892
tdf
Out:
A B C
0 NaN 0.195070 -1.781563
1 -0.729045 0.196557 0.354758
2 0.616887 0.008628 NaN
3 NaN NaN 0.037006
4 0.767902 NaN NaN
5 -0.805627 NaN NaN
6 1.133080 NaN -0.659892
7 -1.139802 0.784958 -0.554310
8 -0.470638 -0.216950 NaN
9 -0.392389 -3.046143 0.543312
It is not reflected in tdf, because:
myslice.is_copy
Out: <weakref at 0x000001CC842FD318; to 'DataFrame' at 0x000001CC8422D6A0>
If you change it to:
tdf.loc[:, 'B'].fillna(tdf.loc[idx, 'B'].mean(), inplace=True)
tdf
Out:
A B C
0 NaN 0.195070 -1.781563
1 -0.729045 0.196557 0.354758
2 0.616887 0.008628 NaN
3 NaN 0.008628 0.037006
4 0.767902 0.008628 NaN
5 -0.805627 0.008628 NaN
6 1.133080 0.008628 -0.659892
7 -1.139802 0.784958 -0.554310
8 -0.470638 -0.216950 NaN
9 -0.392389 -3.046143 0.543312
then it works. In the last part you can also use myslice['B'].mean() because you are not updating those values. But the left side should be the original DataFrame.
Related
I have the following dataframe:
import pandas as pd
df = pd.DataFrame(index=[0, 1, 2], columns=["test1", "test2"])
df.at[1, "test1"] = 3
df.at[2, "test2"] = 5
print(df)
test1 test2
0 NaN NaN
1 3 NaN
2 NaN 5
I tried the following line in order to set all NaN values at indices 1 and 2 to False:
df.loc[[1, 2] & pd.isna(df)] = False
However, this gives me an error.
My expected output would be:
test1 test2
0 NaN NaN
1 3 False
2 False 5
You can do this:
In [917]: df.loc[1:2] = df.loc[1:2].fillna(False)
In [918]: df
Out[918]:
test1 test2
0 NaN NaN
1 3 False
2 False 5
pd.isna(df)is a mask the shape of your DataFrame and you can't use that as a slice in a .loc call. In this case you want to selectively update the null values of your DataFrame on specific rows, so we can use .fillna with update to assing the changes back.
df.update(df.loc[[1, 2]].fillna(False))
print(df)
test1 test2
0 NaN NaN
1 3 False
2 False 5
Let us try with fillna and pass a dict
df = df.T.fillna(dict.fromkeys(df.index[1:],False),axis=0).T
test1 test2
0 NaN NaN
1 3 False
2 False 5
I am trying to learn python 2.7 by converting code I wrote in VB to python. I have column names and I am trying to create a empty dataframe or list then add rows by iterating (see below). I do not know the total number of rows I will need to add in advance. I can create a dataframe with the column names but can't figure out how to add the data. I have looked at several questions like mine but the row/columns of data are unknown in advance.
snippet of code:
cnames=['Security','Time','Vol_21D','Vol2_21D','MaxAPV_21D','MinAPV_21D' ]
df_Calcs = pd.DataFrame(index=range(10), columns=cnames)
this creates the empty df (df_Calcs)...then the code below is where I get the data to fill the rows...I use n as a counter for the new row # to insert (there are 20 other columns that I add to the row), but the below should explain what I am trying to do.
i = 0
n = 0
while True:
df_Calcs.Security[n] = i + 1
df_Calcs.Time[n] = '09:30:00'
df_Calcs.Vol_21D[n] = i + 2
df_Calcs.Vol2_21D[n] = i + 3
df_Calcs.MaxAPV_21D[n] = i + 4
df_Calcs.MinAPV_21D[n] = i + 5
i = i +1
n = n +1
if i > 4:
break
print df_Calcs
If I should use a list or array instead please let me know, I am trying to do this in the fastest most efficient way. This data will then be sent to a MySQL db table.
Result...
Security Time Vol_21D Vol2_21D MaxAPV_21D MinAPV_21D
0 1 09:30:00 2 3 4 5
1 2 09:30:00 3 4 5 6
2 3 09:30:00 4 5 6 7
3 4 09:30:00 5 6 7 8
4 5 09:30:00 6 7 8 9
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
You have many ways to do that.
Create empty dataframe:
cnames=['Security', 'Time', 'Vol_21D', 'Vol2_21D', 'MaxAPV_21D', 'MinAPV_21D']
df = pd.DataFrame(columns=cnames)
Output:
Empty DataFrame
Columns: [Security, Time, Vol_21D, Vol2_21D, MaxAPV_21D, MinAPV_21D]
Index: []
Then, in loop you can create a pd.series and append to your dataframe, example:
df.append(pd.Series([1, 2, 3, 4, 5, 6], cnames), ignore_index=True)
Or you can append a dict:
df.append({'Security': 1,
'Time': 2,
'Vol_21D': 3,
'Vol2_21D': 4,
'MaxAPV_21D': 5,
'MinAPV_21D': 6
}, ignore_index=True)
It will be the same output:
Security Time Vol_21D Vol2_21D MaxAPV_21D MinAPV_21D
0 1 2 3 4 5 6
But I think, more faster and pythonic way: first create an array, then append all raws to array and make data frame from array.
data = []
for i in range(0,5):
data.append([1,2,3,4,i,6])
df = pd.DataFrame(data, columns=cnames)
I hope it helps.
I have an issue with pandas pivot_table.
Sometimes, the order of the columns specified on "values" list does not match
In [11]: p = pivot_table(df, values=["x","y"], cols=["month"],
rows="name", aggfunc=np.sum)
i get the wrong order (y,x) instead of (x,y)
Out[12]:
y x
month 1 2 3 1 2 3
name
a 1 NaN 7 2 NaN 8
b 3 NaN 9 4 NaN 10
c NaN 5 NaN NaN 6 NaN
Is there something i don't do well ?
According to the pandas documentation, values should take the name of a single column, not an iterable.
values : column to aggregate, optional
Suppose I have a dataframe as follows,
import pandas as pd
columns=['A','B','C','D', 'E', 'F']
index=['1','2','3','4','5','6']
df = pd.DataFrame(columns=columns,index=index)
df['D']['1'] = 1
df['E'] = 1
df['F']['1'] = 1
df['A']['2'] = 1
df['B']['3'] = 1
df['C']['4'] = 1
df['A']['5'] = 1
df['B']['5'] = 1
df['C']['5'] = 1
df['D']['6'] = 1
df['F']['6'] = 1
df
A B C D E F
1 NaN NaN NaN 1 1 1
2 1 NaN NaN NaN 1 NaN
3 NaN 1 NaN NaN 1 NaN
4 NaN NaN 1 NaN 1 NaN
5 1 1 1 NaN 1 NaN
6 NaN NaN NaN 1 1 1
My condition is, I want to remove the columns which have value only when A,B,C(together) don't have a value. I want to find which column is mutually exclusive to A,B,C columns together. I am interested in finding the columns that have values only when A or B or C has values. The output here would be to remove D,F columns. But my dataframe has 400 columns and I want a way to check this for A,B,C vs rest of the columns.
One way I can think is,
Remove NA rows from A,B,C
df = df[np.isfinite(df['A'])]
df = df[np.isfinite(df['B'])]
df = df[np.isfinite(df['C'])]
and get NA count of all columns and check with the total number of rows,
df.isnull().sum()
and remove the counts that match.
Is there a better and efficient way to do this?
Thanks
Rather than delete rows, just select the others that don't have A, B, C equal to NaN at the same time.
mask = df[["A", "B", "C"]].isnull().all(axis=1)
df = df[~mask]
I am trying to do the following: on a dataframe X, I want to select all rows where X['a']>0 but I want to preserve the dimension of X, so that any other row will appear as containing NaN. Is there a fast way to do it? If one does X[X['a']>0] the dimensions of X are not preserved.
Use double subscript [[]]:
In [42]:
df = pd.DataFrame({'a':np.random.randn(10)})
df
Out[42]:
a
0 1.042971
1 0.978914
2 0.764374
3 -0.338405
4 0.974011
5 -0.995945
6 -1.649612
7 0.965838
8 -0.142608
9 -0.804508
In [48]:
df[df[['a']] > 1]
Out[48]:
a
0 1.042971
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
The key semantic difference here is what is returned is a df when you double subscript so this masks the df itself rather than the index
Note though that if you have multiple columns then it will mask all those as NaN