I have this DataFrame and want only the records whose "Total" column is not NaN ,and records when A~E has more than two NaN:
A B C D E Total
1 1 3 5 5 8
1 4 3 5 5 NaN
3 6 NaN NaN NaN 6
2 2 5 9 NaN 8
..i.e. something like df.dropna(....) to get this resulting dataframe:
A B C D E Total
1 1 3 5 5 8
2 2 5 9 NaN 8
Here's my code
import pandas as pd
dfInputData = pd.read_csv(path)
dfInputData = dfInputData.dropna(axis=1,how = 'any')
RowCnt = dfInputData.shape[0]
But it looks like no modification has been made even error
Please help!! Thanks
Use boolean indexing with count all columns without Total for number of missing values and not misisng values in Total:
df = df[df.drop('Total', axis=1).isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0
Or filter columns between A:E:
df = df[df.loc[:, 'A':'E'].isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0
I have the following Dataset:
col value
0 A 1
1 A NaN
2 B NaN
3 B NaN
4 B NaN
5 B 1
6 C 3
7 C NaN
8 C NaN
9 D 5
10 E 6
There is only one value set per group, the rest in Nan. What I want to do know, is fill the NaN with he value of the group. If a group has no NaNs, I just want to ignore it.
Outcome should look like this:
col value
0 A 1
1 A 1
2 B 1
3 B 1
4 B 1
5 B 1
6 C 3
7 C 3
8 C 3
9 D 5
10 E 6
What I've tried so far is the following:
df["value"] = df.groupby(col).transform(lambda x: x.fillna(x.mean()))
However, this method is not only super slow, but doesn't give me the wished result.
Anybody an idea?
It depends of data - if there is always one non missing value you can sorting and then replace by GroupBy.ffill, it working well if some groups has NANs only:
df = df.sort_values(['col','value'])
df["value"] = df.groupby('col')["value"].ffill()
#if always only one non missing value per group, fail if all NaNs of some group
#df["value"] = df["value"].ffill()
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0
Or if there is multiple values and need replace by mean, for improve performace change your solution with GroupBy.transform only mean passed to Series.fillna:
df["value"] = df["value"].fillna(df.groupby('col')["value"].transform('mean'))
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0
You can use ffill which is the same as fillna() with method=ffill (see docs)
df["value"] = df["value"].ffill()
Trying to groupby in pandas, then sort values and have a result column show what you need to add to get to the next row in the group, and if your are the end of the group. To replace the value with the number 3. Anyone have an idea how to do it?
import pandas as pd
df = pd.DataFrame({'label': 'a a b c b c'.split(), 'Val': [2,6,6, 4,16, 8]})
df
label Val
0 a 2
1 a 6
2 b 6
3 c 4
4 b 16
5 c 8
Id like the results as shown below, that you have to add 4 to 2 to get 6. So the groups are sorted. But if there is no next value in the group and NaN is added. To replace it with the value 3. I have shown below what the results should look like:
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0
I tried this, and was thinking of shifting values up but the problem is that the labels aren't sorted.
df['Results'] = df.groupby('label').apply(lambda x: x - x.shift())`
df
label Val Results
0 a 2 NaN
1 a 6 4.0
2 b 6 NaN
3 c 4 NaN
4 b 16 10.0
5 c 8 4.0
Hope someone can help:D!
Use groupby, diff and abs:
df['Results'] = abs(df.groupby('label')['Val'].diff(-1)).fillna(3)
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0
Say I have an incomplete dataset in a Pandas DataFrame such as:
incData = pd.DataFrame({'comp': ['A']*3 + ['B']*5 + ['C']*4,
'x': [1,2,3] + [1,2,3,4,5] + [1,2,3,4],
'y': [3,None,7] + [1,4,7,None,None] + [4,None,2,1]})
And also a DataFrame with fitting parameters that I could use to fill holes:
fitTable = pd.DataFrame({'slope': [2,3,-1],
'intercept': [1,-2,5]},
index=['A','B','C'])
I would like to achieve the following using y=x*slope+intercept for the None entries only:
comp x y
0 A 1 3.0
1 A 2 5.0
2 A 3 7.0
3 B 1 1.0
4 B 2 4.0
5 B 3 7.0
6 B 4 10.0
7 B 5 13.0
8 C 1 4.0
9 C 2 3.0
10 C 3 2.0
11 C 4 1.0
One way I envisioned is by using join and drop:
incData = incData.join(fitTable,on='comp')
incData.loc[incData['y'].isnull(),'y'] = incData[incData['y'].isnull()]['x']*\
incData[incData['y'].isnull()]['slope']+\
incData[incData['y'].isnull()]['intercept']
incData.drop(['slope','intercept'], axis=1, inplace=True)
However, that does not seem very efficient, because it adds and removes columns. It seems that I am making this too complicated, do I overlook a simple more direct solution? Something more like this non-functional code:
incData.loc[incData['y'].isnull(),'y'] = incData[incData['y'].isnull()]['x']*\
fitTable[incData[incData['y'].isnull()]['comp']]['slope']+\
fitTable[incData[incData['y'].isnull()]['comp']]['intercept']
I am pretty new to Pandas, so I sometimes get a bit mixed up with the strict indexing rules...
you can use map on the column 'comp' once mask with null value in 'y' like:
mask = incData['y'].isna()
incData.loc[mask, 'y'] = incData.loc[mask, 'x']*\
incData.loc[mask,'comp'].map(fitTable['slope']) +\
incData.loc[mask,'comp'].map(fitTable['intercept'])
and your non-functional code, I guess it would be something like:
incData.loc[mask,'y'] = incData.loc[mask, 'x']*\
fitTable.loc[incData.loc[mask, 'comp'],'slope'].to_numpy()+\
fitTable.loc[incData.loc[mask, 'comp'],'intercept'].to_numpy()
IIUC:
incData.loc[pd.isna(incData['y']), 'y'] = incData[pd.isna(incData['y'])].apply(lambda row: row['x']*fitTable.loc[row['comp'], 'slope']+fitTable.loc[row['comp'], 'intercept'], axis=1)
incData
comp x y
0 A 1 3.0
1 A 2 5.0
2 A 3 7.0
3 B 1 1.0
4 B 2 4.0
5 B 3 7.0
6 B 4 10.0
7 B 5 13.0
8 C 1 4.0
9 C 2 3.0
10 C 3 2.0
11 C 4 1.0
merge is another option
# merge two dataframe together on comp
m = incData.merge(fitTable, left_on='comp', right_index=True)
# y = mx+b
m['y'] = m['x']*m['slope']+m['intercept']
comp x y slope intercept
0 A 1 3 2 1
1 A 2 5 2 1
2 A 3 7 2 1
3 B 1 1 3 -2
4 B 2 4 3 -2
5 B 3 7 3 -2
6 B 4 10 3 -2
7 B 5 13 3 -2
8 C 1 4 -1 5
9 C 2 3 -1 5
10 C 3 2 -1 5
11 C 4 1 -1 5
I would like to apply a function that acts like fillna() but takes a different value than nan. Unfortunately DataFrame.replace() will not work in my case. Here is an example: Given a DataFrame:
df = pd.DataFrame([[1,2,3],[4,-1,-1],[5,6,-1]])
0 1 2
0 1 2.0 3.0
1 4 -1.0 -1.0
2 5 6.0 -1.0
3 7 8.0 NaN
I am looking for a function which will output:
0 1 2
0 1 2.0 3.0
1 4 2.0 3.0
2 5 6.0 3.0
3 7 8.0 NaN
So df.replace() with to_replace=-1 and 'method='ffill' will not work because it requires a column-independent value which will replace the -1 entries. In my example it is column-dependent. I know I can code it with a loop but am looking for an efficient code as it will be applied to a large DataFrame. Any suggestions? Thank you.
You can just replace the value with NaN and then call ffill:
In [3]:
df.replace(-1, np.NaN).ffill()
Out[3]:
0 1 2
0 1 2 3
1 4 2 3
2 5 6 3
I think you're over thinking this
EDIT
If you already have NaN values then create a boolean mask and update just those elements again with ffill on the inverse of the mask:
In [15]:
df[df == -1] = df[df != -1].ffill()
df
Out[15]:
0 1 2
0 1 2 3
1 4 2 3
2 5 6 3
3 7 8 NaN
Another method (thanks to #DSM in comments) is to use where to essentially do the same thing as above:
In [17]:
df.where(df != -1, df.replace(-1, np.nan).ffill())
Out[17]:
0 1 2
0 1 2 3
1 4 2 3
2 5 6 3
3 7 8 NaN