I need to populate NaN values of my df by a static 0, starting from the first non-nan value.
In a way, combining method="ffill" (identify the first value per column, and only act on following NaN values) with value=0 (populating by 0, not the variable quantity in df).
How can I do that? This post is close, but not it: How to replace NaNs by preceding or next values in pandas DataFrame?
Example df
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 NaN 3.0 NaN
3 NaN NaN 4.0
Desired output:
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 0.0 3.0 0.0
3 0.0 0.0 4.0
If possible, df.fillna(value=0, method='ffill') would be great. But that returns ValueError: Cannot specify both 'value' and 'method'.
Edit: Oh, and time matters. We are talking ~60M rows and 4k columns - so looping is out of the question, and masking only if really, really fast
You can try mask(), ffill() and fillna():
df=df.fillna(df.mask(df.ffill().notna(),0))
#OR via where
df=df.fillna(df.where(df.ffill().isna(),0))
output:
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 0.0 3.0 4.0
3 0.0 0.0 0.0
Related
I want to fill e column's NaN with its most closest (by position from left side) not NaN columns' values.
a b c d e
0 1 2.0 3.0 6.0 3.0
1 3 5.0 7.0 NaN NaN
2 2 4.0 NaN NaN NaN
3 5 6.0 NaN NaN NaN
4 3 NaN NaN NaN NaN
For example, for the second row of e, its most closest Not NaN column is e by position, then we take 7.0, is it possible to do this in Pandas? Thanks.
The expected output is like this:
a b c d e
0 1 2.0 3.0 6.0 3.0
1 3 5.0 7.0 NaN 7.0
2 2 4.0 NaN NaN 4.0
3 5 6.0 NaN NaN 6.0
4 3 NaN NaN NaN 3.0
If answer should be simplify get all first non missing values from left side to last column use forward filling them and select last column by position:
df.e = df.ffill(axis=1).iloc[:, -1]
I have a data frame, and want to create a separate column. This columns must be based on the 'most right' value in a data frame. But, if the value is a nan/None, skip the column.
Data frame:
Column_0 Column_1 Column_2 Column_3
nan nan nan nan
1 2 nan nan
1 2 3 4
1 nan 3 nan
Output:
Column_Output
nan
2
4
3
I searched for solutions... but even finding the right search terms was causing me trouble. Thanks a lot in advance!
First forward filling missing values and then select last column:
df['Column_Output'] = df.ffill(axis=1).iloc[:, -1]
print (df)
Column_0 Column_1 Column_2 Column_3 Column_Output
0 NaN NaN NaN NaN NaN
1 1.0 2.0 NaN NaN 2.0
2 1.0 2.0 3.0 4.0 4.0
3 1.0 NaN 3.0 NaN 3.0
I am taking a column from a csv file and inputting the data from it into an array using pandas. However, many of the cells are empty and get saved in the array as 'nan'. I want to either identify the empty cells so I can skip them or remove them all from the array after. Something like the following pseudo-code:
if df.row(column number) == nan
skip
or
if df.row(column number) != nan
do stuff
Basically how do I identify if a cell from the csv file is empty.
Best is to get rid of the NaN rows after you load it, by indexing:
df = df[df['column_to_check'].notnull()]
For example to get rid of NaN values found in column 3 in the following dataframe:
>>> df
0 1 2 3 4
0 1.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
2 NaN NaN NaN NaN NaN
3 NaN 1.0 1.0 NaN NaN
4 1.0 NaN NaN 1.0 1.0
>>> df[df[3].notnull()]
0 1 2 3 4
0 1.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
4 1.0 NaN NaN 1.0 1.0
pd.isnull() and pd.notnull() are standard ways of checking individual null values if you're iterating over a DataFrame row by row and indexing by column as you suggest in your code above. You could then use this expression to do whatever you like with that value.
Example:
import pandas as pd
import numpy as np
a = np.nan
pd.isnull(a)
Out[4]: True
pd.notnull(a)
Out[5]: False
If you want to manipulate all (or certain) NaN values from a DataFrame, handling missing data is a big topic when working with tabular data and there are many methods of doing so. I'd recommend chapter 7 from this book. Here are its contents:
The first section would be most pertinent to your question.
If you just want to exclude missing values, you can use pd.DataFrame.dropna()
Below is an example based on the one describes by #sacul:
>>> import pandas as pd
>>> df
0 1 2 3 4
0 0.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
2 NaN NaN NaN NaN NaN
3 NaN 1.0 1.0 NaN NaN
4 1.0 NaN NaN 1.0 1.0
>>> df.dropna(axis=0, subset=['3'])
0 1 2 3 4
0 0.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
4 1.0 NaN NaN 1.0 1.0
axis=0 indicates that rows containing NaN are excluded.
subset=['3'] indicate to only consider columns "3".
See the link above for details.
I want to assign the position of every value based on its position within the sorted value-list per row (excluding NaNs). I just can't figure out how to do this in an elegant way with pandas.
I think it's easier to explain in an example:
A B C D
Date
2002-02-28 -0.051272 -0.005851 -0.012669 NaN
2002-03-29 0.103416 0.050121 0.050203 0.5
2002-04-30 -0.090579 -0.042308 0.019293 0.03
2002-05-31 0.160239 -0.078983 0.047319 0.66
For every row, I want to do the following:
Exclude NaNs
Calculate the position of the value within the list of sorted values in that row and assign this number (position 1 being the smallest (negative) number and position N being the biggest positive number)
The result would be:
A B C D
Date
2002-02-28 1 3 2 NaN
2002-03-29 3 1 2 4
2002-04-30 1 2 3 4
2002-05-31 3 1 2 4
In a second step, I than want to make a 3-row "rolling" function per column which checks whether the current row and the 2 rows before within one colum were smaller then a certain threshold X and in case it is, then display the average of those 3 values, else just note down NaN. In case any of those 3 values is NaN, than just note down NaN. Can only be calculated from 2002-04-30 ongoing, because needs at least 3 values. For column D, this would yield "NaN" in row 2002-04-30 because there are only two numeric values beforehand. For Column D and row 2002-05-31 it would also yield "NaN" because the 3 values are 4, 4 and 4 with 4 being greater then the treshold.
Let's say the threshold X=3. (I leave out column D because my explanations make the data to wide):
E.g.:
A B C
Date
2002-02-28 NaN NaN NaN
2002-03-29 NaN NaN NaN
2002-04-30 Avg(1,3,1) Avg(3,1,2) Avg(2+2+3)
2002-05-31 Avg(3,1,3) Avg(1,2,1) Avg(2+3+2)
EDIT:
I think I got both steps myself. Could you please evaluate whether this is correct and sensible?:
import numpy as np
import pandas as pd
df = pd.DataFrame(data={'X': [0.1, 0.2, 0.3, 0.4], 'Y': [0.5, -0.2, np.NaN, -1], 'Z': [np.NaN, -0.21, -5, 10]})
df.apply(lambda row: [sorted([y for y in row if not np.isnan(y)]).index(x)+1 if not np.isnan(x) else np.NaN for x in row], axis=1)
df:
X Y Z
0 0.1 0.5 NaN
1 0.2 -0.2 -0.21
2 0.3 NaN -5.00
3 0.4 -1.0 10.00
After .apply:
X Y Z
0 1.0 2.0 NaN
1 3.0 2.0 1.0
2 2.0 NaN 1.0
3 2.0 1.0 3.0
# Step 2 with new examplatory data and only one column
df = pd.DataFrame(data={'A': [1,2,3,np.NaN,3,1,3,4,3,np.NaN,2,2,1,2]})
threshold = 3
df['A_rolling'] = df['A'].rolling(window=3, min_periods=3).apply(lambda x: x.mean() if all([val <= threshold for val in x]) else np.NaN)
A A_rolling
0 1.0 NaN
1 2.0 NaN
2 3.0 2.000000
3 NaN NaN
4 3.0 NaN
5 1.0 NaN
6 3.0 2.333333
7 4.0 NaN
8 3.0 NaN
9 NaN NaN
10 2.0 NaN
11 2.0 NaN
12 1.0 1.666667
13 2.0 1.666667
So only gotta run it for all columns now :)
Any ideas?
Thanks
For step one you could use the rank method:
step1 = df.rank(axis=1)
A B C D
Date
2002-02-28 1.0 3.0 2.0 NaN
2002-03-29 3.0 1.0 2.0 4.0
2002-04-30 1.0 2.0 3.0 4.0
2002-05-31 3.0 1.0 2.0 4.0
For step two it might be less verbose to replace all values greater than threshold with NaNs and run a rolling mean:
threshold = 3
step1[step1 > threshold] = pd.np.NaN
step2 = step1.rolling(window=3, min_periods=3).mean()
A B C D
Date
2002-02-28 NaN NaN NaN NaN
2002-03-29 NaN NaN NaN NaN
2002-04-30 1.666667 2.000000 2.333333 NaN
2002-05-31 2.333333 1.333333 2.333333 NaN
I am trying to create a very large dataframe, made up of one column from many smaller dataframes (renamed to the dataframe name). I am using CONCAT() and looping through dictionary values which represent dataframes, and looping over index values, to create the large dataframe. The CONCAT() join_axes is the common index to all the dataframes. This works fine, however I then have duplicate column names.
I must be able to loop over the indexes at specifc windows as part of my final dataframe creation - so removing this step isnt an option
For example, this results in the following final dataframe with duplciate columns:
Is there any way I can use CONCAT() excatly as I am, but merge the columns to produce an output like so?:
I think you need:
df = pd.concat([df1, df2])
Or if have duplicates in columns use groupby where if some values are overlapping then are summed:
print (df.groupby(level=0, axis=1).sum())
Sample:
df1 = pd.DataFrame({'A':[5,8,7, np.nan],
'B':[1,np.nan,np.nan,9],
'C':[7,3,np.nan,0]})
df2 = pd.DataFrame({'A':[np.nan,np.nan,np.nan,2],
'B':[1,2,np.nan,np.nan],
'C':[np.nan,6,np.nan,3]})
print (df1)
A B C
0 5.0 1.0 7.0
1 8.0 NaN 3.0
2 7.0 NaN NaN
3 NaN 9.0 0.0
print (df2)
A B C
0 NaN 1.0 NaN
1 NaN 2.0 6.0
2 NaN NaN NaN
3 2.0 NaN 3.0
df = pd.concat([df1, df2],axis=1)
print (df)
A B C A B C
0 5.0 1.0 7.0 NaN 1.0 NaN
1 8.0 NaN 3.0 NaN 2.0 6.0
2 7.0 NaN NaN NaN NaN NaN
3 NaN 9.0 0.0 2.0 NaN 3.0
print (df.groupby(level=0, axis=1).sum())
A B C
0 5.0 2.0 7.0
1 8.0 2.0 9.0
2 7.0 NaN NaN
3 2.0 9.0 3.0
What you want is df1.combine_first(df2). Refer to pandas documentation.