I have a data frame, and want to create a separate column. This columns must be based on the 'most right' value in a data frame. But, if the value is a nan/None, skip the column.
Data frame:
Column_0 Column_1 Column_2 Column_3
nan nan nan nan
1 2 nan nan
1 2 3 4
1 nan 3 nan
Output:
Column_Output
nan
2
4
3
I searched for solutions... but even finding the right search terms was causing me trouble. Thanks a lot in advance!
First forward filling missing values and then select last column:
df['Column_Output'] = df.ffill(axis=1).iloc[:, -1]
print (df)
Column_0 Column_1 Column_2 Column_3 Column_Output
0 NaN NaN NaN NaN NaN
1 1.0 2.0 NaN NaN 2.0
2 1.0 2.0 3.0 4.0 4.0
3 1.0 NaN 3.0 NaN 3.0
Related
I'm trying to pivot a dataframe in pandas to produce heatmaps (pandas version 1.4.3). The issue is that after pivoting, the original sorting of the index column is lost. Since my data represent samples from geographical locations, I need them to be sorted by latitude (which is how they are in the 'TILE' column in the example below).
mwe:
dummy = [{'TILE':'N59TE010A','METRIC':'ELD_RMSE','LOW':2},
{'TILE':'N59TE009G','METRIC':'ELD_RMSE','LOW':2},
{'TILE':'N59RE009G','METRIC':'ELD_RMSE','LOW':1},
{'TILE':'N59TE010B','METRIC':'ELD_RMSE','LOW':2},
{'TILE':'N59TE010C','METRIC':'ELD_RMSE','LOW':2},
{'TILE':'S24TW047F','METRIC':'RUF_RMSE','LOW':2},
{'TILE':'S24TW047G','METRIC':'ELD_LE90','LOW':2},
{'TILE':'S24MW047D','METRIC':'SMD_LE90','LOW':2},
{'TILE':'S24MW047C','METRIC':'RUF_RMSE','LOW':0},
{'TILE':'S24MW047D','METRIC':'RUF_RMSE','LOW':0}]
df = pd.DataFrame.from_dict(dummy)
df
TILE METRIC LOW
0 N59TE010A ELD_RMSE 2
1 N59TE009G ELD_RMSE 2
2 N59RE009G ELD_RMSE 1
3 N59TE010B ELD_RMSE 2
4 N59TE010C ELD_RMSE 2
5 S24TW047F RUF_RMSE 2
6 S24TW047G ELD_LE90 2
7 S24MW047D SMD_LE90 2
8 S24MW047C RUF_RMSE 0
9 S24MW047D RUF_RMSE 0
df.pivot(index='TILE', columns='METRIC', values='LOW')
METRIC ELD_LE90 ELD_RMSE RUF_RMSE SMD_LE90
TILE
N59RE009G NaN 1.0 NaN NaN
N59TE009G NaN 2.0 NaN NaN
N59TE010A NaN 2.0 NaN NaN
N59TE010B NaN 2.0 NaN NaN
N59TE010C NaN 2.0 NaN NaN
S24MW047C NaN NaN 0.0 NaN
S24MW047D NaN NaN 0.0 2.0
S24TW047F NaN NaN 2.0 NaN
S24TW047G 2.0 NaN NaN NaN
Never mind the NaN values, the point is that the first row should have tile N59TE010A, and not N59RE009G (and so on).
I've been trying a few solutions I found here and elsewhere but without luck. Is there a way to preserve the sorting of the 'TILE' column?
Thanks
You can use pivot_table that has more options, including sort=False:
df.pivot_table(index='TILE', columns='METRIC', values='LOW', sort=False)
Another option could be to add a dummy column to use as index with the desired order, using for example pandas.factorize to keep the original order.
(df.assign(idx=pd.factorize(df['TILE'])[0])
.pivot(index=['idx', 'TILE'], columns='METRIC', values='LOW')
.droplevel('idx')
)
output:
METRIC ELD_LE90 ELD_RMSE RUF_RMSE SMD_LE90
TILE
N59TE010A NaN 2.0 NaN NaN
N59TE009G NaN 2.0 NaN NaN
N59RE009G NaN 1.0 NaN NaN
N59TE010B NaN 2.0 NaN NaN
N59TE010C NaN 2.0 NaN NaN
S24TW047F NaN NaN 2.0 NaN
S24TW047G 2.0 NaN NaN NaN
S24MW047D NaN NaN 0.0 2.0
S24MW047C NaN NaN 0.0 NaN
I need to populate NaN values of my df by a static 0, starting from the first non-nan value.
In a way, combining method="ffill" (identify the first value per column, and only act on following NaN values) with value=0 (populating by 0, not the variable quantity in df).
How can I do that? This post is close, but not it: How to replace NaNs by preceding or next values in pandas DataFrame?
Example df
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 NaN 3.0 NaN
3 NaN NaN 4.0
Desired output:
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 0.0 3.0 0.0
3 0.0 0.0 4.0
If possible, df.fillna(value=0, method='ffill') would be great. But that returns ValueError: Cannot specify both 'value' and 'method'.
Edit: Oh, and time matters. We are talking ~60M rows and 4k columns - so looping is out of the question, and masking only if really, really fast
You can try mask(), ffill() and fillna():
df=df.fillna(df.mask(df.ffill().notna(),0))
#OR via where
df=df.fillna(df.where(df.ffill().isna(),0))
output:
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 0.0 3.0 4.0
3 0.0 0.0 0.0
I am taking a column from a csv file and inputting the data from it into an array using pandas. However, many of the cells are empty and get saved in the array as 'nan'. I want to either identify the empty cells so I can skip them or remove them all from the array after. Something like the following pseudo-code:
if df.row(column number) == nan
skip
or
if df.row(column number) != nan
do stuff
Basically how do I identify if a cell from the csv file is empty.
Best is to get rid of the NaN rows after you load it, by indexing:
df = df[df['column_to_check'].notnull()]
For example to get rid of NaN values found in column 3 in the following dataframe:
>>> df
0 1 2 3 4
0 1.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
2 NaN NaN NaN NaN NaN
3 NaN 1.0 1.0 NaN NaN
4 1.0 NaN NaN 1.0 1.0
>>> df[df[3].notnull()]
0 1 2 3 4
0 1.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
4 1.0 NaN NaN 1.0 1.0
pd.isnull() and pd.notnull() are standard ways of checking individual null values if you're iterating over a DataFrame row by row and indexing by column as you suggest in your code above. You could then use this expression to do whatever you like with that value.
Example:
import pandas as pd
import numpy as np
a = np.nan
pd.isnull(a)
Out[4]: True
pd.notnull(a)
Out[5]: False
If you want to manipulate all (or certain) NaN values from a DataFrame, handling missing data is a big topic when working with tabular data and there are many methods of doing so. I'd recommend chapter 7 from this book. Here are its contents:
The first section would be most pertinent to your question.
If you just want to exclude missing values, you can use pd.DataFrame.dropna()
Below is an example based on the one describes by #sacul:
>>> import pandas as pd
>>> df
0 1 2 3 4
0 0.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
2 NaN NaN NaN NaN NaN
3 NaN 1.0 1.0 NaN NaN
4 1.0 NaN NaN 1.0 1.0
>>> df.dropna(axis=0, subset=['3'])
0 1 2 3 4
0 0.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
4 1.0 NaN NaN 1.0 1.0
axis=0 indicates that rows containing NaN are excluded.
subset=['3'] indicate to only consider columns "3".
See the link above for details.
I'm preparing data for machine learning where data is in pandas DataFrame which looks like this:
Column v1 v2
first 1 2
second 3 4
third 5 6
now i want to transform it into:
Column v1 v2 first-v1 first-v2 second-v1 econd-v2 third-v1 third-v2
first 1 2 1 2 Nan Nan Nan Nan
second 3 4 Nan Nan 3 4 Nan Nan
third 5 6 Nan Nan Nan Nan 5 6
what i've tried is to do something like this:
# we know how many values there are but
# length can be changed into length of [1, 2, 3, ...] values
values = ['v1', 'v2']
# data with description from above is saved in data
for value in values:
data[ str(data['Column'] + '-' + value)] = data[ value]
Results are a columns with name:
['first-v1' 'second-v1'..], ['first-v2' 'second-v2'..]
where there are correct values. What i'm doing wrong? Is there a more optimal way to do this because my data is big?
Thank you for your time!
You can use unstack with swaping and sorting MultiIndex in columns:
df = data.set_index('Column', append=True)[values].unstack()
.swaplevel(0,1, axis=1).sort_index(1)
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Or stack + unstack:
df = data.set_index('Column', append=True).stack().unstack([1,2])
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Last join to original:
df = data.join(df)
print (df)
Column v1 v2 first-v1 first-v2 second-v1 second-v2 third-v1 \
0 first 1 2 1.0 2.0 NaN NaN NaN
1 second 3 4 NaN NaN 3.0 4.0 NaN
2 third 5 6 NaN NaN NaN NaN 5.0
third-v2
0 NaN
1 NaN
2 6.0
I am trying to create a very large dataframe, made up of one column from many smaller dataframes (renamed to the dataframe name). I am using CONCAT() and looping through dictionary values which represent dataframes, and looping over index values, to create the large dataframe. The CONCAT() join_axes is the common index to all the dataframes. This works fine, however I then have duplicate column names.
I must be able to loop over the indexes at specifc windows as part of my final dataframe creation - so removing this step isnt an option
For example, this results in the following final dataframe with duplciate columns:
Is there any way I can use CONCAT() excatly as I am, but merge the columns to produce an output like so?:
I think you need:
df = pd.concat([df1, df2])
Or if have duplicates in columns use groupby where if some values are overlapping then are summed:
print (df.groupby(level=0, axis=1).sum())
Sample:
df1 = pd.DataFrame({'A':[5,8,7, np.nan],
'B':[1,np.nan,np.nan,9],
'C':[7,3,np.nan,0]})
df2 = pd.DataFrame({'A':[np.nan,np.nan,np.nan,2],
'B':[1,2,np.nan,np.nan],
'C':[np.nan,6,np.nan,3]})
print (df1)
A B C
0 5.0 1.0 7.0
1 8.0 NaN 3.0
2 7.0 NaN NaN
3 NaN 9.0 0.0
print (df2)
A B C
0 NaN 1.0 NaN
1 NaN 2.0 6.0
2 NaN NaN NaN
3 2.0 NaN 3.0
df = pd.concat([df1, df2],axis=1)
print (df)
A B C A B C
0 5.0 1.0 7.0 NaN 1.0 NaN
1 8.0 NaN 3.0 NaN 2.0 6.0
2 7.0 NaN NaN NaN NaN NaN
3 NaN 9.0 0.0 2.0 NaN 3.0
print (df.groupby(level=0, axis=1).sum())
A B C
0 5.0 2.0 7.0
1 8.0 2.0 9.0
2 7.0 NaN NaN
3 2.0 9.0 3.0
What you want is df1.combine_first(df2). Refer to pandas documentation.