Looking to preserve NaN values when changing the shape of the dataframe.
These two questions may be related:
How to preserve NaN instead of filling with zeros in pivot table?
How to make two NaN as NaN after the operation instead of making it zero?
but not been able to use the answers provided - can I set a min count for np.sum somehow?
import pandas as pd
import numpy as np
df = pd.DataFrame([['Y1', np.nan], ['Y2', np.nan], ['Y1', 6], ['Y2',8]], columns=['A', 'B'], index=['1988-01-01','1988-01-01', '1988-01-04', '1988-01-04'])
df.index.name = 'Date'
df
pivot_df = pd.pivot_table(df, values='B', index=['Date'], columns=['A'],aggfunc=np.sum)
pivot_df
The output is:
A Y1 Y2
Date
1988-01-01 0.0 0.0
1988-01-04 6.0 8.0
and the desired output is:
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
From the helpful comments the following solution meets my requirements:
pivot_df_2 = pd.pivot_table(df, values='B', index=['Date'], columns=['A'],aggfunc=min, dropna=False)
pivot_df_2
Values are supposed to be unique per slot so replacing the sum function with a min function shouldn't make a difference (in my case)
If you have no duplicate entries, use set_index + unstack
df.set_index('A', append=True)['B'].unstack(-1)
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
If you have duplicates, use a groupby with min_count
>> df
A B
Date
1988-01-01 Y1 NaN
1988-01-01 Y2 NaN
1988-01-04 Y1 6.0
1988-01-04 Y2 8.0
1988-01-01 Y1 NaN
1988-01-01 Y2 NaN
1988-01-04 Y1 6.0
1988-01-04 Y2 8.0
df.set_index('A', append=True).groupby(level=[0, 1])['B'].sum(min_count=1).unstack(-1)
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 12.0 16.0
In this case, I would resolve by groupby:
(df.groupby(['Date', 'A']).B
.apply(lambda x: np.nan if x.isna().all() else x.sum())
.unstack('A')
)
output:
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
Change isna().all() to isna().any() if needed.
It is possible to count the values, and drop when 0 (or less than the expected count):
pivot_df = pd.pivot_table(df, values='B', index=['Date'], columns=['A'],
aggfunc=['sum','count'])
# build the mask from count
mask = (pivot_df.xs('count', axis=1) == 0) # or ...<min_limit
#build the actual pivot_df from sum
pivot_df = pivot_df.xs('sum', axis=1)
# and reset to NaN when not enough values
pivot_df[mask] = np.nan
It gives as expected:
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
This one will give sensible result when you sum more than one value.
Try to add 'dropna= False' to your pivot_table function?
Related
I have a DataFrame with columns with duplicate data with different names:
In[1]: df
Out[1]:
X1 X2 Y1 Y2
0.0 0.0 6.0 6.0
3.0 3.0 7.1 7.1
7.6 7.6 1.2 1.2
I know .drop(columns = ) exists but is there a way more efficient way to drop these without having to list down the column names? or not.. please let me know as i can just use .drop()
We can use np.unique over axis 1. Unfortunately, there's no pandas built-in function to drop duplicate columns.
df.drop_duplicates only removes duplicate rows.
Return DataFrame with duplicate rows removed.
We can create a function around np.unique to drop duplicate columns.
def drop_duplicate_cols(df):
uniq, idxs = np.unique(df, return_index=True, axis=1)
return pd.DataFrame(uniq, index=df.index, columns=df.columns[idxs])
drop_duplicate_cols(X)
X1 Y1
0 0.0 6.0
1 3.0 7.1
2 7.6 1.2
Online Demo
NB: np.unique docs:
Returns the sorted unique elements of an array.
Workaround: To retain the original order, sort the idxs.
Using .T on dataframe having multiple dtypes is going to mess with your actual dtypes.
df = pd.DataFrame({'A': [0, 1], 'B': ['a', 'b'], 'C': [0, 1], 'D':[2.1, 3.1]})
df.dtypes
A int64
B object
C int64
D float64
dtype: object
df.T.T.dtypes
A object
B object
C object
D object
dtype: object
# To get back original `dtypes` we can use `.astype`
df.T.T.astype(df.dtypes).dtypes
A int64
B object
C int64
D float64
dtype: object
You could transpose with T and drop_duplicates then transpose back:
>>> df.T.drop_duplicates().T
X1 Y1
0 0.0 6.0
1 3.0 7.1
2 7.6 1.2
>>>
Or with loc and duplicated:
>>> df.loc[:, df.T.duplicated(keep='last')]
X1 Y1
0 0.0 6.0
1 3.0 7.1
2 7.6 1.2
>>>
I have multiple data frames that I saved in a concatenated list like below. Each df represents a matrix.
my_df = pd.concat([df1, df2, df3, .....])
How do I sum all these dfs (matrices) into one df (matrix)?
I found a discussion here, but it only answers how to add two data frames, by using code like below.
df_x.add(df_y, fill_value=0)
Should I use the code above in a loop, or is there a more concise way?
I tried to do print(my_df.sum()) but got a very confusing result (it's suddenly turned into a one row instead of two-dimensional matrix).
Thank you.
I believe need functools.reduce if each DataFrame in list have same index and columns values:
np.random.seed(2018)
df1 = pd.DataFrame(np.random.choice([1,np.nan,2], size=(3,3)), columns=list('abc'))
df2 = pd.DataFrame(np.random.choice([1,np.nan,3], size=(3,3)), columns=list('abc'))
df3 = pd.DataFrame(np.random.choice([1,np.nan,4], size=(3,3)), columns=list('abc'))
print (df1)
a b c
0 2.0 2.0 2.0
1 NaN NaN 1.0
2 1.0 2.0 NaN
print (df2)
a b c
0 NaN NaN 1.0
1 3.0 3.0 3.0
2 NaN 1.0 3.0
print (df3)
a b c
0 4.0 NaN NaN
1 4.0 1.0 1.0
2 4.0 NaN 1.0
from functools import reduce
my_df = [df1,df2, df3]
df = reduce(lambda x, y: x.add(y, fill_value=0), my_df)
print (df)
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0
I believe the idiomatic solution to this is to preserve the information about different DataFrames with the help of the keys parameter and then use sum on the innermost level:
dfs = [df1, df2, df3]
my_df = pd.concat(dfs, keys=['df{}'.format(i+1) for i in range(len(dfs))])
my_df.sum(level=1)
which yields
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0
with jezrael's sample DataFrames.
One method is to use sum with a list of arrays. The output here will be an array rather than a dataframe.
This assumes you need to replace np.nan with 0:
res = sum([x.fillna(0).values for x in [df1, df2, df3]])
Alternatively, you can use numpy directly in a couple of different ways:
res_np1 = np.add.reduce([x.fillna(0).values for x in [df1, df2, df3]])
res_np2 = np.nansum([x.values for x in [df1, df2, df3]], axis=0)
numpy.nansum assumes np.nan equals zero for summing purposes.
As per Pandas 0.19.2 documentation, I can provide keys argument to create a resulting multi-index DataFrame. An example (from pandas documents ) is :
result = pd.concat(frames, keys=['x', 'y', 'z'])
How would I concat the dataframe so that I can provide the keys at the column level instead of index level ?
What I basically need is something like this :
where df1 and df2 are to be concat.
This is supported by keys parameter of pd.concat when specifying axis=1:
df1 = pd.DataFrame(np.random.random((4, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.random((4, 3)), columns=list('BDF'), index=[2, 3, 6, 7])
df = pd.concat([df1, df2], keys=['X', 'Y'], axis=1)
The resulting output:
X Y
A B C D B D F
0 0.654406 0.495906 0.601100 0.309276 NaN NaN NaN
1 0.020527 0.814065 0.907590 0.924307 NaN NaN NaN
2 0.239598 0.089270 0.033585 0.870829 0.882028 0.626650 0.622856
3 0.983942 0.103573 0.370121 0.070442 0.986487 0.848203 0.089874
6 NaN NaN NaN NaN 0.664507 0.319789 0.868133
7 NaN NaN NaN NaN 0.341145 0.308469 0.884074
I'm trying to find the difference between the first valid value and the last valid value in a DataFrame per row.
I have a working code with a for loop and looking for something faster.
Here's an example of what I'm doing currently:
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.arange(16).astype(np.float).reshape(4, 4),
columns=['a', 'b', 'c', 'd'])
# Fill some NaN
df.loc[0, ['a', 'd']] = np.nan
df.loc[1, ['c', 'd']] = np.nan
df.loc[2, 'b'] = np.nan
df.loc[3, :] = np.nan
print(df)
# a b c d
# 0 NaN 1.0 2.0 NaN
# 1 4.0 5.0 NaN NaN
# 2 8.0 NaN 10.0 11.0
# 3 NaN NaN NaN NaN
diffs = pd.Series(index=df.index)
for i in df.index:
row = df.loc[i]
min_i = row.first_valid_index()
max_i = row.last_valid_index()
if min_i is None or min_i == max_i: # 0 or 1 valid values
continue
diffs[i] = df.loc[i, max_i] - df.loc[i, min_i]
df['diff'] = diffs
print(df)
# a b c d diff
# 0 NaN 1.0 2.0 NaN 1.0
# 1 4.0 5.0 NaN NaN 1.0
# 2 8.0 NaN 10.0 11.0 3.0
# 3 NaN NaN NaN NaN NaN
One way would be to back and forward fill the missing values, and then just compare the first and last rows.
df2 = df.fillna(method='ffill', axis=1).fillna(method='bfill', axis=1)
df['diff'] = df2.ix[:, -1] - df2.ix[:, 0]
If you want to do it in one line, without creating a new dataframe:
df['diff'] = df.fillna(method='ffill', axis=1).fillna(method='bfill', axis=1).apply(lambda r: r.d - r.a, axis=1)
Pandas making your life easy, one method (first_valid_values()) at a time. Note that you'll have to delete any rows that have all NaN values (no point in having these anyways):
For first valid values:
a= [df.ix[x,i] for x,i in enumerate(df.apply(lambda row: row.first_valid_index(), axis=1))]
For last valid values:
b = [df.ix[x,i] for x,i in enumerate(df.apply(lambda row: row[::-1].first_valid_index(), axis=1))]
Subtract to get final result:
a-b
I want know the first year with incoming revenue for various projects.
Given the following, dataframe:
ID Y1 Y2 Y3
0 NaN 8 4
1 NaN NaN 1
2 NaN NaN NaN
3 5 3 NaN
I would like to return the name of the first column with a non-null value by row.
In this case, I would want to return:
['Y2','Y3',NaN,'Y1']
My goal is to add this as a column to the original dataframe.
The following code mostly works, but is really clunky.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Y1':[np.nan, np.nan, np.nan, 5],'Y2':[8, np.nan, np.nan, 3], 'Y3':[4, 1, np.nan, np.nan]})
df['first'] = np.nan
for ID in df.index:
row = df.loc[ID,]
for i in range(0,len(row)):
if (~pd.isnull(row[i])):
df.loc[ID,'first'] = row.index[i]
break
returns:
Y1 Y2 Y3 first
0 NaN 8 4 Y2
1 NaN NaN 1 Y3
2 NaN NaN NaN first
3 5 3 NaN Y1
Does anyone know a more elegant solution?
You can apply first_valid_index to each row in the dataframe using a lambda expression with axis=1 to specify rows.
>>> df.apply(lambda row: row.first_valid_index(), axis=1)
ID
0 Y2
1 Y3
2 None
3 Y1
dtype: object
To apply it to your dataframe:
df = df.assign(first = df.apply(lambda row: row.first_valid_index(), axis=1))
>>> df
Y1 Y2 Y3 first
ID
0 NaN 8 4 Y2
1 NaN NaN 1 Y3
2 NaN NaN NaN None
3 5 3 NaN Y1
Avoiding apply is preferable as its not vectorized. The following is vectorized. It was tested with Pandas 1.1.
Setup
import numpy as np
import pandas as pd
df = pd.DataFrame({'Y1':[np.nan, np.nan, np.nan, 5],'Y2':[8, np.nan, np.nan, 3], 'Y3':[4, 1, np.nan, np.nan]})
# df.dropna(how='all', inplace=True) # Optional but cleaner
# For ranking only:
col_ranks = pd.DataFrame(index=df.columns, data=np.arange(1, 1 + len(df.columns)), columns=['first_notna_rank'], dtype='UInt8') # UInt8 supports max value of 255.
To find the name of the first non-null column
df['first_notna_name'] = df.dropna(how='all').notna().idxmax(axis=1).astype('string')
If df is guaranteed to have no rows with all nulls, the .dropna operation above can optionally be removed.
To then find the first non-null value
Using bfill:
df['first_notna_value'] = df[df.columns.difference(['first_notna_name'])].bfill(axis=1).iloc[:, 0]
Using melt:
df['first_notna_value'] = df.melt(id_vars='first_notna_name', value_vars=df.columns.difference(['first_notna_name']), ignore_index=False).query('first_notna_name == variable').merge(df[[]], how='right', left_index=True, right_index=True).loc[df.index, 'value']
If df is guaranteed to have no rows with all nulls, the .merge operation above can optionally be removed.
To rank the name
df = df.merge(col_ranks, how='left', left_on='first_notna_name', right_index=True)
Is there a better way?
Output
Y1 Y2 Y3 first_notna_name first_notna_value first_notna_rank
0 NaN 8.0 4.0 Y2 8.0 2
1 NaN NaN 1.0 Y3 1.0 3
2 NaN NaN NaN <NA> NaN <NA>
3 5.0 3.0 NaN Y1 5.0 1
Partial credit: answers by me, piRSquared, and Andy
Apply this code to a dataframe with only one row to return the first column in the row that contains a null value.
row.columns[~(row.loc[:].isna()).all()][-1]