I have two separate data frames, and I want to do the same thing to both. I want to pair the columns according to the first substring before the underscore (a, b, x, y), and then if a value in the first column contains a word, but the corresponding row in the totals column is null, i want to update the total to a zero. I want to update the data frames and then output them both separately.
import pandas as pd
import numpy as np
d1 = pd.DataFrame(data={'a':['yes', 'no', 'maybe', 'sometimes', np.nan],
'a_total': [5,12,4,np.nan,0],
'b': ['blue','orange','pink', np.nan, np.nan],
'b_total': [12,6,0,0, np.nan]})
d2 = pd.DataFrame(data={'y':['frog', 'snail', 'snake', 'spider', 'pig'],
'y_total': [182,32,13, np.nan,8],
'z': ['car','bike','walk', np.nan, np.nan],
'z_total': [12,6,np.nan,np.nan, np.nan]})
then i want to do something to both data frames, and then output the updated versions separately. My current code copied below is not outputting properly. I am trying to output a dicitonary of dataframes, but if I can just output the two data frames (d1 and d2) somehow that would also be good.
out = {}
for i, df in enumerate([d1, d2]):
key_id = [*df.loc[:,~df.columns.str.endswith('total')].columns]
totals = [*df.loc[:,df.columns.str.endswith('total')].columns]
for col in key_id:
pairs = df.loc[:, df.columns.str.startswith(col)]
pairs[col+'_total'].loc[(pairs[col].notnull()) & (pairs[col+'_total'].isnull())] = 0
out[i] = pd.concat([pairs], axis=1)
thank you for looking
Not sure I exactly understand what your need in your output, but maybe this works?
import pandas as pd
import numpy as np
d1 = pd.DataFrame(data={'a':['yes', 'no', 'maybe', 'sometimes', np.nan],
'a_total': [5,12,4,np.nan,0],
'b': ['blue','orange','pink', np.nan, np.nan],
'b_total': [12,6,0,0, np.nan]})
d2 = pd.DataFrame(data={'y':['frog', 'snail', 'snake', 'spider', 'pig'],
'y_total': [182,32,13, np.nan,8],
'z': ['car','bike','walk', np.nan, np.nan],
'z_total': [12,6,np.nan,np.nan, np.nan]})
#show d1 before making changes
print(d1)
#make the changes directly to d1 and d2
for i, df in enumerate([d1, d2]):
cols = [c for c in df.columns if not c.endswith('total')]
for col in cols:
tot_col = col+'_total'
df.loc[df[col].notnull() & df[tot_col].isnull(), tot_col] = 0
#show d1 after making changes
print(d1)
d1 before changes:
a a_total b b_total
0 yes 5.0 blue 12.0
1 no 12.0 orange 6.0
2 maybe 4.0 pink 0.0
3 sometimes NaN NaN 0.0
4 NaN 0.0 NaN NaN
d1 after changes:
a a_total b b_total
0 yes 5.0 blue 12.0
1 no 12.0 orange 6.0
2 maybe 4.0 pink 0.0
3 sometimes 0.0 NaN 0.0
4 NaN 0.0 NaN NaN
Related
I searched a lot but couldn't find an answer about which way (merge vs combine_first vs fillna+append) would be the best/most efficient/most robust way to combine two dataframes that are complementary (but the data in the first should not be overriden).
import pandas as pd
test1 = {'name': ['name a','name b','name c','name d', 'name e'],
'val1': [1,1,1,1,1],
'val2': [None,None,None,None,None],
'val3': [4,4,4,None,None],
'val4': [5,5,5,5,5]}
d1=pd.DataFrame(test1)
test2 = {'name': ['name a','name c','name b','name e', 'name f'],
'val1': [1,2,3,3,4],
'val3': [3,3,None,3,None],
'val2': [2,2,3,5,None],
'val5': [6,6,6,6,6]}
d2=pd.DataFrame(test2)
d1.set_index('name', inplace=True)
d2.set_index('name', inplace=True)
# V1
d3=d1.combine_first(d2)
d3.reset_index('name', inplace=True)
# V2
adddf=d2[~d2.index.isin(d1.index)]
d4=d1.append(adddf)
d4=d4.fillna(d2)
# V3
d5 = pd.merge(left=d1, right=d2, left_on='name', right_on='name', suffixes=('','_x'), how='outer')
d5['val1'].fillna(d5['val1_x'], inplace=True)
d5['val2'].fillna(d5['val2_x'], inplace=True)
d5['val3'].fillna(d5['val3_x'], inplace=True)
d5.drop(d5.filter(regex='_x$').columns.tolist(),axis=1, inplace=True)
Background:
I have a set of data points that I want to complete (with data from multiple sources, each source is used to add new rows or fill in missing columns of existing data but should not override existing data).
Each row has one identifier(name) column, there will/should not be any duplicates, if a source has conflicting data, I want to keep the current (as long as there already is data, if not fill in the new).
From my research all three functions work equally well (and with small dataframes there were no big time differences, could change with bigger dfs, didn't test) and where able to handle the multiple edge cases I introduced:
column order switched in df2 (name c, name b)
row order switched in df2 (val3,val2)
different data in df2 that should not replace data in df1
missing data in df1 and df2
What should happen:
a: 1,nan,4,5 -> 1,2,4(not 3),5,6
b: 1,nan,4,5 -> 1,3(not 2),4,5,6
c: 1,nan,4,5 -> 1(not2/3,)2(not 3),4,5,6
d: 1,n,n,5 -> 1,n,n,5,n(not6)
e: ->1(not 3),5,3,5,6
f: ->4,n,n,n,6
Which one would you prefer or is there an even better way for what I'm trying to achieve?
Try:
x = d1.merge(d2, on="name", how="outer", suffixes=("", "_x"))
for c in d1.columns:
if c + "_x" not in x.columns:
continue
x[[c, c + "_x"]] = x[[c, c + "_x"]].transform(sorted, key=pd.isna, axis=1)
print(x[d1.columns.union(d2.columns)])
Prints:
name val1 val2 val3 val4 val5
0 name a 1.0 2.0 4.0 5.0 6.0
1 name b 1.0 3.0 4.0 5.0 6.0
2 name c 1.0 2.0 4.0 5.0 6.0
3 name d 1.0 NaN NaN 5.0 NaN
4 name e 1.0 5.0 3.0 5.0 6.0
5 name f 4.0 NaN NaN NaN 6.0
I have a pandas dataframe (starting_df) with nan values in the left-hand columns. I'd like to shift all values over to the left for a left-aligned dataframe. My Dataframe is 24x24, but for argument's sake, I'm just posting a 4x4 version.
After some cool initial answers here, I modified the dataframe to also include a non-leading nan, who's position I'd like to preserve.
I have a piece of code that accomplishes what I want, but it relies on nested for-loops and suppressing an IndexError, which does not feel very pythonic. I have no experience with error handling in general, but simply suppressing an error does not seem to be the right strategy.
Starting dataframe and desired final dataframe:
Here is the code that (poorly) accomplishes the goal.
import pandas as pd
import numpy as np
def get_left_aligned(starting_df):
"""take a starting df with right-aligned numbers and nan, and
turn it into a left aligned table."""
left_aligned_df = pd.DataFrame()
for temp_index_1 in range(0, starting_df.shape[0]):
temp_series = []
for temp_index_2 in range(0, starting_df.shape[0]):
try:
temp_series.append(starting_df.iloc[temp_index_2, temp_index_2 + temp_index_1])
temp_index_2 += 1
except IndexError:
pass
temp_series = pd.DataFrame(temp_series, columns=['col'+str(temp_index_1 + 1)])
left_aligned_df = pd.concat([left_aligned_df, temp_series], axis=1)
return left_aligned_df
df = pd.DataFrame(dict(col1=[1, np.nan, np.nan, np.nan],
col2=[5, 2, np.nan, np.nan],
col3=[7, np.nan, 3, np.nan],
col4=[9, 8, 6, 4]))
df_expected = pd.DataFrame(dict(col1=[1, 2, 3, 4],
col2=[5, np.nan, 6, np.nan],
col3=[7, 8, np.nan, np.nan],
col4=[9, np.nan, np.nan, np.nan]))
df_left = get_left_aligned(df)
I appreciate any help with this.
Thanks!
or transpose the df and use shift to shift by column, when the NA num is increasing 1 by 1.
dfn = df.T.copy()
for i, col in enumerate(dfn.columns):
dfn[col] = dfn[col].shift(-i)
dfn = dfn.T
print(dfn)
col1 col2 col3 col4
0 1.0 5.0 7.0 9.0
1 2.0 NaN 8.0 NaN
2 3.0 6.0 NaN NaN
3 4.0 NaN NaN NaN
One way to resolve your challenge is to move the data into numpy territory, sort the data, then return it as a pandas DataFrame:
Numpy converts pandas NA to object data type; pd.to_numeric resolves that to data types numpy can work with.
pd.DataFrame(
np.sort(df.transform(pd.to_numeric).to_numpy(), axis=1),
columns=df.columns,
dtype="Int64",
)
col1 col2 col3
0 1 4 6
1 2 5 <NA>
2 3 <NA> <NA>
You can sort values on the row based on their positions, keeping the nan values at the end, giving them a very high value (np.nan, for example), rather than their actual position.
df.T.apply(
lambda x: [z[1] for z in sorted(enumerate(x), key=(lambda k: np.inf if pd.isna(k[1]) else k[0]), reverse=False)],
axis=0
).T
Here an example:
import numpy as np
import pandas as pd
df = pd.DataFrame(
data = [
[np.nan, 2, 4, 7],
[np.nan, np.nan, 6, 9],
[np.nan, np.nan, np.nan, 10],
[np.nan, np.nan, np.nan, np.nan],
],
columns=['A', 'B', 'C', 'D']
)
df2 = df.T.apply(
lambda x: [z[1] for z in sorted(enumerate(x), key=(lambda k: np.inf if pd.isna(k[1]) else k[0]), reverse=False)],
axis=0
).T
And this id df2:
A B C D
0 2.0 4.0 7.0 NaN
1 6.0 9.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN NaN NaN NaN
EDIT
If you have rows with NaNs after the first not NaN value, you can use this approach based on first_valid_index:
df.apply(
lambda x: x.shift(-list(x.index).index(x.first_valid_index() or x.index[0])),
axis=1,
)
An example for this case:
import numpy as np
import pandas as pd
df = pd.DataFrame(
data = [
[np.nan, 2, 4, 7],
[np.nan, np.nan, 6, 9],
[np.nan, np.nan, np.nan, 10],
[np.nan, np.nan, np.nan, np.nan],
[np.nan, 5, np.nan, 3],
],
columns=['A', 'B', 'C', 'D']
)
df3 = df.apply(
lambda x: x.shift(-list(x.index).index(x.first_valid_index() or x.index[0])),
axis=1,
)
And df3 is:
A B C D
0 2.0 4.0 7.0 NaN
1 6.0 9.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN NaN NaN NaN
4 5.0 NaN 3.0 NaN
Let's say I have data:
a b
0 1.0 NaN
1 6.0 1
2 3.0 NaN
3 1.0 NaN
I would like to iterate over this data to see,
if Data[i] == NaN **and** column['a'] == 1.0 then replace NAN with 4 instead of replace by 4 in any NaN you see. How shall I go about it? I tried every for if function and it didn't work. I also did
for i in df.itertuples():
but the problem is df.itertuples() doesn't have a replace functionality and the other methods I've seen were to do it one by one.
End Result looking for:
a b
0 1.0 4
1 6.0 1
2 3.0 NaN
3 1.0 4
def func(x):
if x['a'] == 1 and pd.isna(x['b']):
x['b'] = 4
return x
df = pd.DataFrame.from_dict({'a': [1.0, 6.0, 3.0, 1.0], 'b': [np.nan, 1, np.nan, np.nan]})
df.apply(func, axis=1)
Instead of iterrows(), apply() may be a better option.
You can create a mask and then fill in the intended NaNs using that mask:
df = pd.DataFrame({'a': [1,6,3,1], 'b': [np.nan, 1, np.nan, np.nan]})
mask = df[['a', 'b']].apply(lambda x: (x[0] == 1) and (pd.isna(x[1])), axis=1)
df['b'] = df['b'].mask(mask, df['b'].fillna(4))
print(df)
a b
0 1 4.0
1 6 1.0
2 3 NaN
3 1 4.0
df2 = df[df['a']==1.0].fillna(4.0)
df2.combine_first(df)
Can this help you?
Like you said, you can achieve this by combining 2 conditions: a==1 and b==Nan.
To combine two conditions in python you can use &.
In your example:
import pandas as pd
import numpy as np
# Create sample data
d = {'a': [1, 6, 3, 1], 'b': [np.nan, 1, np.nan, np.nan]}
df = pd.DataFrame(data=d)
# Convert to numeric
df = df.apply(pd.to_numeric, errors='coerce')
print(df)
# Replace Nans
df[ (df['a'] == 1 ) & np.isnan(df['b']) ] = 4
print(df)
Should do the trick.
I am converting a piece of code written in R to python. The following code is in R. df1 and df2 are the dataframes. id, case, feature, feature_value are column names. The code in R is
for(i in 1:dim(df1)[1]){
temp = subset(df2,df2$id == df1$case[i],select = df1$feature[i])
df1$feature_value[i] = temp[,df1$feature[i]]
}
My code in python is as follows.
for i in range(0,len(df1)):
temp=np.where(df1['case'].iloc[i]==df2['id']),df1['feature'].iloc[i]
df1['feature_value'].iloc[i]=temp[:,df1['feature'].iloc[i]]
but it gives
TypeError: tuple indices must be integers or slices, not tuple
How to rectify this error? Appreciate any help.
Unfortunately, R and Pandas handle dataframes pretty differently. If you'll be using Pandas a lot, it would probably be worth going through a tutorial on it.
I'm not too familiar with R so this is what I think you want to do:
Find rows in df1 where the 'case' matches an 'id' in df2. If such a row is found, add the "feature" in df1 to a new df1 column called "feature_value."
If so, you can do this with the following:
#create a sample df1 and df2
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5]})
>>> df1
case feature
0 1 3
1 2 4
2 3 5
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39]})
>>> df2
id age
0 1 45
1 3 63
2 7 39
#create a list with all the "id" values of df2
>>> df2_list = df2['id'].to_list()
>>> df2_list
[1, 3, 7]
#lambda allows small functions; in this case, the value of df1['feature_value']
#for each row is assigned df1['feature'] if df1['case'] is in df2_list,
#and otherwise it is assigned np.nan.
>>> df1['feature_value'] = df1.apply(lambda x: x['feature'] if x['case'] in df2_list else np.nan, axis=1)
>>> df1
case feature feature_value
0 1 3 3.0
1 2 4 NaN
2 3 5 5.0
Instead of lamda, a full function can be created, which may be easier to understand:
def get_feature_values(df, id_list):
if df['case'] in id_list:
feature_value = df['feature']
else:
feature_value = np.nan
return feature_value
df1['feature_value'] = df1.apply(get_feature_values, id_list=df2_list, axis=1)
Another way of going about this would involve merging df1 and df2 to find rows where the "case" value in df1 matches an "id" value in df2 (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
===================
To address the follow-up question in the comments:
You can do this by merging the databases and then creating a function.
#create example dataframes
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5], 'names': ['a', 'b', 'c']})
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39], 'a': [30, 31, 32], 'b': [40, 41, 42], 'c': [50, 51, 52]})
#merge the dataframes
>>> df1 = df1.merge(df2, how='left', left_on='case', right_on='id')
>>> df1
case feature names id age a b c
0 1 3 a 1.0 45.0 30.0 40.0 50.0
1 2 4 b NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0
Then you can create the following function:
def get_feature_values_2(df):
if pd.notnull(df['id']):
feature_value = df['feature']
column_of_interest = df['names']
feature_extended_value = df[column_of_interest]
else:
feature_value = np.nan
feature_extended_value = np.nan
return feature_value, feature_extended_value
# "result_type='expand'" allows multiple values to be returned from the function
df1[['feature_value', 'feature_extended_value']] = df1.apply(get_feature_values_2, result_type='expand', axis=1)
#This results in the following dataframe:
case feature names id age a b c feature_value \
0 1 3 a 1.0 45.0 30.0 40.0 50.0 3.0
1 2 4 b NaN NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0 5.0
feature_extended_value
0 30.0
1 NaN
2 51.0
#To keep only a subset of the columns:
#First create a copy-pasteable list of the column names
list(df1.columns)
['case', 'feature', 'names', 'id', 'age', 'a', 'b', 'c', 'feature_value', 'feature_extended_value']
#Choose the subset of columns you would like to keep
df1 = df1[['case', 'feature', 'names', 'feature_value', 'feature_extended_value']]
df1
case feature names feature_value feature_extended_value
0 1 3 a 3.0 30.0
1 2 4 b NaN NaN
2 3 5 c 5.0 51.0
I'm wondering if the there is a consice way to do exclude all columns with more than N NaNs, excluding one column from this subset.
For example:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5]],
columns=list('ABCD'))
Results in:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Running the following, I get:
df.dropna(thresh=2, axis=1)
B D
0 2.0 0
1 4.0 1
2 NaN 5
I would like to keep column 'C'. I.e., to perform this thresholding except on column 'C'.
Is that possible?
You can put the column back once you've done the thresholding. If you do this all on one line, you don't even need to store a reference to the column.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5]],
columns=list('ABCD'))
df.dropna(thresh=2, axis=1).assign(C=df['C'])
You could also do
C = df['C']
df.dropna(thresh=2, axis=1)
df.assign(C=C)
As suggested by #Wen, you can also do an indexing operation that won't remove column C to begin with.
threshold = 2
df = df.loc[:, (df.isnull().sum(0) < threshold) | (df.columns == 'C')]
The index here for the column will select columns that have fewer than threshold NaN values, or whose name is C. If you wanted to include more than just one column in the exception, you can chain more conditions with the "or" operator |. For example:
df = df.loc[
:,
(df.isnull().sum(0) < threshold) |
(df.columns == 'C') |
(df.columns == 'D')]
df.loc[:,(df.isnull().sum(0)<=1)|(df.isnull().sum(0)==len(df))]
Out[415]:
B C D
0 2.0 NaN 0
1 4.0 NaN 1
2 NaN NaN 5
As per Zero's suggestion
df.loc[:,(df.isnull().sum(0)<=1)|(df.isnull().all(0))]
EDIT :
df.loc[:,(df.isnull().sum(0)<=1)|(df.columns=='C')]
Another take that blends some concepts from other answers.
df.loc[:, df.isnull().assign(C=False).sum().lt(2)]
B C D
0 2.0 NaN 0
1 4.0 NaN 1
2 NaN NaN 5