Replace zeros in one dataframe with values from another dataframe - python

I have two dataframes df1 and df2:
df1 is shown here:
age
0 42
1 52
2 36
3 24
4 73
df2 is shown here:
age
0 0
1 0
2 1
3 0
4 0
I want to replace all the zeros in df2 with their corresponding entries in df1. In more technical words, if the element at a certain index in df2 is zero, then I would want this element to be replaced by the corresponding entry in df1.
Hence, I want df2 to look like:
age
0 42
1 52
2 1
3 24
4 73
I tried using the replace method but it is not working. Please help :)
Thanks in advance.

You could use where:
In [19]: df2.where(df2 != 0, df1)
Out[19]:
age
0 42
1 52
2 1
3 24
4 73
Above, df2 != 0 is a boolean DataFrame.
In [16]: df2 != 0
Out[16]:
age
0 False
1 False
2 True
3 False
4 False
df2.where(df2 != 0, df1) returns a new DataFrame. Where df2 != 0 is True, the corresponding value of df2 is used. Where it is False, the corresponding value of df1 is used.
Another alternative is to make an assignment with df.loc:
df2.loc[df2['age'] == 0, 'age'] = df1['age']
df.loc[mask, col] selects rows of df where the boolean Series, mask is True, and where the column label is col.
In [17]: df2.loc[df2['age'] == 0, 'age']
Out[17]:
0 0
1 0
3 0
4 0
Name: age, dtype: int64
When used in an assignment, such as df2.loc[df2['age'] == 0, 'age'] = df1['age'],
Pandas performs automatic index label alignment. (Notice the index labels above are 0,1,3,4 -- with 2 being skipped). So the values in df2.loc[df2['age'] == 0, 'age'] are replaced by the corresponding values from d1['age']. Even though d1['age'] is a Series with index labels 0,1,2,3, and 4, the 2 is ignored because there is no corresponding index label on the left-hand side.
In other words,
df2.loc[df2['age'] == 0, 'age'] = df1.loc[df2['age'] == 0, 'age']
would work as well, but the added restriction on the right-hand side is unnecessary.

In [30]: df2.mask(df2==0).combine_first(df1)
Out[30]:
age
0 42.0
1 52.0
2 1.0
3 24.0
4 73.0
or "negating" beautiful #unutbu's solution:
In [46]: df2.mask(df2==0, df1)
Out[46]:
age
0 42
1 52
2 1
3 24
4 73

Or try mul
df1.mul(np.where(df2==1,0,1)).replace({0:1})

Related

How to create a new column based on a condition in another column

In pandas, How can I create a new column B based on a column A in df, such that:
B=1 if A_(i+1)-A_(i) > 5 or A_(i) <= 10
B=0 if A_(i+1)-A_(i) <= 5
However, the first B_i value is always one
Example:
A
B
5
1 (the first B_i)
12
1
14
0
22
1
20
0
33
1
Use diff with a comparison to your value and convertion from boolean to int using le:
N = 5
df['B'] = (~df['A'].diff().le(N)).astype(int)
NB. using a le(5) comparison with inversion enables to have 1 for the first value
output:
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
updated answer, simply combine a second condition with OR (|):
df['B'] = (~df['A'].diff().le(5)|df['A'].lt(10)).astype(int)
output: same as above with the provided data
I was little confused with your rows numeration bacause we should have missing value on last row instead of first if we calcule for B_i basing on condition A_(i+1)-A_(i) (first row should have both, A_(i) and A_(i+1) and last row should be missing A_(i+1) value.
Anyway,basing on your example i assumed that we calculate for B_(i+1).
import pandas as pd
df = pd.DataFrame(columns=["A"],data=[5,12,14,22,20,33])
df['shifted_A'] = df['A'].shift(1) #This row can be removed - it was added only show to how shift works on final dataframe
df['B']=''
df.loc[((df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10), 'B']=1 #Update rows that fulfill one of conditions with 1
df.loc[(df['A']-df['A'].shift(1))<=5, 'B']=0 #Update rows that fulfill condition with 0
df.loc[df.index==0, 'B']=1 #Update first row in B column
print(df)
That prints:
A shifted_A B
0 5 NaN 1
1 12 5.0 1
2 14 12.0 0
3 22 14.0 1
4 20 22.0 0
5 33 20.0 1
I am not sure if it is fastest way, but i guess it should be one of easier to understand.
Little explanation:
df.loc[mask, columnname]=newvalue allows us to update value in given column if condition (mask) is fulfilled
(df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10)
Each condition here returns True or False. If we added them the result is True if any of that is True (which is simply OR). In case we need AND we can multiply the conditions
Use Series.diff, replace first missing value for 1 after compare for greater or equal by Series.ge:
N = 5
df['B'] = (df.A.diff().fillna(N).ge(N) | df.A.lt(10)).astype(int)
print (df)
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1

Updating values from different dataframe on a certain id value [duplicate]

Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).
Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns? I'd like the values on one column to replace all zero values of another column.
df1:
Name Nonprofit Business Education
X 1 1 0
Y 0 1 0 <- Y and Z have zero values for Nonprofit and Educ
Z 0 0 0
Y 0 1 0
df2:
Name Nonprofit Education
Y 1 1 <- this df has the correct values.
Z 1 1
pd.merge(df1, df2, on='Name', how='outer')
Name Nonprofit_X Business Education_X Nonprofit_Y Education_Y
Y 1 1 1 1 1
Y 1 1 1 1 1
X 1 1 0 nan nan
Z 1 1 1 1 1
In a previous post, I tried combine_First and dropna(), but these don't do the job.
I want to replace zeros in df1 with the values in df2.
Furthermore, I want all rows with the same Names to be changed according to df2.
Name Nonprofit Business Education
Y 1 1 1
Y 1 1 1
X 1 1 0
Z 1 0 1
(need to clarify: The value in 'Business' column where name = Z should 0.)
My existing solution does the following:
I subset based on the names that exist in df2, and then replace those values with the correct value. However, I'd like a less hacky way to do this.
pubunis_df = df2
sdf = df1
regex = str_to_regex(', '.join(pubunis_df.ORGS))
pubunis = searchnamesre(sdf, 'ORGS', regex)
sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)
Attention: In latest version of pandas, both answers above doesn't work anymore:
KSD's answer will raise error:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1]],columns=["Name","Nonprofit", "Education"])
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values
Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)
and EdChum's answer will give us the wrong result:
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]
df1
Out[852]:
Name Nonprofit Business Education
0 X 1.0 1 0.0
1 Y 1.0 1 1.0
2 Z NaN 0 NaN
3 Y NaN 1 NaN
Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.
Here is my answer:
Way 1:
df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)
Way 2:
df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)
More guide about update.. The columns names of both data frames need to set index are not necessary same before 'update'. You could try 'Name1' and 'Name2'. Also, it works even if other unnecessary row in df2, which won't update df1. In other words, df2 doesn't need to be the super set of df1.
Example:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1],
['U',1,3]],columns=["Name2","Nonprofit", "Education"])
df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')
df1.update(df2)
result:
Nonprofit Business Education
Name1
X 1.0 1 0.0
Y 1.0 1 1.0
Z 1.0 0 1.0
Y 1.0 1 1.0
Use the boolean mask from isin to filter the df and assign the desired row values from the rhs df:
In [27]:
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
In [27]:
This is the correct one.
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
The above will work only when all rows in df1 exists in df . In other words df should be super set of df1
Incase if you have some non matching rows to df in df1,you should follow below
In other words df is not superset of df1 :
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] =
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values
df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()

How to assign a value to a column for a subset of dataframe based on a condition in Pandas?

I have a data frame:
df = pd.DataFrame([[0,4,0,0],
[1,5,1,0],
[2,6,0,0],
[3,7,1,0]], columns=['index', 'A', 'class', 'label'])
df:
index
A
class
label
0
4
0
0
1
5
1
0
2
6
0
0
3
7
1
0
I want to change the label to 1, if the mean of A column of rows with class 0 is bigger than the mean of all data in column A?
How to do this in a few line of code?
I tried this but didn't work:
if df[df['class'] == 0]['A'].mean() > df['A'].mean():
df[df['class']]['lable'] = 1
Use the following, pandas.DataFrame.groupby 'class', get groupby.mean of each group of 'A', check whether greater than df['A'].mean(), and pandas.Series.map that boolean series astype(int) to df['class'] and assign to df['label']:
>>> df['label'] = df['class'].map(
df.groupby('class')['A'].mean() > df['A'].mean()
).astype(int)
>>> df
index A class label
0 0 4 0 0
1 1 5 1 1
2 2 6 0 0
3 3 7 1 1
Since you are checking only for class == 0, you need to add another boolean mask on df['class']:
>>> df['label'] = (df['class'].map(
df.groupby('class')['A'].mean() > df['A'].mean()
) & (~df['class'].astype(bool))
).astype(int)
index A class label
0 0 4 0 0
1 1 5 1 0 # because (5+7)/2 < (4+5+6+7)/4
2 2 6 0 0
3 3 7 1 0 # because (5+7)/2 < (4+5+6+7)/4
So even if your code has worked, you will not know it, because the conditions do not get fulfilled.
If I understand correctly, if the condition you mentioned is fullfilled, than the labels of all rows changes to 1 right? in that case what you did es correct but you missed something, the code should look like this:
if df[df['class'] == 0]['A'].mean() > df['A'].mean:
df['label'] = 1
This should work.
What you did does not work because when you use df[df['class']], you are only selecting the 'class' column of the DataFrame, so the 'label' column you want to modify is not called

Not able to split value of a column in dataframe on condition

I have an input data frame which looks like
0 1
0 0 10,30
1 1 10,40
2 2 20,50
Now I am trying to split the second column and store the value in to a new column. Here if the value in column A is divisible by 2 then get the first value from column B else second value like below
A B C
0 0 10,30 10
1 1 10,40 10
2 2 20,50 50
My Code :
import pandas as pd
import numpy as np
df = pd.DataFrame([(0, '10,30'), (1, '10,40'), (2, '20,50')])
df['n'] = np.where(df[0] % 2 == 0, df[0], 0 )
df[2] = (df[1]).str.split(',').str[df['n'].fillna(0)
print(df)
Its throwing an error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I believe need lookup by splited column to DataFrame and cast boolean mask to int for select first column by 0 and second column by 1:
df[2] = df[1].str.split(',', expand=True).lookup(df.index, (df[0] % 2 == 0).astype(int))
print (df)
0 1 2
0 0 10,30 30
1 1 10,40 10
2 2 20,50 50
print (df[0] % 2 == 0)
0 True
1 False
2 True
Name: 0, dtype: bool
#select second, first, second column
print ((df[0] % 2 == 0).astype(int))
0 1
1 0
2 1
Name: 0, dtype: int32
Similar solution with changed condition:
df[2] = df[1].str.split(',', expand=True).lookup(df.index, (df[0] % 2 != 0).astype(int))
print (df)
0 1 2
0 0 10,30 10
1 1 10,40 40
2 2 20,50 20
print (df[0] % 2 != 0)
0 False
1 True
2 False
Name: 0, dtype: bool
#select first, second, first column
print ((df[0] % 2 != 0).astype(int))
0 0
1 1
2 0
Name: 0, dtype: int32
print (df[1].str.split(',', expand=True))
0 1
0 10 30 <-first 10
1 10 40 <-second 40
2 20 50 <-first 20
I think you can also achieve it with the method apply.
First let's put the column 1 splited together with the target index into a new dataframe df1
df1 = pd.concat({i:df[1].str.split(',').str.get(i) for i in range(2)}, axis=1)
df1['ind'] = df[0] % 2
df1
0 1 ind
0 10 30 0
1 10 40 1
2 20 50 0
Next you can put the new values into the column 2 with
df[2] = df1.apply(lambda p: p.loc[p["ind"]], axis=1)
df[2]
0 10
1 40
2 20
dtype: object
If you don't want to create a new data frame, you can also do the following to get the same result
df[2] = df1.apply(lambda p: p.loc[1].split(",")[p.loc[0] % 2], axis=1)

pandas index skipping values

I'm reading in two csv files, selecting data from a specific column, dropping NA/nulls, and then using the data that fits some condition in one file to print the associated data in another:
data1 = pandas.read_csv(filename1, usecols = ['X', 'Y', 'Z']).dropna()
data2 = pandas.read_csv(filename2, usecols = ['X', 'Y', 'Z']).dropna()
i=0
for item in data1['Y']:
if item > -20:
print data2['X'][i]
But this throws me an error:
File "hashtable.pyx", line 381, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:7035)
File "hashtable.pyx", line 387, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6976)
KeyError: 6L
Turns out when I print data2['X'] I see missing numbers in the index of rows
0 -1.953779
1 -2.010039
2 -2.562191
3 -2.723993
4 -2.302720
5 -2.356181
7 -1.928778
...
How do I fix this and renumber the index values? Or is there a better way?
Found a solution in another question from here: Reindexing dataframes
.reset_index(drop=True) does the trick!
0 -1.953779
1 -2.010039
2 -2.562191
3 -2.723993
4 -2.302720
5 -2.356181
6 -1.928778
7 -1.925359
Are your two files/dataframes the same length? If so, you can leverage boolean masks and do this (and it avoids a for loop):
data2['X'][data1['Y'] > -20]
Edit: in response to the comment
What happens in between:
In [16]: df1
Out[16]:
X Y
0 0 0
1 1 2
2 2 4
3 3 6
4 4 8
In [17]: df2
Out[17]:
Y X
0 64 75
1 65 73
2 36 44
3 13 58
4 92 54
# creates a pandas Series object of True/False, which you can then use as a "mask"
In [18]: df2['Y'] > 50
Out[18]:
0 True
1 True
2 False
3 False
4 True
Name: Y, dtype: bool
# mask is applied element-wise to (in this case) the column of your DataFrame you want to filter
In [19]: df1['X'][ df2['Y'] > 50 ]
Out[19]:
0 0
1 1
4 4
Name: X, dtype: int64
# same as doing this (where mask is applied to the whole dataframe, and then you grab your column
In [20]: df1[ df2['Y'] > 50 ]['X']
Out[20]:
0 0
1 1
4 4
Name: X, dtype: int64

Categories