Fill the dataframe values from other dataframe in pandas python - python

I have 2 dataframes df1 and df2. df1 is filled with values and df2 is empty.
df1 and df2, as it can be seen, both dataframes's index and columns will always be same, just difference is df1 doesn't contain duplicate values of columns and indexes but df2 does contain.
How to fill values in df2 from df1, so that it also considers the combination of index and columns?
df1 = pd.DataFrame({'Ind':pd.Series([1,2,3,4]),1:pd.Series([1,0.2,0.2,0.8])
,2:pd.Series([0.2,1,0.2,0.8]),3:pd.Series([0.2,0.2,1,0.8])
,4:pd.Series([0.8,0.8,0.8,1])})
df1 = df1.set_index(['Ind'])
df2 = pd.DataFrame(columns = [1,1,2,2,3,4], index=[1,1,2,2,3,4])

IIUC, you want to update:
df2.update(df1)
print(df2)
1 1 2 2 3 4
1 1.0 1.0 0.2 0.2 0.2 0.8
1 1.0 1.0 0.2 0.2 0.2 0.8
2 0.2 0.2 1.0 1.0 0.2 0.8
2 0.2 0.2 1.0 1.0 0.2 0.8
3 0.2 0.2 0.2 0.2 1.0 0.8
4 0.8 0.8 0.8 0.8 0.8 1.0

Related

How to do a fillna with zero values until data appears in each column, then use the forward fill for each column in pandas data frame

I have the following data frame,
date x1 x2 x3
2001-01-01 nan 0.4 0.1
2001-01-02 nan 0.3 nan
2021-01-03 nan nan 0.5
...
2001-05-05 nan 0.1 0.2
2001-05-06 0.1 nan 0.3
...
So I want to first fill all nan values with zero until the first data point appears in each column, after that, I want the rest of the rows to use the fowardfill function.
So the above data frame should look like this,
date x1 x2 x3
2001-01-01 0 0.4 0.1
2001-01-02 0 0.3 0.1
2021-01-03 0 0.3 0.5
...
2001-05-05 0 0.1 0.2
2001-05-06 0.1 0.1 0.3
...
If I do fillna with 0 first then do forwardfill, like this,
df = df.fillna(0)
df = df.ffill()
I just get all the na values to be zero, and I am unable to do ffill for the parts where the data starts.
Is there a way to do the ffill the way I want?
Reverse the logic:
out = df.ffill().fillna(0)
print(out)
# Output
date x1 x2 x3
0 2001-01-01 0.0 0.4 0.1
1 2001-01-02 0.0 0.3 0.1
2 2021-01-03 0.0 0.3 0.5
3 2001-05-05 0.0 0.1 0.2
4 2001-05-06 0.1 0.1 0.3

Pandas: Return first column number that matches a condition

I have a dataframe that looks like this
a b c d
0 0.6 -0.4 0.2 0.7
1 0.8 0.2 -0.2 0.3
2 -0.1 0.5 0.5 -0.4
3 0.8 -0.6 -0.7 -0.2
And I wish to create column 'e' such that it displays the column number of the first instance in each row where the value is less than 0
So the goal result will look like this
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2
I can do this in Excel using a MATCH(True) type function but am struggling to make progress in Pandas.
Thanks for any help
You can use np.argmax:
# where the values are less than 0
a = df.values < 0
# if the row is all non-negative, return 0
df['e'] = np.where(a.any(1), np.argmax(a,axis=1)+1, 0)
Output:
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2
Something like idxmin with np.sin
import numpy as np
df['e']=df.columns.get_indexer(np.sign(df).idxmin(1))+1
df
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2
Get the first max, combined with get indexer for to get the column numbers:
df["e"] = df.columns.get_indexer_for(df.lt(0, axis=1).idxmax(axis=1).array) + 1
df
a b c d e
0 0.6 -0.4 0.2 0.7 2
1 0.8 0.2 -0.2 0.3 3
2 -0.1 0.5 0.5 -0.4 1
3 0.8 -0.6 -0.7 -0.2 2

How to use boolean indexing with Pandas

I have a dataframe:
df =
time time b
0 0.0 1.1 21
1 0.1 2.2 22
2 0.2 3.3 23
3 0.3 4.4 24
4 0.4 5.5 24
I also have a series for my units, defined as
su =
time sal
time zulu
b m/s
Now, I want to set df.index equal to the "time (sal)" values. Those values can be in any column and I will need to check.
I can do this as:
df.index = df.values[:,(df.columns == 'time') & (su.values == 'sal')]
But, my index looks like:
array([[0.0],
[0.1],
[0.2],
[0.3],
[0.4]])
However, this is an array of arrays. In bigger datasets, plot seems to take longer. If I hardcode the value, I get just an array:
df.index = df[0,0]
array([0.0, 0.1, 0.2, 0.3, 0.4])
I can also do the following:
inx = ((df.columns == 'time') & (s.values == 'sal')).tolist().index(True)
This sets "inx" to 0 and then gets a single array
df.index=df.values[0,inx]
However, I shouldn't have to do this. Am I using pandas and boolean indexing incorrectly?
I want:
df =
time time b
0.0 0.0 1.1 21
0.1 0.1 2.2 22
0.2 0.2 3.3 23
0.3 0.3 4.4 24
0.4 0.4 5.5 24
As I understood, this is what you expected. However, I renamed time names as time1 & time2, otherwise it won't let to create the dictionary with same name.
df = {'time1': [0.0,0.1,0.2,0.3,0.4], 'time2': [1.1,2.2,3.3,4.4,5.5],'b':[21,22,23,24,24]}
su = {'time1':'sal', 'time2':'zulu', 'b':'m/s'}
indexes = df[su.keys()[su.values().index('sal')]]
df = pd.DataFrame(df, index=indexes, columns=['time1', 'time2', 'b'])
print df
Your original DataFrame has the duplicate column name, it make complexity.
Try to modify the columns' name.
Sample Code
unit = pd.Series(['sal', 'zulu', 'm/s'], index=['time', 'time', 'b'])
>>> df
time time b
0 0.0 1.1 21.0
1 0.1 2.2 22.0
2 0.2 3.3 23.0
3 0.3 4.4 24.0
4 0.4 5.5 25.0
new_col = ['{}({})'.format(df.columns[i], unit[i]) for i in range(len(df.columns))]
>>> new_col
['time(sal)', 'time(zulu)', 'b(m/s)']
>>> df.columns = new_col
>>> df
time(sal) time(zulu) b(m/s)
0 0.0 1.1 21.0
1 0.1 2.2 22.0
2 0.2 3.3 23.0
3 0.3 4.4 24.0
4 0.4 5.5 25.0
>>> df.index = df['time(sal)'].values
>>> df
time(sal) time(zulu) b(m/s)
0.0 0.0 1.1 21.0
0.1 0.1 2.2 22.0
0.2 0.2 3.3 23.0
0.3 0.3 4.4 24.0
0.4 0.4 5.5 25.0

Subtracting columns based on key column in pandas dataframe

I have two dataframes looking like
df1:
ID A B C D
0 'ID1' 0.5 2.1 3.5 6.6
1 'ID2' 1.2 5.5 4.3 2.2
2 'ID1' 0.7 1.2 5.6 6.0
3 'ID3' 1.1 7.2 10. 3.2
df2:
ID A B C D
0 'ID1' 1.0 2.0 3.3 4.4
1 'ID2' 1.5 5.0 4.0 2.2
2 'ID3' 0.6 1.2 5.9 6.2
3 'ID4' 1.1 7.2 8.5 3.0
df1 can have multiple entries with the same ID whereas each ID occurs only once in df2. Also not all ID in df2 are necessarily present in df1. I can't solve this by using set_index() as multiple rows in df1 can have the same ID, and that the ID in df1 and df2 are not aligned.
I want to create a new dataframe where I subtract the values in df2[['A','B','C','D']] from df1[['A','B','C','D']] based on matching the ID.
The resulting dataframe would look like:
df_new:
ID A B C D
0 'ID1' -0.5 0.1 0.2 2.2
1 'ID2' -0.3 0.5 0.3 0.0
2 'ID1' -0.3 -0.8 2.3 1.6
3 'ID3' 0.5 6.0 1.5 0.2
I know how to do this with a loop, but since I'm dealing with huge data quantities this is not practical at all. What is the best way of approaching this with Pandas?
You just need set_index and subtract
(df1.set_index('ID')-df2.set_index('ID')).dropna(axis=0)
Out[174]:
A B C D
ID
'ID1' -0.5 0.1 0.2 2.2
'ID1' -0.3 -0.8 2.3 1.6
'ID2' -0.3 0.5 0.3 0.0
'ID3' 0.5 6.0 4.1 -3.0
If the order matters add reindex for df2
(df1.set_index('ID')-df2.set_index('ID').reindex(df1.ID)).dropna(axis=0).reset_index()
Out[211]:
ID A B C D
0 'ID1' -0.5 0.1 0.2 2.2
1 'ID2' -0.3 0.5 0.3 0.0
2 'ID1' -0.3 -0.8 2.3 1.6
3 'ID3' 0.5 6.0 4.1 -3.0
Similarly to what Wen (who beat me to it) proposed, you can use pd.DataFrame.subtract:
df1.set_index('ID').subtract(df2.set_index('ID')).reset_index()
A B C D
ID
'ID1' -0.5 0.1 0.2 2.2
'ID1' -0.3 -0.8 2.3 1.6
'ID2' -0.3 0.5 0.3 0.0
'ID3' 0.5 6.0 4.1 -3.0
One method is to use numpy. We can extract the ordered indices required from df2 using numpy.searchsorted.
Then feed this into the construction of a new dataframe.
idx = np.searchsorted(df2['ID'], df1['ID'])
res = pd.DataFrame(df1.iloc[:, 1:].values - df2.iloc[:, 1:].values[idx],
index=df1['ID']).reset_index()
print(res)
ID 0 1 2 3
0 'ID1' -0.5 0.1 0.2 2.2
1 'ID2' -0.3 0.5 0.3 0.0
2 'ID1' -0.3 -0.8 2.3 1.6
3 'ID3' 0.5 6.0 4.1 -3.0

Group rows by date and overwrite NaN values

I have a dataframe of the following structure which is simplified for this question.
A B C D E
0 2014/01/01 nan nan 0.2 nan
1 2014/01/01 0.1 nan nan nan
2 2014/01/01 nan 0.3 nan 0.7
3 2014/01/02 nan 0.4 nan nan
4 2014/01/02 0.5 nan 0.6 0.8
What I have here is a series of readings across several timestamps on single days. The columns B,C,D and E represent different locations. The data I am reading in is set up such that at a specified timestamp it takes data from certain locations and fills in nan values for the other locations.
What I wish to do is group the data by timestamp which I can easily do with a .GroupBy()command. From there I wish to have the nan values in the grouped data be overwritten with the valid values taken in later rows such that this is the following result is obtained.
A B C D E
0 2014/01/01 0.1 0.3 0.2 0.7
1 2014/01/02 0.5 0.4 0.6 0.8
How do I go about achieving this?
Try df.groupby with DataFrameGroupBy.agg:
In [528]: df.groupby('A', as_index=False, sort=False).agg(np.nansum)
Out[528]:
A B C D E
0 2014/01/01 0.1 0.3 0.2 0.7
1 2014/01/02 0.5 0.4 0.6 0.8
A shorter version with DataFrameGroupBy.sum (thanks MaxU!):
In [537]: df.groupby('A', as_index=False, sort=False).sum()
Out[537]:
A B C D E
0 2014/01/01 0.1 0.3 0.2 0.7
1 2014/01/02 0.5 0.4 0.6 0.8
you can try this by using pandas first
df.groupby('A', as_index=False).first()
A B C D E
0 1/1/2014 0.1 0.3 0.2 0.7
1 1/2/2014 0.5 0.4 0.6 0.8

Categories