Subtracting columns based on key column in pandas dataframe - python

I have two dataframes looking like
df1:
ID A B C D
0 'ID1' 0.5 2.1 3.5 6.6
1 'ID2' 1.2 5.5 4.3 2.2
2 'ID1' 0.7 1.2 5.6 6.0
3 'ID3' 1.1 7.2 10. 3.2
df2:
ID A B C D
0 'ID1' 1.0 2.0 3.3 4.4
1 'ID2' 1.5 5.0 4.0 2.2
2 'ID3' 0.6 1.2 5.9 6.2
3 'ID4' 1.1 7.2 8.5 3.0
df1 can have multiple entries with the same ID whereas each ID occurs only once in df2. Also not all ID in df2 are necessarily present in df1. I can't solve this by using set_index() as multiple rows in df1 can have the same ID, and that the ID in df1 and df2 are not aligned.
I want to create a new dataframe where I subtract the values in df2[['A','B','C','D']] from df1[['A','B','C','D']] based on matching the ID.
The resulting dataframe would look like:
df_new:
ID A B C D
0 'ID1' -0.5 0.1 0.2 2.2
1 'ID2' -0.3 0.5 0.3 0.0
2 'ID1' -0.3 -0.8 2.3 1.6
3 'ID3' 0.5 6.0 1.5 0.2
I know how to do this with a loop, but since I'm dealing with huge data quantities this is not practical at all. What is the best way of approaching this with Pandas?

You just need set_index and subtract
(df1.set_index('ID')-df2.set_index('ID')).dropna(axis=0)
Out[174]:
A B C D
ID
'ID1' -0.5 0.1 0.2 2.2
'ID1' -0.3 -0.8 2.3 1.6
'ID2' -0.3 0.5 0.3 0.0
'ID3' 0.5 6.0 4.1 -3.0
If the order matters add reindex for df2
(df1.set_index('ID')-df2.set_index('ID').reindex(df1.ID)).dropna(axis=0).reset_index()
Out[211]:
ID A B C D
0 'ID1' -0.5 0.1 0.2 2.2
1 'ID2' -0.3 0.5 0.3 0.0
2 'ID1' -0.3 -0.8 2.3 1.6
3 'ID3' 0.5 6.0 4.1 -3.0

Similarly to what Wen (who beat me to it) proposed, you can use pd.DataFrame.subtract:
df1.set_index('ID').subtract(df2.set_index('ID')).reset_index()
A B C D
ID
'ID1' -0.5 0.1 0.2 2.2
'ID1' -0.3 -0.8 2.3 1.6
'ID2' -0.3 0.5 0.3 0.0
'ID3' 0.5 6.0 4.1 -3.0

One method is to use numpy. We can extract the ordered indices required from df2 using numpy.searchsorted.
Then feed this into the construction of a new dataframe.
idx = np.searchsorted(df2['ID'], df1['ID'])
res = pd.DataFrame(df1.iloc[:, 1:].values - df2.iloc[:, 1:].values[idx],
index=df1['ID']).reset_index()
print(res)
ID 0 1 2 3
0 'ID1' -0.5 0.1 0.2 2.2
1 'ID2' -0.3 0.5 0.3 0.0
2 'ID1' -0.3 -0.8 2.3 1.6
3 'ID3' 0.5 6.0 4.1 -3.0

Related

How to plot a boxplot grouped by the column names in pandas?

my dataframe:
Q JJ R S R' S' P T JJ Q ... P T JJ Q \
0 -0.2 0.0 6.1 -1.0 0.0 0 0.6 2.1 0.0 0.0 ... 0.9 3.9 -0.3 0.0
1 -0.6 0.0 7.2 0.0 0.0 0 0.4 1.5 0.0 0.0 ... 0.4 2.6 -0.5 0.0
2 1.0 0.0 4.5 -2.8 0.0 0 0.3 2.5 0.8 -0.4 ... 0.4 3.4 0.9 0.0
3 0.9 0.0 7.8 -0.7 0.0 0 1.1 1.9 0.1 0.0 ... 0.6 3.0 0.1 0.0
4 0.0 0.0 5.2 -1.4 0.0 0 0.9 2.3 0.1 0.0 ... -0.2 2.9 -0.4 0.0
R S R' S' P T
0 9.0 -0.9 0.0 0 0.9 2.9
1 8.5 0.0 0.0 0 0.2 2.1
2 9.5 -2.4 0.0 0 0.3 3.4
3 12.2 -2.2 0.0 0 0.4 2.6
4 13.1 -3.6 0.0 0 -0.1 3.9
I'm trying to plot a boxplot grouped by the column names (there are 8 groups so I would expect 8 boxplots).
I used:
bp = df_net_wave_amplitude_for_std.plot.box(figsize=(20,8))
and
bp = df_net_wave_amplitude_for_std.boxplot(figsize=(20,8))
but I'm getting all of the columns in x-axis instead of getting them grouped by the name:
I figured it out:
first i should have satcked the data:
df_net_wave_amplitude_for_std = df_net_wave_amplitude_for_std.stack()
then I arranged the result: reset the index, got rid of a redundant column (called 'level_0') and rename the colums:
df_net_wave_amplitude_for_std = df_net_wave_amplitude_for_std.reset_index()
del df_net_wave_amplitude_for_std['level_0']
df_net_wave_amplitude_for_std.columns = ['Wave', 'data']
and finally I could use the boxplot function:
bp = df_net_wave_amplitude_for_std.boxplot('data', by='Wave', figsize=(20,8))

Concat/join/merge multiple dataframes based on row index (number) of each individual dataframes

I want to read every nth row of a list of DataFrames and create a new DataFrames by appending all the Nth rows.
Let's say we have the following DataFrames:
>>> df1
A B C D
0 -0.8 -2.8 -0.3 -0.1
1 -0.1 -0.9 0.2 -0.7
2 0.7 -3.3 -1.1 -0.4
>>> df2
A B C D
0 1.4 -0.7 1.5 -1.3
1 1.6 1.4 1.4 0.2
2 -1.4 0.2 -1.7 0.7
>>> df3
A B C D
0 0.3 -0.5 -1.6 -0.8
1 0.2 -0.5 -1.1 1.6
2 -0.3 0.7 -1.0 1.0
I have used the following approach to get the desired df:
df = pd.DataFrame()
df_list = [df1, df2, df3]
for i in range(len(df1)):
for x in df_list:
df = df.append(x.loc[i], ignore_index = True)
Here's the result:
>>> df
A B C D
0 -0.8 -2.8 -0.3 -0.1
1 1.4 -0.7 1.5 -1.3
2 0.3 -0.5 -1.6 -0.8
3 -0.1 -0.9 0.2 -0.7
4 1.6 1.4 1.4 0.2
5 0.2 -0.5 -1.1 1.6
6 0.7 -3.3 -1.1 -0.4
7 -1.4 0.2 -1.7 0.7
8 -0.3 0.7 -1.0 1.0
I was just wondering if there is a pandas way of rewriting this code which would do the same thing (maybe by using .iterrows, pd.concat, pd.join, or pd.merge)?
Cheers
Update
Simply appending one df after another is not what I am looking for here.
The code should do:
df.row1 = df1.row1
df.row2 = df2.row1
df.row3 = df3.row1
df.row4 = df1.row2
df.row5 = df2.row2
df.row6 = df3.row2
...
For a single output dataframe, you can concatenate and sort by index:
res = pd.concat([df1, df2, df3]).sort_index().reset_index(drop=True)
A B C D
0 -0.8 -2.8 -0.3 -0.1
1 1.4 -0.7 1.5 -1.3
2 0.3 -0.5 -1.6 -0.8
3 -0.1 -0.9 0.2 -0.7
4 1.6 1.4 1.4 0.2
5 0.2 -0.5 -1.1 1.6
6 0.7 -3.3 -1.1 -0.4
7 -1.4 0.2 -1.7 0.7
8 -0.3 0.7 -1.0 1.0
For a dictionary of dataframes, You can concatenate and then group by index:
res = dict(tuple(pd.concat([df1, df2, df3]).groupby(level=0)))
With the dictionary defined as above, each value represents a row number. For example, res[0] will give the first row from each input dataframe.
There is pd.concat
df=pd.concat([df1,df2,df3]).reset_index(drop=True)
recommended by Jez
df=pd.concat([df1,df2,df3],ignore_index=True)
try :
>>> df1 = pd.DataFrame({'A':['-0.8', '-0.1', '0.7'],
... 'B':['-2.8', '-0.9', '-3.3'],
... 'C':['-0.3', '0.2', '-1.1'],
... 'D':['-0.1', '-0.7', '-0.4']})
>>>
>>> df2 = pd.DataFrame({'A':['1.4', '1.6', '-1.4'],
... 'B':['-0.7', '1.4', '0.2'],
... 'C':['1.5', '1.4', '-1.7'],
... 'D':['-1.3', '0.2', '0.7']})
>>>
>>> df3 = pd.DataFrame({'A':['0.3', '0.2', '-0.3'],
... 'B':['-0.5', '-0.5', '0.7'],
... 'C':['-1.6', '-1.1', '-1.0'],
... 'D':['-0.8', '1.6', '1.0']})
>>> df=pd.concat([df1,df2,df3],ignore_index=True)
>>> print(df)
A B C D
0 -0.8 -2.8 -0.3 -0.1
1 -0.1 -0.9 0.2 -0.7
2 0.7 -3.3 -1.1 -0.4
3 1.4 -0.7 1.5 -1.3
4 1.6 1.4 1.4 0.2
5 -1.4 0.2 -1.7 0.7
6 0.3 -0.5 -1.6 -0.8
7 0.2 -0.5 -1.1 1.6
8 -0.3 0.7 -1.0 1.0
OR
df=pd.concat([df1,df2,df3], axis=0, join='outer', ignore_index=True)
Note:
axis: whether we will concatenate along rows (0) or columns (1)
join: can be set to inner, outer, left, or right. by using outer its sort it's lexicographically
ignore_index: whether or not the original row labels from should be retained, by default False ,If True, do not use the index labels.
You can concatenate them keeping their original indexes as a column this way:
df_total = pd.concat([df1.reset_index(), df2.reset_index(),
df3.reset_index()])
>> df_total
index A B C D
0 0 -0.8 -2.8 -0.3 -0.1
1 1 -0.1 -0.9 0.2 -0.7
2 2 0.7 -3.3 -1.1 -0.4
0 0 1.4 -0.7 1.5 -1.3
1 1 1.6 1.4 1.4 0.2
2 2 -1.4 0.2 -1.7 0.7
0 0 0.3 -0.5 -1.6 -0.8
1 1 0.2 -0.5 -1.1 1.6
2 2 -0.3 0.7 -1.0 1.0
Then you can make a multiindex dataframe and order by index:
df_joined = df_total.reset_index(drop=True).reset_index()
>> df_joined
level_0 index A B C D
0 0 0 -0.8 -2.8 -0.3 -0.1
1 1 1 -0.1 -0.9 0.2 -0.7
2 2 2 0.7 -3.3 -1.1 -0.4
3 3 0 1.4 -0.7 1.5 -1.3
4 4 1 1.6 1.4 1.4 0.2
5 5 2 -1.4 0.2 -1.7 0.7
6 6 0 0.3 -0.5 -1.6 -0.8
7 7 1 0.2 -0.5 -1.1 1.6
8 8 2 -0.3 0.7 -1.0 1.0
>> df_joined = df_joined.set_index(['index', 'level_0']).sort_index()
>> df_joined
A B C D
index level_0
0 0 -0.8 -2.8 -0.3 -0.1
3 1.4 -0.7 1.5 -1.3
6 0.3 -0.5 -1.6 -0.8
1 1 -0.1 -0.9 0.2 -0.7
4 1.6 1.4 1.4 0.2
7 0.2 -0.5 -1.1 1.6
2 2 0.7 -3.3 -1.1 -0.4
5 -1.4 0.2 -1.7 0.7
8 -0.3 0.7 -1.0 1.0
You can put all this a dataframe just by doing:
>> pd.DataFrame(df_joined.values, columns = df_joined.columns)
A B C D
0 -0.8 -2.8 -0.3 -0.1
1 1.4 -0.7 1.5 -1.3
2 0.3 -0.5 -1.6 -0.8
3 -0.1 -0.9 0.2 -0.7
4 1.6 1.4 1.4 0.2
5 0.2 -0.5 -1.1 1.6
6 0.7 -3.3 -1.1 -0.4
7 -1.4 0.2 -1.7 0.7
8 -0.3 0.7 -1.0 1.0

How to use boolean indexing with Pandas

I have a dataframe:
df =
time time b
0 0.0 1.1 21
1 0.1 2.2 22
2 0.2 3.3 23
3 0.3 4.4 24
4 0.4 5.5 24
I also have a series for my units, defined as
su =
time sal
time zulu
b m/s
Now, I want to set df.index equal to the "time (sal)" values. Those values can be in any column and I will need to check.
I can do this as:
df.index = df.values[:,(df.columns == 'time') & (su.values == 'sal')]
But, my index looks like:
array([[0.0],
[0.1],
[0.2],
[0.3],
[0.4]])
However, this is an array of arrays. In bigger datasets, plot seems to take longer. If I hardcode the value, I get just an array:
df.index = df[0,0]
array([0.0, 0.1, 0.2, 0.3, 0.4])
I can also do the following:
inx = ((df.columns == 'time') & (s.values == 'sal')).tolist().index(True)
This sets "inx" to 0 and then gets a single array
df.index=df.values[0,inx]
However, I shouldn't have to do this. Am I using pandas and boolean indexing incorrectly?
I want:
df =
time time b
0.0 0.0 1.1 21
0.1 0.1 2.2 22
0.2 0.2 3.3 23
0.3 0.3 4.4 24
0.4 0.4 5.5 24
As I understood, this is what you expected. However, I renamed time names as time1 & time2, otherwise it won't let to create the dictionary with same name.
df = {'time1': [0.0,0.1,0.2,0.3,0.4], 'time2': [1.1,2.2,3.3,4.4,5.5],'b':[21,22,23,24,24]}
su = {'time1':'sal', 'time2':'zulu', 'b':'m/s'}
indexes = df[su.keys()[su.values().index('sal')]]
df = pd.DataFrame(df, index=indexes, columns=['time1', 'time2', 'b'])
print df
Your original DataFrame has the duplicate column name, it make complexity.
Try to modify the columns' name.
Sample Code
unit = pd.Series(['sal', 'zulu', 'm/s'], index=['time', 'time', 'b'])
>>> df
time time b
0 0.0 1.1 21.0
1 0.1 2.2 22.0
2 0.2 3.3 23.0
3 0.3 4.4 24.0
4 0.4 5.5 25.0
new_col = ['{}({})'.format(df.columns[i], unit[i]) for i in range(len(df.columns))]
>>> new_col
['time(sal)', 'time(zulu)', 'b(m/s)']
>>> df.columns = new_col
>>> df
time(sal) time(zulu) b(m/s)
0 0.0 1.1 21.0
1 0.1 2.2 22.0
2 0.2 3.3 23.0
3 0.3 4.4 24.0
4 0.4 5.5 25.0
>>> df.index = df['time(sal)'].values
>>> df
time(sal) time(zulu) b(m/s)
0.0 0.0 1.1 21.0
0.1 0.1 2.2 22.0
0.2 0.2 3.3 23.0
0.3 0.3 4.4 24.0
0.4 0.4 5.5 25.0

Taking each element in a column to calculate and create a new column using python

I have a dataset that looks like the following;
ID val
1 3.1
2 2.7
3 6.3
4 1.3
And want to calculate the similarity of val between each row and each other in order to obtain a matrix like the following
ID val c_1 c_2 c_3 c_4
1 3.1 0.0 0.4 -3.2 0.8
2 2.7 -0.4 0.0 -3.6 1.4
3 6.3 3.2 3.6 0.0 5.0
4 1.3 -0.8 -1.4 -5.0 0.0
I have got the following code:
def similarities(data):
j=0
k=0
for i in data:
data[j,k+2] = data[j+1] - data[j]
j=j+1
k=k+1
return None
This evidently doesnt work at the moment but is this even the right approach of trying to iterate through the data set and using indexes?
I think you need np.subtract.outer, create new Dataframe and join to original:
df1=pd.DataFrame(np.subtract.outer(df['val'], df['val']), columns=df['ID']).add_prefix('c_')
df = df.join(df1)
print (df)
ID val c_1 c_2 c_3 c_4
0 1 3.1 0.0 0.4 -3.2 1.8
1 2 2.7 -0.4 0.0 -3.6 1.4
2 3 6.3 3.2 3.6 0.0 5.0
3 4 1.3 -1.8 -1.4 -5.0 0.0
Another solution with broadcasting:
val = df.val.values
ids = df.ID.values
df1 = pd.DataFrame(val[:, None] - val, columns = ids).add_prefix('c_')
df = df.join(df1)
print (df)
ID val c_1 c_2 c_3 c_4
0 1 3.1 0.0 0.4 -3.2 1.8
1 2 2.7 -0.4 0.0 -3.6 1.4
2 3 6.3 3.2 3.6 0.0 5.0
3 4 1.3 -1.8 -1.4 -5.0 0.0
You can try this:
s = """
ID val
1 3.1
2 2.7
3 6.3
4 1.3
"""
data = [i.split() for i in filter(None, s.split('\n'))]
rows = map(float, zip(*data[1:])[-1])
final_data = [[i+b]+[round(b-c, 2) for c in rows] for i, b in enumerate(rows, start=1)]
print('ID val {}'.format(' '.join('c_{}'.format(i) for i in range(1, len(rows)+1))))
for row in final_data:
print(' '.join(map(str, row)))
Output:
ID val c_1 c_2 c_3 c_4
4.1 0.0 0.4 -3.2 1.8
4.7 -0.4 0.0 -3.6 1.4
9.3 3.2 3.6 0.0 5.0
5.3 -1.8 -1.4 -5.0 0.0

Pandas join/merge/concat two DataFrames and combine rows of identical key/index [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I am attempting to combine two sets of data, but I can't figure out which method is most suitable (join, merge, concat, etc.) for this application, and the documentation doesn't have any examples that do what I need to do.
I have two sets of data, structured like so:
>>> A
Time Voltage
1.0 5.1
2.0 5.5
3.0 5.3
4.0 5.4
5.0 5.0
>>> B
Time Current
-1.0 0.5
0.0 0.6
1.0 0.3
2.0 0.4
3.0 0.7
I would like to combine the data columns and merge the 'Time' column together so that I get the following:
>>> AB
Time Voltage Current
-1.0 0.5
0.0 0.6
1.0 5.1 0.3
2.0 5.5 0.4
3.0 5.3 0.7
4.0 5.4
5.0 5.0
I've tried AB = merge_ordered(A, B, on='Time', how='outer'), and while it successfully combined the data, it output something akin to:
>>> AB
Time Voltage Current
-1.0 0.5
0.0 0.6
1.0 5.1
1.0 0.3
2.0 5.5
2.0 0.4
3.0 5.3
3.0 0.7
4.0 5.4
5.0 5.0
You'll note that it did not combine rows with shared 'Time' values.
I have also tried merging a la AB = A.merge(B, on='Time', how='outer'), but that outputs something combined, but not sorted, like so:
>>> AB
Time Voltage Current
-1.0 0.5
0.0 0.6
1.0 5.1
2.0 5.5
3.0 5.3 0.7
4.0 5.4
5.0 5.0
1.0 0.3
2.0 0.4
...it essentially skips some of the data in 'Current' and appends it to the bottom, but it does so inconsistently. And again, it does not merge the rows together.
I have also tried AB = pandas.concat(A, B, axis=1), but the result does not get merged. I simply get, well, the concatenation of the two DataFrames, like so:
>>> AB
Time Voltage Time Current
1.0 5.1 -1.0 0.5
2.0 5.5 0.0 0.6
3.0 5.3 1.0 0.3
4.0 5.4 2.0 0.4
5.0 5.0 3.0 0.7
I've been scouring the documentation and here to try to figure out the exact differences between merge and join, but from what I gather they're pretty similar. Still, I haven't found anything that specifically answers the question of "how to merge rows that share an identical key/index". Can anyone enlighten me on how to do this? I only have a few days-worth of experience with Pandas!
merge
merge combines on columns. By default it takes all commonly named columns. Otherwise, you can specify which columns to combine on. In this example, I chose, Time.
A.merge(B, 'outer', 'Time')
Time Voltage Current
0 1.0 5.1 0.3
1 2.0 5.5 0.4
2 3.0 5.3 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
5 -1.0 NaN 0.5
6 0.0 NaN 0.6
join
join combines on index values unless you specify the left hand side's column instead. That is why I set the index for the right hand side and Specify a column for the left hand side Time.
A.join(B.set_index('Time'), 'Time', 'outer')
Time Voltage Current
0 1.0 5.1 0.3
1 2.0 5.5 0.4
2 3.0 5.3 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
4 -1.0 NaN 0.5
4 0.0 NaN 0.6 ​
pd.concat
concat combines on index values... so I create a list comprehension in which I iterate over each dataframe I want to combine [A, B]. In the comprehension, each dataframe assumes the name d, hence the for d in [A, B]. axis=1 says to combine them side by side thus using the index as the joining feature.
pd.concat([d.set_index('Time') for d in [A, B]], axis=1).reset_index()
Time Voltage Current
0 -1.0 NaN 0.5
1 0.0 NaN 0.6
2 1.0 5.1 0.3
3 2.0 5.5 0.4
4 3.0 5.3 0.7
5 4.0 5.4 NaN
6 5.0 5.0 NaN
combine_first
A.set_index('Time').combine_first(B.set_index('Time')).reset_index()
Time Current Voltage
0 -1.0 0.5 NaN
1 0.0 0.6 NaN
2 1.0 0.3 5.1
3 2.0 0.4 5.5
4 3.0 0.7 5.3
5 4.0 NaN 5.4
6 5.0 NaN 5.0
It should work properly if the Time column is of the same dtype in both DFs:
In [192]: A.merge(B, how='outer').sort_values('Time')
Out[192]:
Time Voltage Current
5 -1.0 NaN 0.5
6 0.0 NaN 0.6
0 1.0 5.1 0.3
1 2.0 5.5 0.4
2 3.0 5.3 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
In [193]: A.dtypes
Out[193]:
Time float64
Voltage float64
dtype: object
In [194]: B.dtypes
Out[194]:
Time float64
Current float64
dtype: object
Reproducing your problem:
In [198]: A.merge(B.assign(Time=B.Time.astype(str)), how='outer').sort_values('Time')
Out[198]:
Time Voltage Current
5 -1.0 NaN 0.5
6 0.0 NaN 0.6
0 1.0 5.1 NaN
7 1.0 NaN 0.3
1 2.0 5.5 NaN
8 2.0 NaN 0.4
2 3.0 5.3 NaN
9 3.0 NaN 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
In [199]: B.assign(Time=B.Time.astype(str)).dtypes
Out[199]:
Time object # <------ NOTE
Current float64
dtype: object
Visually it's hard to distinguish:
In [200]: B.assign(Time=B.Time.astype(str))
Out[200]:
Time Current
0 -1.0 0.5
1 0.0 0.6
2 1.0 0.3
3 2.0 0.4
4 3.0 0.7
In [201]: B
Out[201]:
Time Current
0 -1.0 0.5
1 0.0 0.6
2 1.0 0.3
3 2.0 0.4
4 3.0 0.7
Solution found
As per the suggestions below, I had to round the numbers in the 'Time' column prior to merging them, despite the fact that they were both of the same dtype (float64). The suggestion was to round like so:
A = A.assign(A.Time = A.Time.round(4))
But in my actual situation, the column was labeled 'Time, (sec)' (there was punctuation that screwed with the assignment. So instead I used the following line to round it:
A['Time, (sec)'] = A['Time, (sec)'].round(4)
And it worked like a charm. Are there any issues with doing it like that?

Categories