How to difference of two dataframes except one column in pandas? [duplicate]

How to difference of two dataframes except one column in pandas? [duplicate] - python

This question already has answers here:
Subtract one dataframe from another excluding the first column Pandas
(3 answers)
Closed 4 years ago.
I have two data frames with same column names.
wave num stlines fwhm EWs MeasredWave
0 4050.32 3.0 0.282690 0.073650 22.160800 4050.311360
1 4208.98 5.5 0.490580 0.084925 44.323130 4208.973512
2 4374.94 9.0 0.714830 0.114290 86.964970 4374.927110
3 4379.74 9.0 0.314040 0.091070 30.442710 4379.760601
4 4398.01 14.0 0.504150 0.098450 52.832360 4398.007473
5 4502.21 8.0 0.562780 0.101090 60.559960 4502.205220
wave num stlines fwhm EWs MeasredWave
0 4050.32 3.0 0.276350 0.077770 22.876240 4050.310469
1 4208.98 5.5 0.493035 0.084065 44.095755 4208.974363
2 4374.94 6.0 0.716760 0.111550 85.111070 4374.927649
3 4379.74 1.0 0.299070 0.098400 31.325300 4379.759339
4 4398.01 6.0 0.508810 0.084530 45.783740 4398.004164
5 4502.21 9.0 0.572320 0.100540 61.252070 4502.205764
As the both the dataframes have column names and column one wave is same in both the dataframes. I want to take the difference of all the column except column 1 i.e, wave.
So, the resultant dataframe should have column1 and the difference of all the other columns
how can i do that?

I believe need extract columns names by difference and then use DataFrame.sub:
cols = df1.columns.difference(['wave'])
#is possible specify multiple columns
#cols = df1.columns.difference(['wave','MeasredWave'])
#df1[cols] = means in output are not touch columns from df1
df1[cols] = df1[cols].sub(df2[cols])
print (df1)
wave num stlines fwhm EWs MeasredWave
0 4050.32 0.0 0.006340 -0.00412 -0.715440 0.000891
1 4208.98 0.0 -0.002455 0.00086 0.227375 -0.000851
2 4374.94 3.0 -0.001930 0.00274 1.853900 -0.000539
3 4379.74 8.0 0.014970 -0.00733 -0.882590 0.001262
4 4398.01 8.0 -0.004660 0.01392 7.048620 0.003309
5 4502.21 -1.0 -0.009540 0.00055 -0.692110 -0.000544
cols = df1.columns.difference(['wave'])
#df2[cols] = means in output are not touch columns from df2
df2[cols] = df1[cols].sub(df2[cols])
print (df2)
wave num stlines fwhm EWs MeasredWave
0 4050.32 0.0 0.006340 -0.00412 -0.715440 0.000891
1 4208.98 0.0 -0.002455 0.00086 0.227375 -0.000851
2 4374.94 3.0 -0.001930 0.00274 1.853900 -0.000539
3 4379.74 8.0 0.014970 -0.00733 -0.882590 0.001262
4 4398.01 8.0 -0.004660 0.01392 7.048620 0.003309
5 4502.21 -1.0 -0.009540 0.00055 -0.692110 -0.000544

Related

More efficient alternative to nested For loop

I have two dataframes which contain data collected at two different frequencies.
I want to update the label of df2, to that of df1 if it falls into the duration of an event.
I created a nested for-loop to do it, but it takes a rather long time.
Here is the code I used:
for i in np.arange(len(df1)-1):
for j in np.arange(len(df2)):
if (df2.timestamp[j] > df1.timestamp[i]) & (df2.timestamp[j] < (df1.timestamp[i] + df1.duration[i])):
df2.loc[j,"label"] = df1.loc[i,"label"]
Is there a more efficient way of doing this?
df1 size (367, 4)
df2 size (342423, 9)
short example data:
import numpy as np
import pandas as pd
data1 = {'timestamp': [1,2,3,4,5,6,7,8,9],
'duration': [0.5,0.3,0.8,0.2,0.4,0.5,0.3,0.7,0.5],
'label': ['inh','exh','inh','exh','inh','exh','inh','exh','inh']
}
df1 = pd.DataFrame (data1, columns = ['timestamp','duration','label'])
data2 = {'timestamp': [1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9,9.5],
'label': ['plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc']
}
df2 = pd.DataFrame (data2, columns = ['timestamp','label'])

I would first use a merge_asof to select the highest timestamp from df1 below the timestamp from df2. Next a simple (vectorized) comparison of df2.timestamp and df1.timestamp + df1.duration is enough to select matching lines.
Code could be:
df1['t2'] = df1['timestamp'].astype('float64') # types of join columns must be the same
temp = pd.merge_asof(df2, df1, left_on='timestamp', right_on='t2')
df2.loc[temp.timestamp_x <= temp.t2 + temp.duration, 'label'] = temp.label_y
It gives for df2:
timestamp label
0 1.0 inh
1 1.5 inh
2 2.0 exh
3 2.5 plc
4 3.0 inh
5 3.5 inh
6 4.0 exh
7 4.5 plc
8 5.0 inh
9 5.5 plc
10 6.0 exh
11 6.5 exh
12 7.0 inh
13 7.5 plc
14 8.0 exh
15 8.5 exh
16 9.0 inh
17 9.5 inh

How to make different dataframes of different lengths become equal in length (downsampling and upsampling)

I have many dataframes (timeseries) that are of different lengths ranging between 28 and 179. I need to make them all of length 104. (upsampling those below 104 and downsampling those above 104)
For upsampling, the linear method can be sufficient to my needs. For downsampling, the mean of the values should be good.
To get all files to be the same length, I thought that I need to make all dataframes start and end at the same dates.
I was able to downsample all to the size of the smallest dataframe (i.e. 28) using below lines of code:
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
resampled=df.resample('120D').mean()
However, this will not give me good results when I feed them into the model I need them for as it shrinks the longer files so much thus distorting the data.
This is what I tried so far:
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
if df.shape[0]>100: resampled=df.resample('D').mean()
elif df.shape[0]<100: resampled=df.astype(float).resample('33D').interpolate(axis=0, method='linear')
else: break
Now, in the above lines of code, I am getting the files to be the same length (length 100). The downsampling part works fine too.
What's not working is the interpoaltion on the upsampling part. It just returns dataframes of length 100 with the first value of every column just copied over to all the rows.
What I need is to make them all size 104 (average size). This means any df of length>104 needs to downsampled and any df of length<104 needs to be upsampled.
As an example, please consider the two dfs as follows:
>>df1
index
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 11 -7 -2
>>df2
index
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 6 -3 -2
4 4 0 -4
5 8 2 -6
6 10 4 -8
7 12 6 -10
Suppose the avg length is 6, the expected output would be:
df1 upsampled to length 6 using interpolation - for e.g. resamle(rule).interpolate().
And df2 downsampled to length 6 using resample(rule).mean() .
Update:
If I could get all the files to be upsampled to 179, that would be fine as well.

I assume the problem is when you do resample in the up-sampling case, the other values are not kept. With you example df1, you can see it by using asfreq on one column:
print (df1.set_index(pd.date_range(start='1/1/1991' ,periods=len(df1), end='1/1/2000'))[1]
.resample('33D').asfreq().isna().sum(0))
#99 rows are nan on the 100 length resampled dataframe
So when you do interpolate instead of asfreq, it actually interpolates with just the first value, meaning that the first value is "repeated" over all the rows
To get the result you want, then before interpolating, use also mean even in the up-sampling case, such as:
print (df1.set_index(pd.date_range(start='1/1/1991' ,periods=len(df1), end='1/1/2000'))[1]
.resample('33D').mean().interpolate().head())
1991-01-01 3.000000
1991-02-03 3.060606
1991-03-08 3.121212
1991-04-10 3.181818
1991-05-13 3.242424
Freq: 33D, Name: 1, dtype: float64
and you will get values as you want.
To conclude, I think in both up-sampling and down-sampling cases, you can use the same command
resampled = (df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'))
.resample('33D').mean().interpolate())
Because the interpolate would not affect the result in the down-sampling case.

Here is my version using skimage.transform.resize() function:
df1 = pd.DataFrame({
'a': [3,5,9,11],
'b': [-1,-3,-5,-7],
'c': [0,2,0,-2]
})
df1
a b c
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 11 -7 -2
import pandas as pd
import numpy as np
from skimage.transform import resize
def df_resample(df1, num=1):
df2 = pd.DataFrame()
for key, value in df1.iteritems():
temp = value.to_numpy()/value.abs().max() # normalize
resampled = resize(temp, (num,1), mode='edge')*value.abs().max() # de-normalize
df2[key] = resampled.flatten().round(2)
return df2
df2 = df_resample(df1, 20) # resampling rate is 20
df2
a b c
0 3.0 -1.0 0.0
1 3.0 -1.0 0.0
2 3.0 -1.0 0.0
3 3.4 -1.4 0.4
4 3.8 -1.8 0.8
5 4.2 -2.2 1.2
6 4.6 -2.6 1.6
7 5.0 -3.0 2.0
8 5.8 -3.4 1.6
9 6.6 -3.8 1.2
10 7.4 -4.2 0.8
11 8.2 -4.6 0.4
12 9.0 -5.0 0.0
13 9.4 -5.4 -0.4
14 9.8 -5.8 -0.8
15 10.2 -6.2 -1.2
16 10.6 -6.6 -1.6
17 11.0 -7.0 -2.0
18 11.0 -7.0 -2.0
19 11.0 -7.0 -2.0

Pandas data manipulation - multiple measurements per line to one per line [duplicate]

This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 4 years ago.
I am manipulating a data frame using Pandas in Python to match a specific format.
I currently have a data frame with a row for each measurement location (A or B). Each row has a nominal target and multiple measured data points.
This is the format I currently have:
df=
Location Nominal Meas1 Meas2 Meas3
A 4.0 3.8 4.1 4.3
B 9.0 8.7 8.9 9.1
I need to manipulate this data so there is only one measured data point per row, and copy the Location and Nominal values from the source rows to the new rows. The measured data also needs to be put in the first column.
This is the format I need:
df =
Meas Location Nominal
3.8 A 4.0
4.1 A 4.0
4.3 A 4.0
8.7 B 9.0
8.9 B 9.0
9.1 B 9.0
I have tried concat and append functions with and without transpose() with no success.
This is the most similar example I was able to find, but it did not get me there:
for index, row in df.iterrows():
pd.concat([row]*3, ignore_index=True)
Thank you!

Its' a wide to long problem
pd.wide_to_long(df,'Meas',i=['Location','Nominal'],j='drop').reset_index().drop('drop',1)
Out[637]:
Location Nominal Meas
0 A 4.0 3.8
1 A 4.0 4.1
2 A 4.0 4.3
3 B 9.0 8.7
4 B 9.0 8.9
5 B 9.0 9.1

Another solution, using melt:
new_df = (df.melt(['Location','Nominal'],
['Meas1', 'Meas2', 'Meas3'],
value_name = 'Meas')
.drop('variable', axis=1)
.sort_values('Location'))
>>> new_df
Location Nominal Meas
0 A 4.0 3.8
2 A 4.0 4.1
4 A 4.0 4.3
1 B 9.0 8.7
3 B 9.0 8.9
5 B 9.0 9.1

Pandas apply based on conditional from another column

I'm looking to adjust values of one column based on a conditional in another column.
I'm using np.busday_count, but I don't want the weekend values to behave like a Monday (Sat to Tues is given 1 working day, I'd like that to be 2)
dispdf = df[(df.dispatched_at.isnull()==False) & (df.sold_at.isnull()==False)]
dispdf["dispatch_working_days"] = np.busday_count(dispdf.sold_at.tolist(), dispdf.dispatched_at.tolist())
for i in range(len(dispdf)):
if dispdf.dayofweek.iloc[i] == 5 or dispdf.dayofweek.iloc[i] == 6:
dispdf.dispatch_working_days.iloc[i] +=1
Sample:
dayofweek dispatch_working_days
43159 1.0 3
48144 3.0 3
45251 6.0 1
49193 3.0 0
42470 3.0 1
47874 6.0 1
44500 3.0 1
43031 6.0 3
43193 0.0 4
43591 6.0 3
Expected Results:
dayofweek dispatch_working_days
43159 1.0 3
48144 3.0 3
45251 6.0 2
49193 3.0 0
42470 3.0 1
47874 6.0 2
44500 3.0 1
43031 6.0 2
43193 0.0 4
43591 6.0 4
At the moment I'm using this for loop to add a working day to Saturday and Sunday values. It's slow!
Can I use a vectorization instead to speed this up. I tried using .apply but to no avail.

Pretty sure this works, but there are more optimized implementations:
def adjust_dispatch(df_line):
if df_line['dayofweek'] >= 5:
return df_line['dispatch_working_days'] + 1
else:
return df_line['dispatch_working_days']
df['dispatch_working_days'] = df.apply(adjust_dispatch, axis=1)

for in you code could be replaced by that line:
dispdf.loc[dispdf.dayofweek>5,'dispatch_working_days']+=1
or you could use numpy.where
https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html

python pandas:get rolling value of one Dataframe by rolling index of another Dataframe

I have two dataframes: one has multi levels of columns, and another has only single level column (which is the first level of the first dataframe, or say the second dataframe is calculated by grouping the first dataframe).
These two dataframes look like the following:
first dataframe-df1
second dataframe-df2
The relationship between df1 and df2 is:
df2 = df1.groupby(axis=1, level='sector').mean()
Then, I get the index of rolling_max of df1 by:
result1=pd.rolling_apply(df1,window=5,func=lambda x: pd.Series(x).idxmax(),min_periods=4)
Let me explain result1 a little bit. For example, during the five days (window length) 2016/2/23 - 2016/2/29, the max price of the stock sh600870 happened in 2016/2/24, the index of 2016/2/24 in the five-day range is 1. So, in result1, the value of stock sh600870 in 2016/2/29 is 1.
Now, I want to get the sector price for each stock by the index in result1.
Let's take the same stock as example, the stock sh600870 is in sector ’家用电器视听器材白色家电‘. So in 2016/2/29, I wanna get the sector price in 2016/2/24, which is 8.770.
How can I do that?

idxmax (or np.argmax) returns an index which is relative to the rolling
window. To make the index relative to df1, add the index of the left edge of
the rolling window:
index = pd.rolling_apply(df1, window=5, min_periods=4, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=5, min_periods=4)
index = index.add(shift, axis=0)
Once you have ordinal indices relative to df1, you can use them to index
into df1 or df2 using .iloc.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 15
columns = pd.MultiIndex.from_product([['foo','bar'], ['A','B']])
columns.names = ['sector', 'stock']
dates = pd.date_range('2016-02-01', periods=N, freq='D')
df1 = pd.DataFrame(np.random.randint(10, size=(N, 4)), columns=columns, index=dates)
df2 = df1.groupby(axis=1, level='sector').mean()
window_size, min_periods = 5, 4
index = pd.rolling_apply(df1, window=window_size, min_periods=min_periods, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=window_size, min_periods=min_periods)
# alternative, you could use
# shift = np.pad(np.arange(len(df1)-window_size+1), (window_size-1, 0), mode='constant')
# but this is harder to read/understand, and therefore it maybe more prone to bugs.
index = index.add(shift, axis=0)
result = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in index:
sector, stock = col
mask = pd.notnull(index[col])
idx = index.loc[mask, col].astype(int)
result.loc[mask, col] = df2[sector].iloc[idx].values
print(result)
yields
sector foo bar
stock A B A B
2016-02-01 NaN NaN NaN NaN
2016-02-02 NaN NaN NaN NaN
2016-02-03 NaN NaN NaN NaN
2016-02-04 5.5 5 5 7.5
2016-02-05 5.5 5 5 8.5
2016-02-06 5.5 6.5 5 8.5
2016-02-07 5.5 6.5 5 8.5
2016-02-08 6.5 6.5 5 8.5
2016-02-09 6.5 6.5 6.5 8.5
2016-02-10 6.5 6.5 6.5 6
2016-02-11 6 6.5 4.5 6
2016-02-12 6 6.5 4.5 4
2016-02-13 2 6.5 4.5 5
2016-02-14 4 6.5 4.5 5
2016-02-15 4 6.5 4 3.5
Note in pandas 0.18 the rolling_apply syntax was changed. DataFrames and Series now have a rolling method, so that now you would use:
index = df1.rolling(window=window_size, min_periods=min_periods).apply(np.argmax)
shift = (pd.Series(np.arange(len(df1)))
.rolling(window=window_size, min_periods=min_periods).min())
index = index.add(shift.values, axis=0)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to difference of two dataframes except one column in pandas? [duplicate] - python

Related

More efficient alternative to nested For loop

How to make different dataframes of different lengths become equal in length (downsampling and upsampling)

Pandas data manipulation - multiple measurements per line to one per line [duplicate]

Pandas apply based on conditional from another column

python pandas:get rolling value of one Dataframe by rolling index of another Dataframe

Categories

Resources