Find the time difference in two dataframes - python

df1:
time_1 a b c
0 1.675168e+09 -90.56 5.28 -6.23
1 1.675168e+09 -87.98 5.27 -5.68
2 1.675168e+09 -83.96 14.74 -9.44
3 1.675168e+09 -85.58 -5.72 -5.27
4 1.675168e+09 -95.13 -4.15 -5.46
5 1.675168e+09 -90.56 5.28 -6.23
6 1.675168e+09 -87.98 5.27 -5.68
7 1.675168e+09 -83.96 14.74 -9.44
8 1.675168e+09 -85.58 -5.72 -5.27
9 1.675168e+09 -95.13 -4.15 -5.46
df2:
time_2 x y z
0 1.675168e+09 -6.64 542.397494 2.25
1 1.675168e+09 -6.64 541.233179 2.25
2 1.675169e+09 -6.63 567.644365 2.25
3 1.675169e+09 -6.63 530.368776 2.25
4 1.675170e+09 -6.63 552.896863 2.25
I would like to get the difference of time. ie,time_1 in the df1 minus all the time_2 values in df2.
df:
time_1 - time_2 a b c y
0 1.675168e+09 - 1.675168e+09
1 1.675168e+09 - 1.675168e+09
2 1.675168e+09 - 1.675169e+09
3 1.675168e+09 - 1.675169e+09
4 1.675168e+09 - 1.675170e+09
5
6
7
and go on

df = pd.DataFrame({'time_1': [1, -1, 1.5]},
)
df1 = pd.DataFrame({'time_2': [-2.5, 1.5, 2.6]},
)
#combine the two dataframes into one, concat on column axis
df2 = pd.concat([df, df1],axis="columns")
#assign new column with difference between time
df2 = df2.assign(Time_diff = df2['time_1'] - df2['time_2'])

Update according your comment, use merge with how='cross':
out = df1.merge(df2, how='cross').assign(time=lambda x: x.pop('time_1') - x.pop('time_2'))
print(out)
# Output
a b c x y z time
0 -90.56 5.28 -6.23 -6.64 542.397494 2.25 0.0
1 -90.56 5.28 -6.23 -6.64 541.233179 2.25 0.0
2 -90.56 5.28 -6.23 -6.63 567.644365 2.25 -1000.0
3 -90.56 5.28 -6.23 -6.63 530.368776 2.25 -1000.0
4 -90.56 5.28 -6.23 -6.63 552.896863 2.25 -2000.0
...
45 -95.13 -4.15 -5.46 -6.64 542.397494 2.25 0.0
46 -95.13 -4.15 -5.46 -6.64 541.233179 2.25 0.0
47 -95.13 -4.15 -5.46 -6.63 567.644365 2.25 -1000.0
48 -95.13 -4.15 -5.46 -6.63 530.368776 2.25 -1000.0
49 -95.13 -4.15 -5.46 -6.63 552.896863 2.25 -2000.0
You can join your dataframes (based on indexes):
out = df2.join(df1).assign(time=lambda x: x.pop('time_1') - x.pop('time_2'))
print(out)
# Output
x y z a b c time
0 -6.64 542.397494 2.25 -90.56 5.28 -6.23 0.0
1 -6.64 541.233179 2.25 -87.98 5.27 -5.68 0.0
2 -6.63 567.644365 2.25 -83.96 14.74 -9.44 -1000.0
3 -6.63 530.368776 2.25 -85.58 -5.72 -5.27 -1000.0
4 -6.63 552.896863 2.25 -95.13 -4.15 -5.46 -2000.0

Related

Add to values of a DataFrame using cooridnates

I have a dataframe a:
Out[68]:
p0_4 p5_7 p8_9 p10_14 p15 p16_17 p18_19 p20_24 p25_29 \
0 1360.0 921.0 676.0 1839.0 336.0 668.0 622.0 1190.0 1399.0
1 308.0 197.0 187.0 411.0 67.0 153.0 172.0 336.0 385.0
2 76.0 59.0 40.0 72.0 16.0 36.0 20.0 56.0 82.0
3 765.0 608.0 409.0 1077.0 220.0 359.0 342.0 873.0 911.0
4 1304.0 906.0 660.0 1921.0 375.0 725.0 645.0 1362.0 1474.0
5 195.0 135.0 78.0 262.0 44.0 97.0 100.0 265.0 229.0
6 1036.0 965.0 701.0 1802.0 335.0 701.0 662.0 1321.0 1102.0
7 5072.0 3798.0 2865.0 7334.0 1399.0 2732.0 2603.0 4976.0 4575.0
8 1360.0 962.0 722.0 1758.0 357.0 710.0 713.0 1761.0 1660.0
9 743.0 508.0 369.0 1118.0 286.0 615.0 429.0 738.0 885.0
10 1459.0 1015.0 679.0 1732.0 337.0 746.0 677.0 1493.0 1546.0
11 828.0 519.0 415.0 1057.0 190.0 439.0 379.0 788.0 1024.0
12 1042.0 690.0 503.0 1204.0 219.0 451.0 465.0 1193.0 1406.0
p30_44 p45_59 p60_64 p65_74 p75_84 p85_89 p90plus
0 4776.0 8315.0 2736.0 5463.0 2819.0 738.0 451.0
1 1004.0 2456.0 988.0 2007.0 1139.0 313.0 153.0
2 291.0 529.0 187.0 332.0 108.0 31.0 10.0
3 2807.0 5505.0 2060.0 4104.0 2129.0 516.0 252.0
4 4524.0 9406.0 3034.0 6003.0 3366.0 840.0 471.0
5 806.0 1490.0 606.0 1288.0 664.0 185.0 108.0
6 4127.0 8311.0 2911.0 6111.0 3525.0 1029.0 707.0
7 16917.0 27547.0 8145.0 15950.0 9510.0 2696.0 1714.0
8 5692.0 9380.0 3288.0 6458.0 3830.0 1050.0 577.0
9 2749.0 5696.0 2014.0 4165.0 2352.0 603.0 288.0
10 4676.0 7654.0 2502.0 5077.0 3004.0 754.0 461.0
11 2799.0 4880.0 1875.0 3951.0 2294.0 551.0 361.0
12 3288.0 5661.0 1974.0 4007.0 2343.0 623.0 303.0
and a series d:
Out[70]:
2 p45_59
10 p45_59
11 p45_59
Is there a simple way to add 1 to number in a with the same index and column labels in d?
I have tried:
a[d] +=1
However this adds 1 to every value in the column, not just the values with indices 2, 10 and 11.
Thanking you in advance.
You might want to try this.
a.loc[list(d.index), list(d.values)] += 1

Split a pandas dataframe into multiple dataframes if all rows are nan

I have the following dataframe.
a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999
5 NaN NaN NaN NaN
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001
11 NaN NaN NaN NaN
12 4.43 29.724285 126.620003 24.764999
13 NaN NaN NaN NaN
14 4.29 29.010000 120.309998 24.730000
15 4.11 29.420000 119.480003 25.035000
I want to split this df into multiple dfs when there is row with all NaN.
I explored the following links but could not figure out how to apply it to my problem.
Split pandas dataframe in two if it has more than 10 rows
Splitting dataframe into multiple dataframes
In my example, I would have 4 dataframes with 5,5,1 and 2 rows as the output.
Please suggest the way forward.
Using isna, all, cumsum and groupby.
First we check if all the values in a row are NaN, then use cumsum to create a group indicator and finally we save these dataframes in a list with groupby:
grps = df.isna().all(axis=1).cumsum()
dfs = [df.dropna() for _, df in df.groupby(grps)]
for df in dfs:
print(df)
a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999
a b c d
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001
a b c d
12 4.43 29.724285 126.620003 24.764999
a b c d
14 4.29 29.01 120.309998 24.730
15 4.11 29.42 119.480003 25.035
Something like this should do the trick:
import pandas as pd
import numpy as np
data_frame = pd.DataFrame({"a":[1,np.nan,3,np.nan,4,np.nan,5],
"b":[1,np.nan,3,np.nan,4,np.nan,5],
"c":[1,np.nan,3,np.nan,4,np.nan,5],
"d":[1,np.nan,3,np.nan,4,np.nan,5],
"e":[1,np.nan,3,np.nan,4,np.nan,5],
"f":[1,np.nan,3,np.nan,4,np.nan,5]})
all_nan = data_frame.index[data_frame.isnull().all(1)]
df_list = []
prev = 0
for i in all_nan:
df_list.append(data_frame[prev:i])
prev = i+1
for i in df_list:
print(i)
Just another flavor of doing the same thing:
nan_indices = df.index[df.isna().all(axis=1)]
df_list = [df.dropna() for df in np.split(df, nan_indices)]
df_list
[ a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999,
a b c d
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001,
a b c d
12 4.43 29.724285 126.620003 24.764999,
a b c d
14 4.29 29.01 120.309998 24.730
15 4.11 29.42 119.480003 25.035]

Pandas/Python: interpolation of multiple columns based on values specified for one reference column

df
Out[1]:
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
0 978.0 345 17.0 16.5 97 12.22 0 0 292.0 326.8 294.1
1 977.0 354 17.8 16.7 93 12.39 1 0 292.9 328.3 295.1
2 970.0 416 23.4 15.4 61 11.47 4 2 299.1 332.9 301.2
3 963.0 479 24.0 14.0 54 10.54 8 3 300.4 331.6 302.3
4 948.7 610 23.0 13.4 55 10.28 15 6 300.7 331.2 302.5
5 925.0 830 21.4 12.4 56 9.87 20 5 301.2 330.6 303.0
6 916.0 914 20.7 11.7 56 9.51 20 4 301.3 329.7 303.0
7 884.0 1219 18.2 9.2 56 8.31 60 4 301.8 326.7 303.3
8 853.1 1524 15.7 6.7 55 7.24 35 3 302.2 324.1 303.5
9 850.0 1555 15.4 6.4 55 7.14 20 2 302.3 323.9 303.6
10 822.8 1829 13.3 5.6 60 6.98 300 4 302.9 324.0 304.1
How do I interpolate the values of all the columns on specified PRES (pressure) values at say PRES=[950, 900, 875]? Is there an elegant pandas type of way to do this?
The only way I can think of doing this is to first start with making empty NaN values for the entire row for each specified PRES values in a loop, then set PRES as index and then use the pandas native interpolate option:
df.interpolate(method='index', inplace=True)
Is there a more elegant solution?
Use your solution with no loop - reindex by union original index values with PRES list, but working only if all values are unique:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = df.reindex(df.index.union(PRES)).sort_index(ascending=False).interpolate(method='index')
print (df)
HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
978.0 345.0 17.0 16.5 97.0 12.22 0.0 0.0 292.0 326.8 294.1
977.0 354.0 17.8 16.7 93.0 12.39 1.0 0.0 292.9 328.3 295.1
970.0 416.0 23.4 15.4 61.0 11.47 4.0 2.0 299.1 332.9 301.2
963.0 479.0 24.0 14.0 54.0 10.54 8.0 3.0 300.4 331.6 302.3
950.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
948.7 610.0 23.0 13.4 55.0 10.28 15.0 6.0 300.7 331.2 302.5
925.0 830.0 21.4 12.4 56.0 9.87 20.0 5.0 301.2 330.6 303.0
916.0 914.0 20.7 11.7 56.0 9.51 20.0 4.0 301.3 329.7 303.0
900.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
884.0 1219.0 18.2 9.2 56.0 8.31 60.0 4.0 301.8 326.7 303.3
875.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
853.1 1524.0 15.7 6.7 55.0 7.24 35.0 3.0 302.2 324.1 303.5
850.0 1555.0 15.4 6.4 55.0 7.14 20.0 2.0 302.3 323.9 303.6
822.8 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
If possible not unique values in PRES column, then use concat with sort_index:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = (pd.concat([df, pd.DataFrame(index=PRES)])
.sort_index(ascending=False)
.interpolate(method='index'))

Rename specific columns with numbers with str+number

I originally have r number of csv files.
I created one dataframe with 9 columns and r of them have numbers as headers.
I would like to target only them and change their name into ['Apple']+range(len(files)).
Example:
I have 3 csv files.
The current 3 targeted columns in my dataframe are:
0 1 2
0 444.0 286.0 657.0
1 2103.0 2317.0 2577.0
2 157.0 200.0 161.0
3 4000.0 3363.0 4986.0
4 1042.0 541.0 872.0
5 1607.0 1294.0 3305.0
I would like:
Apple1 Apple2 Apple3
0 444.0 286.0 657.0
1 2103.0 2317.0 2577.0
2 157.0 200.0 161.0
3 4000.0 3363.0 4986.0
4 1042.0 541.0 872.0
5 1607.0 1294.0 3305.0
Thank you
IIUC, you can initialise a itertools.count object and reset the columns in a list comprehension.
from itertools import count
cnt = count(1)
df.columns = ['Apple{}'.format(next(cnt)) if
str(x).isdigit() else x for x in df.columns]
This will also work very well if the digits are not contiguous, but you want them to be renamed with a contiguous suffix:
print(df)
1 Col1 5 Col2 500
0 1240.0 552.0 1238.0 52.0 1370.0
1 633.0 435.0 177.0 2201.0 185.0
2 1518.0 936.0 385.0 288.0 427.0
3 212.0 660.0 320.0 438.0 1403.0
4 15.0 556.0 501.0 1259.0 1298.0
5 177.0 718.0 1420.0 833.0 984.0
cnt = count(1)
df.columns = ['Apple{}'.format(next(cnt)) if
str(x).isdigit() else x for x in df.columns]
print(df)
Apple1 Col1 Apple2 Col2 Apple3
0 1240.0 552.0 1238.0 52.0 1370.0
1 633.0 435.0 177.0 2201.0 185.0
2 1518.0 936.0 385.0 288.0 427.0
3 212.0 660.0 320.0 438.0 1403.0
4 15.0 556.0 501.0 1259.0 1298.0
5 177.0 718.0 1420.0 833.0 984.0
You can use rename_axis:
df.rename_axis(lambda x: 'Apple{}'.format(int(x)+1) if str(x).isdigit() else x, axis="columns")
Out[9]:
Apple1 Apple2 Apple3
0 444.0 286.0 657.0
1 2103.0 2317.0 2577.0
2 157.0 200.0 161.0
3 4000.0 3363.0 4986.0
4 1042.0 541.0 872.0
5 1607.0 1294.0 3305.0

How to group pandas DataFrame by varying dates?

I am trying to roll up daily data into fiscal quarter data. For example, I have a table with fiscal quarter end dates:
Company Period Quarter_End
M 2016Q1 05/02/2015
M 2016Q2 08/01/2015
M 2016Q3 10/31/2015
M 2016Q4 01/30/2016
WFM 2015Q2 04/12/2015
WFM 2015Q3 07/05/2015
WFM 2015Q4 09/27/2015
WFM 2016Q1 01/17/2016
and a table of daily data:
Company Date Price
M 06/20/2015 1.05
M 06/22/2015 4.05
M 07/10/2015 3.45
M 07/29/2015 1.86
M 08/24/2015 1.58
M 09/02/2015 8.64
M 09/22/2015 2.56
M 10/20/2015 5.42
M 11/02/2015 1.58
M 11/24/2015 4.58
M 12/03/2015 6.48
M 12/05/2015 4.56
M 01/03/2016 7.14
M 01/30/2016 6.34
WFM 06/20/2015 1.05
WFM 06/22/2015 4.05
WFM 07/10/2015 3.45
WFM 07/29/2015 1.86
WFM 08/24/2015 1.58
WFM 09/02/2015 8.64
WFM 09/22/2015 2.56
WFM 10/20/2015 5.42
WFM 11/02/2015 1.58
WFM 11/24/2015 4.58
WFM 12/03/2015 6.48
WFM 12/05/2015 4.56
WFM 01/03/2016 7.14
WFM 01/17/2016 6.34
And I would like to create the table below.
Company Period Quarter_end Sum(Price)
M 2016Q2 8/1/2015 10.41
M 2016Q3 10/31/2015 18.2
M 2016Q4 1/30/2016 30.68
WFM 2015Q3 7/5/2015 5.1
WFM 2015Q4 9/27/2015 18.09
WFM 2016Q1 1/17/2016 36.1
However, I don't know how to group by varying dates without looping through each record. Any help is greatly appreciated.
Thanks!
I think you can use merge_ordered:
#first convert columns to datetime
df1.Quarter_End = pd.to_datetime(df1.Quarter_End)
df2.Date = pd.to_datetime(df2.Date)
df = pd.merge_ordered(df1,
df2,
left_on=['Company','Quarter_End'],
right_on=['Company','Date'],
how='outer')
print (df)
Company Period Quarter_End Date Price
0 M 2016Q1 2015-05-02 NaT NaN
1 M NaN NaT 2015-06-20 1.05
2 M NaN NaT 2015-06-22 4.05
3 M NaN NaT 2015-07-10 3.45
4 M NaN NaT 2015-07-29 1.86
5 M 2016Q2 2015-08-01 NaT NaN
6 M NaN NaT 2015-08-24 1.58
7 M NaN NaT 2015-09-02 8.64
8 M NaN NaT 2015-09-22 2.56
9 M NaN NaT 2015-10-20 5.42
10 M 2016Q3 2015-10-31 NaT NaN
11 M NaN NaT 2015-11-02 1.58
12 M NaN NaT 2015-11-24 4.58
13 M NaN NaT 2015-12-03 6.48
14 M NaN NaT 2015-12-05 4.56
15 M NaN NaT 2016-01-03 7.14
16 M 2016Q4 2016-01-30 2016-01-30 6.34
17 WFM 2015Q2 2015-04-12 NaT NaN
18 WFM NaN NaT 2015-06-20 1.05
19 WFM NaN NaT 2015-06-22 4.05
20 WFM 2015Q3 2015-07-05 NaT NaN
21 WFM NaN NaT 2015-07-10 3.45
22 WFM NaN NaT 2015-07-29 1.86
23 WFM NaN NaT 2015-08-24 1.58
24 WFM NaN NaT 2015-09-02 8.64
25 WFM NaN NaT 2015-09-22 2.56
26 WFM 2015Q4 2015-09-27 NaT NaN
27 WFM NaN NaT 2015-10-20 5.42
28 WFM NaN NaT 2015-11-02 1.58
29 WFM NaN NaT 2015-11-24 4.58
30 WFM NaN NaT 2015-12-03 6.48
31 WFM NaN NaT 2015-12-05 4.56
32 WFM NaN NaT 2016-01-03 7.14
33 WFM 2016Q1 2016-01-17 2016-01-17 6.34
Then backfill NaN in columns Period and Quarter_End by bfill and aggregate sum. If need remove all NaN values, add Series.dropna and last reset_index:
df.Period = df.Period.bfill()
df.Quarter_End = df.Quarter_End.bfill()
print (df.groupby(['Company','Period','Quarter_End'])['Price'].sum().dropna().reset_index())
Company Period Quarter_End Price
0 M 2016Q2 2015-08-01 10.41
1 M 2016Q3 2015-10-31 18.20
2 M 2016Q4 2016-01-30 30.68
3 WFM 2015Q3 2015-07-05 5.10
4 WFM 2015Q4 2015-09-27 18.09
5 WFM 2016Q1 2016-01-17 36.10
set_index
pd.concat to align indices
groupby with agg
prd_df = period_df.set_index(['Company', 'Quarter_End'])
prc_df = price_df.set_index(['Company', 'Date'], drop=False)
df = pd.concat([prd_df, prc_df], axis=1)
df.groupby([df.index.get_level_values(0), df.Period.bfill()]) \
.agg(dict(Date='last', Price='sum')).dropna()

Categories