formating file with hours and date in the same column - python

our electricity provider think it could be very fun to make difficult to read csv files they provide.
This is precise electric consumption, every 30 min but in the SAME column you have hours, and date, example :
[EDIT : here the raw version of the csv file, my bad]
;
"Récapitulatif de mes puissances atteintes en W";
;
"Date et heure de relève par le distributeur";"Puissance atteinte (W)"
;
"19/11/2022";
"00:00:00";4494
"23:30:00";1174
"23:00:00";1130
[...]
"01:30:00";216
"01:00:00";2672
"00:30:00";2816
;
"18/11/2022";
"00:00:00";4494
"23:30:00";1174
"23:00:00";1130
[...]
"01:30:00";216
"01:00:00";2672
"00:30:00";2816
How damn can I obtain this kind of lovely formated file :
2022-11-19 00:00:00 2098
2022-11-19 23:30:00 218
2022-11-19 23:00:00 606
etc.

Okay I have an idiotic brutforce solution for you, so dont take that as coding recommondation but just something that gets the job done:
import itertools
dList = [f"{f}/{s}/2022" for f, s in itertools.product(range(1, 32), range(1, 13))]
i assume you have a text file with that so im just gonna use that:
file = 'yourfilename.txt'
#make sure youre running the program in the same directory as the .txt file
with open(file, "r") as f:
global lines
lines = f.readlines()
lines = [word.replace('\n','') for word in lines]
for i in lines:
if i in dList:
curD = i
else:
with open('output.txt', 'w') as g:
g.write(f'{i} {(i.split())[0]} {(i.split())[1]}')
make sure to create a file called output.txt in the same directory and everything will get writen into that file.

Try:
import pandas as pd
current_date = None
all_data = []
with open("your_file.txt", "r") as f_in:
# skip first 5 rows (header)
for _ in range(5):
next(f_in)
for row in map(str.strip, f_in):
row = row.replace('"', "")
if row == "":
continue
if "/" in row:
current_date = row
else:
all_data.append([current_date, *row.split(";")])
df = pd.DataFrame(all_data, columns=["Date", "Time", "Value"])
print(df)
Prints:
Date Time Value
0 19/11/2022; 00:00:00 4494
1 19/11/2022; 23:30:00 1174
2 19/11/2022; 23:00:00 1130
3 19/11/2022; 01:30:00 216
4 19/11/2022; 01:00:00 2672
5 19/11/2022; 00:30:00 2816
6 18/11/2022; 00:00:00 4494
7 18/11/2022; 23:30:00 1174
8 18/11/2022; 23:00:00 1130
9 18/11/2022; 01:30:00 216
10 18/11/2022; 01:00:00 2672
11 18/11/2022; 00:30:00 2816

Using pandas operations would be like the following:
data.csv
19/11/2022
00:00:00 2098
23:30:00 218
23:00:00 606
01:30:00 216
01:00:00 2672
00:30:00 2816
18/11/2022
00:00:00 1994
23:30:00 260
23:00:00 732
01:30:00 200
01:00:00 1378
00:30:00 2520
17/11/2022
00:00:00 1830
23:30:00 96
23:00:00 122
01:30:00 694
01:00:00 2950
00:30:00 3062
16/11/2022
00:00:00 2420
23:30:00 678
23:00:00 644
Implementation
import pandas as pd
df = pd.read_csv('data.csv', header=None)
df['amount'] = df[0].apply(lambda item:item.split(' ')[-1] if item.find(':')>0 else None)
df['time'] = df[0].apply(lambda item:item.split(' ')[0] if item.find(':')>0 else None)
df['date'] = df[0].apply(lambda item:item if item.find('/')>0 else None)
df['date'] = df['date'].fillna(method='ffill')
df = df.dropna(subset=['amount'], how='any')
df = df.drop(0, axis=1)
print(df)
output
amount time date
1 2098 00:00:00 19/11/2022
2 218 23:30:00 19/11/2022
3 606 23:00:00 19/11/2022
4 216 01:30:00 19/11/2022
5 2672 01:00:00 19/11/2022
6 2816 00:30:00 19/11/2022
8 1994 00:00:00 18/11/2022
9 260 23:30:00 18/11/2022
10 732 23:00:00 18/11/2022
11 200 01:30:00 18/11/2022
12 1378 01:00:00 18/11/2022
13 2520 00:30:00 18/11/2022
15 1830 00:00:00 17/11/2022
16 96 23:30:00 17/11/2022
17 122 23:00:00 17/11/2022
18 694 01:30:00 17/11/2022
19 2950 01:00:00 17/11/2022
20 3062 00:30:00 17/11/2022
22 2420 00:00:00 16/11/2022
23 678 23:30:00 16/11/2022
24 644 23:00:00 16/11/2022

Related

why I take this plot with matplotlib.pyplot when add date too x axis

First data frame:
date time open high low close volume avg
0 2021-05-23 00:00:00 37458.51 38270.64 31111.01 34655.25 217136.046593 NaN
1 2021-05-24 00:00:00 34681.44 39920.00 34031.00 38796.29 161630.893971 NaN
2 2021-05-25 00:00:00 38810.99 39791.77 36419.62 38324.72 111996.228404 NaN
3 2021-05-26 00:00:00 38324.72 40841.00 37800.44 39241.91 104780.773396 NaN
4 2021-05-27 00:00:00 39241.92 40411.14 37134.27 38529.98 86547.158794 NaN
5 2021-05-28 00:00:00 38529.99 38877.83 34684.00 35663.49 135377.629720 NaN
6 2021-05-29 00:00:00 35661.79 37338.58 33632.76 34605.15 112663.092689 NaN
7 2021-05-30 00:00:00 34605.15 36488.00 33379.00 35641.27 73535.386967 NaN
8 2021-05-31 00:00:00 35641.26 37499.00 34153.84 37253.81 94160.735289 NaN
9 2021-01-06 00:00:00 37253.82 37894.81 35666.00 36693.09 81234.663770 NaN
10 2021-02-06 00:00:00 36694.85 38225.00 35920.00 37568.68 67587.372495 NaN
11 2021-03-06 00:00:00 37568.68 39476.00 37170.00 39246.79 75889.106011 NaN
12 2021-04-06 00:00:00 39246.78 39289.07 35555.15 36829.00 91317.799245 NaN
13 2021-05-06 00:00:00 36829.15 37925.00 34800.00 35513.20 70459.621490 NaN
14 2021-06-06 00:00:00 35516.07 36480.00 35222.00 35796.31 47650.206637 NaN
15 2021-07-06 00:00:00 35796.31 36900.00 33300.00 33552.79 77574.952573 NaN
16 2021-08-06 00:00:00 33556.96 34068.01 31000.00 33380.81 123251.189037 NaN
17 2021-09-06 00:00:00 33380.80 37534.79 32396.82 37388.05 136607.597517 NaN
18 2021-10-06 00:00:00 37388.05 38491.00 35782.00 36675.72 109527.284943 NaN
19 2021-11-06 00:00:00 36677.83 37680.40 35936.77 37331.98 78466.005300 NaN
20 2021-12-06 00:00:00 37331.98 37463.63 34600.36 35546.11 87717.549990 NaN
21 2021-06-13 00:00:00 35546.12 39380.00 34757.00 39020.57 86921.025555 NaN
22 2021-06-14 00:00:00 39020.56 41064.05 38730.00 40516.29 108522.391949 NaN
23 2021-06-15 00:00:00 40516.28 41330.00 39506.40 40144.04 80679.622838 NaN
24 2021-06-16 00:00:00 40143.80 40527.14 38116.01 38349.01 87771.976937 NaN
25 2021-06-17 00:00:00 38349.00 39559.88 37365.00 38092.97 79541.307119 NaN
26 2021-06-18 00:00:00 38092.97 38202.84 35129.29 35819.84 95228.042935 NaN
27 2021-06-19 00:00:00 35820.48 36457.00 34803.52 35483.72 68712.449461 NaN
28 2021-06-20 00:00:00 35483.72 36137.72 33336.00 35600.16 89878.170850 NaN
29 2021-06-21 00:00:00 35600.17 35750.00 31251.23 31608.93 168778.873159 NaN
30 2021-06-22 00:00:00 31614.12 33298.78 28805.00 32509.56 204208.179762 NaN
31 2021-06-23 00:00:00 32509.56 34881.00 31683.00 33678.07 126966.100563 NaN
32 2021-06-24 00:00:00 33675.07 35298.00 32286.57 34663.09 86625.804260 NaN
33 2021-06-25 00:00:00 34663.08 35500.00 31275.00 31584.45 116061.130356 NaN
34 2021-06-26 00:00:00 31576.09 32730.00 30151.00 32283.65 107820.375287 NaN
35 2021-06-27 00:00:00 32283.65 34749.00 31973.45 34700.34 96613.244211 NaN
36 2021-06-28 00:00:00 34702.49 35297.71 33862.72 34494.89 82222.267819 NaN
37 2021-06-29 00:00:00 34494.89 36600.00 34225.43 35911.73 90788.796220 NaN
38 2021-06-30 00:00:00 35911.72 36100.00 34017.55 35045.00 77152.197634 NaN
39 2021-01-07 00:00:00 35045.00 35057.57 32711.00 33504.69 71708.266112 15.362372
40 2021-02-07 00:00:00 33502.33 33977.04 32699.00 33786.55 56172.181378 15.386331
41 2021-03-07 00:00:00 33786.54 34945.61 33316.73 34669.13 43044.578641 15.154877
42 2021-04-07 00:00:00 34669.12 35967.85 34357.15 35286.51 43703.475789 14.677524
43 2021-05-07 00:00:00 35288.13 35293.78 33125.55 33690.14 64123.874245 14.486827
44 2021-06-07 00:00:00 33690.15 35118.88 33532.00 34220.01 58210.596349 14.305665
45 2021-07-07 00:00:00 34220.02 35059.09 33777.77 33862.12 53807.521675 14.133561
46 2021-08-07 00:00:00 33862.11 33929.64 32077.00 32875.71 70136.480320 14.336865
47 2021-09-07 00:00:00 32875.71 34100.00 32261.07 33815.81 47153.939899 14.479159
48 2021-10-07 00:00:00 33815.81 34262.00 33004.78 33502.87 34761.175468 14.564313
49 2021-11-07 00:00:00 33502.87 34666.00 33306.47 34258.99 31572.647448 14.517866
50 2021-12-07 00:00:00 34259.00 34678.43 32658.34 33086.63 48181.403762 14.627892
51 2021-07-13 00:00:00 33086.94 33340.00 32202.25 32729.77 41126.361008 14.839689
52 2021-07-14 00:00:00 32729.12 33114.03 31550.00 32820.02 46777.823484 15.192346
53 2021-07-15 00:00:00 32820.03 33185.25 31133.00 31880.00 51639.576353 15.623083
54 2021-07-16 00:00:00 31874.49 32249.18 31020.00 31383.87 48499.864154 16.058731
55 2021-07-17 00:00:00 31383.86 31955.92 31164.31 31520.07 34012.242132 16.472596
56 2021-07-18 00:00:00 31520.07 32435.00 31108.97 31778.56 35923.716186 16.669426
57 2021-07-19 00:00:00 31778.57 31899.00 30407.44 30839.65 47340.468499 17.041150
58 2021-07-20 00:00:00 30839.65 31063.07 29278.00 29790.35 61034.049017 17.671053
59 2021-07-21 00:00:00 29790.34 32858.00 29482.61 32144.51 82796.265128 17.564616
60 2021-07-22 00:00:00 32144.51 32591.35 31708.00 32287.83 46148.092433 17.463500
61 2021-07-23 00:00:00 32287.58 33650.00 31924.32 33634.09 50112.863626 16.984139
62 2021-07-24 00:00:00 33634.10 34500.00 33401.14 34258.14 47977.550138 16.242346
63 2021-07-25 00:00:00 34261.51 35398.00 33851.12 35381.02 47852.928313 15.607586
64 2021-07-26 00:00:00 35381.02 40550.00 35205.78 37237.60 152452.512724 16.219395
65 2021-07-27 00:00:00 37241.33 39542.61 36383.00 39457.87 88397.267015 16.800613
66 2021-07-28 00:00:00 39456.61 40900.00 38772.00 40019.56 101344.528441 17.599907
67 2021-07-29 00:00:00 40019.57 40640.00 39200.00 40016.48 53998.439283 18.359237
68 2021-07-30 00:00:00 40018.49 42316.71 38313.23 42206.37 73602.784805 19.368676
69 2021-07-31 00:00:00 42206.36 42448.00 41000.15 41461.83 44849.791012 20.349200
70 2021-01-08 00:00:00 41461.84 42599.00 39422.01 39845.44 53953.186326 20.714136
71 2021-02-08 00:00:00 39850.27 40480.01 38690.00 39147.82 50837.351954 20.816480
72 2021-03-08 00:00:00 39146.86 39780.00 37642.03 38207.05 57117.435853 20.578895
73 2021-04-08 00:00:00 38207.04 39969.66 37508.56 39723.18 52329.352430 20.396351
74 2021-05-08 00:00:00 39723.17 41350.00 37332.70 40862.46 84343.755621 20.526294
75 2021-06-08 00:00:00 40862.46 43392.43 39853.86 42836.87 75753.941347 21.042989
76 2021-07-08 00:00:00 42836.87 44700.00 42446.41 44572.54 73396.740808 21.756471
77 2021-08-08 00:00:00 44572.54 45310.00 43261.00 43794.37 69329.092698 22.533424
78 2021-09-08 00:00:00 43794.36 46454.15 42779.00 46253.40 74587.884845 23.450453
79 2021-10-08 00:00:00 46248.87 46700.00 44589.46 45584.99 53814.643421 24.359303
80 2021-11-08 00:00:00 45585.00 46743.47 45341.14 45511.00 52734.901977 25.229618
81 2021-12-08 00:00:00 45510.67 46218.12 43770.00 44399.00 55266.108781 25.471002
82 2021-08-13 00:00:00 44400.06 47886.00 44217.39 47800.00 48239.370431 25.995794
83 2021-08-14 00:00:00 47799.99 48144.00 45971.03 47068.51 46114.359022 26.537795
84 2021-08-15 00:00:00 47068.50 47372.27 45500.00 46973.82 42110.711334 26.878796
85 2021-08-16 00:00:00 46973.82 48053.83 45660.00 45901.29 52480.574014 27.326937
86 2021-08-17 00:00:00 45901.30 47160.00 44376.00 44695.95 57039.341629 27.285215
87 2021-08-18 00:00:00 44695.95 46000.00 44203.28 44705.29 54099.415985 27.184539
88 2021-08-19 00:00:00 44699.37 47033.00 43927.70 46760.62 53411.753920 27.302916
89 2021-08-20 00:00:00 46760.62 49382.99 46622.99 49322.47 56850.352228 27.840242
90 2021-08-21 00:00:00 49322.47 49757.04 48222.00 48821.87 46745.136584 28.412062
91 2021-08-22 00:00:00 48821.88 49500.00 48050.00 49239.22 37007.887795 28.889153
92 2021-08-23 00:00:00 49239.22 50500.00 49029.00 49488.85 52462.541954 29.512800
93 2021-08-24 00:00:00 49488.85 49860.00 47600.00 47674.01 51014.594748 29.565824
94 2021-08-25 00:00:00 47674.01 49264.30 47126.28 48973.32 44655.830342 29.446836
95 2021-08-26 00:00:00 48973.32 49352.84 46250.00 46843.87 49371.277774 29.028026
96 2021-08-27 00:00:00 46843.86 49149.93 46348.00 49069.90 42068.104965 28.630156
97 2021-08-28 00:00:00 49069.90 49299.00 48346.88 48895.35 26681.063786 28.287626
98 2021-08-29 00:00:00 48895.35 49632.27 47762.54 48767.83 32652.283473 27.744622
99 2021-08-30 00:00:00 48767.84 48888.61 46853.00 46982.91 40288.350830 26.903998
100 2021-08-31 00:00:00 46982.91 48246.11 46700.00 47100.89 48645.527370 26.051605
101 2021-01-09 00:00:00 47100.89 49156.00 46512.00 48810.52 49904.655280 25.499838
102 2021-02-09 00:00:00 48810.51 50450.13 48584.06 49246.64 54410.770538 25.311075
103 2021-03-09 00:00:00 49246.63 51000.00 48316.84 49999.14 59025.644157 25.265214
104 2021-04-09 00:00:00 49998.00 50535.69 49370.00 49915.64 34664.659590 25.221647
105 2021-05-09 00:00:00 49917.54 51900.00 49450.00 51756.88 40544.835873 25.504286
106 2021-06-09 00:00:00 51756.88 52780.00 50969.33 52663.90 49249.667081 25.962876
107 2021-07-09 00:00:00 52666.20 52920.00 42843.05 46863.73 123048.802719 25.276717
108 2021-08-09 00:00:00 46868.57 47340.99 44412.02 46048.31 65069.315200 24.624866
109 2021-09-09 00:00:00 46048.31 47399.97 45513.08 46395.14 50651.660020 23.989928
110 2021-10-09 00:00:00 46395.14 47033.00 44132.29 44850.91 49048.266180 23.670387
111 2021-11-09 00:00:00 44842.20 45987.93 44722.22 45173.69 30440.408100 23.366822
112 2021-12-09 00:00:00 45173.68 46460.00 44742.06 46025.24 32094.280520 22.938381
113 2021-09-13 00:00:00 46025.23 46880.00 43370.00 44940.73 65429.150560 22.820722
114 2021-09-14 00:00:00 44940.72 47250.00 44594.44 47111.52 44855.850990 22.594896
115 2021-09-15 00:00:00 47103.28 48500.00 46682.32 48121.41 43204.711740 22.007531
116 2021-09-16 00:00:00 48121.40 48557.00 47021.10 47737.82 40725.088950 21.432816
117 2021-09-17 00:00:00 47737.81 48150.00 46699.56 47299.98 34461.927760 20.965565
118 2021-09-18 00:00:00 47299.98 48843.20 47035.56 48292.74 30906.470380 20.306487
119 2021-09-19 00:00:00 48292.75 48372.83 46829.18 47241.75 29847.243490 19.735184
120 2021-09-20 00:00:00 47241.75 47347.25 42500.00 43015.62 78003.524443 20.139851
121 2021-09-21 00:00:00 43016.64 43639.00 39600.00 40734.38 84534.080485 20.985744
122 2021-09-22 00:00:00 40734.09 44000.55 40565.39 43543.61 58349.055420 21.676235
123 2021-09-23 00:00:00 43546.37 44978.00 43069.09 44865.26 48699.576550 22.029837
124 2021-09-24 00:00:00 44865.26 45200.00 40675.00 42810.57 84113.426292 22.735109
125 2021-09-25 00:00:00 42810.58 42966.84 41646.28 42670.64 33594.571890 23.405118
126 2021-09-26 00:00:00 42670.63 43950.00 40750.00 43160.90 49879.997650 23.734984
127 2021-09-27 00:00:00 43160.90 44350.00 42098.00 42147.35 39776.843830 23.925323
128 2021-09-28 00:00:00 42147.35 42787.38 40888.00 41026.54 43372.262400 24.312088
129 2021-09-29 00:00:00 41025.01 42590.00 40753.88 41524.28 33511.534870 24.702028
130 2021-09-30 00:00:00 41524.29 44141.37 41410.17 43824.10 46381.227810 24.581907
131 2021-01-10 00:00:00 43820.01 48495.00 43283.03 48141.61 66244.874920 23.367632
132 2021-02-10 00:00:00 48141.60 48336.59 47430.18 47634.90 30508.981310 22.214071
133 2021-03-10 00:00:00 47634.89 49228.08 47088.00 48200.01 30825.056010 21.285226
134 2021-04-10 00:00:00 48200.01 49536.12 46891.00 49224.94 46796.493720 20.470586
135 2021-05-10 00:00:00 49224.93 51886.30 49022.40 51471.99 52125.667930 20.178783
136 2021-06-10 00:00:00 51471.99 55750.00 50382.41 55315.00 79877.545181 20.539207
137 2021-07-10 00:00:00 55315.00 55332.31 53357.00 53785.22 54917.377660 20.881611
138 2021-08-10 00:00:00 53785.22 56100.00 53617.61 53951.43 46160.257850 21.322501
139 2021-09-10 00:00:00 53955.67 55489.00 53661.67 54949.72 55177.080130 21.741347
140 2021-10-10 00:00:00 54949.72 56561.31 54080.00 54659.00 89237.836128 22.304343
141 2021-11-10 00:00:00 54659.01 57839.04 54415.06 57471.35 52933.165751 23.025557
142 2021-12-10 00:00:00 57471.35 57680.00 53879.00 55996.93 53471.285500 23.546775
143 2021-10-13 00:00:00 55996.91 57777.00 54167.19 57367.00 55808.444920 24.057061
144 2021-10-14 00:00:00 57370.83 58532.54 56818.05 57347.94 43053.336781 24.660876
145 2021-10-15 00:00:00 57347.94 62933.00 56850.00 61672.42 82512.908022 25.811065
146 2021-10-16 00:00:00 61672.42 62378.42 60150.00 60875.57 35467.880960 26.903744
147 2021-10-17 00:00:00 60875.57 61718.39 58963.00 61528.33 39099.241240 27.563757
148 2021-10-18 00:00:00 61528.32 62695.78 59844.45 62009.84 51798.448440 28.318027
149 2021-10-19 00:00:00 62005.60 64486.00 61322.22 64280.59 53628.107744 29.251726
150 2021-10-20 00:00:00 64280.59 67000.00 63481.40 66001.41 51428.934856 30.405550
151 2021-10-21 00:00:00 66001.40 66639.74 62000.00 62193.15 68538.645370 31.054053
152 2021-10-22 00:00:00 62193.15 63732.39 60000.00 60688.22 52119.358860 31.117531
153 2021-10-23 00:00:00 60688.23 61747.64 59562.15 61286.75 27626.936780 31.062358
154 2021-10-24 00:00:00 61286.75 61500.00 59510.63 60852.22 31226.576760 30.995921
155 2021-10-25 00:00:00 60852.22 63710.63 60650.00 63078.78 36853.838060 31.244720
156 2021-10-26 00:00:00 63078.78 63293.48 59817.55 60328.81 40217.500830 31.249961
157 2021-10-27 00:00:00 60328.81 61496.00 58000.00 58413.44 62124.490160 30.779004
158 2021-10-28 00:00:00 58413.44 62499.00 57820.00 60575.89 61056.353010 30.489479
159 2021-10-29 00:00:00 60575.90 62980.00 60174.81 62253.71 43973.904140 30.289382
160 2021-10-30 00:00:00 62253.70 62359.25 60673.00 61859.19 31478.125660 30.099291
161 2021-10-31 00:00:00 61859.19 62405.30 59945.36 61299.80 39267.637940 29.713720
162 2021-01-11 00:00:00 61299.81 62437.74 59405.00 60911.11 44687.666720 29.196216
163 2021-02-11 00:00:00 60911.12 64270.00 60624.68 63219.99 46368.284100 29.031364
164 2021-03-11 00:00:00 63220.57 63500.00 60382.76 62896.48 43336.090490 28.804634
165 2021-04-11 00:00:00 62896.49 63086.31 60677.01 61395.01 35930.933140 28.589242
166 2021-05-11 00:00:00 61395.01 62595.72 60721.00 60937.12 31604.487490 28.384619
167 2021-06-11 00:00:00 60940.18 61560.49 60050.00 61470.61 25590.574080 27.973716
168 2021-07-11 00:00:00 61470.62 63286.35 61322.78 63273.59 25515.688300 27.926901
169 2021-08-11 00:00:00 63273.58 67789.00 63273.58 67525.83 54442.094554 28.579845
170 2021-09-11 00:00:00 67525.82 68524.25 66222.40 66947.66 44661.378068 29.294016
171 2021-10-11 00:00:00 66947.67 69000.00 62822.90 64882.43 65171.504046 29.014734
172 2021-11-11 00:00:00 64882.42 65600.07 64100.00 64774.26 37237.980580 28.749416
173 2021-12-11 00:00:00 64774.25 65450.70 62278.00 64122.23 44490.108160 28.041179
174 2021-11-13 00:00:00 64122.22 65000.00 63360.22 64380.00 22504.973830 27.368353
175 2021-11-14 00:00:00 64380.01 65550.51 63576.27 65519.10 25705.073470 26.832078
176 2021-11-15 00:00:00 65519.11 66401.82 63400.00 63606.74 37829.371240 26.479925
177 2021-11-16 00:00:00 63606.73 63617.31 58574.07 60058.87 77455.156090 25.267463
178 2021-11-17 00:00:00 60058.87 60840.23 58373.00 60344.87 46289.384910 24.154719
179 2021-11-18 00:00:00 60344.86 60976.00 56474.26 56891.62 62146.999310 23.454728
180 2021-11-19 00:00:00 56891.62 58320.00 55600.00 58052.24 50715.887260 22.944550
181 2021-11-20 00:00:00 58057.10 59845.00 57353.00 59707.51 33811.590100 22.122892
182 2021-11-21 00:00:00 59707.52 60029.76 58486.65 58622.02 31902.227850 21.302202
183 2021-11-22 00:00:00 58617.70 59444.00 55610.00 56247.18 51724.320470 21.040602
184 2021-11-23 00:00:00 56243.83 58009.99 55317.00 57541.27 49917.850170 20.840946
185 2021-11-24 00:00:00 57541.26 57735.00 55837.00 57138.29 39612.049640 20.651273
186 2021-11-25 00:00:00 57138.29 59398.90 57000.00 58960.36 42153.515220 20.071560
187 2021-11-26 00:00:00 58960.37 59150.00 53500.00 53726.53 65927.870660 20.117912
188 2021-11-27 00:00:00 53723.72 55280.00 53610.00 54721.03 29716.999570 20.161946
189 2021-11-28 00:00:00 54716.47 57445.05 53256.64 57274.88 36163.713700 19.704241
190 2021-11-29 00:00:00 57274.89 58865.97 56666.67 57776.25 40125.280090 18.969898
191 2021-11-30 00:00:00 57776.25 59176.99 55875.55 56950.56 49161.051940 18.417868
192 2021-01-12 00:00:00 56950.56 59053.55 56458.01 57184.07 44956.636560 17.893439
193 2021-02-12 00:00:00 57184.07 57375.47 55777.77 56480.34 37574.059760 17.525876
194 2021-03-12 00:00:00 56484.26 57600.00 51680.00 53601.05 58927.690270 17.858850
195 2021-04-12 00:00:00 53601.05 53859.10 42000.30 49152.47 114203.373748 19.217441
196 2021-05-12 00:00:00 49152.46 49699.05 47727.21 49396.33 45580.820120 20.508102
197 2021-06-12 00:00:00 49396.32 50891.11 47100.00 50441.92 58571.215750 21.472003
198 2021-07-12 00:00:00 50441.91 51936.33 50039.74 50588.95 38253.468770 22.161968
199 2021-08-12 00:00:00 50588.95 51200.00 48600.00 50471.19 38425.924660 22.962218
200 2021-09-12 00:00:00 50471.19 50797.76 47320.00 47545.59 37692.686650 23.846688
201 2021-10-12 00:00:00 47535.90 50125.00 46852.00 47140.54 44233.573910 24.732127
202 2021-11-12 00:00:00 47140.54 49485.71 46751.00 49389.99 28889.193580 25.583369
203 2021-12-12 00:00:00 49389.99 50777.00 48638.00 50053.90 26017.934210 26.077754
204 2021-12-13 00:00:00 50053.90 50189.97 45672.75 46702.75 50869.520930 26.859770
205 2021-12-14 00:00:00 46702.76 48700.41 46290.00 48343.28 39955.984450 27.602685
206 2021-12-15 00:00:00 48336.95 49500.00 46547.00 48864.98 51629.181000 28.109255
207 2021-12-16 00:00:00 48864.98 49436.43 47511.00 47632.38 31949.867390 28.590496
208 2021-12-17 00:00:00 47632.38 47995.96 45456.00 46131.20 43104.488700 29.278437
209 2021-12-18 00:00:00 46133.83 47392.37 45500.00 46834.48 25020.052710 29.931981
210 2021-12-19 00:00:00 46834.47 48300.01 46406.91 46681.23 29305.706650 30.303705
211 2021-12-20 00:00:00 46681.24 47537.57 45558.85 46914.16 35848.506090 30.761072
212 2021-12-21 00:00:00 46914.17 49328.96 46630.00 48889.88 37713.929240 30.715132
213 2021-12-22 00:00:00 48887.59 49576.13 48421.87 48588.16 27004.202200 30.607162
214 2021-12-23 00:00:00 48588.17 51375.00 47920.42 50838.81 35192.540460 30.051098
215 2021-12-24 00:00:00 50838.82 51810.00 50384.43 50820.00 31661.949460 29.417439
When run below code is well. But I need date in x axis
test['avg'].plot(legend=True,figsize=(12,5))
plt.grid(True)
plt.xlabel('ADX')
plt.ylabel('date')
plt.title('ADX indicator')
plt.gcf().autofmt_xdate()
plt.show()
Correct plot:
But when I chose date for x axis, I take a bad plot. Code is below:
df.set_index('date',drop=True, inplace=True)
Modified data
test['avg'].plot(legend=True,figsize=(12,5))
plt.grid(True)
plt.xlabel('ADX')
plt.ylabel('date')
plt.title('ADX indicator')
plt.gcf().autofmt_xdate()
plt.show()
Bad plot:
and also why I take NaN value for ADX in TA-lib
Can you help me with this problem?
It does appear to be the problem of the source file. The column names are not tab separated. Once this is fixed, the plotting works fine.
The NaN issue is also the source file; the average was not calculated for the first several rows.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
test = pd.read_csv(r"modified_data.dat", sep='\t')
test.set_index('date')
date = test['date']
avg = test['avg']
fig, ax = plt.subplots(figsize=(20,10))
ax.plot(date, avg)
ax.tick_params(rotation=30, width = 2)
plt.xticks(np.arange(0, len(date)+1, 5))
ax.set_xticks
Output looks like this:

How to import time data in the format of “M-d-y h-m”?

The sample data from csv is shown as follows:
datetime,symbol,price,volume
10/1/2020 9:00,XYZ,10.68,375
10/1/2020 9:00,XYZ,10.9,66
10/1/2020 9:00,XYZ,11.42,103
10/1/2020 9:00,XYZ,12.62,280
10/1/2020 9:00,XYZ,10.73,23
10/1/2020 9:00,XYZ,11.44,299
10/1/2020 9:00,XYZ,12.66,152
10/1/2020 9:00,XYZ,11.04,401
10/1/2020 9:00,XYZ,10.61,392
10/1/2020 9:00,XYZ,11.21,473
I executed the following line to read the data:
schemaTB = extractTextSchema(csvFile)
update schemaTB set type="DATETIME" where name="datetime"
schemaTB[`format]=["M-d-y h:m:s",,,];
t = loadText(csvFile,,schemaTB)
But it reported an error:
t = loadText(csvFile, , schemaTB) => Invalid temporal format M-d-y h:m:s
You can use pandas for this task.
First you should read the date time column with the correct type while importing your csv. Then you convert to the desired format:
import pandas as pd
df = pd.read_csv(csvFile, sep=",", parse_dates=['datetime'])
df["datetime"].dt.strftime("%m-%d-%Y %H:%M:%S")
print(df)
Output:
datetime symbol price volume
0 2020-10-01 09:00:00 XYZ 10.68 375
1 2020-10-01 09:00:00 XYZ 10.90 66
2 2020-10-01 09:00:00 XYZ 11.42 103
3 2020-10-01 09:00:00 XYZ 12.62 280
4 2020-10-01 09:00:00 XYZ 10.73 23
5 2020-10-01 09:00:00 XYZ 11.44 299
6 2020-10-01 09:00:00 XYZ 12.66 152
7 2020-10-01 09:00:00 XYZ 11.04 401
8 2020-10-01 09:00:00 XYZ 10.61 392
9 2020-10-01 09:00:00 XYZ 11.21 473
Edit: you can even directly import in the right format with a lambda function:
df = pd.read_csv(csvFile, sep=",", parse_dates=['datetime'], date_parser=lambda d: pd.Timestamp(d).strftime("%m-%d-%Y %H:%M:%S"))

Downsampling in Pandas DataFrame by dividing observations into ratios

Given a DataFrame having timestamp (ts), I'd like to these by the hour (downsample). Values that were previously indexed by ts should now be divided into ratios based on the number of minutes left in an hour. [note: divide data in ratios for NaN columns while doing resampling]
ts event duration
0 2020-09-09 21:01:00 a 12
1 2020-09-10 00:10:00 a 22
2 2020-09-10 01:31:00 a 130
3 2020-09-10 01:50:00 b 60
4 2020-09-10 01:51:00 b 50
5 2020-09-10 01:59:00 b 26
6 2020-09-10 02:01:00 c 72
7 2020-09-10 02:51:00 b 51
8 2020-09-10 03:01:00 b 63
9 2020-09-10 04:01:00 c 79
def create_dataframe():
df = pd.DataFrame([{'duration':12, 'event':'a', 'ts':'2020-09-09 21:01:00'},
{'duration':22, 'event':'a', 'ts':'2020-09-10 00:10:00'},
{'duration':130, 'event':'a', 'ts':'2020-09-10 01:31:00'},
{'duration':60, 'event':'b', 'ts':'2020-09-10 01:50:00'},
{'duration':50, 'event':'b', 'ts':'2020-09-10 01:51:00'},
{'duration':26, 'event':'b', 'ts':'2020-09-10 01:59:00'},
{'duration':72, 'event':'c', 'ts':'2020-09-10 02:01:00'},
{'duration':51, 'event':'b', 'ts':'2020-09-10 02:51:00'},
{'duration':63, 'event':'b', 'ts':'2020-09-10 03:01:00'},
{'duration':79, 'event':'c', 'ts':'2020-09-10 04:01:00'},
{'duration':179, 'event':'c', 'ts':'2020-09-10 06:05:00'},
])
df.ts = pd.to_datetime(df.ts)
return df
I want to estimate a produced based on the ratio of time spend and produced. This can be compared to how many lines of code have been completed or find how many actual lines per hour?
for example: at "2020-09-10 00:10:00" we have 22. Then during the period from 21:01 - 00:10, we produced based on
59 min of 21:00 hours -> 7 => =ROUND(22/189*59,0)
60 min of 22:00 hours -> 7 => =ROUND(22/189*60,0)
60 min of 23:00 hours -> 7 => =ROUND(22/189*60,0)
10 min of 00:00 hours -> 1 => =ROUND(22/189*10,0)
the result should be something like.
ts event duration
0 2020-09-09 20:00:00 a NaN
1 2020-09-10 21:00:00 a 7
2 2020-09-10 22:00:00 a 7
3 2020-09-10 23:00:00 a 7
4 2020-09-10 00:00:00 a 1
5 2020-09-10 01:00:00 b ..
6 2020-09-10 02:01:00 c ..
Problem with this approach:
It appears to me that, we are having a serious issue with this approach. If you look at the rows[1] -> 2020-09-10 07:00:00, we have 4, we need to divide it between 3 hours. Considering base duration value as 1 (base unit), we however get
def create_dataframe2():
df = pd.DataFrame([{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 07:00:00'},
{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 10:00:00'}])
df.ts = pd.to_datetime(df.ts)
return df
Source
duration event c ts
0 4 c event3.5 2020-09-10 07:00:00
1 4 c event3.5 2020-09-10 10:00:00
Expected Output
ts_hourly mins duration
0 2020-09-10 07:00:00 60.0 2
1 2020-09-10 08:00:00 60.0 1
2 2020-09-10 09:00:00 60.0 1
3 2020-09-10 10:00:00 0.0 0
The first step is to add "previous ts" column to the source DataFrame:
df['tsPrev'] = df.ts.shift()
Then set ts column as the index:
df.set_index('ts', inplace=True)
The third step is to create an auxiliary index, composed of the original
index and "full hours":
ind = df.event.resample('H').asfreq().index.union(df.index)
Then create an auxiliary DataFrame, reindexed with the just created index
and "back fill" event column:
df2 = df.reindex(ind)
df2.event = df2.event.bfill()
Define a function to be applied to each group of rows from df2:
def parts(grp):
lstRow = grp.iloc[-1] # Last row from group
if pd.isna(lstRow.tsPrev): # First group
return pd.Series([lstRow.duration], index=[grp.index[0]], dtype=int)
# Other groups
return -pd.Series([0], index=[lstRow.tsPrev]).append(grp.duration)\
.interpolate(method='index').round().diff(-1)[:-1].astype(int)
Then generate the source data for "produced" column in 2 steps:
Generate detailed data:
prodDet = df2.groupby(np.isfinite(df2.duration.values[::-1]).cumsum()[::-1],
sort=False).apply(parts).reset_index(level=0, drop=True)
The source is df2 grouped this way that each group is terminated
with a row with a non-null value in duration column. Then each group
is processed with parts function.
The result is:
2020-09-09 21:00:00 12
2020-09-09 21:01:00 7
2020-09-09 22:00:00 7
2020-09-09 23:00:00 7
2020-09-10 00:00:00 1
2020-09-10 00:10:00 80
2020-09-10 01:00:00 50
2020-09-10 01:31:00 60
2020-09-10 01:50:00 50
2020-09-10 01:51:00 26
2020-09-10 01:59:00 36
2020-09-10 02:00:00 36
2020-09-10 02:01:00 51
2020-09-10 02:51:00 57
2020-09-10 03:00:00 6
2020-09-10 03:01:00 78
2020-09-10 04:00:00 1
2020-09-10 04:01:00 85
2020-09-10 05:00:00 87
2020-09-10 06:00:00 7
dtype: int32
Generate aggregated data, for the time being also as a Series:
prod = prodDet.resample('H').sum().rename('produced')
This time prodDet is resampled (broken down by hours) and the
result is the sum of values.
The result is:
2020-09-09 21:00:00 19
2020-09-09 22:00:00 7
2020-09-09 23:00:00 7
2020-09-10 00:00:00 81
2020-09-10 01:00:00 222
2020-09-10 02:00:00 144
2020-09-10 03:00:00 84
2020-09-10 04:00:00 86
2020-09-10 05:00:00 87
2020-09-10 06:00:00 7
Freq: H, Name: produced, dtype: int32
Let's describe the content of prodDet:
There is no row for 2020-09-09 20:00:00, because no source row is
from this hour (your data start from 21:01:00).
Row 21:00:00 12 comes from the first source row (you forgot about
it writing the expected result).
Rows for 21:01:00, 22:00:00, 23:00:00 and 00:00:00 come from
"partitioning" of row 00:10:00 a 22, just as a part of your
expected result.
Rows with 80 and 50 come from row containing 130, divided
between rows with hours 00:01:00 and 01:00:00.
And so on.
Now we start to assemble the final result.
Join prod (converted to a DataFrame) with event column:
result = prod.to_frame().join(df2.event)
Add tsMin column - the minimal ts in each hour (as you asked
in one of comments):
result['tsMin'] = df.duration.resample('H').apply(lambda grp: grp.index.min())
Change the index into a regular column and set its name to ts
(like in the source DataFrame):
result = result.reset_index().rename(columns={'index': 'ts'})
The final result is:
ts produced event tsMin
0 2020-09-09 21:00:00 19 a 2020-09-09 21:01:00
1 2020-09-09 22:00:00 7 a NaT
2 2020-09-09 23:00:00 7 a NaT
3 2020-09-10 00:00:00 81 a 2020-09-10 00:10:00
4 2020-09-10 01:00:00 222 a 2020-09-10 01:31:00
5 2020-09-10 02:00:00 144 c 2020-09-10 02:01:00
6 2020-09-10 03:00:00 84 b 2020-09-10 03:01:00
7 2020-09-10 04:00:00 86 c 2020-09-10 04:01:00
8 2020-09-10 05:00:00 87 c NaT
9 2020-09-10 06:00:00 7 c 2020-09-10 06:05:00
E.g. the value of 81 for 00:00:00 is a sum of 1 and 80 (the first
part resulting from row with 130), see prodDet above.
Some values in tsMin column are empty, for hours in which there is no
source row.
If you want to totally drop the result from the first row (with
duration == 12), change return pd.Series([lstRow.duration]... to
return pd.Series([0]... (the 4-th row of parts function).
To sum up, my solution is more pandasonic and significantly shorter
than yours (17 rows (my solution) vs. about 70 (yours), excluding comments).
I was not able to find a solution in pandas, so I created a solution with plain python.
Basically, I am iterating over all the values after sorting and sending two datetimes viz start_time and end_time to a function, which does the processing.
def get_ratio_per_hour(start_time: list, end_time: list, data_: int):
# get total hours between the start and end, use this for looping
totalhrs = lambda x: [1 for _ in range(int(x // 3600))
] + [
(x % 3600 / 3600
or 0.1 # added for loop fix afterwards
)]
# check if Start and End are not in same hour
if start_time.hour != end_time.hour:
seconds = (end_time - start_time).total_seconds()
if seconds < 3600:
parts_ = [1] + totalhrs(seconds)
else:
parts_ = totalhrs(seconds)
else:
# parts_ define the loop iterations
parts_ = totalhrs((end_time - start_time).total_seconds())
sum_of_hrs = sum(parts_)
# for Constructing DF
new_hours = []
mins = []
# Clone data
start_time_ = start_time
end_time_ = end_time
for e in range(len(parts_)):
# print(parts_[e])
if sum_of_hrs != 0:
if sum_of_hrs > 1:
if end_time_.hour != start_time_.hour:
# Floor > based on the startTime +1 hour
floor_time = (start_time_ + timedelta(hours=1)).floor('H')
#
new_hours.append(start_time_.floor('H'))
mins.append((floor_time - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
# Hour is same.
floor_time = (start_time_ + timedelta(hours=1)).floor('H')
new_hours.append(start_time_.floor('H'))
mins.append((floor_time - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
if end_time_.hour != start_time_.hour:
# Get round off hour
floor_time = (end_time_ + timedelta(hours=1)).floor('H')
new_hours.append(end_time_.floor('H'))
mins.append(60 - ((floor_time - end_time_).total_seconds() // 60)
)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
# Hour is same.
floor_time = (end_time_ + timedelta(hours=1)).floor('H')
new_hours.append(end_time_.floor('H'))
mins.append((end_time_ - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
# Get DataFrame Build
df_out = pd.DataFrame()
df_out['hours'] = pd.Series(new_hours)
df_out['mins'] = pd.Series(mins)
df_out['ratios'] = round(data_ / sum(mins) * df_out['mins'])
return df_out
Now, let's run the code for each iteration
time_val=[]
split_f_val=[]
split_field = 'duration'
time_field = 'ts'
# creating DataFrames for intermediate results!
df_final = pd.DataFrame()
df2 = pd.DataFrame()
for ix, row in df.iterrows():
time_val.append(row[str(time_field)])
split_f_val.append(int(row[str(split_field)]))
# Skipping First Element for Processing. Therefore, having minimum two data values
if ix !=0:
# getting Last Two Values
new_time_list=time_val[-2:]
new_data_list=split_f_val[-2:]
# get times to compare
start_time=new_time_list[: -1][0]
end_time=new_time_list[1:][0]
# get latest Data to divide
data_ = new_data_list[1:][0]
# print(start_time)
# print(end_time)
df2 = get_ratio_per_hour(start_time,end_time, data_ )
df_final = pd.concat([df_final
, df2], ignore_index=True)
else:
# Create Empty DataFrame for First Value.
df_final = pd.DataFrame([[np.nan,np.nan,np.nan] ],
columns=['hours', 'mins', 'ratios'])
df_final = pd.concat([df_final
, df2], ignore_index=True)
result = df_final.groupby(['hours'])['ratios'].sum()
Intermediate DataFrame:
hours mins ratios
0
0 2020-09-09 21:00:00 59.0 7.0
1 2020-09-09 22:00:00 60.0 7.0
2 2020-09-09 23:00:00 60.0 7.0
3 2020-09-10 00:00:00 10.0 1.0
0 2020-09-10 00:00:00 50.0 80.0
1 2020-09-10 01:00:00 31.0 50.0
0 2020-09-10 01:00:00 19.0 60.0
0 2020-09-10 01:00:00 1.0 50.0
0 2020-09-10 01:00:00 8.0 26.0
0 2020-09-10 01:00:00 1.0 36.0
1 2020-09-10 02:00:00 1.0 36.0
0 2020-09-10 02:00:00 50.0 51.0
0 2020-09-10 02:00:00 9.0 57.0
1 2020-09-10 03:00:00 1.0 6.0
0 2020-09-10 03:00:00 59.0 78.0
1 2020-09-10 04:00:00 1.0 1.0
0 2020-09-10 04:00:00 59.0 85.0
1 2020-09-10 05:00:00 60.0 87.0
2 2020-09-10 06:00:00 5.0 7.0
Final Output:
hours ratios
2020-09-09 21:00:00 7.0
2020-09-09 22:00:00 7.0
2020-09-09 23:00:00 7.0
2020-09-10 00:00:00 81.0
2020-09-10 01:00:00 222.0
2020-09-10 02:00:00 144.0
2020-09-10 03:00:00 84.0
2020-09-10 04:00:00 86.0
2020-09-10 05:00:00 87.0
2020-09-10 06:00:00 7.0

Folding pandas time series into single day

I have a time series of events that spans multiple days-I'm mostly interested in counts/10min interval. So currently, after resampling, it looks like this
2018-02-27 16:20:00 5
2018-02-27 16:30:00 4
2018-02-27 16:40:00 0
2018-02-27 16:50:00 0
2018-02-27 17:00:00 0
...
2018-06-19 05:30:00 0
2018-06-19 05:40:00 0
2018-06-19 05:50:00 1
How can I "fold" this data over to have just one "day" of data, with the counts added up? So it would look something like this
00:00:00 0
00:10:00 0
...
11:00:00 47
11:10:00 36
11:20:00 12
...
23:40:00 1
23:50:00 0
If your series index is a DatetimeIndex, you can use the attribute time -- if it's a DataFrame and your datetimes are a column, you can use .dt.time. For example:
In [19]: times = pd.date_range("2018-02-27 16:20:00", "2018-06-19 05:50:00", freq="10 min")
...: ser = pd.Series(np.random.randint(0, 6, len(times)), index=times)
...:
...:
In [20]: ser.head()
Out[20]:
2018-02-27 16:20:00 0
2018-02-27 16:30:00 1
2018-02-27 16:40:00 4
2018-02-27 16:50:00 5
2018-02-27 17:00:00 0
Freq: 10T, dtype: int32
In [21]: out = ser.groupby(ser.index.time).sum()
In [22]: out.head()
Out[22]:
00:00:00 285
00:10:00 293
00:20:00 258
00:30:00 263
00:40:00 307
dtype: int32
In [23]: out.tail()
Out[23]:
23:10:00 280
23:20:00 291
23:30:00 236
23:40:00 303
23:50:00 299
dtype: int32
If i understand correctly, you want a sum of values per 10 min intervals in the first time column. You can perhaps try something like:
df.groupby('columns')['value'].agg(['count'])

Sum set of values from pandas dataframe within certain time frame

I have a fairly complicated question. I need to select rows from a data frame within a certain set of start and end dates, and then sum those values and put them in a new dataframe.
So I start off with with data frame, df:
import random
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'_id': random.choice(range(0, 1000)),
'time_stamp': dates,
'value': random.choice(range(2,60))
})
and define some start and end dates:
import pandas as pd
start_date = ["2-13-16", "2-23-16", "3-17-16", "3-24-16", "3-26-16", "5-17-16", "5-25-16", "10-10-16", "10-18-16", "10-23-16", "10-31-16", "11-7-16", "11-14-16", "11-22-16", "1-23-17", "1-29-17", "2-06-17", "3-11-17", "3-23-17", "6-21-17", "6-28-17"]
end_date = pd.DatetimeIndex(start_date) + pd.DateOffset(7)
Then what needs to happen is that I need to create a new data frame with weekly_sum which sums the value column of df which occur in between the the start_date and end_date.
So for example, the first row of the new data frame would return the sum of the values between 2-13-16 and 2-20-16. I imagine I'd use groupby.sum() or something similar.
It might look like this:
id start_date end_date weekly_sum
65 2016-02-13 2016-02-20 100
Any direction is greatly appreciated!
P.S. I know my use of random.choice is a little wonky so if you have a better way of generating random numbers, I'd love to see it!
You can use
def get_dates(x):
# Select the df values between start and ending datetime.
n = df[(df['time_stamp']>x['start'])&(df['time_stamp']<x['end'])]
# Return first id and sum of values
return n['id'].values[0],n['value'].sum()
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'id': np.random.randint(0,1000,size=(1000,)),
'time_stamp': dates,
'value': np.random.randint(2,60,size=(1000,))
})
ndf = pd.DataFrame({'start':pd.to_datetime(start_date),'end':end_date})
#Unpack and assign values to id and value column
ndf[['id','value']] = ndf.apply(lambda x : get_dates(x),1).apply(pd.Series)
print(df.head(5))
id time_stamp value
0 770 2015-01-01 02:00:00 59
1 781 2015-01-02 02:00:00 32
2 761 2015-01-03 02:00:00 40
3 317 2015-01-04 02:00:00 16
4 538 2015-01-05 02:00:00 20
print(ndf.head(5))
end start id value
0 2016-02-20 2016-02-13 569 221
1 2016-03-01 2016-02-23 28 216
2 2016-03-24 2016-03-17 152 258
3 2016-03-31 2016-03-24 892 265
4 2016-04-02 2016-03-26 606 244
You can calculate a weekly summary with the following code. The code below is based on Monday.
import pandas as pd
import random
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'_id': random.choice(range(0, 1000)),
'time_stamp': dates,
'value': random.choice(range(2,60))
})
df['day_of_week'] = df['time_stamp'].dt.weekday_name
df['start'] = np.where(df["day_of_week"]=="Monday", 1, 0)
df['week'] = df["start"].cumsum()
# It is based on Monday.
df.head(20)
# Out[109]:
# _id time_stamp value day_of_week start week
# 0 396 2015-01-01 02:00:00 59 Thursday 0 0
# 1 396 2015-01-02 02:00:00 59 Friday 0 0
# 2 396 2015-01-03 02:00:00 59 Saturday 0 0
# 3 396 2015-01-04 02:00:00 59 Sunday 0 0
# 4 396 2015-01-05 02:00:00 59 Monday 1 1
# 5 396 2015-01-06 02:00:00 59 Tuesday 0 1
# 6 396 2015-01-07 02:00:00 59 Wednesday 0 1
# 7 396 2015-01-08 02:00:00 59 Thursday 0 1
# 8 396 2015-01-09 02:00:00 59 Friday 0 1
# 9 396 2015-01-10 02:00:00 59 Saturday 0 1
# 10 396 2015-01-11 02:00:00 59 Sunday 0 1
# 11 396 2015-01-12 02:00:00 59 Monday 1 2
# 12 396 2015-01-13 02:00:00 59 Tuesday 0 2
# 13 396 2015-01-14 02:00:00 59 Wednesday 0 2
# 14 396 2015-01-15 02:00:00 59 Thursday 0 2
# 15 396 2015-01-16 02:00:00 59 Friday 0 2
# 16 396 2015-01-17 02:00:00 59 Saturday 0 2
# 17 396 2015-01-18 02:00:00 59 Sunday 0 2
# 18 396 2015-01-19 02:00:00 59 Monday 1 3
# 19 396 2015-01-20 02:00:00 59 Tuesday 0 3
aggfunc = {"time_stamp": [np.min, np.max], "value": [np.sum]}
df2 = df.groupby("week", as_index=False).agg(aggfunc)
df2.columns = ["week", "start_date", "end_date", "weekly_sum"]
df2.iloc[58:61]
# Out[110]:
# week start_date end_date weekly_sum
# 58 58 2016-02-08 02:00:00 2016-02-14 02:00:00 413
# 59 59 2016-02-15 02:00:00 2016-02-21 02:00:00 413
# 60 60 2016-02-22 02:00:00 2016-02-28 02:00:00 413

Categories