Python Pandas interpolate with new x-axis - python

I want to do interpolation for a Pandas series of the following structure
X
22.88 3.047
45.75 3.215
68.63 3.328
91.50 3.423
114.38 3.516
137.25 3.578
163.40 3.676
196.08 3.756
228.76 3.861
261.44 3.942
294.12 4.012
326.80 4.084
359.48 4.147
392.16 4.197
Name: Y, dtype: float64
I want to interpolate the data so that I have a new series to cover X=[23:392:1]. I looked up the document but didn't find where I could input the new x-axis. Did I miss something? How can I do interpolation with the new x-axis?

This can be done with pandas's reindex and interpolate:
In [27]: s
Out[27]:
1
0
22.88 3.047
45.75 3.215
68.63 3.328
91.50 3.423
114.38 3.516
137.25 3.578
163.40 3.676
196.08 3.756
228.76 3.861
261.44 3.942
294.12 4.012
326.80 4.084
359.48 4.147
392.16 4.197
[14 rows x 1 columns]
In [28]: idx = pd.Index(np.arange(23, 392))
In [29]: s.reindex(s.index + idx).interpolate(method='values')
Out[29]:
1
22.88 3.047000
23.00 3.047882
24.00 3.055227
25.00 3.062573
26.00 3.069919
27.00 3.077265
28.00 3.084611
29.00 3.091957
30.00 3.099303
31.00 3.106648
32.00 3.113994
33.00 3.121340
34.00 3.128686
35.00 3.136032
36.00 3.143378
37.00 3.150724
38.00 3.158070
39.00 3.165415
40.00 3.172761
41.00 3.180107
42.00 3.187453
43.00 3.194799
44.00 3.202145
45.00 3.209491
45.75 3.215000
46.00 3.216235
47.00 3.221174
48.00 3.226112
The idea is the create the index you want (s.index + idx), which is sorted automatically, reindex an that (which makes a bunch of NaNs at the new points, and the interpolate to fill the NaNs, using the values method, which interpolates at the index points.

You can call numpy.interp() directly:
import numpy as np
import pandas as pd
import io
data = """x y
22.88 3.047
45.75 3.215
68.63 3.328
91.50 3.423
114.38 3.516
137.25 3.578
163.40 3.676
196.08 3.756
228.76 3.861
261.44 3.942
294.12 4.012
326.80 4.084
359.48 4.147
392.16 4.197"""
s = pd.read_csv(io.BytesIO(data), delim_whitespace=True, index_col=0, squeeze=True)
new_idx = np.arange(23,393)
new_val = np.interp(new_idx, s.index.values.astype(float), s.values)
s2 = pd.Series(new_val, new_idx)

Related

Pandas DataFrame (long) to Series ("wide")

I have the following DataFrame:
completeness
homogeneity
label_f1_score
label_precision
label_recall
mean_bbox_iou
mean_iou
px_accuracy
px_f1_score
px_iou
px_precision
px_recall
t_eval
v_score
mean
0.1
1
0.92
0.92
0.92
0.729377
0.784934
0.843802
0.898138
0.774729
0.998674
0.832576
1.10854
0.1
std
0.0707107
0
0.0447214
0.0447214
0.0447214
0.0574177
0.0313196
0.0341158
0.0224574
0.0299977
0.000432499
0.0327758
0.0588322
0.0707107
What I would like to obtain is a Series composed of completeness_mean, completeness_std, homogenety_mean, homogenety_std, ..., i.e. a label {column}_{index} for every cell.
Does Pandas have a function for this or do I have to iterate over all cells myself to build the desired result?
EDIT: I mean a Series with {column}_{index} as index and the corresponding values from the table.
(I believe this is not a duplicate of the other questions on SO related wide to long.)
IIUC, unstack and flatten the index:
df2 = df.unstack()
df2.index = df2.index.map('_'.join)
output:
completeness_mean 0.100000
completeness_std 0.070711
homogeneity_mean 1.000000
homogeneity_std 0.000000
label_f1_score_mean 0.920000
label_f1_score_std 0.044721
label_precision_mean 0.920000
label_precision_std 0.044721
label_recall_mean 0.920000
label_recall_std 0.044721
mean_bbox_iou_mean 0.729377
mean_bbox_iou_std 0.057418
mean_iou_mean 0.784934
mean_iou_std 0.031320
px_accuracy_mean 0.843802
px_accuracy_std 0.034116
px_f1_score_mean 0.898138
px_f1_score_std 0.022457
px_iou_mean 0.774729
px_iou_std 0.029998
px_precision_mean 0.998674
px_precision_std 0.000432
px_recall_mean 0.832576
px_recall_std 0.032776
t_eval_mean 1.108540
t_eval_std 0.058832
v_score_mean 0.100000
v_score_std 0.070711
dtype: float64
or with stack for a different order:
df2 = df.stack()
df2.index = df2.swaplevel().index.map('_'.join)
output:
completeness_mean 0.100000
homogeneity_mean 1.000000
label_f1_score_mean 0.920000
label_precision_mean 0.920000
label_recall_mean 0.920000
mean_bbox_iou_mean 0.729377
mean_iou_mean 0.784934
px_accuracy_mean 0.843802
px_f1_score_mean 0.898138
px_iou_mean 0.774729
px_precision_mean 0.998674
px_recall_mean 0.832576
t_eval_mean 1.108540
v_score_mean 0.100000
completeness_std 0.070711
homogeneity_std 0.000000
label_f1_score_std 0.044721
label_precision_std 0.044721
label_recall_std 0.044721
mean_bbox_iou_std 0.057418
mean_iou_std 0.031320
px_accuracy_std 0.034116
px_f1_score_std 0.022457
px_iou_std 0.029998
px_precision_std 0.000432
px_recall_std 0.032776
t_eval_std 0.058832
v_score_std 0.070711
dtype: float64
Is this what you're looking for?
pd.merge(df.columns.to_frame(), df.index.to_frame(), 'cross').apply('_'.join, axis=1)
# OR
pd.Series(df.unstack().index.map('_'.join))
Output:
0 completeness_mean
1 completeness_std
2 homogeneity_mean
3 homogeneity_std
4 label_f1_score_mean
5 label_f1_score_std
6 label_precision_mean
7 label_precision_std
8 label_recall_mean
9 label_recall_std
10 mean_bbox_iou_mean
11 mean_bbox_iou_std
12 mean_iou_mean
13 mean_iou_std
14 px_accuracy_mean
15 px_accuracy_std
16 px_f1_score_mean
17 px_f1_score_std
18 px_iou_mean
19 px_iou_std
20 px_precision_mean
21 px_precision_std
22 px_recall_mean
23 px_recall_std
24 t_eval_mean
25 t_eval_std
26 v_score_mean
27 v_score_std
dtype: object

Mean of values in some columns with Pandas/Numpy

I've just started with Pandas and Numpy a couple of months ago and I've learned already quite a lot thanks to all the threads here. But now I can't find what I need.
For work, I have created an excel sheet that calculates some figures to be used for re-ordering inventory. To practice and maybe actually use it, I'd wanted to give it a try to replicate the functionality in Python. Later I might want to add some more sophisticated calculations with the help of Scikit-learn.
So far I've managed to load a csv with sales figures from our ERP into a dataframe, calculate mean and std. The calculations have been done on a subset of the data because I don't know how to apply calculations only to the specific columns. The csv does also contain for example product codes and leadtimes and these should not be used for the average and std calculations. Not sure yet also how to merge this subset back with the original dataframe.
The reason why I didn't hardcode the column names is because the ERP reports the sales number over the past x no. of months, so the order of the columns will change througout the year and I want to keep them in chronological order.
My data from the csv looks like:
"code","leadtime","jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec"
"001.002",60,299,821,351,614,246,957,968,939,125,368,727,231
"001.002",25,340,274,733,575,904,953,614,268,638,960,617,757
"001.002",130,394,327,435,767,377,699,424,951,972,717,317,264
What I've done so far and what is working fine. (This can be doe probably much easier/more efficient):
import numpy as np
import timeit
import csv
import pandas as pd
sd = 1
csv_in = "data_in.csv"
csv_out = "data_out.csv"
# Use Pandas
df = pd.read_csv(csv_in,dtype={'code': str})
# Get no of columns and substract 2 for compcode and leadtime
cols = df.shape[1] - 2
# Create a subset and count the columns
df_subset = df.iloc[:, -cols:]
subset_cols = df_subset.shape[1]
# Add columns for std dev and average
df_subset = (df_subset.assign(mean=df_subset.mean(axis=1),
stddev=df_subset.std(axis=1, ddof=0))
)
# Add columns for min and max values based on mean +/- std multiplied by factor sd
df_subset = (df_subset.assign(minSD=df_subset['mean'].sub(df_subset['stddev'] * sd),
maxSD=df_subset['mean'].add(df_subset['stddev'] * sd))
df_subset
Which gives me:
jan feb mar apr may jun jul aug sep oct nov dec mean stddev minSD maxSD
0 299 821 351 614 246 957 968 939 125 368 727 231 553.833333 304.262998 249.570335 858.096332
1 340 274 733 575 904 953 614 268 638 960 617 757 636.083333 234.519530 401.563804 870.602863
2 394 327 435 767 377 699 424 951 972 717 317 264 553.666667 242.398203 311.268464 796.064870
However for my next calculation I'm stuck again:
I want to calculate the average over values from the "month" columns and only the values that match the condition >= minSD and <= maxSD
So for row 0, I'm looking for the value (299+821+351+614+368+727)/6 = 530
How can I achieve this?
I've tried this, but this doesn't seem to work:
df_subset = df_subset.assign(avgwithSD=df_subset.iloc[:,0:subset_cols].values(where(df_subset.values>=df_subset['minSD'] & df_subset.values>=df_subset['maxSD'])).mean(axis=1))
Some help would be very welcome. Thanks
EDIT: With help I ended up using this to get further with my program
import numpy as np
import timeit
import csv
import pandas as pd
# sd will determine if range will be SD1 or SD2
sd = 1
# file to use
csv_in = "data_in.csv"
csv_out = "data_out.csv"
# Function to calculate the mean of the values within the range between minSD and maxSD
def CalcMeanSD(row):
months_ = row[2:14]
min_SD = row[-2]
max_SD = row[-1]
return months_[(months_ >= min_SD) & (months_ <= max_SD)]
# Use Pandas
df = pd.read_csv(csv_in,dtype={'code': str})
# Define the month/data columns and set them to floatvalues
months_cols = df.columns[2:]
df.loc[:, months_cols] = df.loc[:, months_cols].astype('float64')
# Add columns for stddev and mean. Based on these values set new range between minSD and maxSD
df['stddev'] = df.loc[:,months_cols].std(axis=1, ddof=0)
df['mean'] = df.loc[:, months_cols].mean(axis=1)
df['minSD'] = df['mean'].sub(df['stddev'] * sd)
df['maxSD'] = df['mean'].add(df['stddev'] * sd)
# Add column with the mean of the new range
df['avgwithSD'] = np.nanmean(df.apply(CalcMeanSD, axis=1), axis=1)
df
Result is:
code leadtime jan feb mar apr may jun jul aug sep oct nov dec stddev mean minSD maxSD avgwithSD
0 001.002 60 299.0 821.0 351.0 614.0 246.0 957.0 968.0 939.0 125.0 368.0 727.0 231.0 304.262998 553.833333 249.570335 858.096332 530.000000
1 001.002 25 340.0 274.0 733.0 575.0 904.0 953.0 614.0 268.0 638.0 960.0 617.0 757.0 234.519530 636.083333 401.563804 870.602863 655.666667
2 001.002 130 394.0 327.0 435.0 767.0 377.0 699.0 424.0 951.0 972.0 717.0 317.0 264.0 242.398203 553.666667 311.268464 796.064870 495.222222
3 001.002 90 951.0 251.0 411.0 469.0 359.0 220.0 192.0 250.0 818.0 768.0 937.0 128.0 292.572925 479.500000 186.927075 772.072925 365.000000
4 001.002 35 228.0 400.0 46.0 593.0 61.0 293.0 5.0 203.0 850.0 506.0 37.0 631.0 264.178746 321.083333 56.904588 585.262079 281.833333
5 001.002 10 708.0 804.0 208.0 380.0 531.0 125.0 500.0 773.0 354.0 238.0 805.0 215.0 242.371773 470.083333 227.711560 712.455106 451.833333
6 001.002 14 476.0 628.0 168.0 946.0 29.0 324.0 3.0 400.0 981.0 467.0 459.0 571.0 295.814225 454.333333 158.519109 750.147558 436.625000
7 001.002 14 92.0 906.0 18.0 537.0 57.0 399.0 544.0 977.0 909.0 687.0 881.0 459.0 333.154577 538.833333 205.678756 871.987910 525.200000
8 001.002 90 487.0 634.0 5.0 918.0 158.0 447.0 713.0 459.0 465.0 643.0 482.0 672.0 233.756447 506.916667 273.160220 740.673113 555.777778
9 001.002 130 741.0 43.0 976.0 461.0 35.0 321.0 434.0 8.0 330.0 32.0 896.0 531.0 326.216782 400.666667 74.449885 726.883449 415.400000
EDIT:
Instead of your original code:
# first part:
months_cols = df.columns[2:]
df.loc[:, months_cols] = df.loc[:, months_cols].astype('float64')
df['stddev'] = df.loc[:,months_cols].std(axis=1, ddof=0)
df['mean'] = df.loc[:, months_cols].mean(axis=1)
df['minSD'] = df['mean'].sub(df['stddev'] * sd)
df['maxSD'] = df['mean'].add(df['stddev'] * sd)
# second part: (the one that doesn't work for you)
def calc_mean_per_row_by_condition(row):
months_ = row[2:14]
min_SD = row[-2]
max_SD = row[-1]
return months_[(months_ >= min_SD) & (months_ <= max_SD)]
df['avgwithSD'] = np.nanmean(df.apply(calc_mean_per_row_by_condition, axis=1), axis=1)

what is the best way to merge pandas.Dataframe with pandas.Series based on df.columns and Series.index names?

im facing the following problem and i dont know what is the cleanest/smartest way to solve it.
I have a dataframe called wfm that contains the input for my simulation
wfm.head()
Out[51]:
OPN Vin Vout_ref Pout_ref CFN ... Cdclink Cdm L k ron
0 6 350 750 80500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
1 7 400 800 92000 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
2 8 350 900 80500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
3 9 450 750 103500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
4 10 450 900 103500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
[5 rows x 13 columns]
then every simulation loop I receive 2 Series outputs_rms and outputs_avg that look like this:
outputs_rms outputs_avg
Out[53]: Out[54]:
time.rms 0.057751 time.avg 5.78E-02
Vi_dc.voltage.rms 400 Vi_dc.voltage.avg 4.00E+02
Vi_dc.current.rms 438.333188 Vi_dc.current.avg 3.81E+02
Vi_dc.power.rms 175333.2753 Vi_dc.power.avg 1.53E+05
Am_in.current.rms 438.333188 Am_in.current.avg 3.81E+02
Cdm.voltage.rms 396.614536 Cdm.voltage.avg 3.96E+02
Cdm.current.rms 0.213185 Cdm.current.avg -5.14E-05
motor_phU.current.rms 566.035833 motor_phU.current.avg -5.67E+02
motor_phU.voltage.rms 296.466083 motor_phU.voltage.avg -9.17E-02
motor_phV.current.rms 0.061024 motor_phV.current.avg 2.58E-02
motor_phV.voltage.rms 1.059341 motor_phV.voltage.avg -1.24E-09
motor_phW.current.rms 566.005071 motor_phW.current.avg 5.67E+02
motor_phW.voltage.rms 297.343876 motor_phW.voltage.avg 9.17E-02
S_ULS.voltage.rms 305.017804 S_ULS.voltage.avg 2.65E+02
S_ULS.current.rms 358.031053 S_ULS.current.avg -1.86E+02
S_UHS.voltage.rms 253.340047 S_UHS.voltage.avg 1.32E+02
S_UHS.current.rms 438.417985 S_UHS.current.avg 3.81E+02
S_VLS.voltage.rms 295.509073 S_VLS.voltage.avg 2.64E+02
S_VLS.current.rms 0 S_VLS.current.avg 0.00E+00
S_VHS.voltage.rms 152.727975 S_VHS.voltage.avg 1.32E+02
S_VHS.current.rms 0.061024 S_VHS.current.avg -2.58E-02
S_WLS.voltage.rms 509.388666 S_WLS.voltage.avg 2.64E+02
S_WLS.current.rms 438.417985 S_WLS.current.avg 3.81E+02
S_WHS.voltage.rms 619.258959 S_WHS.voltage.avg 5.37E+02
S_WHS.current.rms 357.982417 S_WHS.current.avg -1.86E+02
Cdclink.voltage.rms 801.958092 Cdclink.voltage.avg 8.02E+02
Cdclink.current.rms 103.73088 Cdclink.current.avg 2.08E-05
Am_out.current.rms 317.863371 Am_out.current.avg 1.86E+02
Vo_dc.voltage.rms 800 Vo_dc.voltage.avg 8.00E+02
Vo_dc.current.rms 317.863371 Vo_dc.current.avg -1.86E+02
Vo_dc.power.rms 254290.6969 Vo_dc.power.avg -1.49E+05
CFN 1 CFN 1.00E+00
OPN 6 OPN 6.00E+00
dtype: float64 dtype: float64
then my goal is to place outputs_rms and outputs_avg on the right line of wfm, based on 'CFN' and 'OPN' values.
what is your suggestions?
thanks
Riccardo
Suppose that you create these series as outputs output_rms_1, output_rms_2, etc.,
than the series can be combined in one dataframe
import pandas as pd
dfRms = pd.DataFrame([output_rms_1, output_rms_2, output_rms_3])
Next output, say output_rms_10, can simply be added by using:
dfRms = dfRms.append(output_rms_10, ignore_index=True)
Finally, when all outputs are joined into one Dataframe,
you can merge the original wfm with the output, i.e.
result = pd.merge(wfm, dfRms, on=['CFN', 'OPN'], how='left')
Similarly for avg.

Selecting with same condition, but HDFStore gives different answers

I stored my data in hdf5 file. The strange thing is that I am selecting a table with same condition, but HDFStore gives different answers.
Who can tell me why?
In [2]: import pandas as pd
In [3]: store=pd.HDFStore("./data/m2016.h5","r")
In [4]: store
Out[4]:
<class 'pandas.io.pytables.HDFStore'>
File path: ./data/m2016.h5
/m2016 frame_table (typ->appendable,nrows->37202055,ncols->6,indexers->[index],dc->[dt,code])
In [5]: a=store.select('m2016',where="code='000001'")
In [6]: b=store.select('m2016',where="code='000001'")
In [7]: a.shape
Out[7]: (2388318, 6)
In [8]: b.shape
Out[8]: (2374525, 6)
In [9]: a.head()
Out[9]:
dt market code price volume preclose
85920 2016-01-04 09:30:00 0 000001 11.98 1102900 11.99
85921 2016-01-04 09:31:00 0 000001 11.96 289100 11.99
85922 2016-01-04 09:32:00 0 000001 11.97 361800 11.99
85923 2016-01-04 09:33:00 0 000001 12.00 279200 11.99
85924 2016-01-04 09:34:00 0 000001 12.00 405600 11.99
I tested it at all my three computers, result as:
PC1, os:Win2012server, python:winpython 2.7.10.3 (64bits), select result is wrong.
PC2, os:Win10, python winpython 2.7.10.3 (64bits), select result is wrong.
PC3, os:Win7, python:Winpython 2.7.10.3 (64bits), select result is ok!
Maybe HDFStore.select only can run at Win7?
maybe the default encoding of your operating system varies ?
would this work b=store.select('m2016',where="code=u'000001'")
I has tested more about it at my PC in Win7, it still got random wrong result.
In [1]: import pandas as pd
In [2]: cd /projects
C:\projects
In [3]: store=pd.HDFStore("./data/m2016.h5","r")
In [4]: d0=store.select("m2016",where='dt<Timestamp("2016-01-10")')
In [5]: d1=store.select("m2016",where='dt<Timestamp("2016-01-10")')
In [6]: d0.shape
Out[6]: (6917149, 6)
In [7]: d1.shape
Out[7]: (4199769, 6)
In [8]: d0.tail()
Out[8]:
dt market code price volume preclose
455381 2016-04-21 11:11:00 1 600461 13.33 16400 13.2
455386 2016-04-21 11:16:00 1 600461 13.36 13800 13.2
455387 2016-04-21 11:17:00 1 600461 13.37 8300 13.2
455388 2016-04-21 11:18:00 1 600461 13.36 9800 13.2
455389 2016-04-21 11:19:00 1 600461 13.34 15300 13.2
In [9]: d1.tail()
Out[9]:
dt market code price volume preclose
573543 2016-04-22 14:03:00 1 601333 3.94 8200 3.97
573548 2016-04-22 14:08:00 1 601333 3.96 45000 3.97
573549 2016-04-22 14:09:00 1 601333 3.96 8800 3.97
573550 2016-04-22 14:10:00 1 601333 3.97 10700 3.97
573551 2016-04-22 14:11:00 1 601333 3.96 6800 3.97
In [10]: !ptdump m2016.h5
/ (RootGroup) ''
/m2016 (Group) ''
/m2016/table (Table(50957318,), shuffle, zlib(9)) ''
I upload my hdf5 file here

Combine daily data into monthly data in Excel using Python

I am trying to figure out how I can combine daily dates into specific months and summing the data for the each day that falls within the specific month.
Note: I have a huge list with daily dates but I put a small sample here to simply the example.
File name: (test.xlsx)
For an Example (sheet1) contains in dataframe mode:
DATE 51 52 53 54 55 56
0 20110706 28.52 27.52 26.52 25.52 24.52 23.52
1 20110707 28.97 27.97 26.97 25.97 24.97 23.97
2 20110708 28.52 27.52 26.52 25.52 24.52 23.52
3 20110709 28.97 27.97 26.97 25.97 24.97 23.97
4 20110710 30.5 29.5 28.5 27.5 26.5 25.5
5 20110711 32.93 31.93 30.93 29.93 28.93 27.93
6 20110712 35.54 34.54 33.54 32.54 31.54 30.54
7 20110713 33.02 32.02 31.02 30.02 29.02 28.02
8 20110730 35.99 34.99 33.99 32.99 31.99 30.99
9 20110731 30.5 29.5 28.5 27.5 26.5 25.5
10 20110801 32.48 31.48 30.48 29.48 28.48 27.48
11 20110802 31.04 30.04 29.04 28.04 27.04 26.04
12 20110803 32.03 31.03 30.03 29.03 28.03 27.03
13 20110804 34.01 33.01 32.01 31.01 30.01 29.01
14 20110805 27.44 26.44 25.44 24.44 23.44 22.44
15 20110806 32.48 31.48 30.48 29.48 28.48 27.48
What I would like is to edit ("test.xlsx",'sheet1') to result in what is below:
DATE 51 52 53 54 55 56
0 201107 313.46 303.46 293.46 283.46 273.46 263.46
1 201108 189.48 183.48 177.48 171.48 165.48 159.48
How would I go about implementing this?
Here is my code thus far:
import pandas as pd
from pandas import ExcelWriter
df = pd.read_excel('thecddhddtestquecdd.xlsx')
def sep_yearmonths(x):
x['month'] = str(x['DATE'])[:-2]
return x
df = df.apply(sep_yearmonths,axis=1)
df.groupby('month').sum()
writer = ExcelWriter('thecddhddtestquecddMERGE.xlsx')
df.to_excel(writer,'Sheet1',index=False)
writer.save()
This will work if 'DATE' is a column of strings and not your index.
Example dataframe - shortened for clarity:
df = pd.DataFrame({'DATE': {0: '20110706', 1:'20110707', 2: '20110801'},
52: {0: 28.52, 1: 28.97, 2: 28.52},
55: { 0: 24.52, 1: 24.97, 2:24.52 }
})
Which yields:
52 55 DATE
0 28.52 24.52 20110706
1 28.97 24.97 20110707
2 28.52 24.52 20110801
Apply the following function over the dataframe to generate a new column:
def sep_yearmonths(x):
x['month'] = x['DATE'][:-2]
return x
Like this:
df = df.apply(sep_yearmonths,axis=1)
Over which you can then groupby and sum:
df.groupby('month').sum()
Resulting in the following:
52 55
month
201107 57.49 49.49
201108 28.52 24.52
If 'date' is your index, simply call reset_index before. If it's not a column of string values, then you need to do that beforehand.
Finally, you can rename your 'month' column to 'DATE'. I suppose you could just substitute the column 'DATE' inplace, but I choose to do things explictly. You can do that like so:
df['DATE'] = df['DATE'].apply(lambda x: x[:-2])
Then 'groupby' 'DATE' instead of month.
Use resample
import pandas as pd
myTable=pd.read_excel('test.xlsx')
myTable['DATE']=pd.to_datetime(myTable['DATE'], format="%Y%m%d")
myTable=myTable.set_index('DATE')
myTable.resample("M").sum()

Categories