final value in column of exponentially weighted average values - python

I have a dataframe stockData which looks like this:
Name: BBG.XLON.VOD.S_MKTCAP_EUR,
04/02/2008 125761.8868
05/02/2008 124513.4973
06/02/2008 124299.8368
07/02/2008 122973.7429
08/02/2008 123451.0086
11/02/2008 122948.5002
12/02/2008 124336.3475
13/02/2008 124546.6607
14/02/2008 124434.8762
15/02/2008 123370.2129
18/02/2008 123246.854
19/02/2008 121965.328
20/02/2008 119154.8945
I am trying to create an exponentially weighted moving average with an alpha of 0.1, so the resulting dataframe should look like:
Name: BBG.XLON.VOD.S_MKTCAP_EUR, expon
04/02/2008 125761.8868 125761.8868
05/02/2008 124513.4973 125637.0478
06/02/2008 124299.8368 125503.3267
07/02/2008 122973.7429 125250.3683
08/02/2008 123451.0086 125070.4324
11/02/2008 122948.5002 124858.2391
12/02/2008 124336.3475 124806.05
13/02/2008 124546.6607 124780.111
14/02/2008 124434.8762 124745.5876
15/02/2008 123370.2129 124608.0501
18/02/2008 123246.854 124471.9305
19/02/2008 121965.328 124221.2702
20/02/2008 119154.8945 123714.6327
I have tried using the following from panadas:
stockData['expon'] = pd.ewma(stockData[unique_id+"_MKTCAP_EUR"], span = 0.1)
but get a result which does not equal what I am expecting:
Name: BBG.XLON.VOD.S_MKTCAP_EUR, expon
04/02/2008 125761.8868 125761.8868
05/02/2008 124513.4973 123681.2377
06/02/2008 124299.8368 124062.4362
07/02/2008 122973.7429 121107.3884
08/02/2008 123451.0086 124216.9907
11/02/2008 122948.5002 122075.8313
12/02/2008 124336.3475 126868.3597
13/02/2008 124546.6607 124942.6688
14/02/2008 124434.8762 124220.0306
15/02/2008 123370.2129 121296.275
18/02/2008 123246.854 123004.4148
19/02/2008 121965.328 119431.9075
20/02/2008 119154.8945 113577.3494
Could someone let me know what I need to do in order to return the expected result please.
Also if I wanted just to return the last value in the exponentially weighted series (123714.6327) could someone also let me know how that would be possible please?
Thanks

Just simplifying column names:
df.columns = ['date', 'ticker']
Use adjust=False (see docs for calculation of weights)
df['emwa'] = pd.ewma(df.ticker, alpha=0.1, adjust=False)
date ticker emwa
0 04/02/2008 125761.8868 125761.886800
1 05/02/2008 124513.4973 125637.047850
2 06/02/2008 124299.8368 125503.326745
3 07/02/2008 122973.7429 125250.368361
4 08/02/2008 123451.0086 125070.432384
5 11/02/2008 122948.5002 124858.239166
6 12/02/2008 124336.3475 124806.049999
7 13/02/2008 124546.6607 124780.111069
8 14/02/2008 124434.8762 124745.587583
9 15/02/2008 123370.2129 124608.050114
10 18/02/2008 123246.8540 124471.930503
11 19/02/2008 121965.3280 124221.270253
12 20/02/2008 119154.8945 123714.632677
and to get the last value:
df.emwa.iloc[-1]
123714.632677

Related

Is it possible to apply seasonal_decompose() after a groupby() where date is not the index of the data frame?

So, for a forecasting project, I have a really long Dataframe of multiple time series of the following type (it has a numerical index):
date
time_series_id
value
2015-08-01
0
0
2015-08-02
0
1
2015-08-03
0
2
2015-08-04
0
3
2015-08-01
1
2
2015-08-02
1
3
2015-08-03
1
4
2015-08-04
1
5
My objective, is to add 3 new columns to these dataset, for each individual time series (each id) that correspond to trend, seasonal and resid.
According to the characteristics of the dataset, they tend to have Nans at the start and the end of the dates.
What I was trying to do was the following:
from statsmodels.tsa.seasonal import seasonal_decompose
df.assign(trend = lambda x: x.groupby("time_series_id")["value"].transform(lambda s: s.mask(~s.isna(), other= seasonal_decompose(s[~s.isna()], model='aditive', extrapolate_trend='freq').trend))
The expected output (trend value are not actual values) should be:
date
time_series_id
value
trend
2015-08-01
0
0
1
2015-08-02
0
1
1
2015-08-03
0
2
1
2015-08-04
0
3
1
2015-08-01
1
2
1
2015-08-02
1
3
1
2015-08-03
1
4
1
2015-08-04
1
5
1
But I get the following error message:
AttributeError: 'Int64Index' object has no attribute 'inferred_freq'
In a previous iteration of my code, this worked for my individual time series data frames, since I had embedded the date column as an index of the data frame instead of an additional column, so the "x" that the lambda function takes has already a "date time" index appropriate for the seasonal_decompose function.
df.assign(
trend = lambda x: x["value"].mask(~x["value"].isna(), other =
seasonal_decompose(x["value"][~x["value"].isna()], model='aditive', extrapolate_trend='freq').trend))
My questions are, first: is it possible to achieve this using groupby? Or other approaches are possible second: is it possible to handle this that doesn't eat much memory? The original dataset I'm working on has approximately 1MM ~ rows, so any help is really welcomed :).
Did one of the already posed solutions work? If so or you found a different solution please share. I tried each without success, but I'm new to Python so probably missing something.
Here is what I came up with, using a for loop. For my dataset it took 8 minutes to decompose 20 million rows consisting of 6,000 different subsets. This works but I wish it were faster.
Date Time
Segment ID
Travel Time(Minutes)
2021-11-09 07:15:00
1
30
2021-11-09 07:30:00
1
18
2021-11-09 07:15:00
2
30
2021-11-09 07:30:00
2
17
segments = set(frame['Segment ID'])
data = pd.DataFrame([])
for s in segments:
df = frame[frame['Segment ID'] == s].set_index('Date Time').resample('H').mean()
comp = sm.tsa.seasonal_decompose(x=df['Travel Time(Minutes)'], period=24*7, two_sided=False)
df = df.join(comp.trend).join(comp.seasonal).join(comp.resid)
#optional columns with some statistics to find outliers and trend changes
df['resid zscore'] = (df['resid'] - df['resid'].mean()).div(df['resid'].std())
df['trend pct_change'] = df.trend.pct_change()
df['trend pct_change zscore'] = (df['trend pct_change'] - df['trend pct_change'].mean()).div(df['trend pct_change'].std())
data = data.append(df.dropna())
where you have lambda x: x.groupby(..., you don't have anything to group; you are telling it to group a row (I believe). You can try a setup like this, perhaps
Here you define a function to act on the group you are sending via the apply() method. Then you should be able to use your original code.
I have not tested this, but I use this setup quite often to work on groups.
def trend_function(x):
# do your lambda function here as you are sending each grouping
x.assign(
trend = lambda x: x["value"].mask(~x["value"].isna(), other =
seasonal_decompose(x["value"][~x["value"].isna()], model='aditive', extrapolate_trend='freq').trend))
return x
dfnew = df.groupby('time_series_id').apply(trend_function)
use extrapolate_trend='freq' as a parameter. you add the trend, seasonal, and residual to a dictionary and plot the dictionary
from statsmodels.graphics import tsaplots
import statsmodels.api as sm
date=['2015-08-01','2015-08-02','2015-08-03','2015-08-04','2015-08-01','2015-08-02','2015-08-03','2015-08-04']
time_series_id=[0,0,0,0,1,1,1,1]
value=[0,1,2,3,2,3,4,5]
df=pd.DataFrame({'date':date,'time_series_id':time_series_id,'value':value})
df['date']=pd.to_datetime(df['date'])
df=df.set_index('date')
print(df)
index_day = df.index.day
value_by_day = df.groupby(index_day)['value'].mean()
fig,ax = plt.subplots(figsize=(12,4))
value_by_day.plot(ax=ax)
plt.title('value by month')
plt.show()
df[['value']].boxplot()
plt.show()
fig,ax = plt.subplots(figsize=(12,4))
df[['value']].hist(ax=ax, bins=5)
plt.show()
fig,ax = plt.subplots(figsize=(12,4))
df[['value']].plot(kind='density', ax=ax)
plt.show()
plt.clf()
fig,ax = plt.subplots(figsize=(12,4))
plt.style.use('seaborn-pastel')
fig = tsaplots.plot_acf(df['value'], lags=4,ax=ax)
plt.show()
decomposition=sm.tsa.seasonal_decompose(x=df['value'],model='additive', extrapolate_trend='freq', period=1)
decomposition.plot()
plt.show()
decomposition_trend=decomposition.trend
ax= decomposition_trend.plot(figsize=(14,2))
ax.set_xlabel('Date')
ax.set_ylabel('Trend of time series')
ax.set_title('Trend values of the time series')
plt.show()
I changed the first piece of code according to my scinario.
Here's my code and attached output
data = pd.DataFrame([])
segments = set(subset['Planning_Material'])
for s in segments:
df = subset[subset['Planning_Material'] == s].set_index('Cal_year_month').resample('M').sum()
comp = sm.tsa.seasonal_decompose(df)
df = df.join(comp.trend).join(comp.seasonal).join(comp.resid)
df['Planning_Material'] = s
data = pd.concat([data,df])
data = data.reset_index()
data = data[['Planning_Material', 'Cal_year_month', 'Historical_demand', 'trend','seasonal','resid']]
data

How to use one dataframe's index to reindex another one in pandas

I am so sorry that I truly don't know what title I should use. But here is my question
Stocks_Open
d-1 d-2 d-3 d-4
000001.HR 1817.670960 1808.937405 1796.928768 1804.570628
000002.ZH 4867.910878 4652.713598 4652.713598 4634.904168
000004.HD 92.046474 92.209029 89.526880 96.435445
000005.SS 28.822245 28.636893 28.358865 28.729569
000006.SH 192.362963 189.174626 185.986290 187.403328
000007.SH 79.190528 80.515892 81.509916 78.693516
Stocks_Volume
d-1 d-2 d-3 d-4
000001.HR 324234 345345 657546 234234
000002.ZH 4867343 465234 4652598 4634168
000004.HD 9246474 929029 826880 965445
000005.SS 2822245 2836893 2858865 2829569
000006.SH 19262963 1897466 1886290 183328
000007.SH 7190528 803892 809916 7693516
Above are the data I parsed from a database, what I exactly want to do is to obtain the correlation of open price and volume in 4 days for each stock (The first column consists of codes of different stocks). In other words, I am trying to calculate the correlation of corresponding rows of each DataFrame. (This is only simplified example, the real data should be extended to more than 1000 different stocks.)
My attempt is to create a dataframe and to run a loop, assigning the results to that dataframe. But here is a problem, which is, the index pf the created dataframe is not exactly what I want. When I tried to append the correlation column, the bug occurred. (Please ignore the value of correlation, which is I concocted here, just to give an example)
r = pd.DataFrame(index = range(6),columns = ['c']
for i in range(6):
r.iloc[i-1,:] = Stocks_Open.iloc[i-1].corr(Stocks_Volume.iloc[i-1])
Correlation_in_4days = pd.concat([Stocks_Open,Stocks_Volume], axis = 1)
Correlation_in_4days['corr'] = r['c']
for i in range(6):
Correlation_in_4days.iloc[i-1,8] = r.iloc[i-1,:]
r c
1 0.654
2 -0.454
3 0.3321
4 0.2166
5 -0.8772
6 0.3256
The bug occurred.
"ValueError: Incompatible indexer with Series"
I realized that my correlation dataframe's index is integer but not the stock code, but I don't know how to fix it, is there any help?
My ideal result is:
corr
000001.HR 0.654
000002.ZH -0.454
000004.HD 0.3321
000005.SS 0.2166
000006.SH -0.8772
000007.SH 0.3256
Try assign the index back
r.index = Stocks_Open.index

How to find the average of data samples at random intervals in python?

I have temperature data stored in a csv file when plotted looks like the below image. How do I find the average during each interval when the temperature goes above 12. The result should be the T1, T2 ,T3 which should be the average temperature during the interval when its value is above 12.
Could you please suggest how to achieve this in python?
Highlighted the areas approximately over which I need to calculate the average:
Please find below sample data:
R3,R4
1,11
2,11
3,11
4,11
5,11
6,15.05938512
7,15.12975992
8,15.05850141
18,15.1677708
19,15.00921862
20,15.00686921
21,15.01168888
22,11
23,11
24,11
25,11
26,11
27,15.05938512
28,15.12975992
29,15.05850141
30,15.00328706
31,15.12622611
32,15.01479819
33,15.17778891
34,15.01411488
35,9
36,9
37,9
38,9
39,16.16042435
40,16.00091253
41,16.00419677
42,16.15381827
43,16.0471766
44,16.03725301
45,16.13925003
46,16.00072279
47,11
48,1
In pandas, an idea would be to group the data based on the condition T > 12 and use mean as agg func. Ex:
import pandas as pd
# a dummy df:
df = pd.DataFrame({'T': [11, 13, 13, 10, 14]})
# set the condition
m = df['T'] > 12
# define groups
grouper = (~m).cumsum().where(m)
# ...looks like
# 0 NaN
# 1 1.0
# 2 1.0
# 3 NaN
# 4 2.0
# Name: T, dtype: float64
# now we can easily calculate the mean for each group:
grp_mean = df.groupby(grouper)['T'].mean()
# T
# 1.0 13
# 2.0 14
# Name: T, dtype: int64
Note: if you have noisy data (T jumps up and down), it might be clever to apply a filter first (savgol, median etc. - whatever is appropriate) so you don't end up with groups caused by the noise.
I couldn't find a good pattern for this - here's a clunky bit of code that does what you want, though.
In general, use .shift() to find transition points, and use groupby with transform to get your means.
#if you had a csv with Dates and Temps, do this
#tempsDF = pd.read_csv("temps.csv", columns=["Date","Temp"])
#tempsDF.set_index("Date", inplace=True)
#Using fake data since I don't have your csv
tempsDF = pd.DataFrame({'Temp': [0,13,14,13,8,7,5,0,14,16,16,0,0,0]})
#This is a bit clunky - I bet there's a more elegant way to do it
tempsDF["CumulativeFlag"] = 0
tempsDF.loc[tempsDF["Temp"]>12, "CumulativeFlag"]=1
tempsDF.loc[tempsDF["CumulativeFlag"] > tempsDF["CumulativeFlag"].shift(), "HighTempGroup"] = list(range(1,len(tempsDF.loc[tempsDF["CumulativeFlag"] > tempsDF["CumulativeFlag"].shift()])+1))
tempsDF["HighTempGroup"].fillna(method='ffill', inplace=True)
tempsDF.loc[tempsDF["Temp"]<=12, "HighTempGroup"]= None
tempsDF["HighTempMean"] = tempsDF.groupby("HighTempGroup").transform(np.mean)["Temp"]

While loop on a dataframe column?

I have a small dataframe comprised of two columns, an ORG column and a percentage column. The dataframe is sorted largest to smallest based on the percentage column.
I'd like to create a while loop that adds up the values in the percentage column up until it hits a value of .80 (80%).
So far I've tried:
retail_pareto = 0
counter = 0
while retail_pareto < .80:
retail_pareto += retailerDF[counter]['RETAILER_PCT_OF_CHANGE']
counter += 1
This does not work, both the counter and the counter and retail_pareto value remain at zero with no real error message to help me troubleshoot what I'm doing incorrectly. Ideally, I'd like to end up with a list of the orgs with the largest percentage that together add up to 80%.
I'm not exactly sure what to try next. I've searched these forums, but haven't found anything similar in the forums yet.
Any advice or help is much appreciated. Thank you.
Example Dataframe:
ORG PCT
KST 0.582561
ISL 0.290904
BOV 0.254456
BRH 0.10824
GNT 0.0913631
DSH 0.023441
RDM -0.0119665
JBL -0.0348893
JBD -0.071883
WEG -0.232227
The output that I would expect would be something along the lines of:
ORG PCT
KST 0.582561
ISL 0.290904
Use:
df_filtered = df.loc[df['PCT'].shift(fill_value=0).cumsum().le(0.80),:]
#if you don't want include where cumsum is greater than 0,80
#df_filtered = df.loc[df['PCT'].cumsum().le(0.80),:]
print(df_filtered)
ORG PCT
0 KST 0.582561
1 ISL 0.290904
Can you use this example to help you?
import pandas as pd
retail_pareto = 0
orgs = []
for i,row in retailerDF.iterrows():
if retail_pareto <= .80:
retail_pareto += row['RETAILER_PCT_OF_CHANGE']
orgs.append(row)
else:
break
new_df = pd.DataFrame(orgs)
Edit: made it more like your example and added the new DataFrame.
Instead of your loop, take a more pandasonic approach.
Start with computing an additional column containing cumulative sum
of RETAILER_PCT_OF_CHANGE:
df['pct_cum'] = df.RETAILER_PCT_OF_CHANGE.cumsum()
For your data, the result is:
ORG RETAILER_PCT_OF_CHANGE pct_cum
0 KST 0.582561 0.582561
1 ISL 0.290904 0.873465
2 BOV 0.254456 1.127921
3 BRH 0.108240 1.236161
4 GNT 0.091363 1.327524
5 DSH 0.023441 1.350965
6 RDM -0.011967 1.338999
7 JBL -0.034889 1.304109
8 JBD -0.071883 1.232226
9 WEG -0.232227 0.999999
And now, to print rows which totally include 80 % of change,
ending on the first row above the limit, run:
df[df.pct_cum.shift(1).fillna(0) < 0.8]
The result, together with the cumulated sum, is:
ORG RETAILER_PCT_OF_CHANGE pct_cum
0 KST 0.582561 0.582561
1 ISL 0.290904 0.873465

Inexpensive way to add time series intensity in python pandas dataframe

I am trying to sum (and plot) a total from functions which change states at different times using Python's Pandas.DataFrame. For example:
Suppose we have 3 people whose states can be a) holding nothing, b) holding a 5 pound weight, and c) holding a 10 pound weight. Over time, these people pick weights up and put them down. I want to plot the total amount of weight being held. So, given:
My brute forece attempt:
import pandas as ps
import math
import numpy as np
person1=[3,0,10,10,10,10,10]
person2=[4,0,20,20,25,25,40]
person3=[5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['count','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDfNoCount=allPeopleDf[['start1', 'end1', 'start2', 'end2', 'start3','end3']]
uniqueTimes=sorted(ps.unique(allPeopleDfNoCount.values.ravel()))
possibleStates=[-1,0,1,2] #extra state 0 for initialization
stateData={}
comboStates={}
#initialize dict to add up all of the stateData
for time in uniqueTimes:
comboStates[time]=0.0
allPeopleDf['track']=-1
allPeopleDf['status']=-1
numberState=len(possibleStates)
starti=-1
endi=0
startState=0
for i in range(3):
starti=starti+2
print starti
endi=endi+2
for time in uniqueTimes:
def helper(row):
start=row[starti]
end=row[endi]
track=row[7]
if start <= time and time < end:
return possibleStates[i+1]
else:
return possibleStates[0]
def trackHelp(row):
status=row[8]
track=row[7]
if track<=status:
return status
else:
return track
def Multiplier(row):
x=row[8]
if x==0:
return 0.0*row[0]
if x==1:
return 5.0*row[0]
if x==2:
return 10.0*row[0]
if x==-1:#numeric place holder for non-contributing
return 0.0*row[0]
allPeopleDf['status']=allPeopleDf.apply(helper,axis=1)
allPeopleDf['track']=allPeopleDf.apply(trackHelp,axis=1)
stateData[time]=allPeopleDf.apply(Multiplier,axis=1).sum()
for k,v in stateData.iteritems():
comboStates[k]=comboStates.get(k,0)+v
print allPeopleDf
print stateData
print comboStates
Plots of weight being held over time might look like the following:
And the sum of the intensities over time might look like the black line in the following:
with the black line defined with the Cartesian points: (0,0 lbs),(5,0 lbs),(5,5 lbs),(15,5 lbs),(15,10 lbs),(20,10 lbs),(20,15 lbs),(25,15 lbs),(25,20 lbs),(40,20 lbs). However, I'm flexible and don't necessarily need to define the combined intensity line as a set of Cartesian points. The unique times can be found with:
print list(set(uniqueTimes).intersection(allNoCountT[1].values.ravel())).sort()
,but I can't come up with a slick way of getting the corresponding intensity values.
I started out with a very ugly function to break apart each "person's" graph so that all people had start and stop times (albeit many stop and start times without state change) at the same time, and then I could add up all the "chunks" of time. This was cumbersome; there has to be a slick pandas way of handling this. If anyone can offer a suggestion or point me to another SO like that I might have missed, I'd appreciate the help!
In case my simplified example isn't clear, another might be plotting the intensity of sound coming from a piano: there are many notes being played for different durations with different intensities. I would like the sum of intensity coming from the piano over time. While my example is simplistic, I need a solution that is more on the scale of a piano song: thousands of discrete intensity levels per key, and many keys contributing over the course of a song.
Edit--Implementation of mgab's provided solution:
import pandas as ps
import math
import numpy as np
person1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
person2=['person2',4,0,20,20,25,25,40]
person3=['person3',5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['id','intensity','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDf=ps.melt(allPeopleDf,id_vars=['intensity','id'])
allPeopleDf.columns=['intensity','id','timeid','time']
df=ps.DataFrame(allPeopleDf).drop('timeid',1)
df[df.id=='person1'].drop('id',1) #easier to visualize one id for check
df['increment']=df.groupby('id')['intensity'].transform( lambda x: x.sub(x.shift(), fill_value= 0 ))
TypeError: unsupported operand type(s) for -: 'str' and 'int'
End Edit
Going for the piano keys example, lets assume you have three keys, with 30 levels of intensity.
I would try to keep the data in this format:
import pandas as pd
df = pd.DataFrame([[10,'A',5],
[10,'B',7],
[13,'C',10],
[15,'A',15],
[20,'A',7],
[23,'C',0]], columns=["time", "key", "intensity"])
time key intensity
0 10 A 5
1 10 B 7
2 13 C 10
3 15 A 15
4 20 A 7
5 23 C 0
where you record every change in intensity of any of the keys. From here you can already get the Cartesian coordinates for each individual key as (time,intensity) pairs
df[df.key=="A"].drop('key',1)
time intensity
0 10 5
3 15 15
4 20 7
Then, you can easily create a new column increment that will indicate the change in intensity that occurred for that key at that time point (intensity indicates just the new value of intensity)
df["increment"]=df.groupby("key")["intensity"].transform(
lambda x: x.sub(x.shift(), fill_value= 0 ))
df
time key intensity increment
0 10 A 5 5
1 10 B 7 7
2 13 C 10 10
3 15 A 15 10
4 20 A 7 -8
5 23 C 0 -10
And then, using this new column, you can generate the (time, total_intensity) pairs to use as Cartesian coordinates
df.groupby("time").sum()["increment"].cumsum()
time
10 12
13 22
15 32
20 24
23 14
dtype: int64
EDIT: applying the specific data presented in question
Assuming the data comes as a list of values, starting with the element id (person/piano key), then a factor multiplying the measured weight/intensities for this element, and then pairs of time values indicating the start and end of a series of known states (weight being carried/intensity being emitted). Not sure if I got the data format right. From your question:
data1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
data2=['person2',4,0,20,20,25,25,40]
data3=['person3',5,0,5,5,15,15,40]
And if we know the weight/intensity of each one of the states, we can define:
known_states = [5, 10, 15]
DF_columns = ["time", "id", "intensity"]
Then, the easiest way I came up to load the data includes this function:
import pandas as pd
def read_data(data, states, columns):
id = data[0]
factor = data[1]
reshaped_data = []
for i in xrange(len(states)):
j += 2+2*i
if not data[j] == data[j+1]:
reshaped_data.append([data[j], id, factor*states[i]])
reshaped_data.append([data[j+1], id, -1*factor*states[i]])
return pd.DataFrame(reshaped_data, columns=columns)
Notice that the if not data[j] == data[j+1]: avoids loading data to the dataframe when start and end times for a given state are equal (seems uninformative, and wouldn't appear in your plots anyway). But take it out if you still want these entries.
Then, you load the data:
df = read_data(data1, known_states, DF_columns)
df = df.append(read_data(data2, known_states, DF_columns), ignore_index=True)
df = df.append(read_data(data3, known_states, DF_columns), ignore_index=True)
# and so on...
And then you're right at the beginning of this answer (substituting 'key' by 'id' and the ids, of course)
Appears to be what .sum() is for:
In [10]:
allPeopleDf.sum()
Out[10]:
aStart 0
aEnd 35
bStart 35
bEnd 50
cStart 50
cEnd 90
dtype: int32

Categories