I want to extract electricity consumption for Site 2
>>> df4 = pd.read_excel(xls, 'Elec Monthly Cons')
>>> df4
Site Unnamed: 1 2014-01-01 00:00:00 2014-02-01 00:00:00 2014-03-01 00:00:00 ... 2017-08-01 00:00:00 2017-09-01 00:00:00 2017-10-01 00:00:00 2017-11-01 00:00:00 2017-12-01 00:00:00
0 Site Profile JAN 2014 FEB 2014 MAR 2014 ... AUG 2017 SEP 2017 OCT 2017 NOV 2017 DEC 2017
1 Site 1 NHH 10344 NaN NaN ... NaN NaN NaN NaN NaN
2 Site 2 HH 258351 229513 239379 ... NaN NaN NaN NaN NaN
type
type(df4)
<class 'pandas.core.frame.DataFrame'>
My goal is to take out the numerical value but I do not know how to set the index properly. What I have tried so far does not work at all.
df1 = df.loc[idx[:,1:2],:]
But
raise IndexingError('Too many indexers')
pandas.core.indexing.IndexingError: Too many indexers
It seems that I do not understand indexing. Does the series type play any role?
df.head
<bound method NDFrame.head of Site Site 2
Unnamed: 1 HH
EDIT
print (df.index)
Index([ 'Site', 'Unnamed: 1', 2014-01-01 00:00:00,
2014-02-01 00:00:00, 2014-03-01 00:00:00, 2014-04-01 00:00:00,
2014-05-01 00:00:00, 2014-06-01 00:00:00, 2014-07-01 00:00:00,
How to solve this?
In my opinion is necessary remove :, because it means select all columns, but Series have no column.
Also it seems no MultiIndex, so then need:
df1 = df.iloc[1:2]
There is problem first 2 rows are headers, so for MultiIndex DataFrame need:
df4 = pd.read_excel(xls, 'Elec Monthly Cons', header=[0,1], index_col=[0,1])
And then for select use:
idx = pd.IndexSlice
df1 = df.loc[:, idx[:,'FEB 2014':'MAR 2014']]
Related
This is my data:
df = pd.DataFrame([
{start_date: '2019/12/01', end_date: '2019/12/05', spend: 10000, campaign_id: 1}
{start_date: '2019/12/05', end_date: '2019/12/09', spend: 50000, campaign_id: 2}
{start_date: '2019/12/01', end_date: '', spend: 10000, campaign_id: 3}
{start_date: '2019/12/01', end_date: '2019/12/01', spend: 50, campaign_id: 4}
]);
I need to add a column to each row for each day since 2019/12/01, and calculate the spend on that campaign that day, which I'll get by dividing the spend on the campaign by the total number of days it was active.
So here I'd add a column for each day between 1 December and today (10 December). For row 1, the content of the five columns for 1 Dec to 5 Dec would be 2000, then for the six ocolumns from 5 Dec to 10 Dec it would be zero.
I know pandas is well-designed for this kind of problem, but I have no idea where to start!
Doesn't seem like a straight forward task to me. But first convert your date columns if you haven't already:
df["start_date"] = pd.to_datetime(df["start_date"])
df["end_date"] = pd.to_datetime(df["end_date"])
Then create a helper function for resampling:
def resampler(data, daterange):
temp = (data.set_index('start_date').groupby('campaign_id')
.apply(daterange)
.drop("campaign_id",axis=1)
.reset_index().rename(columns={"level_1":"start_date"}))
return temp
Now its a 3 step process. First resample your data according to end_date of each group:
df1 = resampler(df, lambda d: d.reindex(pd.date_range(min(d.index),max(d["end_date"]),freq="D")) if d["end_date"].notnull().all() else d)
df1["spend"] = df1.groupby("campaign_id")["spend"].transform(lambda x: x.mean()/len(x))
With the average values calculated, resample again to current date:
dates = pd.date_range(min(df["start_date"]),pd.Timestamp.today(),freq="D")
df1 = resampler(df1,lambda d: d.reindex(dates))
Finally transpose your dataframe:
df1 = pd.concat([df1.drop("end_date",axis=1).set_index(["campaign_id","start_date"]).unstack(),
df1.groupby("campaign_id")["end_date"].min()], axis=1)
df1.columns = [*dates,"end_date"]
print (df1)
#
2019-12-01 00:00:00 2019-12-02 00:00:00 2019-12-03 00:00:00 2019-12-04 00:00:00 2019-12-05 00:00:00 2019-12-06 00:00:00 2019-12-07 00:00:00 2019-12-08 00:00:00 2019-12-09 00:00:00 2019-12-10 00:00:00 end_date
campaign_id
1 2000.0 2000.0 2000.0 2000.0 2000.0 NaN NaN NaN NaN NaN 2019-12-05
2 NaN NaN NaN NaN 10000.0 10000.0 10000.0 10000.0 10000.0 NaN 2019-12-09
3 10000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
4 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2019-12-01
I have a question regarding duplicating rows in a pandas dataframe. I have allocated relevant dates to each observation in the column "relevant shocks" in lists. Observation 22 has an empty list, 23 a list with one date, 24 a list with two dates and 25 a list with three dates (as seen in column "listlength").
My aim is to expand the dataframe in the way that observations with empty lists remain in the dataset with one row while rows with x observations get duplicated x times - as such, rows 22 and 23 should stay in the dataframe once (22 despite the empty list and 23 because it has one relevant date), row 24 should get duplicated once and thus be in the dataframe twice and observation 25 should be duplicated twice and thus be in the dataframe thrice. As such, each row should be in the dataframe as many times as it has relevant shocks (as measured by listlength). Except for the ones with list length 0, they should still remain in the dataframe.
Further, I want to create a new column "relevant shock" which is filled by each of the relevant shocks once and separately.
This is the current dataframe:
quarter year pddate relevant shocks listlength
22 1 2012 2012-02-15 [] 0.0
23 4 2011 2011-11-15 [2011-08-18 00:00:00] 1.0
24 3 2011 2011-08-15 [2011-08-18 00:00:00, 2011-09-22 00:00:00] 2.0
25 2 2011 2011-05-13 [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00 3.0
The new dataframe should have 7 rows and look as follows:
quarter year pddate relevant shocks listlength relevant shock
22 1 2012 2012-02-15 [] 0.0
23 4 2011 2011-11-15 [2011-08-18 00:00:00] 1.0 2011-08-18 00:00:00
24 3 2011 2011-08-15 [2011-08-18 00:00:00, 2011-09-22 00:00:00] 2.0 2011-08-18 00:00:00
25 3 2011 2011-08-15 [2011-08-18 00:00:00, 2011-09-22 00:00:00] 2.0 2011-09-22 00:00:00
26 2 2011 2011-05-13 [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00 3.0 2011-08-04 00:00:00
27 2 2011 2011-05-13 [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00 3.0 2011-08-08 00:00:00
28 2 2011 2011-05-13 [2011-08-04 00:00:00, 2011-08-08 00:00:00, 2011-08-10 00:00:00 3.0 2011-08-10 00:00:00
So the basic idea would be to add the new column "relevant shock", go through each row, keep it unchanged if it has an empty list in "relevant shocks", also keep it unchanged if it has one date in "relevant shocks", but fill the new column "relevant shock" with that one list entry, duplicate it if it has two list entries in "relevant shocks" and fill the column "relevant shock" in each row with one of the two list entries, respectively, and so on.
Is this possible with Python?
EDIT for pandas version >= 0.25, a new method explode would do the job really easily:
#first create a copy of the column
df['relevant shock'] = df['relevant shocks']
#explode the new column
df = df.explode('relevant shock').fillna('')
print (df)
#same result than the one below
Old answer
From the column 'relevant shocks' you can use apply, pd.Series and stack to create a row for each date, such as:
df['relevant shocks'].apply(pd.Series).stack()
Out[448]:
23 0 2011-08-18 00:00:00
24 0 2011-08-18 00:00:00
1 2011-09-22 00:00:00
25 0 2011-08-04 00:00:00
1 2011-08-08 00:00:00
2 2011-08-10 00:00:00
dtype: object
I know the one empty is missing but after you join the result to your df with a reset_index, fillna and drop the extra column. With a df like this:
df = pd.DataFrame({'quarter':[1,2,3,4],
'relevant shocks':[[],['2011-08-18 00:00:00'],
['2011-08-18 00:00:00', '2011-09-22 00:00:00'],
['2011-08-04 00:00:00', '2011-08-08 00:00:00', '2011-08-10 00:00:00']]},
index=[22,23,24,25])
then you do:
df = (df.join(df['relevant shocks'].apply(pd.Series).stack()
.reset_index(1,name='relevant shock'))
.fillna('').drop('level_1',1))
and you get:
quarter relevant shocks \
22 1 []
23 2 [2011-08-18 00:00:00]
24 3 [2011-08-18 00:00:00, 2011-09-22 00:00:00]
24 3 [2011-08-18 00:00:00, 2011-09-22 00:00:00]
25 4 [2011-08-04 00:00:00, 2011-08-08 00:00:00, 201...
25 4 [2011-08-04 00:00:00, 2011-08-08 00:00:00, 201...
25 4 [2011-08-04 00:00:00, 2011-08-08 00:00:00, 201...
relevant shock
22
23 2011-08-18 00:00:00
24 2011-08-18 00:00:00
24 2011-09-22 00:00:00
25 2011-08-04 00:00:00
25 2011-08-08 00:00:00
25 2011-08-10 00:00:00
EDIT: it seems that for the real data, an error occured with empty list, so to solve it and reset_index at the end:
df = (df.join(df.loc[df['relevant shocks'].str.len() > 0, 'relevant shocks']
.apply(pd.Series).stack().reset_index(1,name='relevant shock'))
.fillna('').drop('level_1',1).reset_index(drop=True))
Now can use pandas.DataFrame.explode
The excerpt from data:
Givent the following example of pandas dataframe:
df =
index date
7838 2012 January
7790 2012 January
7853 2015 September
7889 2016 March
7928 2015 October
7847 1999 January
7884 2006 January
7826 1992 January
Is there a simple (and pythonic) way to convert free text into a standard date time variable? Something like:
df =
index date
7838 2012-01-01
7790 2012-01-01
7853 2015-09-01
7889 2016-03-01
7928 2015-10-01
7847 1999-01-01
7884 2006-01-01
7826 1992-01-01
Use pd.to_datetime() to convert from text to date type. You can glean the appropriate date types from this list.
df['date'] = pd.to_datetime(df['date'], format='%Y %B')
to_datetime handles this fine without any specific format specifier:
In [83]:
pd.to_datetime(df['date'])
Out[83]:
0 2012-01-01
1 2012-01-01
2 2015-09-01
3 2016-03-01
4 2015-10-01
5 1999-01-01
6 2006-01-01
7 1992-01-01
Name: date, dtype: datetime64[ns]
I have a series that looks like this
2014 7 2014-07-01 -0.045417
8 2014-08-01 -0.035876
9 2014-09-02 -0.030971
10 2014-10-01 -0.027471
11 2014-11-03 -0.032968
12 2014-12-01 -0.031110
2015 1 2015-01-02 -0.028906
2 2015-02-02 -0.035563
3 2015-03-02 -0.040338
4 2015-04-01 -0.032770
5 2015-05-01 -0.025762
6 2015-06-01 -0.019746
7 2015-07-01 -0.018541
8 2015-08-03 -0.028101
9 2015-09-01 -0.043237
10 2015-10-01 -0.053565
11 2015-11-02 -0.062630
12 2015-12-01 -0.064618
2016 1 2016-01-04 -0.064852
I want to be able to get the value from a date. Something like:
myseries.loc('2015-10-01') and it returns -0.053565
The index are tuples in the form (2016, 1, 2016-01-04)
You can do it like this:
In [32]:
df.loc(axis=0)[:,:,'2015-10-01']
Out[32]:
value
year month date
2015 10 2015-10-01 -0.053565
You can also pass slice for each level:
In [39]:
df.loc[(slice(None),slice(None),'2015-10-01'),]
Out[39]:
value
year month date
2015 10 2015-10-01 -0.053565|
Or just pass the first 2 index levels:
In [40]:
df.loc[2015,10]
Out[40]:
value
date
2015-10-01 -0.053565
Try xs:
print s.xs('2015-10-01',level=2,axis=0)
#year datetime
#2015 10 -0.053565
#Name: series, dtype: float64
print s.xs(7,level=1,axis=0)
#year datetime
#2014 2014-07-01 -0.045417
#2015 2015-07-01 -0.018541
#Name: series, dtype: float64
Suppose I was trying to organize sales data for a membership business.
I only have the start and end dates. Ideally sales between the start and end dates appear as 1, instead of missing.
I can't get the 'date' column to be filled with in-between dates. That is: I want a continuous set of months instead of gaps. Plus I need to fill missing data in columns with ffill.
I have tried different ways such as stack/unstack and reindex but different errors occur. I'm guessing there's a clean way to do this. What's the best practice to do this?
Suppose the multiindexed data structure:
variable sales
vendor date
a 2014-01-01 start date 1
2014-03-01 end date 1
b 2014-03-01 start date 1
2014-07-01 end date 1
And the desired result
variable sales
vendor date
a 2014-01-01 start date 1
2014-02-01 NaN 1
2014-03-01 end date 1
b 2014-03-01 start date 1
2014-04-01 NaN 1
2014-05-01 NaN 1
2014-06-01 NaN 1
2014-07-01 end date 1
you can do:
>>> f = lambda df: df.resample(rule='M', how='first')
>>> df.reset_index(level=0).groupby('vendor').apply(f).drop('vendor', axis=1)
variable sales
vendor date
a 2014-01-31 start date 1
2014-02-28 NaN NaN
2014-03-31 end date 1
b 2014-03-31 start date 1
2014-04-30 NaN NaN
2014-05-31 NaN NaN
2014-06-30 NaN NaN
2014-07-31 end date 1
and then just .fillna on sales column if needed.
I have a solution, but it's not really simple:
so, here's your DataFrame:
>>> df
sales date variable
vendor date
a 2014-01-01 1 start date
2014-01-03 1 end date
b 2014-01-03 1 start date
2014-01-07 1 end date
first, I want to create data for new MultiIndex:
>>> df2 = df.set_index('date variable', append=True).reset_index(level='date')['date']
>>> df2
vendor date variable
a start date 2014-01-01
end date 2014-01-03
b start date 2014-01-03
end date 2014-01-07
>>> df2 = df2.unstack()
>>> df2
date variable end date start date
vendor
a 2014-01-03 2014-01-01
b 2014-01-07 2014-01-03
now, create tuples for new MultiIndex:
>>> tuples = [(x[0], d) for x in df3.iterrows() for d in pd.date_range(x[1]['start date'], x[1]['end date'])]
>>> tuples
[('a', '2014-01-01'), ..., ('b', '2014-01-07)]
and create MultiIndex and reindex():
>>> mi = pd.MultiIndex.from_tuples(tuples,names=df.index.names)
>>> df.reindex(mi)
sales date variable
vendor date
a 2014-01-01 1 start date
2014-01-02 NaN NaN
2014-01-03 1 end date
b 2014-01-03 1 start date
2014-01-04 NaN NaN
2014-01-05 NaN NaN
2014-01-06 NaN NaN
2014-01-07 1 end date