I'm resampling from a multi-index dataframe containing seasonal data (with some years/seasons missing). I want to resample a random winter followed by a random summer followed by a random spring, but the method I'm using only samples a random season followed by a random season - even though I'm specifying which season to choose from. I can't see where I'm going wrong so here's code to illustrate:
Take a multi-index dataframe from which to resample:
import pandas as pd
import numpy as np
dates = pd.date_range('20100101',periods=1825)
df = pd.DataFrame(data=np.random.randint(0,100,(1825,2)), columns =list('AB'))
df['date'] = dates
df = df[['date','A', 'B']]
#season function
def get_season(row):
if row['date'].month >= 3 and row['date'].month <= 5:
return '2'
elif row['date'].month >= 6 and row['date'].month <= 8:
return '3'
elif row['date'].month >= 9 and row['date'].month <= 11:
return '4'
else:
return '1'
#apply the season function to dataframe
df['Season'] = df.apply(get_season, axis=1)
#Year column for multi-index
df['Year'] = df['date'].dt.year
#season column for multi-index
df = df.set_index(['Year', 'Season'], inplace=False)
re-index so it's missing some seasons (necessary to do what I want)
newindex = [(2010L, '1'), (2011L, '1'), (2011L, '3'), (2012L, '4'), (2013L, '2'), (2015L, '3')]
df = df.loc[newindex]
#recreate season and year
df['Season'] = df.apply(get_season, axis=1)
df['Year'] = df['date'].dt.year
Years variable to select range from:
years = df['date'].dt.year.unique()
Sample from the dataframe:
dfs = []
for i in range(100):
dfs.append(df.query("Year == %d and Season == '1'" %np.random.choice(years, 1)))
dfs.append(df.query("Year == %d and Season == '2'" %np.random.choice(years, 1)))
dfs.append(df.query("Year == %d and Season == '3'" %np.random.choice(years, 1)))
dfs.append(df.query("Year == %d and Season == '4'" %np.random.choice(years, 1)))
rnd = pd.concat(dfs)
This outputs a dataframe and samples seasons randomly, but even though I've selected it to choose from Season == '1'
Season == '2'
Season == '3'
Season =='4' it seems to be choosing randomly and not respecting the order of Winter, Spring, Summer, Autumn (1,2,3,4). I've tried adding replace == True but this has no effect.
How can I adjust this so it selects a random Winter, followed by a random Spring, followed by a random Summer, then random Autumn?
Thanks
EDIT 1:
Changing the code so it only selects season and not year helps - but it now selects more than one winter (even though I'm specifying to choose only 1)
dfs = []
for i in range(100):
dfs.append(df.query("Season == '1'" %np.random.choice(years, 1)))
dfs.append(df.query("Season == '2'" %np.random.choice(years, 1)))
dfs.append(df.query("Season == '3'" %np.random.choice(years, 1)))
dfs.append(df.query("Season == '4'" %np.random.choice(years, 1)))
rnd = pd.concat(dfs)
You could use .groupby() with TimeGrouper('Q-Nov') to produce your seasons, .sample() from each season, set a new index for each season sample and then .sortlevel() accordingly:
Starting with your sample df, but setting DateTimeIndex:
dates = pd.date_range('20100101', periods=1825)
df = pd.DataFrame(data=np.random.randint(0, 100, (1825, 2)), columns=list('AB'), index=dates)
DatetimeIndex: 1825 entries, 2010-01-01 to 2014-12-30
Freq: D
Data columns (total 2 columns):
A 1825 non-null int64
B 1825 non-null int64
This allows for groupby() with TimeGrouper(), shifting quarter end to November (and assigning values in December at the end of the series to the first season again). Assigns the max() of .month for each group, translated via season_dict back to the original df using .transform():
season_dict = {2: 1, 5: 2, 8: 3, 11: 4}
df['season'] = df.groupby(pd.TimeGrouper('Q-Nov')).A.transform(lambda x: season_dict.get(x.index.month.max(), 1))
Create year column and set season and year to index:
df['year'] = df.index.to_series().dt.year.astype(int)
df = df.reset_index().set_index(['year', 'season'])
Get unique (year, season) combinations from the index:
sample_seasons = df.reset_index().loc[:, ['year', 'season']].drop_duplicates()
Sample from the result, using .reset_index() to ensure you can sort after:
sample_seasons = sample_seasons.groupby('season').apply(lambda x: x.sample(frac=0.5).reset_index(drop=True))
sample_seasons = sample_seasons.reset_index(0, drop=True).sort_index()
Convert into format so that you can select from MultiIndex later to pull entire season:
sample_seasons = list(sample_seasons.values)
sample_seasons = [tuple(s) for s in sample_seasons]
[(2011, 1), (2013, 2), (2011, 3), (2014, 4), (2014, 1), (2010, 2), (2010, 3), (2012, 4)]
sample = df.loc[sample_seasons]
which yields:
index A B
year season
2011 1 2011-01-01 33 64
1 2011-01-02 91 66
1 2011-01-03 37 47
1 2011-01-04 1 87
1 2011-01-05 68 47
1 2011-01-06 92 60
1 2011-01-07 81 7
1 2011-01-08 78 13
1 2011-01-09 31 67
1 2011-01-10 24 50
1 2011-01-11 71 55
1 2011-01-12 56 37
1 2011-01-13 25 87
1 2011-01-14 24 55
1 2011-01-15 29 97
1 2011-01-16 70 94
1 2011-01-17 18 37
1 2011-01-18 95 30
1 2011-01-19 58 87
1 2011-01-20 75 96
1 2011-01-21 52 63
1 2011-01-22 60 75
1 2011-01-23 39 58
1 2011-01-24 86 24
1 2011-01-25 61 21
1 2011-01-26 19 24
1 2011-01-27 5 71
1 2011-01-28 72 81
1 2011-01-29 0 45
1 2011-01-30 80 48
... ... .. ..
2012 4 2012-11-01 90 44
4 2012-11-02 43 53
4 2012-11-03 3 49
4 2012-11-04 38 7
4 2012-11-05 64 44
4 2012-11-06 82 44
4 2012-11-07 38 75
4 2012-11-08 7 96
4 2012-11-09 52 9
4 2012-11-10 32 64
4 2012-11-11 30 38
4 2012-11-12 91 70
4 2012-11-13 63 18
4 2012-11-14 77 29
4 2012-11-15 58 51
4 2012-11-16 90 17
4 2012-11-17 87 85
4 2012-11-18 64 79
4 2012-11-19 10 61
4 2012-11-20 76 52
4 2012-11-21 9 40
4 2012-11-22 15 28
4 2012-11-23 14 33
4 2012-11-24 24 74
4 2012-11-25 38 43
4 2012-11-26 27 87
4 2012-11-27 6 30
4 2012-11-28 91 3
4 2012-11-29 32 64
4 2012-11-30 0 28
Related
I have looked through many questions with that error but i didn't found anything that can help with my problem.
I've got DataFrame:
Errors.dtypes
Date object
Hour int64
Minute int64
Second int64
Machine object
Position object
ErrorVal object
Duration int64
dtype: object
and list of lists:
list_of
[[datetime.date(2019, 1, 27), 'MAS1', 'OBS', '15'],
[datetime.date(2019, 1, 10), 'MAS1', 'OBS', '21'],
...
Now, I want to add new column in Errors based on list_of - when columns 'Date', 'Machine', 'Position', 'ErrorVal' is in list_of - the Value of new column 'AboveAv' is True, otherwise False. I tried this:
Errors['AboveAv'] = True if ([Errors['Date'],Errors['Machine'],Errors['Position'],Errors['ErrorVal']] in tmp) else False
But i got error when i try to run this: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
How i can handle with it? I just want to get new column with information if this row contains in list_of
Example:
DataFrame Error:
Date Hour Minute Second Machine Position ErrorVal Duration
1 2019-01-12 22 50 30 MAS1 POS 76 94
2 2019-01-14 3 13 21 MAS1 POS 76 87
3 2019-01-21 3 14 54 MAS1 POS 14 19
4 2019-01-22 3 59 57 MAS1 POS 76 87
5 2019-01-25 4 1 30 MAS1 POS 14 12
6 2019-01-27 11 15 28 MAS1 POS 76 63
list_of:
[[datetime.date(2019, 1, 21), 'MAS1', 'POS', '14'],
[datetime.date(2019, 1, 22), 'MAS1', 'POS', '76'],
[datetime.date(2019, 1, 27), 'MAS1', 'POS', '76']]
My new DataFrame:
Date Hour Minute Second Machine Position ErrorVal Duration AboveAv
1 2019-01-12 22 50 30 MAS1 POS 76 94 False
2 2019-01-14 3 13 21 MAS1 POS 76 87 False
3 2019-01-21 3 14 54 MAS1 POS 14 19 True
4 2019-01-22 3 59 57 MAS1 POS 76 87 True
5 2019-01-25 4 1 30 MAS1 POS 14 12 False
6 2019-01-27 11 15 28 MAS1 POS 76 63 True
You can make another DataFrame and merge them.
list_of = [[datetime.date(2019, 1, 21), 'MAS1', 'POS', '14'],
[datetime.date(2019, 1, 22), 'MAS1', 'POS', '76'],
[datetime.date(2019, 1, 27), 'MAS1', 'POS', '76']]
df = pd.DataFrame(list_of, columns=['Date', 'Machine', 'Position', 'ErrorVal'])
df['AboveAv'] = True
Error = pd.merge(Error, df, on=['Date', 'Machine', 'Position', 'ErrorVal'], how='left')
Error.fillna(False)
Results
Date Hour Minute Second Machine Position ErrorVal Duration \
0 2019-01-12 22 50 30 MAS1 POS 76 94
1 2019-01-14 3 13 21 MAS1 POS 76 87
2 2019-01-21 3 14 54 MAS1 POS 14 19
3 2019-01-22 3 59 57 MAS1 POS 76 87
4 2019-01-25 4 1 30 MAS1 POS 14 12
5 2019-01-27 11 15 28 MAS1 POS 76 63
AboveAv
0 False
1 False
2 True
3 True
4 False
5 True
Make sure the Dtypes are the same or this will not work!!! Check with Error.info and df.info to get more specific results than Error.dtypes and df.dtypes.
I have a Pandas dataframe with a multiindex
A B
year age
1895 0 10 12
1895 1 13 14
...
1965 0 34 45
1965 1 41 34
...
1965 50 56 22
1966 0 10 34
...
I would like to get all ages between two values (e.g. 10 and 20) summed for column A (and B). I played around a bit with .xs e.g.
pops.xs(20, level='age')
gives all the age 20 for each year, but I cannot get this for multiple ages (and summed).
Eg. for 0 and 1 I would like to get
Any suggetions for an elegant (efficient) way to do that?
A B
year
1895 23 26
...
1965 75 79
...
Use query for select with sum per first level years:
print (df)
A B
year age
1895 8 10 12
12 13 14
1965 0 34 45
14 41 34
12 56 22
1966 0 10 34
df = df.query('10 <= age <= 20').sum(level=0)
print (df)
A B
year
1895 13 14
1965 97 56
Detail:
print (df.query('10 <= age <= 20'))
A B
year age
1895 12 13 14
1965 14 41 34
12 56 22
Another solution is use Index.get_level_values for index and filter by boolean indexing:
i = df.index.get_level_values('age')
print (i)
Int64Index([8, 12, 0, 14, 12, 0], dtype='int64', name='age')
df = df[(i >= 10) & (i <= 20)].sum(level=0)
print (df)
A B
year
1895 13 14
1965 97 56
You can use loc and slice to select the part of the DF you want such as:
df.loc[(slice(None),slice(10,20)),:].sum(level=0)
where (slice(None),slice(10,20)) allows you to keep all indexes for all years and age between 10 and 20 included
I have a Pandas timeseries:
days = pd.DatetimeIndex([
'2011-01-01T00:00:00.000000000',
'2011-01-02T00:00:00.000000000',
'2011-01-03T00:00:00.000000000',
'2011-01-04T00:00:00.000000000',
'2011-01-05T00:00:00.000000000',
'2011-01-06T00:00:00.000000000',
'2011-01-07T00:00:00.000000000',
'2011-01-08T00:00:00.000000000',
'2011-01-09T00:00:00.000000000',
'2011-01-11T00:00:00.000000000',
'2011-01-12T00:00:00.000000000',
'2011-01-13T00:00:00.000000000',
'2011-01-14T00:00:00.000000000',
'2011-01-16T00:00:00.000000000',
'2011-01-18T00:00:00.000000000',
'2011-01-19T00:00:00.000000000',
'2011-01-21T00:00:00.000000000',
])
counts = [85, 97, 24, 64, 3, 37, 73, 86, 87, 82, 75, 84, 43, 51, 42, 3, 70]
df = pd.DataFrame(counts,
index=days,
columns=['count'],
)
df['day of the week'] = df.index.dayofweek
And it looks like this:
count day of the week
2011-01-01 85 5
2011-01-02 97 6
2011-01-03 24 0
2011-01-04 64 1
2011-01-05 3 2
2011-01-06 37 3
2011-01-07 73 4
2011-01-08 86 5
2011-01-09 87 6
2011-01-11 82 1
2011-01-12 75 2
2011-01-13 84 3
2011-01-14 43 4
2011-01-16 51 6
2011-01-18 42 1
2011-01-19 3 2
2011-01-21 70 4
Notice that there are some days that are missing which should be filled with zeros. I want to convert this so it looks like a calendar so that the rows are increasing by weeks, the columns are days of the week, and the values are the count for that particular day. So the end result should look like:
0 1 2 3 4 5 6
0 0 0 0 0 0 85 97
1 24 64 3 37 73 86 87
2 0 82 75 84 0 0 51
3 0 42 3 0 70 0 0
# create weeks number based on day of the week
df['weeks'] = (df['day of the week'].diff() < 0).cumsum()
# pivot the table
df.pivot('weeks', 'day of the week', 'count').fillna(0)
I am using python/pandas, and want to know how to get the week number in the year of one day while Saturday as the first day of the week.
i did search a lot, but all the way takes either Monday or Sunday as the first day of week...
Please help...thanks
Thanks all! really appreciated all your quick answers..but i have to apology that i am not making my question clearly.
I want to know the week number in the year. for example, 2015-08-09 is week 32 while Monday as first day of week, but week 33 while Saturday as first day of week.
Thanks #Cyphase and everyone, I changed a bit the code of Cyphase and it works.
def week_number(start_week_on, date_=None):
assert 1 <= start_week_on <= 7 #Monday=1, Sunday=7
if not date_:
date_ = date.today()
__, normal_current_week, normal_current_day = date_.isocalendar()
print date_, normal_current_week, normal_current_day
if normal_current_day >= start_week_on:
week = normal_current_week + 1
else:
week = normal_current_week
return week
If I understand correctly the following does what you want:
In [101]:
import datetime as dt
import pandas as pd
df = pd.DataFrame({'date':pd.date_range(start=dt.datetime(2015,8,9), end=dt.datetime(2015,9,1))})
df['week'] = df['date'].dt.week.shift(-2).ffill()
df['orig week'] = df['date'].dt.week
df['day of week'] = df['date'].dt.dayofweek
df
Out[101]:
date week orig week day of week
0 2015-08-09 33 32 6
1 2015-08-10 33 33 0
2 2015-08-11 33 33 1
3 2015-08-12 33 33 2
4 2015-08-13 33 33 3
5 2015-08-14 33 33 4
6 2015-08-15 34 33 5
7 2015-08-16 34 33 6
8 2015-08-17 34 34 0
9 2015-08-18 34 34 1
10 2015-08-19 34 34 2
11 2015-08-20 34 34 3
12 2015-08-21 34 34 4
13 2015-08-22 35 34 5
14 2015-08-23 35 34 6
15 2015-08-24 35 35 0
16 2015-08-25 35 35 1
17 2015-08-26 35 35 2
18 2015-08-27 35 35 3
19 2015-08-28 35 35 4
20 2015-08-29 36 35 5
21 2015-08-30 36 35 6
22 2015-08-31 36 36 0
23 2015-09-01 36 36 1
The above uses dt.week and shifts by 2 rows and then forward fills the NaN values.
import datetime
datetime.date(2015, 8, 9).isocalendar()[1]
You could just do this:
from datetime import date
def week_number(start_week_on, date_=None):
assert 0 <= start_week_on <= 6
if not date_:
date_ = date.today()
__, normal_current_week, normal_current_day = date_.isocalendar()
if normal_current_day >= start_week_on:
week = normal_current_week
else:
week = normal_current_week - 1
return week
print("Week starts We're in")
for start_week_on in range(7):
this_week = week_number(start_week_on)
print(" day {0} week {1}".format(start_week_on, this_week))
Output on day 4 (Thursday):
Week starts We're in
day 0 week 33
day 1 week 33
day 2 week 33
day 3 week 33
day 4 week 33
day 5 week 32
day 6 week 32
If I have a dataframe that has columns that include the same name, is there a way to combine the columns that have the same name with some sort of function (i.e. sum)?
For instance with:
In [186]:
df["NY-WEB01"].head()
Out[186]:
NY-WEB01 NY-WEB01
DateTime
2012-10-18 16:00:00 5.6 2.8
2012-10-18 17:00:00 18.6 12.0
2012-10-18 18:00:00 18.4 12.0
2012-10-18 19:00:00 18.2 12.0
2012-10-18 20:00:00 19.2 12.0
How might I collapse the NY-WEB01 columns (there are a bunch of duplicate columns, not just NY-WEB01) by summing each row where the column name is the same?
I believe this does what you are after:
df.groupby(lambda x:x, axis=1).sum()
Alternatively, between 3% and 15% faster depending on the length of the df:
df.groupby(df.columns, axis=1).sum()
EDIT: To extend this beyond sums, use .agg() (short for .aggregate()):
df.groupby(df.columns, axis=1).agg(numpy.max)
pandas >= 0.20: df.groupby(level=0, axis=1)
You don't need a lambda here, nor do you explicitly have to query df.columns; groupby accepts a level argument you can specify in conjunction with the axis argument. This is cleaner, IMO.
# Setup
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
df
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
<!_ >
df.groupby(level=0, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
Handling MultiIndex columns
Another case to consider is when dealing with MultiIndex columns. Consider
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
df
one two
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
To perform aggregation across the upper levels, use
df.groupby(level=1, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
or, if aggregating per upper level only, use
df.groupby(level=[0, 1], axis=1).sum()
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Alternate Interpretation: Dropping Duplicate Columns
If you came here looking to find out how to simply drop duplicate columns (without performing any aggregation), use Index.duplicated:
df.loc[:,~df.columns.duplicated()]
A B
0 44 0
1 39 19
2 23 24
3 1 39
4 24 37
Or, to keep the last ones, specify keep='last' (default is 'first'),
df.loc[:,~df.columns.duplicated(keep='last')]
A B
0 47 3
1 9 36
2 6 12
3 38 46
4 17 13
The groupby alternatives for the two solutions above are df.groupby(level=0, axis=1).first(), and ... .last(), respectively.
Here is possible simplier solution for common aggregation functions like sum, mean, median, max, min, std - only use parameters axis=1 for working with columns and level:
#coldspeed samples
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
print (df)
print (df.sum(axis=1, level=0))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
print (df.sum(axis=1, level=1))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
print (df.sum(axis=1, level=[0,1]))
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Similar it working for index, then use axis=0 instead axis=1:
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('ABCDE'), index=list('aabbc'))
print (df)
A B C D E
a 44 47 0 3 3
a 39 9 19 21 36
b 23 6 24 24 12
b 1 38 39 23 46
c 24 17 37 25 13
print (df.min(axis=0, level=0))
A B C D E
a 39 9 0 3 3
b 1 6 24 23 12
c 24 17 37 25 13
df.index = pd.MultiIndex.from_arrays([['bar']*3 + ['foo']*2, df.index])
print (df.mean(axis=0, level=1))
A B C D E
a 41.5 28.0 9.5 12.0 19.5
b 12.0 22.0 31.5 23.5 29.0
c 24.0 17.0 37.0 25.0 13.0
print (df.max(axis=0, level=[0,1]))
A B C D E
bar a 44 47 19 21 36
b 23 6 24 24 12
foo b 1 38 39 23 46
c 24 17 37 25 13
If need use another functions like first, last, size, count is necessary use coldspeed answer