I have a Pandas timeseries:
days = pd.DatetimeIndex([
'2011-01-01T00:00:00.000000000',
'2011-01-02T00:00:00.000000000',
'2011-01-03T00:00:00.000000000',
'2011-01-04T00:00:00.000000000',
'2011-01-05T00:00:00.000000000',
'2011-01-06T00:00:00.000000000',
'2011-01-07T00:00:00.000000000',
'2011-01-08T00:00:00.000000000',
'2011-01-09T00:00:00.000000000',
'2011-01-11T00:00:00.000000000',
'2011-01-12T00:00:00.000000000',
'2011-01-13T00:00:00.000000000',
'2011-01-14T00:00:00.000000000',
'2011-01-16T00:00:00.000000000',
'2011-01-18T00:00:00.000000000',
'2011-01-19T00:00:00.000000000',
'2011-01-21T00:00:00.000000000',
])
counts = [85, 97, 24, 64, 3, 37, 73, 86, 87, 82, 75, 84, 43, 51, 42, 3, 70]
df = pd.DataFrame(counts,
index=days,
columns=['count'],
)
df['day of the week'] = df.index.dayofweek
And it looks like this:
count day of the week
2011-01-01 85 5
2011-01-02 97 6
2011-01-03 24 0
2011-01-04 64 1
2011-01-05 3 2
2011-01-06 37 3
2011-01-07 73 4
2011-01-08 86 5
2011-01-09 87 6
2011-01-11 82 1
2011-01-12 75 2
2011-01-13 84 3
2011-01-14 43 4
2011-01-16 51 6
2011-01-18 42 1
2011-01-19 3 2
2011-01-21 70 4
Notice that there are some days that are missing which should be filled with zeros. I want to convert this so it looks like a calendar so that the rows are increasing by weeks, the columns are days of the week, and the values are the count for that particular day. So the end result should look like:
0 1 2 3 4 5 6
0 0 0 0 0 0 85 97
1 24 64 3 37 73 86 87
2 0 82 75 84 0 0 51
3 0 42 3 0 70 0 0
# create weeks number based on day of the week
df['weeks'] = (df['day of the week'].diff() < 0).cumsum()
# pivot the table
df.pivot('weeks', 'day of the week', 'count').fillna(0)
Related
I have the following dataframe:
import pandas as pd
array = {'test_ID': [10, 13, 10, 13, 16],
'test_date': ['2010-09-05', '2010-10-23', '2011-09-12', '2010-05-05', '2010-06-01'],
'Value1': [40, 56, 23, 78, 67],
'Value2': [25, 0, 68, 0, 0]}
df = pd.DataFrame(array)
df
test_ID test_date Value1 Value2
0 10 2010-09-05 40 25
1 13 2010-10-23 56 0
2 10 2011-09-12 23 68
3 13 2010-05-05 78 0
4 16 2010-06-01 67 0
I would like to delete column 'Value2' and combine it in column 'Value1' - but only when Value2 != Zero.
The expected output is:
test_ID test_date Value1
0 10 2010-09-05 40
1 99 2010-09-05 25
2 13 2010-10-23 56
3 10 2011-09-12 23
4 99 2011-09-12 68
5 13 2010-05-05 78
6 16 2010-06-01 67
Use DataFrame.set_index with DataFrame.stack for reshape, remove values with 0, remove last level of MultiIndex by DataFrame.droplevel and last create 3 columns DataFrame:
s = df.set_index(['test_ID','test_date']).stack()
df = s[s.ne(0)].reset_index(name='Value1')
df['test_ID'] = df['test_ID'].mask(df.pop('level_2').eq('Value2'), 99)
print (df)
test_ID test_date Value1
0 10 2010-09-05 40
1 99 2010-09-05 25
2 13 2010-10-23 56
3 10 2011-09-12 23
4 99 2011-09-12 68
5 13 2010-05-05 78
6 16 2010-06-01 67
Another solution with DataFrame.melt and remove 0 rows by DataFrame.loc:
df = (df.melt(['test_ID','test_date'], value_name='Value1', ignore_index=False)
.assign(test_ID = lambda x: x['test_ID'].mask(x.pop('variable').eq('Value2'), 99))
.sort_index()
.loc[lambda x: x['Value1'].ne(0)]
.reset_index(drop=True))
print (df)
test_ID test_date Value1
0 10 2010-09-05 40
1 99 2010-09-05 25
2 13 2010-10-23 56
3 10 2011-09-12 23
4 99 2011-09-12 68
5 13 2010-05-05 78
6 16 2010-06-01 67
Here is a simple solution by filtering on non zero values.
df = pd.DataFrame(array)
filtered_rows = df.loc[df["Value2"] != 0]
filtered_rows.loc[:,'Value1'] = filtered_rows.loc[:,'Value2']
filtered_rows.loc[:, 'test_ID'] = 99
df = pd.concat([df, filtered_rows]).sort_index().drop(['Value2'], axis=1)
This gives us the expected data :
test_ID test_date Value1
0 10 2010-09-05 40
0 99 2010-09-05 25
1 13 2010-10-23 56
2 10 2011-09-12 23
2 99 2011-09-12 68
3 13 2010-05-05 78
4 16 2010-06-01 67
id numbers
1 {'105': 1, '65': 11, '75': 0, '85': 51, '95': 0}
2 {'105': 1, '65': 11, '75': 0, '85': 50, '95': 0}
3 {'105': 1, '65': 11, '75': 0, '85': 51, '95': 0}
4 {}
5 {}
6 {}
7 {'75 cm': 7, '85 cm': 52, '95 cm': 10}
8 {'75 cm': 51, '85 cm': 114, '95 cm': 10}
9 {'75 cm': 9, '85 cm': 60, '95 cm': 10}
this is the current table
I know how to turn the dict into column and rows (key as column and value as rows but what i am looking for is for key and value to be rows with their own column headers)
test = pd.concat([df.drop(['numbers'], axis=1).sort_values(['id']),
df['numbers'].apply(pd.Series)], axis=1)
test2 = test.melt(id_vars=['id'],
var_name="name",
value_name="nameN").fillna(0)
im trying to get each key and value in the dictionary to be rows
id name nameN
1 105 1
1 65 11
1 75 0
1 85 51
1 95 0
You should use comprehensions to build the data for a new DataFrame. If you can just drop the ids where numbers is an empy dictionary, you can do:
test = pd.DataFrame([[x['id'], k, v] for _, x in df.iterrows()
for k,v in x['numbers'].items()], columns=['id', 'name', 'nameN'])
to get:
id name nameN
0 1 105 1
1 1 65 11
2 1 75 0
3 1 85 51
4 1 95 0
5 2 105 1
6 2 65 11
7 2 75 0
8 2 85 50
9 2 95 0
10 3 105 1
11 3 65 11
12 3 75 0
13 3 85 51
14 3 95 0
15 7 75 cm 7
16 7 85 cm 52
17 7 95 cm 10
18 8 75 cm 51
19 8 85 cm 114
20 8 95 cm 10
21 9 75 cm 9
22 9 85 cm 60
23 9 95 cm 10
If you want a line with a specific value when numbers is empty:
test2 = pd.DataFrame([i for lst in [[[x['id'], '', '']] if x['numbers'] == {}
else [[x['id'], k, v] for k,v in x['numbers'].items()]
for _, x in df.iterrows()] for i in lst],
columns=['id', 'name', 'nameN']).sort_values('id').reset_index(drop=True)
giving:
id name nameN
0 1 105 1
1 1 65 11
2 1 75 0
3 1 85 51
4 1 95 0
5 2 105 1
6 2 65 11
7 2 75 0
8 2 85 50
9 2 95 0
10 3 95 0
11 3 75 0
12 3 85 51
13 3 105 1
14 3 65 11
15 4
16 5
17 6
18 7 75 cm 7
19 7 85 cm 52
20 7 95 cm 10
21 8 75 cm 51
22 8 85 cm 114
23 8 95 cm 10
24 9 85 cm 60
25 9 75 cm 9
26 9 95 cm 10
dummy_df = pd.DataFrame({
'accnt' : [101, 102, 103, 104, 101, 102, 103, 104, 101, 102, 103, 104, 101, 102, 103, 104, 101, 102, 103, 104],
'value' : [10, 20, 30, 40, 5, 2, 6, 48, 22, 23, 24, 25, 18, 25, 26, 14, 78, 72, 54, 6],
'category' : [1,1,1,1,2,2,2,2,1,1,2,2,3,3,3,3,1,3,2,3]
})
dummy_df
accnt value category
101 10 1
102 20 1
103 30 1
104 40 1
101 5 2
102 2 2
103 6 2
104 48 2
101 22 1
102 23 1
103 24 2
104 25 2
101 18 3
102 25 3
103 26 3
104 14 3
101 78 1
102 72 3
103 54 2
104 6 3
I want to get a dataframe like below:
accnt sum_val_c1 count_c1 sum_val_ct2 count_c2 sum_val_c3 count_c3
101 110 3 5 1 18 1
102 43 2 2 1 97 2
103 30 1 84 3 26 1
104 40 1 73 2 20 2
Which is summing up the occurrence of a category into count_c# and summing the value of that category into sum_val_c# and grouping by on accnt. I have tried using pivot() and groupby() but I know I'm missing something.
Use groupby, agg, and unstack:
u = df.groupby(['accnt', 'category'])['value'].agg(['sum', 'count']).unstack(1)
u.columns = u.columns.map('{0[0]}_c{0[1]}'.format)
u
sum_c1 sum_c2 sum_c3 count_c1 count_c2 count_c3
accnt
101 110 5 18 3 1 1
102 43 2 97 2 1 2
103 30 84 26 1 3 1
104 40 73 20 1 2 2
Similarly, with pivot_table,
u = df.pivot_table(index=['accnt'],
columns='category',
values='value',
aggfunc=['sum', 'count'])
u.columns = u.columns.map('{0[0]}_c{0[1]}'.format)
u
sum_c1 sum_c2 sum_c3 count_c1 count_c2 count_c3
accnt
101 110 5 18 3 1 1
102 43 2 97 2 1 2
103 30 84 26 1 3 1
104 40 73 20 1 2 2
Pandas has a method to do that.
pivot2 = dummy_df.pivot_table(values='value', index='accnt', columns='category', aggfunc=['count', 'sum'])
That returns a dataframe like this:
count sum
category 1 2 3 1 2 3
accnt
101 3 1 1 110 5 18
102 2 1 2 43 2 97
103 1 3 1 30 84 26
104 1 2 2 40 73 20
Imagine a pandas DataFrame like this
date id initial_value part_value
2016-01-21 1 100 10
2016-05-18 1 100 20
2016-03-15 2 150 75
2016-07-28 2 150 50
2016-08-30 2 150 25
2015-07-21 3 75 75
Generated with following
df = pd.DataFrame({
'id': (1, 1, 2, 2, 2, 3),
'date': tuple(pd.to_datetime(date) for date in
('2016-01-21', '2016-05-18', '2016-03-15', '2016-07-28', '2016-08-30', '2015-07-21')),
'initial_value': (100, 100, 150, 150, 150, 75),
'part_value': (10, 20, 75, 50, 25, 75)}).sort_values(['id', 'date'])
I wish to add a column with the remaining value defined by the initial_value minus the cumulative sum of part_value conditioned on id and dates before. Hence I wish my goal is
date id initial_value part_value goal
2016-01-21 1 100 10 100
2016-05-18 1 100 20 90
2016-03-15 2 150 75 150
2016-07-28 2 150 50 75
2016-08-30 2 150 25 25
2015-07-21 3 75 75 75
I'm thinking that a solution can be made by combining the solution from here and here, but I can't exactly figure it out.
If dont use dates values need add, sub and groupby with cumsum:
df['goal'] = df.initial_value.add(df.part_value).sub(df.groupby('id').part_value.cumsum())
print (df)
date id initial_value part_value goal
0 2016-01-21 1 100 10 100
1 2016-05-18 1 100 20 90
2 2016-03-15 2 150 75 150
3 2016-07-28 2 150 50 75
4 2016-08-30 2 150 25 25
5 2015-07-21 3 75 75 75
What is same as:
df['goal'] = df.initial_value + df.part_value - df.groupby('id').part_value.cumsum()
print (df)
date id initial_value part_value goal
0 2016-01-21 1 100 10 100
1 2016-05-18 1 100 20 90
2 2016-03-15 2 150 75 150
3 2016-07-28 2 150 50 75
4 2016-08-30 2 150 25 25
5 2015-07-21 3 75 75 75
I actually came up with a solution myself as well. I guess it is kind of the same that is happening.
df['goal'] = df.initial_value - ((df.part_value).groupby(df.id).cumsum() - df.part_value)
df
date id initial_value part_value goal
0 2016-01-21 1 100 10 100
1 2016-05-18 1 100 20 90
2 2016-03-15 2 150 75 150
3 2016-07-28 2 150 50 75
4 2016-08-30 2 150 25 25
5 2015-07-21 3 75 75 75
I'm resampling from a multi-index dataframe containing seasonal data (with some years/seasons missing). I want to resample a random winter followed by a random summer followed by a random spring, but the method I'm using only samples a random season followed by a random season - even though I'm specifying which season to choose from. I can't see where I'm going wrong so here's code to illustrate:
Take a multi-index dataframe from which to resample:
import pandas as pd
import numpy as np
dates = pd.date_range('20100101',periods=1825)
df = pd.DataFrame(data=np.random.randint(0,100,(1825,2)), columns =list('AB'))
df['date'] = dates
df = df[['date','A', 'B']]
#season function
def get_season(row):
if row['date'].month >= 3 and row['date'].month <= 5:
return '2'
elif row['date'].month >= 6 and row['date'].month <= 8:
return '3'
elif row['date'].month >= 9 and row['date'].month <= 11:
return '4'
else:
return '1'
#apply the season function to dataframe
df['Season'] = df.apply(get_season, axis=1)
#Year column for multi-index
df['Year'] = df['date'].dt.year
#season column for multi-index
df = df.set_index(['Year', 'Season'], inplace=False)
re-index so it's missing some seasons (necessary to do what I want)
newindex = [(2010L, '1'), (2011L, '1'), (2011L, '3'), (2012L, '4'), (2013L, '2'), (2015L, '3')]
df = df.loc[newindex]
#recreate season and year
df['Season'] = df.apply(get_season, axis=1)
df['Year'] = df['date'].dt.year
Years variable to select range from:
years = df['date'].dt.year.unique()
Sample from the dataframe:
dfs = []
for i in range(100):
dfs.append(df.query("Year == %d and Season == '1'" %np.random.choice(years, 1)))
dfs.append(df.query("Year == %d and Season == '2'" %np.random.choice(years, 1)))
dfs.append(df.query("Year == %d and Season == '3'" %np.random.choice(years, 1)))
dfs.append(df.query("Year == %d and Season == '4'" %np.random.choice(years, 1)))
rnd = pd.concat(dfs)
This outputs a dataframe and samples seasons randomly, but even though I've selected it to choose from Season == '1'
Season == '2'
Season == '3'
Season =='4' it seems to be choosing randomly and not respecting the order of Winter, Spring, Summer, Autumn (1,2,3,4). I've tried adding replace == True but this has no effect.
How can I adjust this so it selects a random Winter, followed by a random Spring, followed by a random Summer, then random Autumn?
Thanks
EDIT 1:
Changing the code so it only selects season and not year helps - but it now selects more than one winter (even though I'm specifying to choose only 1)
dfs = []
for i in range(100):
dfs.append(df.query("Season == '1'" %np.random.choice(years, 1)))
dfs.append(df.query("Season == '2'" %np.random.choice(years, 1)))
dfs.append(df.query("Season == '3'" %np.random.choice(years, 1)))
dfs.append(df.query("Season == '4'" %np.random.choice(years, 1)))
rnd = pd.concat(dfs)
You could use .groupby() with TimeGrouper('Q-Nov') to produce your seasons, .sample() from each season, set a new index for each season sample and then .sortlevel() accordingly:
Starting with your sample df, but setting DateTimeIndex:
dates = pd.date_range('20100101', periods=1825)
df = pd.DataFrame(data=np.random.randint(0, 100, (1825, 2)), columns=list('AB'), index=dates)
DatetimeIndex: 1825 entries, 2010-01-01 to 2014-12-30
Freq: D
Data columns (total 2 columns):
A 1825 non-null int64
B 1825 non-null int64
This allows for groupby() with TimeGrouper(), shifting quarter end to November (and assigning values in December at the end of the series to the first season again). Assigns the max() of .month for each group, translated via season_dict back to the original df using .transform():
season_dict = {2: 1, 5: 2, 8: 3, 11: 4}
df['season'] = df.groupby(pd.TimeGrouper('Q-Nov')).A.transform(lambda x: season_dict.get(x.index.month.max(), 1))
Create year column and set season and year to index:
df['year'] = df.index.to_series().dt.year.astype(int)
df = df.reset_index().set_index(['year', 'season'])
Get unique (year, season) combinations from the index:
sample_seasons = df.reset_index().loc[:, ['year', 'season']].drop_duplicates()
Sample from the result, using .reset_index() to ensure you can sort after:
sample_seasons = sample_seasons.groupby('season').apply(lambda x: x.sample(frac=0.5).reset_index(drop=True))
sample_seasons = sample_seasons.reset_index(0, drop=True).sort_index()
Convert into format so that you can select from MultiIndex later to pull entire season:
sample_seasons = list(sample_seasons.values)
sample_seasons = [tuple(s) for s in sample_seasons]
[(2011, 1), (2013, 2), (2011, 3), (2014, 4), (2014, 1), (2010, 2), (2010, 3), (2012, 4)]
sample = df.loc[sample_seasons]
which yields:
index A B
year season
2011 1 2011-01-01 33 64
1 2011-01-02 91 66
1 2011-01-03 37 47
1 2011-01-04 1 87
1 2011-01-05 68 47
1 2011-01-06 92 60
1 2011-01-07 81 7
1 2011-01-08 78 13
1 2011-01-09 31 67
1 2011-01-10 24 50
1 2011-01-11 71 55
1 2011-01-12 56 37
1 2011-01-13 25 87
1 2011-01-14 24 55
1 2011-01-15 29 97
1 2011-01-16 70 94
1 2011-01-17 18 37
1 2011-01-18 95 30
1 2011-01-19 58 87
1 2011-01-20 75 96
1 2011-01-21 52 63
1 2011-01-22 60 75
1 2011-01-23 39 58
1 2011-01-24 86 24
1 2011-01-25 61 21
1 2011-01-26 19 24
1 2011-01-27 5 71
1 2011-01-28 72 81
1 2011-01-29 0 45
1 2011-01-30 80 48
... ... .. ..
2012 4 2012-11-01 90 44
4 2012-11-02 43 53
4 2012-11-03 3 49
4 2012-11-04 38 7
4 2012-11-05 64 44
4 2012-11-06 82 44
4 2012-11-07 38 75
4 2012-11-08 7 96
4 2012-11-09 52 9
4 2012-11-10 32 64
4 2012-11-11 30 38
4 2012-11-12 91 70
4 2012-11-13 63 18
4 2012-11-14 77 29
4 2012-11-15 58 51
4 2012-11-16 90 17
4 2012-11-17 87 85
4 2012-11-18 64 79
4 2012-11-19 10 61
4 2012-11-20 76 52
4 2012-11-21 9 40
4 2012-11-22 15 28
4 2012-11-23 14 33
4 2012-11-24 24 74
4 2012-11-25 38 43
4 2012-11-26 27 87
4 2012-11-27 6 30
4 2012-11-28 91 3
4 2012-11-29 32 64
4 2012-11-30 0 28