Melt and find average counts in a pandas dataframe - python

I have one pandas Dataframe like below:
import pandas as pd
df = pd.DataFrame({'name': ['AAA','BBB','CCC'],
'2017-01-06': ['3','3','4'],
'2017-01-13': ['2','1','5'],
'2017-01-20': ['1','3','4'],
'2017-01-27': ['8','3','5'],
'average_count': ['4','3','5']})
df = df.reindex_axis(['name','2017-01-06','2017-01-13','2017-01-20','2017-01-27','average_count'], axis=1)
print df
name 2017-01-06 2017-01-13 2017-01-20 2017-01-27 average_count
0 AAA 3 2 1 8 4
1 BBB 3 1 3 3 3
2 CCC 4 5 4 5 5
I want to one output dataframe with four columns : name,date,count,average_count.
name column contains name from the above dataframe.
date column contains four different dates per single name.
count column contains count values for respective date.
average_count contains four different average count values.
If the months first week it is then average count need to calculate with (count of first week) / 1.
For 2nd week, (count of first week+count of first week) / 2.
For 3rd week, (count of first week+count of second week+count of third week) / 3.
For 4th week, (count of first week+count of second week+count of third week+count of fourth week) / 4.
In one month maximum five weeks are available (Need to handle five week scenario as well).
Edit1: Average count value calculation
This average count value is truncated like if the value <= 2.49 i.e. 2 and value >= 2.50 i.e. 3.
Output Dataframe looks like below:
name date count average_count
0 AAA 2017-01-06 3 3
1 AAA 2017-01-13 2 2
3 AAA 2017-01-20 1 2
3 AAA 2017-01-27 8 4
4 BBB 2017-01-06 3 3
5 BBB 2017-01-13 1 2
6 BBB 2017-01-20 3 3
7 BBB 2017-01-27 3 3
8 CCC 2017-01-06 4 4
9 CCC 2017-01-13 5 5
10 CCC 2017-01-20 4 3
11 CCC 2017-01-27 5 5

You can stack the values and reset_index to get the dataframe of 4 columns i.e
def round_next(x):
if x%1 == 0.5:
return x+0.5
else :
return np.round(x)
ndf = df.set_index(['name','average_count']).stack().reset_index().rename(columns = {'level_2':'date',0:'count'})
ndf['date'] = pd.to_datetime(ndf['date'])
ndf['count'] =ndf['count'].astype(int) # Since they are in string format
#Thank you #Zero. Since they are dates appearing to be taken weekly once groupby cumcount() + 1 will do that work.
#Incase you have missing weeks then I would suggest dt.week i.e ndf.groupby('name')['date'].dt.week
ndf['average_count'] = (ndf.groupby('name')['count'].cumsum()/(ndf.groupby('name')['count'].cumcount()+1)).apply(round_next)
name average_count date count
0 AAA 3.0 2017-01-06 3
1 AAA 3.0 2017-01-13 2
2 AAA 2.0 2017-01-20 1
3 AAA 4.0 2017-01-27 8
4 BBB 3.0 2017-01-06 3
5 BBB 2.0 2017-01-13 1
6 BBB 2.0 2017-01-20 3
7 BBB 3.0 2017-01-27 3
8 CCC 4.0 2017-01-06 4
9 CCC 5.0 2017-01-13 5
10 CCC 4.0 2017-01-20 4
11 CCC 5.0 2017-01-27 5

Use df.melt, df.sort_values and df.reset_index for the first bit.
df2 = df.iloc[:, :-1].melt('name', var_name=['date'], value_name='count')\
.sort_values('name').reset_index(drop=True)
# cleaning up OP's data
df2['count'] = pd.to_numeric(df2['count'])
df2['date'] = pd.to_datetime(df2.date)
df2
name date count
0 AAA 2017-01-06 3
1 AAA 2017-01-13 2
2 AAA 2017-01-20 1
3 AAA 2017-01-27 8
4 BBB 2017-01-06 3
5 BBB 2017-01-13 1
6 BBB 2017-01-20 3
7 BBB 2017-01-27 3
8 CCC 2017-01-06 4
9 CCC 2017-01-13 5
10 CCC 2017-01-20 4
11 CCC 2017-01-27 5
Now, you'll need to groupby name, get the cumsum of count and divide by the week number, which you can access by dt.week.
df2['average_count'] = np.round(df2.groupby('name')\
['count'].cumsum() / df2.date.dt.week).astype(int)
df2
name date count average_count
0 AAA 2017-01-06 3 3
1 AAA 2017-01-13 2 2
2 AAA 2017-01-20 1 2
3 AAA 2017-01-27 8 4
4 BBB 2017-01-06 3 3
5 BBB 2017-01-13 1 2
6 BBB 2017-01-20 3 2
7 BBB 2017-01-27 3 2
8 CCC 2017-01-06 4 4
9 CCC 2017-01-13 5 4
10 CCC 2017-01-20 4 4
11 CCC 2017-01-27 5 4

Related

"Rank" DataFrame columns per row

Given a Time Series DataFrame is it possible to create a new DataFrame with the same dimensions but the values are the ranking for each row compared to other columns (ordered smallest value first)?
Example:
ABC DEFG HIJK XYZ
date
2018-01-14 0.110541 0.007615 0.063217 0.002543
2018-01-21 0.007012 0.042854 0.061271 0.007988
2018-01-28 0.085946 0.177466 0.046432 0.069297
2018-02-04 0.018278 0.065254 0.038972 0.027278
2018-02-11 0.071785 0.033603 0.075826 0.073270
The first row would become:
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
as XYZ has the smallest value in that row and ABC the largest.
numpy.argsort looks like it might help however as it outputs the location itself I have not managed to get it to work.
Many thanks
Use double argsort for rank per rows and pass to DataFrame constructor:
df1 = pd.DataFrame(df.values.argsort().argsort() + 1, index=df.index, columns=df.columns)
print (df1)
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
2018-01-21 1 3 4 2
2018-01-28 3 4 1 2
2018-02-04 1 4 3 2
2018-02-11 2 1 4 3
Or use DataFrame.rank with method='dense':
df1 = df.rank(axis=1, method='dense').astype(int)
print (df1)
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
2018-01-21 1 3 4 2
2018-01-28 3 4 1 2
2018-02-04 1 4 3 2
2018-02-11 2 1 4 3

to_datetime assemblage error due to extra keys

My pandas version is 0.23.4.
I tried to run this code:
df['date_time'] = pd.to_datetime(df[['year','month','day','hour_scheduled_departure','minute_scheduled_departure']])
and the following error appeared:
extra keys have been passed to the datetime assemblage: [hour_scheduled_departure, minute_scheduled_departure]
Any ideas of how to get the job done by pd.to_datetime?
#anky_91
In this image an extract of first 10 rows is presented. First column [int32]: year; Second column[int32]: month; Third column[int32]: day; Fourth column[object]: hour; Fifth column[object]: minute. The length of objects is 2.
Another solution:
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: '0'.join(map(str,x))))],axis=1)
A Date
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
For the example you have added as image (i have skipped the last 3 columns due to save time)
>>df.month=df.month.map("{:02}".format)
>>df.day = df.day.map("{:02}".format)
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: ''.join(map(str,x))))],axis=1)
A Date
0 a 2015-01-01 00:05:00
1 b 2015-01-01 00:01:00
2 c 2015-01-01 00:02:00
3 d 2015-01-01 00:02:00
4 e 2015-01-01 00:25:00
5 f 2015-01-01 00:25:00
You can use rename to columns, so possible use pandas.to_datetime with columns year, month, day, hour, minute:
df = pd.DataFrame({
'A':list('abcdef'),
'year':[2002,2002,2002,2002,2002,2002],
'month':[7,8,9,4,2,3],
'day':[1,3,5,7,1,5],
'hour_scheduled_departure':[5,3,6,9,2,4],
'minute_scheduled_departure':[7,8,9,4,2,3]
})
print (df)
A year month day hour_scheduled_departure minute_scheduled_departure
0 a 2002 7 1 5 7
1 b 2002 8 3 3 8
2 c 2002 9 5 6 9
3 d 2002 4 7 9 4
4 e 2002 2 1 2 2
5 f 2002 3 5 4 3
cols = ['year','month','day','hour_scheduled_departure','minute_scheduled_departure']
d = {'hour_scheduled_departure':'hour','minute_scheduled_departure':'minute'}
df['date_time'] = pd.to_datetime(df[cols].rename(columns=d))
#if necessary remove columns
df = df.drop(cols, axis=1)
print (df)
A date_time
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
Detail:
print (df[cols].rename(columns=d))
year month day hour minute
0 2002 7 1 5 7
1 2002 8 3 3 8
2 2002 9 5 6 9
3 2002 4 7 9 4
4 2002 2 1 2 2
5 2002 3 5 4 3

Keep similar rows pandas dataframe with maximum overlap

I have a question for which I have
a dataframe which looks like (example):
index ID time value
0 1 2h 10
1 1 2.15h 15
2 1 2.30h 5
3 1 2.45h 24
4 2 2.15h 6
5 2 2.30h 12
6 2 2.45h 18
7 3 2.15h 2
8 3 2.30h 1
I would like to keep the maximum number of ID row overlapping.
So:
index ID time value
1 1 2.15h 15
2 1 2.30h 5
4 2 2.15h 6
5 2 2.30h 12
7 3 2.15h 2
8 3 2.30h 1
I know I can create a df with unique times and then merge each ID separately to it and then keep all rows with all IDs filled for each time but this is quite impractical. I have looked but have not found an answer for a possible smarter way. Does someone have an idea how to make this more practical?
Use:
cols = df.groupby(['ID', 'time']).size().unstack().dropna(axis=1).columns
df = df[df['time'].isin(cols)]
print (df)
ID time value
1 1 2.15h 15
2 1 2.30h 5
4 2 2.15h 6
5 2 2.30h 12
7 3 2.15h 2
8 3 2.30h 1
Details:
First aggregate DataFrame by groupby and size, then reshape by unstack - NaNs are created for non overlapping values:
print (df.groupby(['ID', 'time']).size().unstack())
time 2.15h 2.30h 2.45h 2h
ID
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 NaN
3 1.0 1.0 NaN NaN
Remove columns with dropna and get columns names:
print (df.groupby(['ID', 'time']).size().unstack().dropna(axis=1))
time 2.15h 2.30h
ID
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
And last filter list by isin and boolean indexing:
df = df[df['time'].isin(cols)]

Pandas groupby() transform() max() with filter

I have a dataframe like this:
id date value
1 12/01/2016 5
1 25/02/2016 7
1 10/03/2017 13
2 02/04/2016 0
2 06/07/2016 1
2 12/03/2017 6
I'm looking to create a column called 'max_ever' for each unique value of 'id'
I can do: df['max_ever']=df.groupby(['id'])['value'].transform(max)
Which would give me:
id date value max_ever
1 12/01/2016 5 13
1 25/02/2016 7 13
1 10/03/2017 13 13
2 02/04/2016 0 6
2 06/07/2016 1 6
2 12/03/2017 6 6
But I would like to add another column called 'max_12_months' from today() for each unique value of 'id'
I can create a new dataframe with the filtered date and repeat the above, but I'd like to try filter and transform within this dataframe.
The final dataframe would look like this:
id date value max_ever max_12_months
1 12/01/2016 13 13 7
1 25/05/2016 7 13 7
1 10/03/2017 5 13 7
2 02/04/2016 6 6 2
2 06/07/2016 1 6 2
2 12/03/2017 2 6 2
Appreciate any help!
Custom agg function to be apply'd... Then join
today = pd.to_datetime(pd.datetime.today()).floor('D')
year_ago = today - pd.offsets.Day(366)
def max12(df):
return df.value.loc[df.date.between(year_ago, today)].max()
def aggf(df):
return pd.Series(
[df.value.max(), max12(df)],
['max_ever', 'max_12_months']
)
df.join(df.groupby('id').apply(aggf), on='id')
id date value max_ever max_12_months
0 1 2016-01-12 13 13 7
1 1 2016-05-25 7 13 7
2 1 2017-03-10 5 13 7
3 2 2016-04-02 6 6 2
4 2 2016-07-06 1 6 2
5 2 2017-03-12 2 6 2

How to indicate the multi index columns using read_sql_query (pandas dataframes)

I have a table with the following columns:
| Date | ProductId | SubProductId | Value |
I am trying to retrieve the data from that table and to put it in a pandas DataFrame.
I want the DataFrame to have the following structure:
index: dates
columns: products
sub-columns: sub-products
(products) 1 2 ...
(subproducts) 1 2 3 1 2 3 ...
date
2015-01-02 val val val ...
2015-01-03 val val val ...
2015-01-04 ...
2015-01-05
...
I already have dataframes with the products and the subproducts and the dates.
I understand that I need to use the MultiIndex, here is what I tried:
query ="SELECT Date, ProductId, SubProductId, Value " \
" FROM table "\
" WHERE SubProductId in (1,2,3)"\
" AND ProductId in (1,2,3)"\
" AND Date BETWEEN '2015-01-02' AND '2015-01-08' "\
" GROUP BY Date, ProductId, SubProductId, Value "\
" ORDER BY Date, ProductId, SubProductId "
df = pd.read_sql_query(query, conn, index_col=pd.MultiIndex.from_product([df_products['products'].tolist(), df_subproducts['subproducts'].tolist()])
But it does not work because the query returns a vector of "value" (shape is nb of value x 1), while I need to have a matrix (shape: nb of distinct dates x (nb of subproducts*nb of prodcuts)) in the dataframe.
How can it be achieved:
directly via the read sql query ?
or by "trandofrming" the dataframe once the database values inserted in ?
NB: I am using Microsoft SQL Server.
IIUC you can use unstack() method:
df = pd.read_sql_query(query, conn, index_col=['Date','ProductID','SubProductId']) \
.unstack(['ProductID','SubProductId'])
Demo:
In [413]: df
Out[413]:
Date ProductID SubProductId Value
0 2015-01-02 1 1 11
1 2015-01-02 1 2 12
2 2015-01-02 1 3 13
3 2015-01-02 2 1 14
4 2015-01-02 2 2 15
5 2015-01-02 2 3 16
6 2015-01-03 1 1 17
7 2015-01-03 1 2 18
8 2015-01-03 1 3 19
9 2015-01-03 2 1 20
10 2015-01-03 2 2 21
In [414]: df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
Out[414]:
Value
ProductID 1 2
SubProductId 1 2 3 1 2 3
Date
2015-01-02 11.0 12.0 13.0 14.0 15.0 16.0
2015-01-03 17.0 18.0 19.0 20.0 21.0 NaN
You can also use pivot_table
df.pivot_table('Value', 'Date', ['ProductId', 'SubProductId'])
demo
df = pd.DataFrame(dict(
Date=pd.date_range('2017-03-31', periods=2).repeat(9),
ProductId=[1, 1, 1, 2, 2, 2, 3, 3, 3] * 2,
SubProductId=list('abc') * 6,
Value=np.random.randint(10, size=18)
))
print(df)
Date ProductId SubProductId Value
0 2017-03-31 1 a 8
1 2017-03-31 1 b 2
2 2017-03-31 1 c 5
3 2017-03-31 2 a 4
4 2017-03-31 2 b 3
5 2017-03-31 2 c 2
6 2017-03-31 3 a 9
7 2017-03-31 3 b 3
8 2017-03-31 3 c 1
9 2017-04-01 1 a 3
10 2017-04-01 1 b 5
11 2017-04-01 1 c 7
12 2017-04-01 2 a 3
13 2017-04-01 2 b 6
14 2017-04-01 2 c 4
15 2017-04-01 3 a 5
16 2017-04-01 3 b 2
17 2017-04-01 3 c 0
df.pivot_table('Value', 'Date', ['ProductId', 'SubProductId'])
ProductId 1 2 3
SubProductId a b c a b c a b c
Date
2017-03-31 8 2 5 4 3 2 9 3 1
2017-04-01 3 5 7 3 6 4 5 2 0

Categories