Frequency count unique values Pandas - python

I have a Pandas Series as follow :
2014-05-24 23:59:49 1.3
2014-05-24 23:59:50 2.17
2014-05-24 23:59:50 1.28
2014-05-24 23:59:51 1.30
2014-05-24 23:59:51 2.17
2014-05-24 23:59:53 2.17
2014-05-24 23:59:58 2.17
Name: api_id, Length: 483677
I'm trying to count for each id the frequency per day.
For now I'm doing this :
count = {}
for x in apis.unique():
count[x] = apis[apis == x].resample('D','count')
count_df = pd.DataFrame(count)
That gives me what I want which is :
... 2.13 2.17 2.4 2.6 2.7 3.5(user) 3.9 4.2 5.1 5.6
timestamp ...
2014-05-22 ... 391 49962 3727 161 2 444 113 90 1398 90
2014-05-23 ... 450 49918 3861 187 1 450 170 90 629 90
2014-05-24 ... 396 46359 3603 172 3 513 171 89 622 90
But is there a way to do so without the for loop ?

You can use the value_counts function for this (docs), applying this after a groupby (which is similar to the resample('D') you did, but resample is expecting an aggregated output so we have to use the more general groupby in this case). With a small example:
In [16]: s = pd.Series([1,1,2,2,1,2,5,6,2,5,4,1], index=pd.date_range('2012-01-01', periods=12, freq='8H'))
In [17]: counts = s.groupby(pd.Grouper(freq='D')).value_counts()
In [18]: counts
Out[18]:
2012-01-01 1 2
2 1
2012-01-02 2 2
1 1
2012-01-03 2 1
6 1
5 1
2012-01-04 1 1
5 1
4 1
dtype: int64
To get this in the desired format, you can just unstack this (move the second level row indices to the columns):
In [19]: counts.unstack()
Out[19]:
1 2 4 5 6
2012-01-01 2 1 NaN NaN NaN
2012-01-02 1 2 NaN NaN NaN
2012-01-03 NaN 1 NaN 1 1
2012-01-04 1 NaN 1 1 NaN
Note: for the use of groupby(pd.Grouper(freq='D')) you need pandas 0.14. If you have al older version, you can use groupby(pd.TimeGrouper(freq='D')) to obtain exactly the same. This is also similar to doing groupby(s.index.date) (with the difference you have then datetime.date objects in the index).

Related

PatsyError: numbers besides '0' and '1' are only allowed with ** doesnt' not resolve when using Q

I'm trying to run anova test to dataframe that looks like this:
>>>code 2020-11-01 2020-11-02 2020-11-03 2020-11-04 ...
0 1 22.5 73.1 12.2 77.5
1 1 23.1 75.4 12.4 78.3
2 2 43.1 72.1 13.4 85.4
3 2 41.6 85.1 34.1 96.5
4 3 97.3 43.2 31.1 55.3
5 3 12.1 44.4 32.2 52.1
...
I want to calculate one way anova for each column based on the code. I have used for that statsmodel and for loop :
keys = []
tables = []
for variable in df.columns[1:]:
model = ols('{} ~ code'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model)
keys.append(variable)
tables.append(anova_table)
df_anova = pd.concat(tables, keys=keys, axis=0)
df_anova
The problem is that I keep getting error for the 4th line:
PatsyError: numbers besides '0' and '1' are only allowed with **
2020-11-01 ~ code
^^^^
I have tried to use the Q argument as suggested here:
...
model = ols('{Q(x)} ~ code'.format(x=variable), data=df).fit()
KeyError: 'Q(x)'
I have also tried to locate the Q outside but got the same error.
My end goal: to calculate one-way anove for each day (each column) based on the "code" column.
You can try to pivot it long and skip the iteration through columns:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.DataFrame({"code":[1,1,2,2,3,3],
"2020-11-01":[22.5,23.1,43.1,41.6,97.3,12.1],
"2020-11-02":[73.1,75.4,72.1,85.1,43.2,44.4]})
df_long = df.melt(id_vars="code")
df_long
code variable value
0 1 2020-11-01 22.5
1 1 2020-11-01 23.1
2 2 2020-11-01 43.1
3 2 2020-11-01 41.6
4 3 2020-11-01 97.3
5 3 2020-11-01 12.1
6 1 2020-11-02 73.1
7 1 2020-11-02 75.4
8 2 2020-11-02 72.1
9 2 2020-11-02 85.1
10 3 2020-11-02 43.2
11 3 2020-11-02 44.4
Then applying your code:
tables = []
keys = df_long.variable.unique()
for D in keys:
model = ols('value ~ code', data=df_long[df_long.variable == D]).fit()
anova_table = sm.stats.anova_lm(model)
tables.append(anova_table)
pd.concat(tables,keys=keys)
Or simply:
def aov_func(x):
model = ols('value ~ code', data=x).fit()
return sm.stats.anova_lm(model)
df_long.groupby("variable").apply(aov_func)
Gives this result:
df sum_sq mean_sq F PR(>F)
variable
2020-11-01 code 1.0 1017.6100 1017.610000 1.115768 0.350405
Residual 4.0 3648.1050 912.026250 NaN NaN
2020-11-02 code 1.0 927.2025 927.202500 6.194022 0.067573
Residual 4.0 598.7725 149.693125 NaN NaN

Pandas dataframe Groupby and retrieve date range

Here is my dataframe that I am working on. There are two pay periods defined:
first 15 days and last 15 days for each month.
date employee_id hours_worked id job_group report_id
0 2016-11-14 2 7.50 385 B 43
1 2016-11-15 2 4.00 386 B 43
2 2016-11-30 2 4.00 387 B 43
3 2016-11-01 3 11.50 388 A 43
4 2016-11-15 3 6.00 389 A 43
5 2016-11-16 3 3.00 390 A 43
6 2016-11-30 3 6.00 391 A 43
I need to group by employee_id and job_group but at the same time
I have to achieve date range for that grouped row.
For example grouped results would be like following for employee_id 1:
Expected Output:
date employee_id hours_worked job_group report_id
1 2016-11-15 2 11.50 B 43
2 2016-11-30 2 4.00 B 43
4 2016-11-15 3 17.50 A 43
5 2016-11-16 3 9.00 A 43
Is this possible using pandas dataframe groupby?
Use SM with Grouper and last add SemiMonthEnd:
df['date'] = pd.to_datetime(df['date'])
d = {'hours_worked':'sum','report_id':'first'}
df = (df.groupby(['employee_id','job_group',pd.Grouper(freq='SM',key='date', closed='right')])
.agg(d)
.reset_index())
df['date'] = df['date'] + pd.offsets.SemiMonthEnd(1)
print (df)
employee_id job_group date hours_worked report_id
0 2 B 2016-11-15 11.5 43
1 2 B 2016-11-30 4.0 43
2 3 A 2016-11-15 17.5 43
3 3 A 2016-11-30 9.0 43
a. First, (for each employee_id) use multiple Grouper with the .sum() on the hours_worked column. Second, use DateOffset to achieve bi-weekly date column. After these 2 steps, I have assigned the date in the grouped DF based on 2 brackets (date ranges) - if day of month (from the date column) is <=15, then I set the day in date to 15, else I set the day to 30. This day is then used to assemble a new date. I calculated month end day based on 1, 2.
b. (For each employee_id) get the .last() record for the job_group and report_id columns
c. merge a. and b. on the employee_id key
# a.
hours = (df.groupby([
pd.Grouper(key='employee_id'),
pd.Grouper(key='date', freq='SM')
])['hours_worked']
.sum()
.reset_index())
hours['date'] = pd.to_datetime(hours['date'])
hours['date'] = hours['date'] + pd.DateOffset(days=14)
# Assign day based on bracket (date range) 0-15 or bracket (date range) >15
from pandas.tseries.offsets import MonthEnd
hours['bracket'] = hours['date'] + MonthEnd(0)
hours['bracket'] = pd.to_datetime(hours['bracket']).dt.day
hours.loc[hours['date'].dt.day <= 15, 'bracket'] = 15
hours['date'] = pd.to_datetime(dict(year=hours['date'].dt.year,
month=hours['date'].dt.month,
day=hours['bracket']))
hours.drop('bracket', axis=1, inplace=True)
# b.
others = (df.groupby('employee_id')['job_group','report_id']
.last()
.reset_index())
# c.
merged = hours.merge(others, how='inner', on='employee_id')
Raw data for employee_id==1 and employeeid==3
df.sort_values(by=['employee_id','date'], inplace=True)
print(df[df.employee_id.isin([1,3])])
index date employee_id hours_worked id job_group report_id
0 0 2016-11-14 1 7.5 481 A 43
10 10 2016-11-21 1 6.0 491 A 43
11 11 2016-11-22 1 5.0 492 A 43
15 15 2016-12-14 1 7.5 496 A 43
25 25 2016-12-21 1 6.0 506 A 43
26 26 2016-12-22 1 5.0 507 A 43
6 6 2016-11-02 3 6.0 487 A 43
4 4 2016-11-08 3 6.0 485 A 43
3 3 2016-11-09 3 11.5 484 A 43
5 5 2016-11-11 3 3.0 486 A 43
20 20 2016-11-12 3 3.0 501 A 43
21 21 2016-12-02 3 6.0 502 A 43
19 19 2016-12-08 3 6.0 500 A 43
18 18 2016-12-09 3 11.5 499 A 43
Output
print(merged)
employee_id date hours_worked job_group report_id
0 1 2016-11-15 7.5 A 43
1 1 2016-11-30 11.0 A 43
2 1 2016-12-15 7.5 A 43
3 1 2016-12-31 11.0 A 43
4 2 2016-11-15 31.0 B 43
5 2 2016-12-15 31.0 B 43
6 3 2016-11-15 29.5 A 43
7 3 2016-12-15 23.5 A 43
8 4 2015-03-15 5.0 B 43
9 4 2016-02-29 5.0 B 43
10 4 2016-11-15 5.0 B 43
11 4 2016-11-30 15.0 B 43
12 4 2016-12-15 5.0 B 43
13 4 2016-12-31 15.0 B 43

Improve Performance of Apply Method

I would like to groupby by the variable of my df "cod_id" and then apply this function:
[df.loc[df['dt_op'].between(d, d + pd.Timedelta(days = 7)), 'quantity'].sum() \
for d in df['dt_op']]
Moving from this df:
print(df)
dt_op quantity cod_id
20/01/18 1 613
21/01/18 8 611
21/01/18 1 613
...
To this one:
print(final_df)
n = 7
dt_op quantity product_code Final_Quantity
20/01/18 1 613 2
21/01/18 8 611 8
25/01/18 1 613 1
...
I tried with:
def lookforward(x):
L = [x.loc[x['dt_op'].between(row.dt_op, row.dt_op + pd.Timedelta(days=7)), \
'quantity'].sum() for row in x.itertuples(index=False)]
return pd.Series(L, index=x.index)
s = df.groupby('cod_id').apply(lookforward)
s.index = s.index.droplevel(0)
df['Final_Quantity'] = s
print(df)
dt_op quantity cod_id Final_Quantity
0 2018-01-20 1 613 2
1 2018-01-21 8 611 8
2 2018-01-21 1 613 1
But it is not an efficient solution, since it is computationally slow;
How can I improve its performance?
I would achieve it even with a new code/new function that leads to the same result.
EDIT:
Subset of the original dataset, with just one product (cod_id == 2), I tried to run on the code provided by "w-m":
print(df)
cod_id dt_op quantita final_sum
0 2 2017-01-03 1 54.0
1 2 2017-01-04 1 53.0
2 2 2017-01-13 1 52.0
3 2 2017-01-23 2 51.0
4 2 2017-01-26 1 49.0
5 2 2017-02-03 1 48.0
6 2 2017-02-27 1 47.0
7 2 2017-03-05 1 46.0
8 2 2017-03-15 1 45.0
9 2 2017-03-23 1 44.0
10 2 2017-03-27 2 43.0
11 2 2017-03-31 3 41.0
12 2 2017-04-04 1 38.0
13 2 2017-04-05 1 37.0
14 2 2017-04-15 2 36.0
15 2 2017-04-27 2 34.0
16 2 2017-04-30 1 32.0
17 2 2017-05-16 1 31.0
18 2 2017-05-18 1 30.0
19 2 2017-05-19 1 29.0
20 2 2017-06-03 1 28.0
21 2 2017-06-04 1 27.0
22 2 2017-06-07 1 26.0
23 2 2017-06-13 2 25.0
24 2 2017-06-14 1 23.0
25 2 2017-06-20 1 22.0
26 2 2017-06-22 2 21.0
27 2 2017-06-28 1 19.0
28 2 2017-06-30 1 18.0
29 2 2017-07-03 1 17.0
30 2 2017-07-06 2 16.0
31 2 2017-07-07 1 14.0
32 2 2017-07-13 1 13.0
33 2 2017-07-20 1 12.0
34 2 2017-07-28 1 11.0
35 2 2017-08-06 1 10.0
36 2 2017-08-07 1 9.0
37 2 2017-08-24 1 8.0
38 2 2017-09-06 1 7.0
39 2 2017-09-16 2 6.0
40 2 2017-09-20 1 4.0
41 2 2017-10-07 1 3.0
42 2 2017-11-04 1 2.0
43 2 2017-12-07 1 1.0
Edit 181017: this approach doesn't work due to forward rolling functions on sparse time series not currently being supported by pandas, see the comments.
Using for loops can be a performance killer when doing pandas operations.
The for loop around the rows plus their timedelta of 7 days can be replaced with a .rolling("7D"). To get a forward-rolling time delta (current date + 7 days), we reverse the df by date, as shown here.
Then no custom function is required anymore, and you can just take .quantity.sum() from the groupby.
quant_sum = df.sort_values("dt_op", ascending=False).groupby("cod_id") \
.rolling("7D", on="dt_op").quantity.sum()
cod_id dt_op
611 2018-01-21 8.0
613 2018-01-21 1.0
2018-01-20 2.0
Name: quantity, dtype: float64
result = df.set_index(["cod_id", "dt_op"])
result["final_sum"] = quant_sum
result.reset_index()
cod_id dt_op quantity final_sum
0 613 2018-01-20 1 2.0
1 611 2018-01-21 8 8.0
2 613 2018-01-21 1 1.0
Implementing the exact behavior from the question is difficult due to two shortcoming in pandas: neither groupby/rolling/transform nor forward looking rolling sparse dates being implemented (see other answer for more details).
This answer attempts to work around both by resampling the data, filling in all days, and then joining the quant_sums back with the original data.
# Create a temporary df with all in between days filled in with zeros
filled = df.set_index("dt_op").groupby("cod_id") \
.resample("D").asfreq().fillna(0) \
.quantity.to_frame()
# Reverse and sum
filled["quant_sum"] = filled.reset_index().set_index("dt_op") \
.iloc[::-1] \
.groupby("cod_id") \
.rolling(7, min_periods=1) \
.quantity.sum().astype(int)
# Join with original `df`, dropping the filled days
result = df.set_index(["cod_id", "dt_op"]).join(filled.quant_sum).reset_index()

Parsing week of year to datetime objects with pandas

A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
The last column contains the the year along with the weeknumber. Is it possible to convert this to a datetime column with pd.to_datetime?
I've tried:
pd.to_datetime(df.yearweek, format='%Y-%U')
0 2014-01-01
1 2014-01-01
2 2014-01-01
3 2014-01-01
4 2015-01-01
5 2015-01-01
6 2015-01-01
7 2015-01-01
Name: yearweek, dtype: datetime64[ns]
But that output is incorrect, while I believe %U should be the format string for week number. What am I missing here?
You need another parameter for specify day - check this:
df = pd.to_datetime(df.yearweek.add('-0'), format='%Y-%W-%w')
print (df)
0 2014-12-07
1 2014-12-14
2 2014-12-21
3 2014-12-28
4 2015-01-18
5 2015-01-25
6 2015-02-01
7 2015-02-08
Name: yearweek, dtype: datetime64[ns]

Dataframe Merge in Pandas

For some reason, I cannot get this merge to work correctly.
This Dataframe (rspars) has 2,000+ rows...
rsparid f1mult f2mult f3mult
0 1 0.318 0.636 0.810
1 2 0.348 0.703 0.893
2 3 0.384 0.777 0.000
3 4 0.296 0.590 0.911
4 5 0.231 0.458 0.690
5 6 0.275 0.546 0.839
6 7 0.248 0.486 0.731
7 8 0.430 0.873 0.000
8 9 0.221 0.438 0.655
9 11 0.204 0.399 0.593
When trying to join the above to a table based on the rsparid columns to this Dataframe...
line_track line_race rsparid
line_date
2013-03-23 TP 10 1400
2013-02-23 GP 7 634
2013-01-01 GP 7 1508
2012-11-11 AQU 5 96
2012-10-11 BEL 2 161
Using this...
df = pd.merge(datalines, rspars, how='left', on='rsparid')
I get blanks..
line_track line_race rsparid f1mult f2mult f3mult
0 TP 10 1400 NaN NaN NaN
1 TP 10 1400 NaN NaN NaN
2 TP 10 1400 NaN NaN NaN
3 GP 7 634 NaN NaN NaN
4 GP 10 634 NaN NaN NaN
Note, the "datalines" column can have thousands more rows than the rspars, thus the left join. I must be doing something wrong?
I also tried it this way...
df = datalines.merge(rspars, how='left', on='rsparid')
EXAMPLE #2
I dropped the data down to a few rows...
rspars:
rsparid f1mult f2mult f3mult
0 1400 0.216 0.435 0.656
datalines:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
Merging...
datalines.merge(rspars, how='left', on='rsparid')
Output...
rsparid f1mult f2mult f3mult
0 1400 NaN NaN NaN
1 634 NaN NaN NaN
2 1508 NaN NaN NaN
3 96 NaN NaN NaN
4 161 NaN NaN NaN
5 1011 NaN NaN NaN
6 1007 NaN NaN NaN
7 518 NaN NaN NaN
8 1955 NaN NaN NaN
9 678 NaN NaN NaN
The NaNs mean they have no values in rsparid in common. This can be tricky when merging things that may look the same when they repr
The repr of small DataFrames with strings (of integers) or integers looks the same and no dtype information is printed when frames are small. You can get this information (and more) for small frames by calling the DataFrame.info() method, like so: df.info(). This will give you a nice summary of what's in the DataFrame and what the dtypes of its columns are:
In [205]: datalines_int = DataFrame({'rsparid':[1400,634,1508,96,161,1011,1007,518,1955,678]})
In [206]: datalines_str = DataFrame({'rsparid':map(str,[1400,634,1508,96,161,1011,1007,518,1955,678])})
In [207]: datalines_int
Out[207]:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
In [208]: datalines_str
Out[208]:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
In [209]: datalines_int.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid 10 non-null values
dtypes: int64(1)
In [210]: datalines_str.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid 10 non-null values
dtypes: object(1)
NOTE: You'll notice a slight difference in the reprs here, most likely because of padding of numeric DataFrames. Point is, no one would really be able to see that using this interactively, unless they were specifically looking for the difference.

Categories