I have the following use case where I have a dataframe dfp containing prices for two assets a and b and have another dataframe dfm containing metadata information about those assets. Assume I have start and end dates where I'd like to consider those prices and would like to set all other prices outside those ranges to np.nan so I have:
import pandas as pd
import numpy as np
# sample prices for two assets
dfp = pd.DataFrame(data=np.random.random_sample((20, 2)),
columns=['a', 'b'],
index=pd.date_range(end='2020-12-10', periods=20))
print(dfp)
a b
2020-11-21 0.411653 0.001124
2020-11-22 0.773671 0.210065
2020-11-23 0.143332 0.090111
2020-11-24 0.062085 0.475205
2020-11-25 0.160982 0.557469
2020-11-26 0.025793 0.353725
2020-11-27 0.651929 0.794265
2020-11-28 0.266566 0.270451
2020-11-29 0.713030 0.346842
2020-11-30 0.838571 0.969477
2020-12-01 0.701627 0.480349
2020-12-02 0.946619 0.344399
2020-12-03 0.430523 0.857529
2020-12-04 0.202790 0.003393
2020-12-05 0.293010 0.250172
2020-12-06 0.172535 0.932216
2020-12-07 0.508303 0.775843
2020-12-08 0.704445 0.760226
2020-12-09 0.515398 0.193958
2020-12-10 0.219717 0.040269
# metadata information for those two assets
dfm = pd.DataFrame(data=[['a', '2020-11-22', '2020-11-29'],
['b', '2020-12-01', '2020-12-07']],
columns=['name', 'start', 'end'])
# set all prices outside the range to np.nan in a loop :(
for index, row in dfm.iterrows():
dfp.loc[(dfp.index < row['start']) | (row['end'] < dfp.index), row['name']] = np.nan
print(dfp)
a b
2020-11-21 NaN NaN
2020-11-22 0.773671 NaN
2020-11-23 0.143332 NaN
2020-11-24 0.062085 NaN
2020-11-25 0.160982 NaN
2020-11-26 0.025793 NaN
2020-11-27 0.651929 NaN
2020-11-28 0.266566 NaN
2020-11-29 0.713030 NaN
2020-11-30 NaN NaN
2020-12-01 NaN 0.480349
2020-12-02 NaN 0.344399
2020-12-03 NaN 0.857529
2020-12-04 NaN 0.003393
2020-12-05 NaN 0.250172
2020-12-06 NaN 0.932216
2020-12-07 NaN 0.775843
2020-12-08 NaN NaN
2020-12-09 NaN NaN
2020-12-10 NaN NaN
Is it possible (and how) to replace the looping with advanced indexing here? if so, how?
if your dataframe isn't too large, you can use melt and merge
then apply a conditional using np.where
df1 = pd.merge(
pd.melt(dfp.reset_index(), id_vars="index", var_name="name"),
dfm,
on=["name"],
how="left",
)
df1['value_new'] = np.where(
(df1['index'] > df1['start']) &
(df1['index'] < df1['end']),
df1['value'],
np.nan
)
index name value start end value_new
0 2020-11-21 a 0.460695 2020-11-22 2020-11-29 NaN
1 2020-11-22 a 0.818190 2020-11-22 2020-11-29 NaN
2 2020-11-23 a 0.869208 2020-11-22 2020-11-29 0.869208
3 2020-11-24 a 0.466557 2020-11-22 2020-11-29 0.466557
4 2020-11-25 a 0.218630 2020-11-22 2020-11-29 0.218630
5 2020-11-26 a 0.769285 2020-11-22 2020-11-29 0.769285
6 2020-11-27 a 0.066418 2020-11-22 2020-11-29 0.066418
7 2020-11-28 a 0.746973 2020-11-22 2020-11-29 0.746973
8 2020-11-29 a 0.881565 2020-11-22 2020-11-29 NaN
9 2020-11-30 a 0.856797 2020-11-22 2020-11-29 NaN
10 2020-12-01 a 0.303156 2020-11-22 2020-11-29 NaN
11 2020-12-02 a 0.152055 2020-11-22 2020-11-29 NaN
12 2020-12-03 a 0.239251 2020-11-22 2020-11-29 NaN
13 2020-12-04 a 0.579377 2020-11-22 2020-11-29 NaN
14 2020-12-05 a 0.950465 2020-11-22 2020-11-29 NaN
15 2020-12-06 a 0.017557 2020-11-22 2020-11-29 NaN
16 2020-12-07 a 0.459709 2020-11-22 2020-11-29 NaN
17 2020-12-08 a 0.235053 2020-11-22 2020-11-29 NaN
18 2020-12-09 a 0.935113 2020-11-22 2020-11-29 NaN
19 2020-12-10 a 0.121584 2020-11-22 2020-11-29 NaN
20 2020-11-21 b 0.982475 2020-12-01 2020-12-07 NaN
21 2020-11-22 b 0.006563 2020-12-01 2020-12-07 NaN
22 2020-11-23 b 0.863132 2020-12-01 2020-12-07 NaN
23 2020-11-24 b 0.059826 2020-12-01 2020-12-07 NaN
24 2020-11-25 b 0.853701 2020-12-01 2020-12-07 NaN
25 2020-11-26 b 0.494347 2020-12-01 2020-12-07 NaN
26 2020-11-27 b 0.680949 2020-12-01 2020-12-07 NaN
27 2020-11-28 b 0.247310 2020-12-01 2020-12-07 NaN
28 2020-11-29 b 0.777140 2020-12-01 2020-12-07 NaN
29 2020-11-30 b 0.552633 2020-12-01 2020-12-07 NaN
30 2020-12-01 b 0.330672 2020-12-01 2020-12-07 NaN
31 2020-12-02 b 0.295119 2020-12-01 2020-12-07 0.295119
32 2020-12-03 b 0.361580 2020-12-01 2020-12-07 0.361580
33 2020-12-04 b 0.874205 2020-12-01 2020-12-07 0.874205
34 2020-12-05 b 0.754738 2020-12-01 2020-12-07 0.754738
35 2020-12-06 b 0.135053 2020-12-01 2020-12-07 0.135053
36 2020-12-07 b 0.998768 2020-12-01 2020-12-07 NaN
37 2020-12-08 b 0.955664 2020-12-01 2020-12-07 NaN
38 2020-12-09 b 0.330856 2020-12-01 2020-12-07 NaN
39 2020-12-10 b 0.826502 2020-12-01 2020-12-07 NaN
Data:
np.random.seed(44)
dfp = pd.DataFrame(data=np.random.random_sample((20, 2)),
columns=['a', 'b'],
index=pd.date_range(end='2020-12-10', periods=20))
dfm = pd.DataFrame(data=[['a', '2020-11-22', '2020-11-29'],
['b', '2020-12-01', '2020-12-07']],
columns=['name', 'start', 'end'])
dfp:
a b
2020-11-21 0.834842 0.104796
2020-11-22 0.744640 0.360501
2020-11-23 0.359311 0.609238
2020-11-24 0.393780 0.409073
2020-11-25 0.509902 0.710148
2020-11-26 0.960526 0.456621
2020-11-27 0.427652 0.113464
2020-11-28 0.217899 0.957472
2020-11-29 0.943351 0.881824
2020-11-30 0.646411 0.213825
2020-12-01 0.636832 0.139146
2020-12-02 0.458704 0.873863
2020-12-03 0.258450 0.664851
2020-12-04 0.862674 0.148848
2020-12-05 0.562950 0.159155
2020-12-06 0.172895 0.104023
2020-12-07 0.202938 0.455189
2020-12-08 0.794575 0.990823
2020-12-09 0.805017 0.377415
2020-12-10 0.515737 0.058899
dfm:
name start end
0 a 2020-11-22 2020-11-29
1 b 2020-12-01 2020-12-07
x = dfm.apply(lambda row: (dfp.index < row['start']) | (row['end'] < dfp.index),
axis=1)
final = dfp[~pd.DataFrame({'a' : x[0], 'b' : x[1]}, index=dfp.index)]
final:
a b
2020-11-21 NaN NaN
2020-11-22 0.744640 NaN
2020-11-23 0.359311 NaN
2020-11-24 0.393780 NaN
2020-11-25 0.509902 NaN
2020-11-26 0.960526 NaN
2020-11-27 0.427652 NaN
2020-11-28 0.217899 NaN
2020-11-29 0.943351 NaN
2020-11-30 NaN NaN
2020-12-01 NaN 0.139146
2020-12-02 NaN 0.873863
2020-12-03 NaN 0.664851
2020-12-04 NaN 0.148848
2020-12-05 NaN 0.159155
2020-12-06 NaN 0.104023
2020-12-07 NaN 0.455189
2020-12-08 NaN NaN
2020-12-09 NaN NaN
2020-12-10 NaN NaN
Related
There are many steps to do that for the data I have, I will show you the steps taken so far where I stuck:
I have this df:
df = pd.DataFrame(np.array([['Iza', '2020-12-01 10:34:00'],['Iza', '2020-12-02 10:34:00'],['Iza', '2020-12-01 17:34:00'],['Iza', '2020-12-01 17:34:00'],['Sara', '2020-12-04 17:34:00'], ['Sara', '2020-12-04 20:11:00'], ['Sara', '2020-12-06 17:34:00'],['Silvia', '2020-12-07 18:34:00'],['Silvia', '2020-12-09 11:22:00'],['Paul', '2020-12-09 11:22:00'],['Paul', '2020-12-08 11:22:00'],['Paul', '2020-12-07 11:22:00']]),
columns=['Name', 'Time'])
df:
Name Time
0 Iza 2020-12-01 10:34:00
1 Iza 2020-12-02 10:34:00
2 Iza 2020-12-01 17:34:00
3 Iza 2020-12-01 17:34:00
4 Sara 2020-12-04 17:34:00
5 Sara 2020-12-04 20:11:00
6 Sara 2020-12-06 17:34:00
7 Silvia 2020-12-07 18:34:00
8 Silvia 2020-12-09 11:22:00
9 Paul 2020-12-09 11:22:00
10 Paul 2020-12-08 11:22:00
11 Paul 2020-12-07 11:22:00
I converted the time column to datetime:
df['Time'] = pd.to_datetime(df['Time'])
Now I want to get days in names and find the percentage of each day per name in columns:
df['Day'] = df['Time'].dt.day_name()
Result:
Name Time Day
0 Iza 2020-12-01 10:34:00 Tuesday
1 Iza 2020-12-02 10:34:00 Wednesday
2 Iza 2020-12-01 17:34:00 Tuesday
3 Iza 2020-12-01 17:34:00 Tuesday
4 Sara 2020-12-04 17:34:00 Friday
5 Sara 2020-12-04 20:11:00 Friday
6 Sara 2020-12-06 17:34:00 Sunday
7 Silvia 2020-12-07 18:34:00 Monday
8 Silvia 2020-12-09 11:22:00 Wednesday
9 Paul 2020-12-09 11:22:00 Wednesday
10 Paul 2020-12-08 11:22:00 Tuesday
11 Paul 2020-12-07 11:22:00 Monday
df2 = round(df.groupby(['Name'])['Day'].apply(lambda x: x.value_counts(normalize=True)) * 100)
Result:
Name
Iza Tuesday 75.0
Wednesday 25.0
Paul Monday 33.0
Tuesday 33.0
Wednesday 33.0
Sara Friday 67.0
Sunday 33.0
Silvia Wednesday 50.0
Monday 50.0
Name: Day, dtype: float64
I stuck here, my desired output - days in columns with % for each per name:
Name Sunday Monday Tuesday Wednesday Friday
Iza NaN NaN 75 25 NaN
Paul NaN 33 33 33 NaN
Sara 33 NaN NaN NaN 67
Silvia NaN 50 NaN 50 NaN
Use Categorical for correct order in last Series.unstack, solution was simplify without apply:
df['Time'] = pd.to_datetime(df['Time'])
week = ['Sunday',
'Monday',
'Tuesday',
'Wednesday',
'Thursday',
'Friday',
'Saturday']
df['Day'] = pd.Categorical(df['Time'].dt.day_name(), ordered=True, categories=week)
df1 = df.groupby('Name')['Day'].value_counts(normalize=True).unstack().mul(100).round()
print (df1)
Day Sunday Monday Tuesday Wednesday Friday
Name
Iza NaN NaN 75.0 25.0 NaN
Paul NaN 33.0 33.0 33.0 NaN
Sara 33.0 NaN NaN NaN 67.0
Silvia NaN 50.0 NaN 50.0 NaN
For correct ordering is a bit changed solution:
df['Time'] = pd.to_datetime(df['Time'])
df['Day'] = df['Time'].dt.dayofweek
d = {0: 'Sunday', 1: 'Monday', 2: 'Tuesday', 3: 'Wednesday',
4: 'Thursday', 5: 'Friday', 6: 'Saturday'}
df1 = df.groupby('Name')['Day'].value_counts(normalize=True).unstack().mul(100).round().rename(columns=d)
print (df1)
Day Sunday Monday Tuesday Thursday Saturday
Name
Iza NaN 75.0 25.0 NaN NaN
Paul 33.0 33.0 33.0 NaN NaN
Sara NaN NaN NaN 67.0 33.0
Silvia 50.0 NaN 50.0 NaN NaN
Just unstack. You are on the money
df2 = round(df.groupby(['Name'])['Day'].apply(lambda x: x.value_counts(normalize=True)) * 100).unstack(level=1)
df2=df2[['Sunday','Monday','Tuesday', 'Wednesday','Friday']]
Sunday Monday Tuesday Wednesday Friday
Name
Iza NaN NaN 75.0 25.0 NaN
Paul NaN 33.0 33.0 33.0 NaN
Sara 33.0 NaN NaN NaN 67.0
Silvia NaN 50.0 NaN 50.0 NaN
I started to work with Pandas and I have some issues that I don't really know how to solve.
I have a dataframe with date, product, stock and sales. Some dates and products are missing. I would like to get a timeseries for each product in a range of dates.
For example:
product udsStock udsSales
date
2019-12-26 14 161 848
2019-12-27 14 1340 914
2019-12-30 14 856 0
2019-12-25 4 3132 439
2019-12-27 4 3177 616
2020-01-01 4 500 883
It has to be the same range for all products even if one product doesn't appear in one date in the range.
If I want the range 2019-12-25 to 2020-01-01, the final dataframe should be like this one:
product udsStock udsSales
date
2019-12-25 14 NaN NaN
2019-12-26 14 161 848
2019-12-27 14 1340 914
2019-12-28 14 NaN NaN
2019-12-29 14 NaN NaN
2019-12-30 14 856 0
2019-12-31 14 NaN NaN
2020-01-01 14 NaN NaN
2019-12-25 4 3132 439
2019-12-26 4 NaN NaN
2019-12-27 4 3177 616
2019-12-28 4 NaN NaN
2019-12-29 4 NaN NaN
2019-12-30 4 NaN NaN
2019-12-31 4 NaN NaN
2020-01-01 4 500 883
I have tried to reindex by the range but it doesn't work because there are identical indexes.
idx = pd.date_range('25-12-2019', '01-01-2020')
df = df.reindex(idx)
I also have tried to index by date and product and then reindex, but I don't know how to put the product that is missing.
Any more ideas?
Thanks in advance
We can use pd.date_range and groupby.reindex to achieve your result:
date_range = pd.date_range(start='2019-12-25', end='2020-01-01', freq='D')
df = df.groupby('product', sort=False).apply(lambda x: x.reindex(date_range))
df['product'] = df.groupby(level=0)['product'].ffill().bfill()
df = df.droplevel(0)
product udsStock udsSales
2019-12-25 14.0 NaN NaN
2019-12-26 14.0 161.0 848.0
2019-12-27 14.0 1340.0 914.0
2019-12-28 14.0 NaN NaN
2019-12-29 14.0 NaN NaN
2019-12-30 14.0 856.0 0.0
2019-12-31 14.0 NaN NaN
2020-01-01 14.0 NaN NaN
2019-12-25 4.0 3132.0 439.0
2019-12-26 4.0 NaN NaN
2019-12-27 4.0 3177.0 616.0
2019-12-28 4.0 NaN NaN
2019-12-29 4.0 NaN NaN
2019-12-30 4.0 NaN NaN
2019-12-31 4.0 NaN NaN
2020-01-01 4.0 500.0 883.0
Convert index to datetime object :
df2.index = pd.to_datetime(df2.index)
Create unique combinations of date and product :
import itertools
idx = pd.date_range("25-12-2019", "01-01-2020")
product = df2["product"].unique()
temp = itertools.product(idx, product)
temp = pd.MultiIndex.from_tuples(temp, names=["date", "product"])
temp
MultiIndex([('2019-12-25', 14),
('2019-12-25', 4),
('2019-12-26', 14),
('2019-12-26', 4),
('2019-12-27', 14),
('2019-12-27', 4),
('2019-12-28', 14),
('2019-12-28', 4),
('2019-12-29', 14),
('2019-12-29', 4),
('2019-12-30', 14),
('2019-12-30', 4),
('2019-12-31', 14),
('2019-12-31', 4),
('2020-01-01', 14),
('2020-01-01', 4)],
names=['date', 'product'])
Reindex dataframe :
df2.set_index("product", append=True).reindex(temp).sort_index(
level=1, ascending=False
).reset_index(level="product")
product udsStock udsSales
date
2020-01-01 14 NaN NaN
2019-12-31 14 NaN NaN
2019-12-30 14 856.0 0.0
2019-12-29 14 NaN NaN
2019-12-28 14 NaN NaN
2019-12-27 14 1340.0 914.0
2019-12-26 14 161.0 848.0
2019-12-25 14 NaN NaN
2020-01-01 4 500.0 883.0
2019-12-31 4 NaN NaN
2019-12-30 4 NaN NaN
2019-12-29 4 NaN NaN
2019-12-28 4 NaN NaN
2019-12-27 4 3177.0 616.0
2019-12-26 4 NaN NaN
2019-12-25 4 3132.0 439.0
In R, specifically tidyverse, it can be achieved with the complete method. In Python, the pyjanitor package has something similar, but a few kinks remain to be ironed out (A PR has been submitted already for this).
I have the following data:
(Pdb) df1 = pd.DataFrame({'id': ['SE0000195570','SE0000195570','SE0000195570','SE0000195570','SE0000191827','SE0000191827','SE0000191827','SE0000191827', 'SE0000191827'],'val': ['1','2','3','4','5','6','7','8', '9'],'date': pd.to_datetime(['2014-10-23','2014-07-16','2014-04-29','2014-01-31','2018-10-19','2018-07-11','2018-04-20','2018-02-16','2018-12-29'])})
(Pdb) df1
id val date
0 SE0000195570 1 2014-10-23
1 SE0000195570 2 2014-07-16
2 SE0000195570 3 2014-04-29
3 SE0000195570 4 2014-01-31
4 SE0000191827 5 2018-10-19
5 SE0000191827 6 2018-07-11
6 SE0000191827 7 2018-04-20
7 SE0000191827 8 2018-02-16
8 SE0000191827 9 2018-12-29
UPDATE:
As per the suggestions of #user3483203 I have gotten a bit further but not quite there. I've amended the example data above with a new row to illustrate better.
(Pdb) df2.assign(calc=(df2.dropna()['val'].groupby(level=0).rolling(4).sum().shift(-3).reset_index(0, drop=True)))
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26.0
2018-03-31 NaN NaN NaT NaN
2018-04-30 SE0000191827 7 2018-04-20 27.0
2018-05-31 NaN NaN NaT NaN
2018-06-30 NaN NaN NaT NaN
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT NaN
2018-09-30 NaN NaN NaT NaN
2018-10-31 SE0000191827 5 2018-10-19 NaN
2018-11-30 NaN NaN NaT NaN
2018-12-31 SE0000191827 9 2018-12-29 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10.0
2014-02-28 NaN NaN NaT NaN
2014-03-31 NaN NaN NaT NaN
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT NaN
2014-06-30 NaN NaN NaT NaN
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT NaN
2014-09-30 NaN NaN NaT NaN
2014-10-31 SE0000195570 1 2014-10-23 NaN
For my requirements, the row (SE0000191827, 2018-03-31) should have a calc value since it has four consecutive rows with a value. Currently the row is being removed with the dropna call and I can't figure out how to solve that problem.
What I need
Calculations: The dates in my initial data is quarterly dates. However, I need to transform this data into monthly rows ranging between the first and last date of each id and for each month calculate the sum of the four closest consecutive rows of the input data within that id. That's a mouthful. This led me to resample. See expected output below. I need the data to be grouped by both id and the monthly dates.
Performance: The data I'm testing on now is just for benchmarking but I will need the solution to be performant. I'm expecting to run this on upwards of 100k unique ids which may result in around 10 million rows. (100k ids, dates range back up to 10 years, 10years * 12months = 120 months per id, 100k*120 = 12million rows).
What I've tried
(Pdb) res = df.groupby('id').resample('M',on='date')
(Pdb) res.first()
id val date
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16
2018-03-31 NaN NaN NaT
2018-04-30 SE0000191827 7 2018-04-20
2018-05-31 NaN NaN NaT
2018-06-30 NaN NaN NaT
2018-07-31 SE0000191827 6 2018-07-11
2018-08-31 NaN NaN NaT
2018-09-30 NaN NaN NaT
2018-10-31 SE0000191827 5 2018-10-19
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31
2014-02-28 NaN NaN NaT
2014-03-31 NaN NaN NaT
2014-04-30 SE0000195570 3 2014-04-29
2014-05-31 NaN NaN NaT
2014-06-30 NaN NaN NaT
2014-07-31 SE0000195570 2 2014-07-16
2014-08-31 NaN NaN NaT
2014-09-30 NaN NaN NaT
2014-10-31 SE0000195570 1 2014-10-23
This data looks very nice for my case since it's nicely grouped by id and has the dates nicely lined up by month. Here it seems like I could use something like df['val'].rolling(4) and make sure it skips NaN values and put that result in a new column.
Expected output (new column calc):
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26
2018-03-31 NaN NaN NaT
2018-04-30 SE0000191827 7 2018-04-20 NaN
2018-05-31 NaN NaN NaT
2018-06-30 NaN NaN NaT
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT
2018-09-30 NaN NaN NaT
2018-10-31 SE0000191827 5 2018-10-19 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10
2014-02-28 NaN NaN NaT
2014-03-31 NaN NaN NaT
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT
2014-06-30 NaN NaN NaT
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT
2014-09-30 NaN NaN NaT
2014-10-31 SE0000195570 1 2014-10-23 NaN
2014-11-30 NaN NaN NaT
2014-12-31 SE0000195570 1 2014-10-23 NaN
Here the result in calc is 26 for the first date since it adds the three preceding (8+7+6+5). The rest for that id is NaN since four values are not available.
The problems
While it may look like the data is grouped by id and date, it seems like it's actually grouped by date. I'm not sure how this works. I need the data to be grouped by id and date.
(Pdb) res['val'].get_group(datetime.date(2018,2,28))
7 6.730000e+08
Name: val, dtype: object
The result of the resample above returns a DatetimeIndexResamplerGroupby which doesn't have rolling...
(Pdb) res['val'].rolling(4)
*** AttributeError: 'DatetimeIndexResamplerGroupby' object has no attribute 'rolling'
What to do? My guess is that my approach is wrong but after scouring the documentation I'm not sure where to start.
In the dataframe below, I want to set row values in the column p50 to NaN if they are below 2.0 between the dates May 15th and August 15th 2018.
date p50
2018-03-02 2018-03-02 NaN
2018-03-03 2018-03-03 NaN
2018-03-04 2018-03-04 0.022590
2018-03-05 2018-03-05 NaN
2018-03-06 2018-03-06 -0.042227
2018-03-07 2018-03-07 NaN
2018-03-08 2018-03-08 NaN
2018-03-09 2018-03-09 -0.028646
2018-03-10 2018-03-10 NaN
2018-03-11 2018-03-11 -0.045244
2018-03-12 2018-03-12 NaN
2018-03-13 2018-03-13 NaN
2018-03-14 2018-03-14 -0.020590
2018-03-15 2018-03-15 NaN
2018-03-16 2018-03-16 -0.028317
2018-03-17 2018-03-17 NaN
2018-03-18 2018-03-18 NaN
2018-03-19 2018-03-19 NaN
2018-03-20 2018-03-20 NaN
2018-03-21 2018-03-21 NaN
2018-03-22 2018-03-22 NaN
2018-03-23 2018-03-23 NaN
2018-03-24 2018-03-24 -0.066800
2018-03-25 2018-03-25 NaN
2018-03-26 2018-03-26 -0.104135
2018-03-27 2018-03-27 NaN
2018-03-28 2018-03-28 NaN
2018-03-29 2018-03-29 -0.115200
2018-03-30 2018-03-30 NaN
2018-03-31 2018-03-31 -0.000455
... ...
2018-07-03 2018-07-03 NaN
2018-07-04 2018-07-04 2.313035
2018-07-05 2018-07-05 NaN
2018-07-06 2018-07-06 NaN
2018-07-07 2018-07-07 NaN
2018-07-08 2018-07-08 NaN
2018-07-09 2018-07-09 0.054513
2018-07-10 2018-07-10 NaN
2018-07-11 2018-07-11 NaN
2018-07-12 2018-07-12 3.711159
2018-07-13 2018-07-13 NaN
2018-07-14 2018-07-14 6.583810
2018-07-15 2018-07-15 NaN
2018-07-16 2018-07-16 NaN
2018-07-17 2018-07-17 0.070182
2018-07-18 2018-07-18 NaN
2018-07-19 2018-07-19 3.688812
2018-07-20 2018-07-20 NaN
2018-07-21 2018-07-21 NaN
2018-07-22 2018-07-22 0.876552
2018-07-23 2018-07-23 NaN
2018-07-24 2018-07-24 1.077895
2018-07-25 2018-07-25 NaN
2018-07-26 2018-07-26 NaN
2018-07-27 2018-07-27 3.802159
2018-07-28 2018-07-28 NaN
2018-07-29 2018-07-29 0.077402
2018-07-30 2018-07-30 NaN
2018-07-31 2018-07-31 NaN
2018-08-01 2018-08-01 3.202214
The dataframe has a datetime index. I do the foll:
mask = (group['date'] > '2018-5-15') & (group['date'] <= '2018-8-15')
group[mask].loc[group[mask]['p50'] < 2.]['p50'] = np.NaN
However, this does not update the dataframe. How to fix this?
I think you should using .loc like
mask = (group['date'] > '2018-5-15') & (group['date'] <= '2018-8-15')
group.loc[mask&(group['p50'] < 2),'p50']=np.nan
I am learning 'pandas' and trying to plot id column but I get an error AttributeError: Unknown property color_cycle and empty graph. The graph only appears in interactive shell. When I execute as script I get same error except the graph doesn't appear.
Below is the log:
>>> import pandas as pd
>>> pd.set_option('display.mpl_style', 'default')
>>> df = pd.read_csv('2015.csv', parse_dates=['log_date'])
>>> employee_198 = df[df['employee_id'] == 198]
>>> print(employee_198)
id version company_id early_minutes employee_id late_minutes \
90724 91635 0 1 NaN 198 NaN
90725 91636 0 1 NaN 198 0:20:00
90726 91637 0 1 0:20:00 198 NaN
90727 91638 0 1 0:05:00 198 NaN
90728 91639 0 1 0:25:00 198 NaN
90729 91640 0 1 0:15:00 198 0:20:00
90730 91641 0 1 NaN 198 0:15:00
90731 91642 0 1 NaN 198 NaN
90732 91643 0 1 NaN 198 NaN
90733 91644 0 1 NaN 198 NaN
90734 91645 0 1 NaN 198 NaN
90735 91646 0 1 NaN 198 NaN
90736 91647 0 1 NaN 198 NaN
90737 91648 0 1 NaN 198 NaN
90738 91649 0 1 NaN 198 NaN
90739 91650 0 1 NaN 198 0:10:00
90740 91651 0 1 NaN 198 NaN
90741 91652 0 1 NaN 198 NaN
90742 91653 0 1 NaN 198 NaN
90743 91654 0 1 NaN 198 NaN
90744 91655 0 1 NaN 198 NaN
90745 91656 0 1 NaN 198 NaN
90746 91657 0 1 1:30:00 198 NaN
90747 91658 0 1 0:04:25 198 NaN
90748 91659 0 1 NaN 198 NaN
90749 91660 0 1 NaN 198 NaN
90750 91661 0 1 NaN 198 NaN
90751 91662 0 1 NaN 198 NaN
90752 91663 0 1 NaN 198 NaN
90753 91664 0 1 NaN 198 NaN
90897 91808 0 1 NaN 198 0:04:14
91024 91935 0 1 NaN 198 0:21:43
91151 92062 0 1 NaN 198 0:42:07
91278 92189 0 1 NaN 198 0:16:36
91500 92411 0 1 NaN 198 0:07:12
91532 92443 0 1 NaN 198 NaN
91659 92570 0 1 NaN 198 0:53:03
91786 92697 0 1 NaN 198 NaN
91913 92824 0 1 NaN 198 NaN
92040 92951 0 1 NaN 198 NaN
92121 93032 0 1 4:22:35 198 NaN
92420 93331 0 1 NaN 198 NaN
92421 93332 0 1 NaN 198 3:51:15
log_date log_in_time log_out_time over_time remarks \
90724 2015-11-15 No In No Out NaN [Absent]
90725 2015-10-18 10:00:00 17:40:00 NaN NaN
90726 2015-10-19 9:20:00 17:10:00 NaN NaN
90727 2015-10-25 9:30:00 17:25:00 NaN NaN
90728 2015-10-26 9:34:00 17:05:00 NaN NaN
90729 2015-10-27 10:00:00 17:15:00 NaN NaN
90730 2015-10-28 9:55:00 17:30:00 NaN NaN
90731 2015-10-29 9:40:00 17:30:00 NaN NaN
90732 2015-10-30 9:00:00 17:30:00 0:30:00 NaN
90733 2015-10-20 No In No Out NaN [Absent]
90734 2015-10-21 No In No Out NaN [Maha Asthami]
90735 2015-10-22 No In No Out NaN [Nawami/Dashami]
90736 2015-10-23 No In No Out NaN [Absent]
90737 2015-10-24 No In No Out NaN [Off]
90738 2015-11-01 9:15:00 17:30:00 0:15:00 NaN
90739 2015-11-02 9:50:00 17:30:00 NaN NaN
90740 2015-11-03 9:30:00 17:30:00 NaN NaN
90741 2015-11-04 9:40:00 17:30:00 NaN NaN
90742 2015-11-05 9:38:00 17:30:00 NaN NaN
90743 2015-11-06 9:30:00 17:30:00 NaN NaN
90744 2015-11-08 9:30:00 17:30:00 NaN NaN
90745 2015-11-09 9:30:00 17:30:00 NaN NaN
90746 2015-11-10 9:30:00 16:00:00 NaN NaN
90747 2015-11-16 9:30:00 17:25:35 NaN NaN
90748 2015-11-07 No In No Out NaN [Off]
90749 2015-11-11 No In No Out NaN [Laxmi Puja]
90750 2015-11-12 No In No Out NaN [Govardhan Puja]
90751 2015-11-13 No In No Out NaN [Bhai Tika]
90752 2015-11-14 No In No Out NaN [Off]
90753 2015-10-31 No In No Out NaN [Off]
90897 2015-11-17 9:44:14 17:35:01 NaN NaN
91024 2015-11-18 10:01:43 17:36:29 NaN NaN
91151 2015-11-19 10:22:07 17:43:47 NaN NaN
91278 2015-11-20 9:56:36 17:37:00 NaN NaN
91500 2015-11-22 9:47:12 17:46:44 NaN NaN
91532 2015-11-21 No In No Out NaN [Off]
91659 2015-11-23 10:33:03 17:30:00 NaN NaN
91786 2015-11-24 9:34:11 17:32:24 NaN NaN
91913 2015-11-25 9:36:05 17:35:00 NaN NaN
92040 2015-11-26 9:35:39 17:58:05 0:22:26 NaN
92121 2015-11-27 9:08:45 13:07:25 NaN NaN
92420 2015-11-28 No In No Out NaN [Off]
92421 2015-11-29 13:31:15 17:34:44 NaN NaN
shift_in_time shift_out_time work_time under_time
90724 9:30:00 17:30:00 NaN NaN
90725 9:30:00 17:30:00 7:40:00 0:20:00
90726 9:30:00 17:30:00 7:50:00 0:10:00
90727 9:30:00 17:30:00 7:55:00 0:05:00
90728 9:30:00 17:30:00 7:31:00 0:29:00
90729 9:30:00 17:30:00 7:15:00 0:45:00
90730 9:30:00 17:30:00 7:35:00 0:25:00
90731 9:30:00 17:30:00 7:50:00 0:10:00
90732 9:30:00 17:30:00 8:30:00 NaN
90733 9:30:00 17:30:00 NaN NaN
90734 9:30:00 17:30:00 NaN NaN
90735 9:30:00 17:30:00 NaN NaN
90736 9:30:00 17:30:00 NaN NaN
90737 9:30:00 17:30:00 NaN NaN
90738 9:30:00 17:30:00 8:15:00 NaN
90739 9:30:00 17:30:00 7:40:00 0:20:00
90740 9:30:00 17:30:00 8:00:00 NaN
90741 9:30:00 17:30:00 7:50:00 0:10:00
90742 9:30:00 17:30:00 7:52:00 0:08:00
90743 9:30:00 17:30:00 8:00:00 NaN
90744 9:30:00 17:30:00 8:00:00 NaN
90745 9:30:00 17:30:00 8:00:00 NaN
90746 9:30:00 17:30:00 6:30:00 1:30:00
90747 9:30:00 17:30:00 7:55:35 0:04:25
90748 9:30:00 17:30:00 NaN NaN
90749 9:30:00 17:30:00 NaN NaN
90750 9:30:00 17:30:00 NaN NaN
90751 9:30:00 17:30:00 NaN NaN
90752 9:30:00 17:30:00 NaN NaN
90753 9:30:00 17:30:00 NaN NaN
90897 9:30:00 17:30:00 7:50:47 0:09:13
91024 9:30:00 17:30:00 7:34:46 0:25:14
91151 9:30:00 17:30:00 7:21:40 0:38:20
91278 9:30:00 17:30:00 7:40:24 0:19:36
91500 9:30:00 17:30:00 7:59:32 0:00:28
91532 9:30:00 17:30:00 NaN NaN
91659 9:30:00 17:30:00 6:56:57 1:03:03
91786 9:30:00 17:30:00 7:58:13 0:01:47
91913 9:30:00 17:30:00 7:58:55 0:01:05
92040 9:30:00 17:30:00 8:22:26 NaN
92121 9:30:00 17:30:00 3:58:40 4:01:20
92420 9:30:00 17:30:00 NaN NaN
92421 9:30:00 17:30:00 4:03:29 3:56:31
>>> employee_198['id'].plot()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 3497, in __call__
**kwds)
File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 2587, in plot_series
**kwds)
File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 2384, in _plot
plot_obj.generate()
File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 987, in generate
self._make_plot()
File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 1664, in _make_plot
**kwds)
File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 1678, in _plot
lines = MPLPlot._plot(ax, x, y_values, style=style, **kwds)
File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 1300, in _plot
return ax.plot(*args, **kwds)
File "C:\Python27\lib\site-packages\matplotlib\__init__.py", line 1811, in inner
return func(ax, *args, **kwargs)
File "C:\Python27\lib\site-packages\matplotlib\axes\_axes.py", line 1427, in plot
for line in self._get_lines(*args, **kwargs):
File "C:\Python27\lib\site-packages\matplotlib\axes\_base.py", line 386, in _grab_next_args
for seg in self._plot_args(remaining, kwargs):
File "C:\Python27\lib\site-packages\matplotlib\axes\_base.py", line 374, in _plot_args
seg = func(x[:, j % ncx], y[:, j % ncy], kw, kwargs)
File "C:\Python27\lib\site-packages\matplotlib\axes\_base.py", line 280, in _makeline
seg = mlines.Line2D(x, y, **kw)
File "C:\Python27\lib\site-packages\matplotlib\lines.py", line 366, in __init__
self.update(kwargs)
File "C:\Python27\lib\site-packages\matplotlib\artist.py", line 856, in update
raise AttributeError('Unknown property %s' % k)
AttributeError: Unknown property color_cycle
>>>
There's currently a bug in Pandas 0.17.1 with Matplotlib 1.5.0
print pandas.__version__
print matplotlib.__version__
Instead of using
import pandas as pd
pd.set_option('display.mpl_style', 'default')
Use:
import matplotlib
matplotlib.style.use('ggplot')