I have a series that looks like this
2014 7 2014-07-01 -0.045417
8 2014-08-01 -0.035876
9 2014-09-02 -0.030971
10 2014-10-01 -0.027471
11 2014-11-03 -0.032968
12 2014-12-01 -0.031110
2015 1 2015-01-02 -0.028906
2 2015-02-02 -0.035563
3 2015-03-02 -0.040338
4 2015-04-01 -0.032770
5 2015-05-01 -0.025762
6 2015-06-01 -0.019746
7 2015-07-01 -0.018541
8 2015-08-03 -0.028101
9 2015-09-01 -0.043237
10 2015-10-01 -0.053565
11 2015-11-02 -0.062630
12 2015-12-01 -0.064618
2016 1 2016-01-04 -0.064852
I want to be able to get the value from a date. Something like:
myseries.loc('2015-10-01') and it returns -0.053565
The index are tuples in the form (2016, 1, 2016-01-04)
You can do it like this:
In [32]:
df.loc(axis=0)[:,:,'2015-10-01']
Out[32]:
value
year month date
2015 10 2015-10-01 -0.053565
You can also pass slice for each level:
In [39]:
df.loc[(slice(None),slice(None),'2015-10-01'),]
Out[39]:
value
year month date
2015 10 2015-10-01 -0.053565|
Or just pass the first 2 index levels:
In [40]:
df.loc[2015,10]
Out[40]:
value
date
2015-10-01 -0.053565
Try xs:
print s.xs('2015-10-01',level=2,axis=0)
#year datetime
#2015 10 -0.053565
#Name: series, dtype: float64
print s.xs(7,level=1,axis=0)
#year datetime
#2014 2014-07-01 -0.045417
#2015 2015-07-01 -0.018541
#Name: series, dtype: float64
Related
I have a pandas DataFrame of the form:
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 2014-01-01 00:00:00
I am interested in only the year, month and day in the birth column of the dataframe. I tried to leverage on the Python datetime from pandas but it resulted into an error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1054-02-07 00:00:00
The birth column is an object dtype.
My guess would be that it is an incorrect date. I would not like to pass the parameter errors="coerce" into the to_datetime method, because each item is important and I need just the YYYY-MM-DD.
I tried to leverage on the regex from pandas:
df["birth"].str.find("(\d{4})-(\d{2})-(\d{2})")
But this is returning NANs. How can I resolve this?
Thanks
Because not possible convert to datetimes you can use split by first whitespace and then select first value:
df['birth'] = df['birth'].str.split().str[0]
And then if necessary convert to periods.
Representing out-of-bounds spans.
print (df)
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 0-01-01 00:00:00
def to_per(x):
splitted = x.split('-')
return pd.Period(year=int(splitted[0]),
month=int(splitted[1]),
day=int(splitted[2]), freq='D')
df['birth'] = df['birth'].str.split().str[0].apply(to_per)
print (df)
id amount birth
0 4 78.0 1980-02-02
1 5 24.0 1989-03-03
2 6 49.5 2014-01-01
3 7 34.0 2014-01-01
4 8 49.5 0000-01-01
I need to write a function and then apply it for a dataframe's column in pandas.
My dataframe looks like this.Data is sorted by id and then by period columns.
period id column1
0 2013-01-31 5 NaT
1 2013-02-28 5 28 days
2 2013-03-31 5 31 days
3 2013-04-30 5 30 days
4 2016-05-31 6 NaT
5 2016-06-30 6 30 days
6 2016-08-31 6 62 days
The new column values should be defined according to values in column1:
if column1=NaT or column1>31
then new column eqauls to the value in period column
Else - values of new column should be copied from its previous row:
new column ith row= new column i-1 row.
I am very new to python and my code doesn't work:
def f(x):
if not x or x > 31
return x=df['period']
else
return x=x.shift()
df['newcolumn'] = df['column1'].apply(f)
The output should be this:
period id column1 newcolumn
0 2013-01-31 5 NaT 2013-01-31
1 2013-02-28 5 28 days 2013-01-31
2 2013-03-31 5 31 days 2013-01-31
3 2013-04-30 5 30 days 2013-01-31
4 2016-05-31 6 NaT 2016-05-31
5 2016-06-30 6 30 days 2016-05-31
6 2016-08-31 6 62 days 2016-08-31
Any help would be much appreciated.
first it might be necessary to convert period to datetime: using pd.to_datetime
df['period']=pd.to_datetime(df['period'])
Then you can Use Dataframe.where with DataFrame.ffill:
df['newcolumn']=df['period'].where((df["column1"]>pd.Timedelta("31 days"))|(df["column1"].isnull())).ffill()
print(df)
period id column1 newcolumn
0 2013-01-31 5 NaT 2013-01-31
1 2013-02-28 5 28 days 2013-01-31
2 2013-03-31 5 31 days 2013-01-31
3 2013-04-30 5 30 days 2013-01-31
4 2016-05-31 6 NaT 2016-05-31
5 2016-06-30 6 30 days 2016-05-31
6 2016-08-31 6 62 days 2016-08-31
you can use df.where(cond, other) which return return df's row if condition match else returns other
df["newcolumn"] = df["period"].where(df["column1"].isnull() | (df["column1"]>pd.TimeDelta("31D")), df["column1"].shift())
I have a simple DataFrame that looks something like this:
TimeStamp, Value
1-Jan 06:10, 5
1-Jan 08:15, 7
1-Jan 15:30, 3
2-Jan 07:05, 1
2-Jan 10:15, 3
2-Jan 13:30, 2
How can I add a third column to the same DataFrame that would show me the running max value of 'Value' for each day and reset with each next day? I want the DataFrame to look like this:
TimeStamp, Value, DayMax
1-Jan 06:10, 5, 7
1-Jan 08:15, 7, 7
1-Jan 15:30, 3, 7
2-Jan 07:05, 1, 3
2-Jan 10:15, 3, 3
2-Jan 13:30, 2, 3
I tried using .rolling().max(...) but problem is I need the max value even in earlier rows, before the max value is encountered, and also before min_periods are reached. Also I need the max to reset with each day, and thus to ignore the window parameter.
I am hoping to avoid looping and complex code manipulations, as I will be doing it over a very large DataFrame, so would much prefer something built-in!
If you convert the TimeStamp column to a datetime using to_datetime then you can groupby on the date and call transform to return a Series that is the max value for each day:
In [54]:
df['TimeStamp'] = pd.to_datetime(df['TimeStamp'], format='%d-%b %H:%M')
df
Out[54]:
TimeStamp Value
0 1900-01-01 06:10:00 5
1 1900-01-01 08:15:00 7
2 1900-01-01 15:30:00 3
3 1900-01-02 07:05:00 1
4 1900-01-02 10:15:00 3
5 1900-01-02 13:30:00 2
In [55]:
df['DayMax'] = df.groupby(df['TimeStamp'].dt.date)['Value'].transform('max')
df
Out[55]:
TimeStamp Value DayMax
0 1900-01-01 06:10:00 5 7
1 1900-01-01 08:15:00 7 7
2 1900-01-01 15:30:00 3 7
3 1900-01-02 07:05:00 1 3
4 1900-01-02 10:15:00 3 3
5 1900-01-02 13:30:00 2 3
I have this dataframe df:
U,Datetime
01,2015-01-01 20:00:00
01,2015-02-01 20:05:00
01,2015-04-01 21:00:00
01,2015-05-01 22:00:00
01,2015-07-01 22:05:00
02,2015-08-01 20:00:00
02,2015-09-01 21:00:00
02,2014-01-01 23:00:00
02,2014-02-01 22:05:00
02,2015-01-01 20:00:00
02,2014-03-01 21:00:00
03,2015-10-01 20:00:00
03,2015-11-01 21:00:00
03,2015-12-01 23:00:00
03,2015-01-01 22:05:00
03,2015-02-01 20:00:00
03,2015-05-01 21:00:00
03,2014-01-01 20:00:00
03,2014-02-01 21:00:00
made by U and a Datetime object. What I would like to do is to filter U values having at least three consecutive occurrences in months/year. So far I have grouped by by U, year and month as:
m = df.groupby(['U',df.index.year,df.index.month]).size()
obtaining:
U
1 2015 1 1
2 1
4 1
5 1
7 1
2 2014 1 1
2 1
3 1
2015 1 1
8 1
9 1
3 2014 1 1
2 1
2015 1 1
2 1
5 1
10 1
11 1
12 1
The third column is related to the occurrences in different months/year. In this case only U values of 02 and 03 contain at least three consecutive values in months/year. Now I can't figured out how can I select those users and getting them out in a list, for instance, or just keeping them in the original dataframe df and discard the others. I tried also:
g = m.groupby(level=[0,1]).diff()
But I can't get any useful information.
Finally I could come up with the solution :) .
to give you an idea of how custom function works , simply it subtracts the value of the month from it's preceding value , the result should be one of course , and this should happen twice , for example if you have a list of numbers [5 , 6 , 7] , so 7 - 6 = 1 and 6 - 5 = 1 , 1 here appeared twice so the condition has been fulfilled
In [80]:
df.reset_index(inplace=True)
In [281]:
df['month'] = df.Datetime.dt.month
df['year'] = df.Datetime.dt.year
df
Out[281]:
Datetime U month year
0 2015-01-01 20:00:00 1 1 2015
1 2015-02-01 20:05:00 1 2 2015
2 2015-04-01 21:00:00 1 4 2015
3 2015-05-01 22:00:00 1 5 2015
4 2015-07-01 22:05:00 1 7 2015
5 2015-08-01 20:00:00 2 8 2015
6 2015-09-01 21:00:00 2 9 2015
7 2014-01-01 23:00:00 2 1 2014
8 2014-02-01 22:05:00 2 2 2014
9 2015-01-01 20:00:00 2 1 2015
10 2014-03-01 21:00:00 2 3 2014
11 2015-10-01 20:00:00 3 10 2015
12 2015-11-01 21:00:00 3 11 2015
13 2015-12-01 23:00:00 3 12 2015
14 2015-01-01 22:05:00 3 1 2015
15 2015-02-01 20:00:00 3 2 2015
16 2015-05-01 21:00:00 3 5 2015
17 2014-01-01 20:00:00 3 1 2014
18 2014-02-01 21:00:00 3 2 2014
In [284]:
g = df.groupby([df['U'] , df.year])
In [86]:
res = g.filter(lambda x : is_at_least_three_consec(x['month'].diff().values.tolist()))
res
Out[86]:
Datetime U month year
7 2014-01-01 23:00:00 2 1 2014
8 2014-02-01 22:05:00 2 2 2014
10 2014-03-01 21:00:00 2 3 2014
11 2015-10-01 20:00:00 3 10 2015
12 2015-11-01 21:00:00 3 11 2015
13 2015-12-01 23:00:00 3 12 2015
14 2015-01-01 22:05:00 3 1 2015
15 2015-02-01 20:00:00 3 2 2015
16 2015-05-01 21:00:00 3 5 2015
if you want to see the result of the custom function
In [84]:
res = g['month'].agg(lambda x : is_at_least_three_consec(x.diff().values.tolist()))
res
Out[84]:
U year
1 2015 False
2 2014 True
2015 False
3 2014 False
2015 True
Name: month, dtype: bool
this is how custom function implemented
In [53]:
def is_at_least_three_consec(month_diff):
consec_count = 0
#print(month_diff)
for index , val in enumerate(month_diff):
if index != 0 and val == 1:
consec_count += 1
if consec_count == 2:
return True
else:
consec_count = 0
return False
I have a column in a pandas data frame looking like:
test1.Received
Out[9]:
0 01/01/2015 17:25
1 02/01/2015 11:43
2 04/01/2015 18:21
3 07/01/2015 16:17
4 12/01/2015 20:12
5 14/01/2015 11:09
6 15/01/2015 16:05
7 16/01/2015 21:02
8 26/01/2015 03:00
9 27/01/2015 08:32
10 30/01/2015 11:52
This represents a time stamp as Day Month Year Hour Minute. I would like to rearrange the date as Year Month Day Hour Minute. So that it would look like:
test1.Received
Out[9]:
0 2015/01/01 17:25
1 2015/01/02 11:43
...
Just use pd.to_datetime:
In [33]:
import pandas as pd
pd.to_datetime(df['date'])
Out[33]:
index
0 2015-01-01 17:25:00
1 2015-02-01 11:43:00
2 2015-04-01 18:21:00
3 2015-07-01 16:17:00
4 2015-12-01 20:12:00
5 2015-01-14 11:09:00
6 2015-01-15 16:05:00
7 2015-01-16 21:02:00
8 2015-01-26 03:00:00
9 2015-01-27 08:32:00
10 2015-01-30 11:52:00
Name: date, dtype: datetime64[ns]
In your case:
pd.to_datetime(test1['Received'])
should just work
If you want to change the display format then you need to parse as a datetime and then apply `datetime.strftime:
In [35]:
import datetime as dt
pd.to_datetime(df['date']).apply(lambda x: dt.datetime.strftime(x, '%m/%d/%y %H:%M:%S'))
Out[35]:
index
0 01/01/15 17:25:00
1 02/01/15 11:43:00
2 04/01/15 18:21:00
3 07/01/15 16:17:00
4 12/01/15 20:12:00
5 01/14/15 11:09:00
6 01/15/15 16:05:00
7 01/16/15 21:02:00
8 01/26/15 03:00:00
9 01/27/15 08:32:00
10 01/30/15 11:52:00
Name: date, dtype: object
So the above is now showing month/day/year, in your case the following should work:
pd.to_datetime(test1['Received']).apply(lambda x: dt.datetime.strftime(x, '%y/%m/%d %H:%M:%S'))
EDIT
it looks like you need to pass param dayfirst=True to to_datetime:
In [45]:
pd.to_datetime(df['date'], format('%d/%m/%y %H:%M:%S'), dayfirst=True).apply(lambda x: dt.datetime.strftime(x, '%m/%d/%y %H:%M:%S'))
Out[45]:
index
0 01/01/15 17:25:00
1 01/02/15 11:43:00
2 01/04/15 18:21:00
3 01/07/15 16:17:00
4 01/12/15 20:12:00
5 01/14/15 11:09:00
6 01/15/15 16:05:00
7 01/16/15 21:02:00
8 01/26/15 03:00:00
9 01/27/15 08:32:00
10 01/30/15 11:52:00
Name: date, dtype: object
Pandas has this in-built, you can specify your datetime format
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html.
use infer_datetime_format
>>> import pandas as pd
>>> i = pd.date_range('20000101',periods=100)
>>> df = pd.DataFrame(dict(year = i.year, month = i.month, day = i.day))
>>> pd.to_datetime(df.year*10000 + df.month*100 + df.day, format='%Y%m%d')
0 2000-01-01
1 2000-01-02
...
98 2000-04-08
99 2000-04-09
Length: 100, dtype: datetime64[ns]
you can use the datetime functions to convert from and to strings.
# converts to date
datetime.strptime(date_string, 'DD/MM/YYYY HH:MM')
and
# converts to your requested string format
datetime.strftime(date_string, "YYYY/MM/DD HH:MM:SS")