I have some monthly data with a date column in the format: YYYY.fractional month. For example:
0 1960.500
1 1960.583
2 1960.667
3 1960.750
4 1960.833
5 1960.917
Where the first index is June, 1960 (6/12=.5), the second is July, 1960 (7/12=.583) and so on.
The answers in this question don't seem to apply well, though I feel like pd.to_datetime should be able to help somehow. Obviously I can use a map to split this into components and build a datetime, but I'm hoping for a faster and more rigorous method since the data is large.
I think you need a bit maths:
a = df['date'].astype(int)
print (a)
0 1960
1 1960
2 1960
3 1960
4 1960
5 1960
Name: date, dtype: int32
b = df['date'].sub(a).add(1/12).mul(12).round(0).astype(int)
print (b)
0 7
1 8
2 9
3 10
4 11
5 12
Name: date, dtype: int32
c = pd.to_datetime(a.astype(str) + '.' + b.astype(str), format='%Y.%m')
print (c)
0 1960-07-01
1 1960-08-01
2 1960-09-01
3 1960-10-01
4 1960-11-01
5 1960-12-01
Name: date, dtype: datetime64[ns]
Solution with map:
d = {'500':'7','583':'8','667':'9','750':'10','833':'11','917':'12'}
#if necessary
#df['date'] = df['date'].astype(str)
a = df['date'].str[:4]
b = df['date'].str[5:].map(d)
c = pd.to_datetime(a + '.' + b, format='%Y.%m')
print (c)
0 1960-07-01
1 1960-08-01
2 1960-09-01
3 1960-10-01
4 1960-11-01
5 1960-12-01
Name: date, dtype: datetime64[ns]
For future reference, here's the map I was using before. I actually made a mistake in the question; the data is set so that January 1960 is 1960.0, which means 1/12 must be added to each fractional component.
def date_conv(d):
y, frac_m = str(d).split('.')
y = int(y)
m = int(round((float('0.{}'.format(frac_m)) + 1/12) * 12, 0))
d = 1
try:
date = datetime.datetime(year=y, month=m, day=d)
except ValueError:
print(y, m, frac_m)
raise
return date
dates_series = dates_series.map(lambda d: date_conv(d))
The try/except block was just something I added for troubleshooting while writing it.
Related
There is an excel file logging a set of data. Its columns are as below, where each column is seperated by comma.
SampleData
year,date,month,location,time,count
2019,20,Jan,Japan,22:33,1
2019,31,Jan,Japan,19:21,1
2019,1,Jan,Japan,8:00,1
2019,4,Jan,Japan,4:28,2
2019,13,Feb,Japan,6:19,1
From this data, I would like to create python pandas dataframe, which looks like below.
DataFrame
u_datetime,location,count
1547991180,Japan,1
1548930060,Japan,1
1546297200,Japan,1
1546543680,Japan,2
1550006340,Japan,1
One of the DataFrame methods can be useful for this operation, but it does not take date with one digit.
pandas.to_datetime(
DataFrame["year"].astype(str)
+ DataFrame["month"].astype(str)
+ DataFrame["date"].astype(str)
+ DataFrame["time"].astype(str),
format="%Y%b%d%-H%M"
)
Could anybody give me a hand?
Thank you.
try this
from datetime import datetime
data['datetime'] = data[['year','date','month','time']].apply(lambda x: datetime.strptime(str(x['year'])+'-'+str(x['date'])+'-'+str(x['month'])+' '+str(x['time']), "%Y-%d-%b %H:%M").timestamp(), axis=1)
data[['datetime','location','count']]
Output
datetime \
0 1548003780.0
1 1548942660.0
2 1546309800.0
3 1546556280.0
4 1550018940.0
location \
0 Japan
1 Japan
2 Japan
3 Japan
4 Japan
count
0 1
1 1
2 1
3 2
4 1
In case you are working with csv file this can be done easily using parse_dates.
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv('/home/users/user/xxx.csv', parse_dates ={'date_time':[0,1,2,4]})
df['u_datetime'] = df['date_time'].values.astype(np.int64) // 10 ** 9
df_new = df[['u_datetime', 'location', 'count']]
You are close, need %Y%b%d%H:%M format and then convert to unix time by cast to int64 with integer division by 10**9:
s = (DataFrame["year"].astype(str)+
DataFrame["month"].astype(str)+
DataFrame["date"].astype(str)+
DataFrame["time"].astype(str))
DataFrame['u_datetime'] = pd.to_datetime(s, format="%Y%b%d%H:%M").astype(np.int64) // 10**9
DataFrame = DataFrame[['u_datetime','location','count']]
print (DataFrame)
u_datetime location count
0 1548023580 Japan 1
1 1548962460 Japan 1
2 1546329600 Japan 1
3 1546576080 Japan 2
4 1550038740 Japan 1
I am a beginner with python, and so my questions could come across as trivial. I would appreciate your support or any leads to my problem.
Problem:
There are about 10 different states; A order moves across different states and a time stamp is generated when the state ends. For example below, There are four states A, B, C, D.
A 10 AM
B 1 PM
C 4 Pm
D 5 PM
Time spent in B = 1PM -10AM = 3.
Some times the same state can occur multiple times. Hence, we need a variable to store the time difference value for a single state
Attached the raw data csv and my code so far.There are multiple orders for which this calculation needs to be performed. however, for simplicity, I have data for just one order now.
sample data:
Order States modified_at
1 Resolved 2018-06-18T15:05:52.2460000
1 Edited 2018-05-24T21:44:07.9030000
1 Pending PO Creation 2018-06-06T19:52:51.5990000
1 Assigned 2018-05-24T17:46:03.2090000
1 Edited 2018-06-04T15:02:57.5130000
1 Draft 2018-05-24T17:45:07.9960000
1 PO Placed 2018-06-06T20:49:37.6540000
1 Edited 2018-06-04T11:18:13.9830000
1 Edited 2018-05-24T17:45:39.4680000
1 Pending Approval 2018-05-24T21:48:23.9180000
1 Edited 2018-06-06T21:00:19.6350000
1 Submitted 2018-05-24T21:44:37.8830000
1 Edited 2018-05-30T11:19:36.5460000
1 Edited 2018-05-25T11:16:07.9690000
1 Edited 2018-05-24T21:43:35.0770000
1 Assigned 2018-06-07T18:39:00.2580000
1 Pending Review 2018-05-24T17:45:10.5980000
1 Pending PO Submission 2018-06-06T14:16:26.6580000
Code I tried:
import pandas as pd
import datetime as datetime
from dateutil.relativedelta import relativedelta
fileName = "SamplePR.csv"
df = pd.read_csv(fileName, delimiter=',')
df['modified_at'] = pd.to_datetime(df.modified_at)
df = df.sort_values(by='modified_at')
df = df.reset_index(drop=True)
df1 = df[:-1]
df2 = df[1:]
dfm1 = df1['modified_at']
dfm2 = df2['modified_at']
dfm1 = dfm1.reset_index(drop=True)
dfm2 = dfm2.reset_index(drop=True)
for i in range(len(df)-1):
start = datetime.datetime.strptime(str(dfm1[i]), '%Y-%m-%d %H:%M:%S')
ends = datetime.datetime.strptime(str(dfm2[i]), '%Y-%m-%d %H:%M:%S')
diff = relativedelta(ends, start)
print (diff)
So far, I tried to sort the list by time and then calculate the difference between 2 states. Would really appreciate if someone can help with logic or point in the right direction
You can use diff from pandas to get the difference between two rows
Here is a sample code.
In [1]: import pandas as pd
In [2]: from io import StringIO
In [3]: data = StringIO('''Order,States,modified_at
...: 1,Resolved,2018-06-18T15:05:52.2460000
...: 1,Edited,2018-05-24T21:44:07.9030000
...: 1,Pending PO Creation,2018-06-06T19:52:51.5990000
...: ''')
In [4]: df = pd.read_csv(data, sep=',')
In [5]: df['modified_at'] = pd.to_datetime(df['modified_at']) #convert the type to datetime
In [6]: df
Out[6]:
Order States modified_at
0 1 Resolved 2018-06-18 15:05:52.246
1 1 Edited 2018-05-24 21:44:07.903
2 1 Pending PO Creation 2018-06-06 19:52:51.599
In [7]: df['diff'] = df['modified_at'].diff() #get the diff and add to a new column
In [8]: df
Out[8]:
Order States modified_at diff
0 1 Resolved 2018-06-18 15:05:52.246 NaT
1 1 Edited 2018-05-24 21:44:07.903 -25 days +06:38:15.657000
2 1 Pending PO Creation 2018-06-06 19:52:51.599 12 days 22:08:43.696000
Welcome visal, if your intention is to just check the time difference between date stamp , use to_datetime to convert to datestamp and difference it by shifting
index Order States modified_at
0 0 1 Resolved 2018-06-18 15:05:52.246
1 1 1 Edited 2018-05-24 21:44:07.903
2 0 1 Edited 2018-06-06 21:00:19.635
3 1 1 Submitted 2018-05-24 21:44:37.883
4 2 1 Edited 2018-05-30 11:19:36.546
5 3 1 Edited 2018-05-25 11:16:07.969
6 4 1 Edited 2018-05-24 21:43:35.077
7 5 1 Assigned 2018-06-07 18:39:00.258
df.modified_at = pd.to_datetime(df.modified_at)
df['time_spent'] = df.modified_at - df.modified_at.shift()
Out:
0 NaT
1 -25 days +06:38:15.657000
2 12 days 23:16:11.732000
3 -13 days +00:44:18.248000
4 5 days 13:34:58.663000
5 -6 days +23:56:31.423000
6 -1 days +10:27:27.108000
7 13 days 20:55:25.181000
Name: modified_at, dtype: timedelta64[ns]
you can use pivot table for your requirement
df.time_spent = df.time_spent.dt.seconds
pd.pivot_table(df,values='time_spent',index=['Order'],columns=['States'],aggfunc=np.sum)
Out:
States Assigned Edited Resolved Submitted
Order
0 NaN 83771.0 0.0 NaN
1 NaN 23895.0 NaN 2658.0
2 NaN 48898.0 NaN NaN
3 NaN 86191.0 NaN NaN
4 NaN 37647.0 NaN NaN
5 75325.0 NaN NaN NaN
$datetime1 = new DateTime('2016-11-30 03:55:06');//start time
$datetime2 = new DateTime('2016-11-30 11:55:06');//end time
$interval = $datetime1->diff($datetime2);
echo $interval->format('%Y years %m months %d days %H hours %i minutes %s seconds');//00 years 0 months 0 days 08 hours 0 minutes 0 seconds
I have a dataframe with records spanning multiple years:
WarName | StartDate | EndDate
---------------------------------------------
'fakewar1' 01-01-1990 02-02-1995
'examplewar' 05-01-1990 03-07-1998
(...)
'examplewar2' 05-07-1999 06-09-2002
I am trying to convert this dataframe to a summary overview of the total wars per year, e.g.:
Year | Number_of_wars
----------------------------
1989 0
1990 2
1991 2
1992 3
1994 2
Usually I would use someting like df.groupby('year').count() to get total wars by year, but since I am currently working with ranges instead of set dates that approach wouldn't work.
I am currently writing a function that generates a list of years, and then for each year in the list checks each row in the dataframe and runs a function that checks if the year is within the date-range of that row (returning True if that is the case).
years = range(1816, 2006)
year_dict = {}
for year in years:
for index, row in df.iterrows():
range = year_in_range(year, row)
if range = True:
year_dict[year] = year_dict.get(year, 0) + 1
This works, but is also seems extremely convoluted. So I was wondering, what am I missing? What would be the canonical 'pandas-way' to solve this issue?
Use a comprehension with pd.value_counts
pd.value_counts([
d.year for s, e in zip(df.StartDate, df.EndDate)
for d in pd.date_range(s, e, freq='Y')
]).sort_index()
1990 2
1991 2
1992 2
1993 2
1994 2
1995 1
1996 1
1997 1
1999 1
2000 1
2001 1
dtype: int64
Alternate
from functools import reduce
def r(t):
return pd.date_range(t.StartDate, t.EndDate, freq='Y')
pd.value_counts(reduce(pd.Index.append, map(r, df.itertuples())).year).sort_index()
Setup
df = pd.DataFrame(dict(
WarName=['fakewar1', 'examplewar', 'feuxwar2'],
StartDate=pd.to_datetime(['01-01-1990', '05-01-1990', '05-07-1999']),
EndDate=pd.to_datetime(['02-02-1995', '03-07-1998', '06-09-2002'])
), columns=['WarName', 'StartDate', 'EndDate'])
df
WarName StartDate EndDate
0 fakewar1 1990-01-01 1995-02-02
1 examplewar 1990-05-01 1998-03-07
2 feuxwar2 1999-05-07 2002-06-09
By using np.unique
x,y = np.unique(sum([list(range(x.year,y.year)) for x,y in zip(df.StartDate,df.EndDate)],[]), return_counts=True)
pd.Series(dict(zip(x,y)))
Out[222]:
1990 2
1991 2
1992 2
1993 2
1994 2
1995 1
1996 1
1997 1
1999 1
2000 1
2001 1
dtype: int64
The other answers with pandas are far preferable, but the native Python answer you showed didn't have to be so convoluted; just instantiate and directly index into an array:
wars = [0] * 191 # max(df['EndDate']).year - min(df['StartDate']).year + 1
yr_offset = 1816 # min(df['StartDate']).year
for _, row in df.iterrows():
for yr in range(row['StartDate'].year-yr_offset, row['EndDate'].year-yr_offset): # or maybe (year+1)
wars[yr] += 1
I Need to calculate the compund interest rate so, lets say I have a Dataframe like that:
days
1 10
2 15
3 20
What I want to get is (suppose the interest rate is 1% every day:
days interst rate
1 10 10,46%
2 15 16,10%
3 20 22,02%
My code is as follows:
def inclusao_juros (x):
dias = df_arrumada_4['Prazo Medio']
return ((1.0009723)^dias)-1
df_arrumada_4['juros_acumulado'] = df_arrumada_4['Prazo Medio'].apply(inclusao_juros)
What should I do??? Tks
I think you need numpy.power:
df['new'] = np.power(1.01, df['days']) - 1
print (df)
days new
1 10 0.104622
2 15 0.160969
3 20 0.220190
IIUC
pd.Series([1.01]*len(df)).pow(df.reset_index().days,0).sub(1)
Out[695]:
0 0.104622
1 0.160969
2 0.220190
dtype: float64
Jez's : pd.Series([1.01]*len(df),index=df.index).pow(df.days,0).sub(1)
Or using your apply
df.days.apply(lambda x: 1.01**x -1)
Out[697]:
1 0.104622
2 0.160969
3 0.220190
Name: days, dtype: float64
I have a table that has multiple subgroups. For example, person A has a total of three visits and person B has a total of two visits. I also have the time of each visit:
id visit time_of_visit
A 1 2002-01-15
A 2 2003-01-15
A 3 2003-02-15
B 1 1996-08-09
B 2 1998-08-09
I want to compute how long apart each visit is in terms of years for each person. So I want something like this:
id visit time_of_visit difference_in_time
A 1 2002-01-15 na
A 2 2003-01-15 1
A 3 2003-02-15 0.0833
B 1 1996-08-09 na
B 2 1998-08-09 2
Any ideas how to do this in python pandas? Thanks!
groupby.diff on a datetime column will give you
df['time_of_visit'] = pd.to_datetime(df['time_of_visit'])
df.groupby('id')['time_of_visit'].diff()
Out:
0 NaT
1 365 days
2 31 days
3 NaT
4 730 days
Name: time_of_visit, dtype: timedelta64[ns]
However, timedeltas cannot give you years as it is not a standard measure. You can always convert by your own rules of course (for example divide by 365).
df.groupby('id')['time_of_visit'].diff().dt.days / 365
Out:
0 NaN
1 1.000000
2 0.084932
3 NaN
4 2.000000
Name: time_of_visit, dtype: float64