Pandas convert YEARMODA in to a datetime. convert to datetimelike values - python

I have the following dataframe:
YEARMODA TEMP MAX MIN
0 19730701 74.5 90.0 53.6
1 19730702 74.5 88.9 57.9
2 19730703 81.7 95.0 63.0
3 19730704 85.0 95.0 65.8
4 19730705 85.0 97.9 63.9
How do I get the date to datetimelike. I want to get the average and standard deviation of the temp by year and by month. I know how to use group, it's just working with YEARMODA that is the problem

Here are two ways to solve this, take your pick
df['YEARMODA'] = pd.to_datetime(df['YEARMODA'], format='%Y%m%d')
YEARMODA TEMP MAX MIN
0 1973-07-01 74.5 90.0 53.6
1 1973-07-02 74.5 88.9 57.9
2 1973-07-03 81.7 95.0 63.0
3 1973-07-04 85.0 95.0 65.8
4 1973-07-05 85.0 97.9 63.9
--------------------------------------------------------------------
from functools import partial
p = partial(pd.to_datetime, format='%Y%m%d')
df['YEARMODA'] = df['YEARMODA'].apply(p)
YEARMODA TEMP MAX MIN
0 1973-07-01 74.5 90.0 53.6
1 1973-07-02 74.5 88.9 57.9
2 1973-07-03 81.7 95.0 63.0
3 1973-07-04 85.0 95.0 65.8
4 1973-07-05 85.0 97.9 63.9
Edit: The issue you are having is you are not providing the correct format to your pd.to_datetime expression hence it is failing.
Edit 2: To get the std by month according to how you want to do it you would do it as such.
df.groupby(df.YEARMODA.apply(p).dt.strftime('%B')).TEMP.std()
YEARMODA
July 5.321936
Name: TEMP, dtype: float64
df.assign(temp=pd.to_datetime(df['YEARMODA'], format='%Y%m%d') \
.dt \
.strftime('%B')) \
.groupby('temp') \
.TEMP \
.std()
temp
July 5.321936
Name: TEMP, dtype: float64

Related

get data of given time range in python when time stamp is not proper

time a b
2021-05-23 22:06:54 10.4 70.1
2021-05-23 22:21:41 10.7 68.3
2021-05-23 22:36:28 10.4 69.4
2021-05-23 22:51:15 9.9 71.7
2021-05-23 23:06:02 9.5 73.1
... ... ... ... ... ...
2021-11-19 08:18:31 19.8 43.0
2021-11-19 08:20:04 21.0 42.0
2021-11-19 08:21:25 35.5 20.0
2021-11-19 08:21:32 19.8 43.0
2021-11-19 08:23:05 21.0 42.0
here time is in the index, not a column.
when I did df.between_time("2021-11-17 08:15:00","2021-11-19 08:00:00")
it throws the error ValueError: Cannot convert arg ['2021-11-17 08:15:00'] to a time
data frame has not proper time stamp.
What i want to do,-: when i pass time range or date range, i want to get all the data between given time.
Thanks
Use truncate:
>>> df.truncate("2021-05-23 23:00:00", "2021-11-19 08:20:00")
a b
time
2021-05-23 23:06:02 9.5 73.1
2021-11-19 08:18:31 19.8 43.0

30 Day distance between dates in datetime64[ns] column

I have data of the following form:
6460 2001-07-24 00:00:00 67.5 75.1 75.9 71.0 75.2 81.8
6490 2001-06-24 00:00:00 68.4 74.9 76.1 70.9 75.5 82.7
6520 2001-05-25 00:00:00 69.6 74.7 76.3 70.8 75.5 83.2
6550 2001-04-25 00:00:00 69.2 74.6 76.1 70.6 75.0 83.1
6580 2001-03-26 00:00:00 69.1 74.4 75.9 70.5 74.3 82.8
6610 2001-02-24 00:00:00 69.0 74.0 75.3 69.8 73.8 81.9
6640 2001-01-25 00:00:00 68.9 73.9 74.6 69.7 73.5 80.0
6670 2000-12-26 00:00:00 69.0 73.5 75.0 69.5 72.6 81.8
6700 2000-11-26 00:00:00 69.8 73.2 75.1 69.5 72.0 82.7
6730 2000-10-27 00:00:00 70.3 73.1 75.0 69.4 71.3 82.6
6760 2000-09-27 00:00:00 69.4 73.0 74.8 69.4 71.0 82.3
6790 2000-08-28 00:00:00 69.6 72.8 74.6 69.2 70.7 81.9
6820 2000-07-29 00:00:00 67.8 72.9 74.4 69.1 70.6 81.8
I want all the dates to have a 30 day difference between each other. I know how to add a specific day or month to a datetime object with something like
ndfd = ndf['Date'].astype('datetime64[ns]')
ndfd = ndfd.apply(lambda dt: dt.replace(day=15))
But this does not take into account the difference in days from month to month.
How can I ensure there is a consistent step in days from month to month in my data, given that I am able to change the day as long as it remains on the same month?
You could use date_range:
df['date'] = pd.date_range(start=df['date'][0], periods=len(df), freq='30D')
IIUC you could change your date column like this:
import datetime
a = df.iloc[0,0] # first date, assuming date col is first
df['date'] = [a + datetime.timedelta(days=30 * i) for i in range(len(df))]
I haven't tested this so not sure it work as smooth as I thought it will =).
You can transform your first day into ordinal, add 30*i to it and then transform it back.
first_day=df.iloc[0]['date_column'].toordinal()
df['date']=(first_day+30*i for i in range(len(df))).fromordinal

selecting dates of a dataframe and looping over date index

I m having a dataframe with dates from 2006 to 2016 and for each date 7 corresponding values.
The data is like below:
H PS T RH TD WDIR WSP
date
2006-01-01 11:28:00 38 988.6 0.9 98.0 0.6 120.0 14.4
2006-01-01 11:28:00 46 987.6 0.5 91.0 -0.7 122.0 15.0
2006-01-01 11:28:00 57 986.3 0.5 89.0 -1.1 124.0 15.5
2006-01-01 11:28:00 66 985.1 0.5 90.0 -1.1 126.0 16.0
2006-01-01 11:28:00 74 984.1 0.4 90.0 -1.1 127.0 16.5
2006-01-01 11:28:00 81 983.3 0.4 90.0 -1.1 129.0 17.0
I want to select few columns for each year ( for example T and RH for all 2006) . So, for each year 2006 to 2016 select a bunch of columns then write each of the new dataframes in one file.
I did the following:
df_H_T=(df[['RH','T']])
mask = (df_H_T['date'] >'2016-01-01 00:00:00') & (df_H_T['date'] <='2016-12-31 23:59:59')
df_H_T_2006 =df.loc[mask]
print(df_H_T_2006.head(20))
print(df_H_T_2006.tail(20))
But is not working because it seems it doesn't know what 'date' is but then when I print the head of the dataframe it seems date is there. What am I doing wrong ?
My second question is how can I put this in a loop over the year variable so that I don t write by hand each new dataframe and select one year at a time up to 2016 ? ( I m new and never used loops in python).
Thanks,
Ioana
date is in the original dataframe, but then you take df_H_T=df[['RH','T']], so now date isn't in df_H_T. You can use masks generated from one dataframe to slice another, as long as they have the same index. So you can do
mask = (df['date'] >'2016-01-01 00:00:00') & (df['date'] <='2016-12-31 23:59:59')
df_H_T_2006 =df_H_T.loc[mask]
(Note: you're applying the mask to df, but presumably you want to apply it to df_H_T).
If date is in datetime format, you can just take df['date'].apply(lamda x: x.year==2016). For your for-loop, it would be
df_H_T=(df[['RH','T']])
for year in years:
mask = df['date'].apply(lamda x: x.year==year)
df_H_T_cur_year =df_H_T.loc[mask]
print(df_H_T_cur_year.head(20))
print(df_H_T_cur_year.tail(20))

How to assign a values to dataframe's column by comparing values in another dataframe

I have two data frames. One has rows for every five minutes in a day:
df
TIMESTAMP TEMP
1 2011-06-01 00:05:00 24.5
200 2011-06-01 16:40:00 32.0
1000 2011-06-04 11:20:00 30.2
5000 2011-06-18 08:40:00 28.4
10000 2011-07-05 17:20:00 39.4
15000 2011-07-23 02:00:00 29.3
20000 2011-08-09 10:40:00 29.5
30656 2011-09-15 10:40:00 13.8
I have another dataframe that ranks the days
ranked
TEMP DATE RANK
62 43.3 2011-08-02 1.0
63 43.1 2011-08-03 2.0
65 43.1 2011-08-05 3.0
38 43.0 2011-07-09 4.0
66 42.8 2011-08-06 5.0
64 42.5 2011-08-04 6.0
84 42.2 2011-08-24 7.0
56 42.1 2011-07-27 8.0
61 42.1 2011-08-01 9.0
68 42.0 2011-08-08 10.0
Both the columns TIMESTAMP and DATE are datetime datatypes (dtype returns dtype('M8[ns]').
What I want to be able to do is add a column to the dataframe df and then put the rank of the row based on the TIMESTAMP and corresponding day's rank from ranked (so in a day all the 5 minute timesteps will have the same rank).
So, the final result would look something like this:
df
TIMESTAMP TEMP RANK
1 2011-06-01 00:05:00 24.5 98.0
200 2011-06-01 16:40:00 32.0 98.0
1000 2011-06-04 11:20:00 30.2 96.0
5000 2011-06-18 08:40:00 28.4 50.0
10000 2011-07-05 17:20:00 39.4 9.0
15000 2011-07-23 02:00:00 29.3 45.0
20000 2011-08-09 10:40:00 29.5 40.0
30656 2011-09-15 10:40:00 13.8 100.0
What I have done so far:
# Separate the date and times.
df['DATE'] = df['YYYYMMDDHHmm'].dt.normalize()
df['TIME'] = df['YYYYMMDDHHmm'].dt.time
df = df[['DATE', 'TIME', 'TAIR']]
df['RANK'] = 0
for index, row in df.iterrows():
df.loc[index, 'RANK'] = ranked[ranked['DATE']==row['DATE']]['RANK'].values
But I think I am going in a very wrong direction because this takes ages to complete.
How do I improve this code?
IIUC, you can play with indexes to match the values
df = df.set_index(df.TIMESTAMP.dt.date)\
.assign(RANK=ranked.set_index('DATE').RANK)\
.set_index(df.index)

How can I add summary rows to a pandas DataFrame calculated on multiple columns by agg functions like mean, median, etc

I have some data with multiple observations for a given Collector, Date, Sample, and Type where the observation values vary by ID.
import StringIO
import pandas as pd
data = """Collector,Date,Sample,Type,ID,Value
Emily,2014-06-20,201,HV,A,34
Emily,2014-06-20,201,HV,B,22
Emily,2014-06-20,201,HV,C,10
Emily,2014-06-20,201,HV,D,5
John,2014-06-22,221,HV,A,40
John,2014-06-22,221,HV,B,39
John,2014-06-22,221,HV,C,11
John,2014-06-22,221,HV,D,2
Emily,2014-06-23,203,HV,A,33
Emily,2014-06-23,203,HV,B,35
Emily,2014-06-23,203,HV,C,13
Emily,2014-06-23,203,HV,D,1
John,2014-07-01,218,HV,A,35
John,2014-07-01,218,HV,B,29
John,2014-07-01,218,HV,C,13
John,2014-07-01,218,HV,D,1
"""
>>> df = pd.read_csv(StringIO.StringIO(data), parse_dates="Date")
After doing some graphing with the data in this long format, I pivot it to a wide summary table format with columns for each ID.
>>> table = df.pivot_table(index=["Collector", "Date", "Sample", "Type"], columns="ID", values="Value")
ID A B C D
Collector Date Sample Type
Emily 2014-06-20 201 HV 34 22 10 5
2014-06-23 203 HV 33 35 13 1
John 2014-06-22 221 HV 40 39 11 2
2014-07-01 218 HV 35 29 13 1
However, I can't find a concise way to calculate and add some summary rows to the wide format data with mean, median, and maybe some custom aggregation function applied to each of the ID-based columns. This is what I want to end up with:
ID Collector Date Sample Type A B C D
0 Emily 2014-06-20 201 HV 34 22 10 5
2 John 2014-06-22 221 HV 40 39 11 2
1 Emily 2014-06-23 203 HV 33 35 13 1
3 John 2014-07-01 218 HV 35 29 13 1
4 mean 35.5 31.3 11.8 2.3
5 median 34.5 32.0 12.0 1.5
I tried things like calling mean or median on the summary table, but I end up with a Series rather than a row I can concatenate to the summary table. The summary rows I want are sort of like pivot_table margins, but the aggregation function is not sum.
>>> table.mean()
ID
A 35.50
B 31.25
C 11.75
D 2.25
dtype: float64
>>> table.median()
ID
A 34.5
B 32.0
C 12.0
D 1.5
dtype: float64
You could use aggfunc=[np.mean, np.median] to compute both the means and the medians. Then you could use margins=True to also obtain the means and medians for each column and for each row.
result = df.pivot_table(index=["Collector", "Date", "Sample", "Type"],
columns="ID", values="Value", margins=True,
aggfunc=[np.mean, np.median]).stack(level=0)
yields
ID A B C D All
Collector Date Sample Type
Emily 2014-06-20 201 HV mean 34.0 22.00 10.00 5.00 17.7500
median 34.0 22.00 10.00 5.00 16.0000
2014-06-23 203 HV mean 33.0 35.00 13.00 1.00 20.5000
median 33.0 35.00 13.00 1.00 23.0000
John 2014-06-22 221 HV mean 40.0 39.00 11.00 2.00 23.0000
median 40.0 39.00 11.00 2.00 25.0000
2014-07-01 218 HV mean 35.0 29.00 13.00 1.00 19.5000
median 35.0 29.00 13.00 1.00 21.0000
All mean 35.5 31.25 11.75 2.25 20.1875
median 34.5 32.00 12.00 1.50 17.5000
Yes, result contains more data than you asked for, but
result.loc['All']
has the additional values:
ID A B C D All
Date Sample Type
mean 35.5 31.25 11.75 2.25 20.1875
median 34.5 32.00 12.00 1.50 17.5000
Or, you could further subselect result to get just the rows you are looking for:
result.index.names = [u'Collector', u'Date', u'Sample', u'Type', u'aggfunc']
mask = result.index.get_level_values('aggfunc') == 'mean'
mask[-1] = True
result = result.loc[mask]
print(result)
yields
ID A B C D All
Collector Date Sample Type aggfunc
Emily 2014-06-20 201 HV mean 34.0 22.00 10.00 5.00 17.7500
2014-06-23 203 HV mean 33.0 35.00 13.00 1.00 20.5000
John 2014-06-22 221 HV mean 40.0 39.00 11.00 2.00 23.0000
2014-07-01 218 HV mean 35.0 29.00 13.00 1.00 19.5000
All mean 35.5 31.25 11.75 2.25 20.1875
median 34.5 32.00 12.00 1.50 17.5000
This might not be super clean, but you could assign to the new entries with .loc.
In [131]: table_mean = table.mean()
In [132]: table_median = table.median()
In [134]: table.loc['Mean', :] = table_mean.values
In [135]: table.loc['Median', :] = table_median.values
In [136]: table
Out[136]:
ID A B C D
Collector Date Sample Type
Emily 2014-06-20 201 HV 34.0 22.00 10.00 5.00
2014-06-23 203 HV 33.0 35.00 13.00 1.00
John 2014-06-22 221 HV 40.0 39.00 11.00 2.00
2014-07-01 218 HV 35.0 29.00 13.00 1.00
Mean 35.5 31.25 11.75 2.25
Median 34.5 32.00 12.00 1.50

Categories