Pandas: group columns into a time series - python

Consider this set of data:
data = [{'Year':'1959:01','0':138.89,'1':139.39,'2':139.74,'3':139.69,'4':140.68,'5':141.17},
{'Year':'1959:07','0':141.70,'1':141.90,'2':141.01,'3':140.47,'4':140.38,'5':139.95},
{'Year':'1960:01','0':139.98,'1':139.87,'2':139.75,'3':139.56,'4':139.61,'5':139.58}]
How can I convert to Pandas time series, like this:
Year Value
1959-01 138.89
1959-02 139.39
1959-03 139.74
...
1959-07 141.70
1959-08 141.90
...

Code
df = pd.DataFrame(data).set_index('Year').stack().droplevel(1)
df.index=pd.date_range(start=pd.to_datetime(df.index, format='%Y:%m')[0],
periods=len(df.index), freq='M').to_period('M')
df = df.to_frame().reset_index().rename(columns={'index': 'Year', (0):'Value'})
Explanation
Converting the df to series using stack and dropping the level which is not required.
Then resetting the index for desired range and since we need the output in monthly freq, hence doing that using to_period.
Last step is to convert series back to frame and rename columns.
Output as required
Year Value
0 1959-01 138.89
1 1959-02 139.39
2 1959-03 139.74
3 1959-04 139.69
4 1959-05 140.68
5 1959-06 141.17
6 1959-07 141.70
7 1959-08 141.90
8 1959-09 141.01
9 1959-10 140.47
10 1959-11 140.38
11 1959-12 139.95
12 1960-01 139.98
13 1960-02 139.87
14 1960-03 139.75
15 1960-04 139.56
16 1960-05 139.61
17 1960-06 139.58

here is one way
s = pd.DataFrame(data).set_index("Year").stack()
s.index = pd.Index([pd.to_datetime(start, format="%Y:%m") + pd.DateOffset(months=int(off))
for start, off in s.index], name="Year")
df = s.to_frame("Value")
First we set Year as the index and then stack the values next to it. Then prepare an index from the current index via available dates + other values as month offsets. Lastly go to a frame with new column's name being Value.
to get
>>> df
Value
Year
1959-01-01 138.89
1959-02-01 139.39
1959-03-01 139.74
1959-04-01 139.69
1959-05-01 140.68
1959-06-01 141.17
1959-07-01 141.70
1959-08-01 141.90
1959-09-01 141.01
1959-10-01 140.47
1959-11-01 140.38
1959-12-01 139.95
1960-01-01 139.98
1960-02-01 139.87
1960-03-01 139.75
1960-04-01 139.56
1960-05-01 139.61
1960-06-01 139.58

Related

How to interpolate only over a specific window?

I have a dataset that follows a weekly indexation, and a list of dates that I need to get interpolated data for. For example, I have the following df with weekly aggregation:
data value
1/01/2021 10
7/01/2021 10
14/01/2021 10
28/01/2021 10
and a list of dates that do not coincide with the df indexed dates, for example:
list_dates = [12/01/2021, 13/01/2021 ...]
I need to get what the interpolated values would be for every date on the list_dates but within a given window (for ex: using only 4 values in the df to calculate to interpolation, split between before and after --> so the 2 first dates before the list date and the 2 first dates after the list date).
To get the interpolated value of the list date 12/01/2021 in the list, I would need to use:
1/1/2021
7/1/2021
14/1/2021
28/1/2021
The output would then be:
data value
1/01/2021 10
7/01/2021 10
12/01/2021 10
13/01/2021 10
14/01/2021 10
28/01/2021 10
I have successfully coded an example of this but it fails for when there are multiple NaNs consecutively (for ex: 12/01 and 13/01). I also can't concat the interpolated value before running the next one in the list, as that would be using the interpolated date to calc the new interpolated date (for ex: using 12/01 to calculate 13/01).
Any advice on how to do this?
Use interpolate to get expected outcome but before you have to prepare your dataframe like below.
I slightly modify your input data to show you interpolation with datetimeindex (method='time'):
# Input data
df = pd.DataFrame({'data': ['1/01/2021', '7/01/2021', '14/01/2021', '28/01/2021'],
'value': [10, 10, 17, 10]})
list_dates = ['12/01/2021', '13/01/2021']
# Conversion of dates
df['data'] = pd.to_datetime(df['data'], format='%d/%m/%Y')
new_dates = pd.to_datetime(list_dates, format='%d/%m/%Y')
# Set datetime column as index and append new dates
df = df.set_index('data')
df = df.reindex(df.index.append(new_dates)).sort_index()
# Interpolate with method='time'
df['value'] = df['value'].interpolate(method='time')
Output:
>>> df
value
2021-01-01 10.0
2021-01-07 10.0
2021-01-12 15.0 # <- time interpolation
2021-01-13 16.0 # <- time interpolation
2021-01-14 17.0 # <- changed from 10 to 17
2021-01-28 10.0

Date Difference based on matching values in two columns - Pandas

I have a dataframe, I am struggling to create a column based out of other columns, I will share the problem for a sample data.
Date Target1 Close
0 2018-05-25 198.0090 188.580002
1 2018-05-25 197.6835 188.580002
2 2018-05-25 198.0090 188.580002
3 2018-05-29 196.6230 187.899994
4 2018-05-29 196.9800 187.899994
5 2018-05-30 197.1375 187.500000
6 2018-05-30 196.6965 187.500000
7 2018-05-30 196.8750 187.500000
8 2018-05-31 196.2135 186.869995
9 2018-05-31 196.2135 186.869995
10 2018-05-31 196.5600 186.869995
11 2018-05-31 196.7700 186.869995
12 2018-05-31 196.9275 186.869995
13 2018-05-31 196.2135 186.869995
14 2018-05-31 196.2135 186.869995
15 2018-06-01 197.2845 190.240005
16 2018-06-01 197.2845 190.240005
17 2018-06-04 201.2325 191.830002
18 2018-06-04 201.4740 191.830002
I want to create another column (for each observation) (called days_to_hit_target for example) which is the difference of days such that close hits (or crosses target of specific day), then it counts the difference of days and put them in the column days_to_hit_target.
The idea is, suppose close price today in 2018-05-25 is 188.58, so, I want to get the date for which this target (198.0090) is hit close which it is doing somewhere later on 2018-06-04, where close has reached to the target of first observation, (198.0090), that will be fed to the first observation of the column (days_to_hit_target ).
Use a combination of loc and at to find the date at which the target is hit, then subtract the dates.
df['TargetDate'] = 'NA'
for i, row in df.iterrows():
t = row['Target1']
d = row['Date']
targdf = df.loc[df['Close'] >= t]
if len(targdf)>0:
targdt = targdf.at[0,'Date']
df.at[i,'TargetDate'] = targdt
else:
df.at[i,'TargetDate'] = '0'
df['Diff'] = df['Date'].sub(df['TargetDate'], axis=0)
import pandas as pd
csv = pd.read_csv(
'sample.csv',
parse_dates=['Date']
)
csv.sort_values('Date', inplace=True)
def find_closest(row):
target = row['Target1']
date = row['Date']
matches = csv[
(csv['Close'] >= target) &
(csv['Date'] > date)
]
closest_date = matches['Date'].iloc[0] if not matches.empty else None
row['days to hit target'] = (closest_date - date).days if closest_date else None
return row
final = csv.apply(find_closest, axis=1)
It's a bit hard to test because none of the targets appear in the close. But the idea is simple. Subset your original frame such that date is after the current row date and Close is greater than or equal to Target1 and get the first entry (this is after you've sorted it using df.sort_values.
If the subset is empty, use None. Otherwise, use the Date. Days to hit target is pretty simple at that point.

how to apply filter condition in percentage string column using pandas?

I am working on below df but unable to apply filter in percentage field,but it is working normal excel.
I need to apply filter condition > 100.00% in the particular field using pandas.
I tried reading it from Html,csv and excel in pandas but unable to use condition.
it requires float conversion but not working with given data
I am assuming that the values you have are read as strings in Pandas:
data = ['4,700.00%', '3,900.00%', '1,500.00%', '1,400.00%', '1,200.00%', '0.15%', '0.13%', '0.12%', '0.10%', '0.08%', '0.07%']
df = pd.DataFrame(data)
df.columns = ['data']
printing the df:
data
0 4,700.00%
1 3,900.00%
2 1,500.00%
3 1,400.00%
4 1,200.00%
5 0.15%
6 0.13%
7 0.12%
8 0.10%
9 0.08%
10 0.07%
then:
df['data'] = df['data'].str.rstrip('%').str.replace(',','').astype('float')
df_filtered = df[df['data'] > 100]
Results:
data
0 4700.0
1 3900.0
2 1500.0
3 1400.0
4 1200.0
I have used below code as well.str.rstrip('%') and .str.replace(',','').astype('float') it working fine

Forward filling missing dates into Python Pandas Dataframe

I have a Panda's dataframe that is filled as follows:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
8/31/2010 1
9/30/2010 4
12/31/2010 2
Note how there are missing months (i.e. 7, 10, 11) in the data. I want to fill in the missing data through a forward filling method so that it looks like this:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
7/30/2010 3
8/31/2010 1
9/30/2010 4
10/29/2010 4
11/30/2010 4
12/31/2010 2
The tag of the missing date will have the tag of the previous. All dates represent the last business day of the month.
This is what I tried to do:
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df.ref_date.index = pd.to_datetime(df.ref_date.index)
df = df.reindex(index=[idx], columns=[ref_date], method='ffill')
It's giving me the error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
where pd is pandas and df is the dataframe.
I'm new to Pandas Dataframe, so any help would be appreciated!
You were very close, you just need to set the dataframe's index with the ref_date, reindex it to the business day month end index while specifying ffill at the method, then reset the index and rename back to the original:
# First ensure the dates are Pandas Timestamps.
df['ref_date'] = pd.to_datetime(df['ref_date'])
# Create a monthly index.
idx_monthly = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
# Reindex to the daily index, forward fill, reindex to the monthly index.
>>> (df
.set_index('ref_date')
.reindex(idx_monthly, method='ffill')
.reset_index()
.rename(columns={'index': 'ref_date'}))
ref_date tag
0 2010-01-29 1.0
1 2010-02-26 3.0
2 2010-03-31 4.0
3 2010-04-30 4.0
4 2010-05-31 1.0
5 2010-06-30 3.0
6 2010-07-30 3.0
7 2010-08-31 1.0
8 2010-09-30 4.0
9 2010-10-29 4.0
10 2010-11-30 4.0
11 2010-12-31 2.0
Thanks to the previous person that answered this question but deleted his answer. I got the solution:
df[ref_date] = pd.to_datetime(df[ref_date])
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df = df.set_index(ref_date).reindex(idx).ffill().reset_index().rename(columns={'index': ref_date})

Group by multiple time units in pandas data frame

I have a data frame that consists of a time series data with 15-second intervals:
date_time value
2012-12-28 11:11:00 103.2
2012-12-28 11:11:15 103.1
2012-12-28 11:11:30 103.4
2012-12-28 11:11:45 103.5
2012-12-28 11:12:00 103.3
The data spans many years. I would like to group by both year and time to look at the distribution of time-of-day effect over many years. For example, I may want to compute the mean and standard deviation of every 15-second interval across days, and look at how the means and standard deviations change from 2010, 2011, 2012, etc. I naively tried data.groupby(lambda x: [x.year, x.time]) but it didn't work. How can I do such grouping?
In case date_time is not your index, a date_time-indexed DataFrame could be created with:
dfts = df.set_index('date_time')
From there you can group by intervals using
dfts.groupby(lambda x : x.month).mean()
to see mean values for each month. Similarly, you can do
dfts.groupby(lambda x : x.year).std()
for standard deviations across the years.
If I understood the example task you would like to achieve, you could simply split the data into years using xs, group them and concatenate the results and store this in a new DataFrame.
years = range(2012, 2015)
yearly_month_stats = [dfts.xs(str(year)).groupby(lambda x : x.month).mean() for year in years]
df2 = pd.concat(yearly_month_stats, axis=1, keys = years)
From which you get something like
2012 2013 2014
value value value
1 NaN 5.324165 15.747767
2 NaN -23.193429 9.193217
3 NaN -14.144287 23.896030
4 NaN -21.877975 16.310195
5 NaN -3.079910 -6.093905
6 NaN -2.106847 -23.253183
7 NaN 10.644636 6.542562
8 NaN -9.763087 14.335956
9 NaN -3.529646 2.607973
10 NaN -18.633832 0.083575
11 NaN 10.297902 14.059286
12 33.95442 13.692435 22.293245
You were close:
data.groupby([lambda x: x.year, lambda x: x.time])
Also be sure to set date_time as the index, as in kermit666's answer

Categories