I am trying to find a way to get the next day (next row in this case) in a Pandas dataframe. I thought this would be easy to find but Im struggling.
Starting Data:
ts = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts.columns = ['Val']
ts['Week'] = ts.index.week
ts
Val Week
2000-01-01 -0.639345 52
2000-01-02 1.294537 52
2000-01-03 1.181486 1
2000-01-04 -0.011694 1
2000-01-05 -0.224887 1
2000-01-06 -0.493120 1
2000-01-07 1.439436 1
2000-01-08 1.017722 1
2000-01-09 1.125153 1
2000-01-10 0.209741 2
Subset of the data:
tsSig = ts[ts.Val>1.5].drop_duplicates(subset='Week')
tsSig.head()
Val Week
2000-01-24 2.215559 4
2000-02-09 1.561941 6
2000-02-24 1.645916 8
2000-03-16 1.745079 11
2000-04-10 1.570023 15
I now want to use the index from my tsSig subset to find the next day in tsand then create a new column ts['Val_Dayplus1'] showing values from the 25th(-0.309811), 10th(-1.644814) etc.
I am trying things like ts.loc[tsSig.index].shift(1) to get next day but this is obviously not correct....
Desired output:
Val Val_Dayplus1 Week
2000-01-24 2.215559 -0.309811 4
2000-02-09 1.561941 -1.644814 6
2000-02-24 1.645916 -0.187440 8
(for all rows in tsSig.index)
Edit:
This appears to give me what I need in terms of shifting the date index on tsSig.index. I would like to hear of any other ways to do this as well.
ts.loc[tsSig.index + pd.DateOffset(days=1)]
tsSig['Val_Dayplus1'] = ts['Val'].ix[tsSig.index + pd.DateOffset(days=1)].values
I managed to work this one out so sharing the answer:
ts.loc[tsSig.index + pd.DateOffset(days=1)]
tsSig['Val_Dayplus1'] = ts['Val'].ix[tsSig.index + pd.DateOffset(days=1)].values
tsSig
Val Week Val_Dayplus1
2000-02-15 1.551125 7 -0.102154
2000-02-24 1.525402 8 -0.009776
2000-03-11 1.801845 10 0.832837
2000-03-22 1.546953 12 0.377510
2000-04-17 1.568720 16 -0.258558
2000-06-04 1.646147 22 0.853044
I'm not sure if I completely understand your question, but in general, you can index through a pandas data frame by using df.iloc[ROWS,COLS]. So, in your case, for an index i in a for loop, you could index ts.iloc[i+1,:], to get all of the info from the next row in the ts dataframe.
Related
As the title says, i want to know how to set every n-th value in a python list as Null. I looked after a solution in a lot of forums but i didn't find much.
I also don't want to overwrite existing values as None, instead i want to create new spaces with the value None
The list contains the date (12 dates = 1 year) and every 13th value should be empty because that row will be the average so i don't need a date
Here is my code how i generated the dates with pandas
import pandas as pd
numdays = 370 #i have 370 values, every day = 1 month. Starting from 1990 till June 2019
date1 = '1990-01-01'
date2 = '2019-06-01'
mydates = pd.date_range(date1, date2,).tolist()
date_all = pd.date_range(start=date1, end=date2, freq='1BMS')
date_lst = [date_all]
The expected Output:
01.01.1990
01.02.1990
01.03.1990
01.04.1990
01.05.1990
01.06.1990
01.07.1990
01.08.1990
01.09.1990
01.10.1990
01.11.1990
01.12.1990
None
01.01.1991
.
.
.
If I understood correctly:
import pandas as pd
numdays = 370
date1 = '1990-01-01'
date2 = '2019-06-01'
mydates = pd.date_range(date1, date2,).tolist()
date_all = pd.date_range(start=date1, end=date2, freq='1BMS')
date_lst = [date_all]
for i in range(12,len(mydates),13): # add this
mydates.insert(i, None)
I saw some of the answers above, but there's a way of doing this without having to loop over the complete list:
date_lst[12::12] = [None] * len(date_lst[12::12])
The first 12 in [12::12] means that the first item that should be changed is item number 12. The second 12 means that from then on every 12th item should be changed.
You add a step in iloc and set values this way.
lets generate some dummy data.
df = pd.DataFrame({'Vals' :
pd.date_range('01-01-19','02-02-19',freq='D')})
print(df)
Vals
0 2019-01-01
1 2019-01-02
2 2019-01-03
3 2019-01-04
4 2019-01-05
5 2019-01-06
6 2019-01-07
7 2019-01-08
now, you can decide your step
step = 5
new_df = df.iloc[step::step]
print(new_df)
Vals
5 2019-01-06
10 2019-01-11
15 2019-01-16
20 2019-01-21
25 2019-01-26
30 2019-01-31
now, if you want to write a value to a specific column then -
df['Vals'].iloc[step::step] = pd.NaT
print(df)
Vals
0 2019-01-01
1 2019-01-02
2 2019-01-03
3 2019-01-04
4 2019-01-05
5 NaT
Here is an example of setting null if the element of the list is in the 3rd position, you can make this 13th position by changed ((index+1)%13 == 0)
data = [1,2,3,4,5,6,7,8,9]
data = [None if ((index+1)%3 == 0) else d for index, d in enumerate(data)]
print(data)
output:
[1, 2, None, 4, 5, None, 7, 8, None]
According to your code try this:
date_lst = list(date_all)
dateWithNone = [None if ((index+1)%13 == 0) else d for index, d in enumerate(date_lst)]
print(dateWithNone)
I have a dataframe, I am struggling to create a column based out of other columns, I will share the problem for a sample data.
Date Target1 Close
0 2018-05-25 198.0090 188.580002
1 2018-05-25 197.6835 188.580002
2 2018-05-25 198.0090 188.580002
3 2018-05-29 196.6230 187.899994
4 2018-05-29 196.9800 187.899994
5 2018-05-30 197.1375 187.500000
6 2018-05-30 196.6965 187.500000
7 2018-05-30 196.8750 187.500000
8 2018-05-31 196.2135 186.869995
9 2018-05-31 196.2135 186.869995
10 2018-05-31 196.5600 186.869995
11 2018-05-31 196.7700 186.869995
12 2018-05-31 196.9275 186.869995
13 2018-05-31 196.2135 186.869995
14 2018-05-31 196.2135 186.869995
15 2018-06-01 197.2845 190.240005
16 2018-06-01 197.2845 190.240005
17 2018-06-04 201.2325 191.830002
18 2018-06-04 201.4740 191.830002
I want to create another column (for each observation) (called days_to_hit_target for example) which is the difference of days such that close hits (or crosses target of specific day), then it counts the difference of days and put them in the column days_to_hit_target.
The idea is, suppose close price today in 2018-05-25 is 188.58, so, I want to get the date for which this target (198.0090) is hit close which it is doing somewhere later on 2018-06-04, where close has reached to the target of first observation, (198.0090), that will be fed to the first observation of the column (days_to_hit_target ).
Use a combination of loc and at to find the date at which the target is hit, then subtract the dates.
df['TargetDate'] = 'NA'
for i, row in df.iterrows():
t = row['Target1']
d = row['Date']
targdf = df.loc[df['Close'] >= t]
if len(targdf)>0:
targdt = targdf.at[0,'Date']
df.at[i,'TargetDate'] = targdt
else:
df.at[i,'TargetDate'] = '0'
df['Diff'] = df['Date'].sub(df['TargetDate'], axis=0)
import pandas as pd
csv = pd.read_csv(
'sample.csv',
parse_dates=['Date']
)
csv.sort_values('Date', inplace=True)
def find_closest(row):
target = row['Target1']
date = row['Date']
matches = csv[
(csv['Close'] >= target) &
(csv['Date'] > date)
]
closest_date = matches['Date'].iloc[0] if not matches.empty else None
row['days to hit target'] = (closest_date - date).days if closest_date else None
return row
final = csv.apply(find_closest, axis=1)
It's a bit hard to test because none of the targets appear in the close. But the idea is simple. Subset your original frame such that date is after the current row date and Close is greater than or equal to Target1 and get the first entry (this is after you've sorted it using df.sort_values.
If the subset is empty, use None. Otherwise, use the Date. Days to hit target is pretty simple at that point.
So I have a pandas dataframe which has a large number of columns, and one of the columns is a timestamp in datetime format. Each row in the dataframe represents a single "event". What I'm trying to do is graph the frequency of these events over time. Basically a simple bar graph showing how many events per month.
Started with this code:
data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count().plot(kind = 'bar')
plt.show()
This "kind of" works. But there are 2 problems:
1) The graph comes with a legend which includes all the columns in the original data (like 30+ columns). And each bar on the graph has a tiny sub-bar for each of the columns (all of which are the same value since I'm just counting events).
2) There are some months where there are zero events. And these months don't show up on the graph at all.
I finally came up with code to get the graph looking the way I wanted. But it seems to me that I'm not doing this the "correct" way, since this must be a fairly common usecase.
Basically I created a new dataframe with one column "count" and an index that's a string representation of month/year. I populated that with zeroes over the time range I care about and then I copied over the data from the first frame into the new one. Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
cnt = data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count()
index = []
for year in [2015, 2016, 2017, 2018]:
for month in range(1,13):
index.append('%04d-%02d'%(year, month))
cnt_new = pd.DataFrame(index=index, columns=['count'])
cnt_new = cnt_new.fillna(0)
for i, row in cnt.iterrows():
cnt_new.at['%04d-%02d'%i,'count'] = row[0]
cnt_new.plot(kind = 'bar')
plt.show()
Anyone know an easier way to go about this?
EDIT --> Per request, here's an idea of the type of dataframe. It's the results from an SQL query. Actual data is my company's so...
Timestamp FirstName LastName HairColor \
0 2018-11-30 02:16:11 Fred Schwartz brown
1 2018-11-29 16:25:55 Sam Smith black
2 2018-11-19 21:12:29 Helen Hunt red
OK, so I think I got it. Thanks to Yuca for resample command. I just need to run that on the Timestamp data series (rather than on the whole dataframe) and it gives me exactly what I was looking for.
> data.index = data.Timestamp
> data.Timestamp.resample('M').count()
Timestamp
2017-11-30 0
2017-12-31 0
2018-01-31 1
2018-02-28 2
2018-03-31 7
2018-04-30 9
2018-05-31 2
2018-06-30 6
2018-07-31 5
2018-08-31 4
2018-09-30 1
2018-10-31 0
2018-11-30 5
So OP request is: "Basically a simple bar graph showing how many events per month"
Using pd.resample and monthly frequency yields the desired result
df[['FirstName']].resample('M').count()
Output:
FirstName
Timestamp
2018-11-30 3
To include non observed months, we need to create a baseline calendar
df_a = pd.DataFrame(index = pd.date_range(df.index[0].date(), periods=12, freq='M'))
and then assign to it the result of our resample
df_a['count'] = df[['FirstName']].resample('M').count()
Output:
count
2018-11-30 3.0
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN
I have a Panda's dataframe that is filled as follows:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
8/31/2010 1
9/30/2010 4
12/31/2010 2
Note how there are missing months (i.e. 7, 10, 11) in the data. I want to fill in the missing data through a forward filling method so that it looks like this:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
7/30/2010 3
8/31/2010 1
9/30/2010 4
10/29/2010 4
11/30/2010 4
12/31/2010 2
The tag of the missing date will have the tag of the previous. All dates represent the last business day of the month.
This is what I tried to do:
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df.ref_date.index = pd.to_datetime(df.ref_date.index)
df = df.reindex(index=[idx], columns=[ref_date], method='ffill')
It's giving me the error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
where pd is pandas and df is the dataframe.
I'm new to Pandas Dataframe, so any help would be appreciated!
You were very close, you just need to set the dataframe's index with the ref_date, reindex it to the business day month end index while specifying ffill at the method, then reset the index and rename back to the original:
# First ensure the dates are Pandas Timestamps.
df['ref_date'] = pd.to_datetime(df['ref_date'])
# Create a monthly index.
idx_monthly = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
# Reindex to the daily index, forward fill, reindex to the monthly index.
>>> (df
.set_index('ref_date')
.reindex(idx_monthly, method='ffill')
.reset_index()
.rename(columns={'index': 'ref_date'}))
ref_date tag
0 2010-01-29 1.0
1 2010-02-26 3.0
2 2010-03-31 4.0
3 2010-04-30 4.0
4 2010-05-31 1.0
5 2010-06-30 3.0
6 2010-07-30 3.0
7 2010-08-31 1.0
8 2010-09-30 4.0
9 2010-10-29 4.0
10 2010-11-30 4.0
11 2010-12-31 2.0
Thanks to the previous person that answered this question but deleted his answer. I got the solution:
df[ref_date] = pd.to_datetime(df[ref_date])
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df = df.set_index(ref_date).reindex(idx).ffill().reset_index().rename(columns={'index': ref_date})
I have 4188006 rows of data. I want to group my data by its column Code value. And set the Code value as the key, the corresponding data as the value int0 a dict`.
The _a_stock_basic_data is my data:
Code date_time open high low close \
0 000001.SZ 2007-03-01 19.000000 19.000000 18.100000 18.100000
1 000002.SZ 2007-03-01 14.770000 14.800000 13.860000 14.010000
2 000004.SZ 2007-03-01 6.000000 6.040000 5.810000 6.040000
3 000005.SZ 2007-03-01 4.200000 4.280000 4.000000 4.040000
4 000006.SZ 2007-03-01 13.050000 13.470000 12.910000 13.110000
... ... ... ... ... ... ...
88002 603989.SH 2015-06-30 44.950001 50.250000 41.520000 49.160000
88003 603993.SH 2015-06-30 10.930000 12.500000 10.540000 12.360000
88004 603997.SH 2015-06-30 21.400000 24.959999 20.549999 24.790001
88005 603998.SH 2015-06-30 65.110001 65.110001 65.110001 65.110001
amt volume
0 418404992 22927500
1 659624000 46246800
2 23085800 3853070
3 131162000 31942000
4 251946000 19093500
.... ....
88002 314528000 6933840
88003 532364992 46215300
88004 169784992 7503370
88005 0 0
[4188006 rows x 8 columns]
And my code is:
_a_stock_basic_data = pandas.concat(dfs)
_all_universe = set(all_universe.values.tolist())
for _code in _all_universe:
_temp_data = _a_stock_basic_data[_a_stock_basic_data['Code']==_code]
data[_code] = _temp_data[_temp_data.notnull()]
_all_universe contains _a_stock_basic_data['Code']. The length of _all_universe is about 2816, and the number of for loop is 2816, it costs a lot of time to complete the process.
So, I just wonder how to use high performance method to group these data. And I think multiprocessing is a choice, but I think share memory is its problem. And I think as the data is more and more large, performance of code need take into consideration, otherwise, it will costs a lot. Thank you for your help.
I'll show an example which I think will solve your problem. Below I make a dataframe with random elements, where the column Code will have duplicate values
a = pd.DataFrame({'a':np.arange(20), 'b':np.random.random(20), 'Code':np.random.random_integers(0, 10, 20)})
To group by the column Code, set it as index:
a.index = a['Code']
you can now use the index to access the data by the value of Code:
In : a.ix[8]
Out:
a b Code
Code
8 1 0.589938 8
8 3 0.030435 8
8 13 0.228775 8
8 14 0.329637 8
8 17 0.915402 8
Did you tried the pd.concat function? Here you can append arrays along an axis of your choice.
pd.concat([data,_temp_data],axis=1)
- dict(_a_stock_basic_data.groupby(['Code']).size())
## Number of occurences per code
- dict(_a_stock_basic_data.groupby(['Code'])['Column_you_want_to_Aggregate'].sum()) ## If you want to do an aggregation on a certain column
?