Python Pandas: Find index based on value in DataFrame - python

Is there a way to specify a DataFrame index (row) based on matching text inside the dataframe?
I am importing a text file from the internet located here every day into a python pandas DataFrame. I am parsing out just some of the data and doing calculations to give me the peak value for each day. The specific group of data I am needing to gather starts with the section headed "RTO COMBINED HOUR ENDING INTEGRATED FORECAST LOAD MW".
I need to specifically only use part of the data to do the calculations I need and I am able to manually specify which index line to start with, but daily this number could change due to text added to the top of the file by the authors.
Updated as of: 05-05-2016 1700 Constrained operations ARE expected in
the AEP, APS, BC, COMED, DOM,and PS zones on 05-06-2016. Constrained
operations ARE expected in the AEP, APS, BC, COMED, DOM,and PS zones
on 05-07-2016. The PS/ConEd 600/400 MW contract will be limited to
700MW on 05-06-16.
Is there a way to match text in the pandas DataFrame and specify the index of that match? Currently I am manually specifying the index I want to start with using the variable 'day' below on the 6th line. I would like this variable to hold the index (row) of the dataframe that includes the text I want to match.
The code below works but may stop working if the line number (index) changes:
def forecastload():
wb = load_workbook(filename = 'pjmactualload.xlsx')
ws = wb['PJM Load']
printRow = 13
#put this in iteration to pull 2 rows of data at a time (one for each day) for 7 days max
day = 239
while day < 251:
#pulls in first day only
data = pd.read_csv("http://oasis.pjm.com/doc/projload.txt", skiprows=day, delim_whitespace=True, header=None, nrows=2)
#sets data at HE 24 = to data that is in HE 13- so I can delete column 0 data to allow checking 'max'
data.at[1,13]= data.at[1,1]
#get date for printing it with max load later on
newDate = str(data.at[0,0])
#now delete first column to get rid of date data. date already saved as newDate
data = data.drop(0,1)
data = data.drop(1,1)
#pull out max value of day
#add index to this for iteration ie dayMax[x] = data.values.max()
dayMax = data.max().max()
dayMin = data.min().min()
#print date and max load for that date
actualMax = "Forecast Max"
actualMin = "Forecast Min"
dayMax = int(dayMax)
maxResults = [str(newDate),int(dayMax),actualMax,dayMin,actualMin]
d = 1
for items in maxResults:
ws.cell(row=printRow, column=d).value = items
d += 1
printRow += 1
#print maxResults
#l.writerows(maxResults)
day = day + 2
wb.save('pjmactualload.xlsx')

In this case i recommend you to use the command line in order to obtain a dataset that you could read later with pandas and do whatever you want.
To retrieve the data you can use curl and grep:
$ curl -s http://oasis.pjm.com/doc/projload.txt | grep -A 17 "RTO COMBINED HOUR ENDING INTEGRATED FORECAST" | tail -n +5
05/06/16 am 68640 66576 65295 65170 66106 70770 77926 83048 84949 85756 86131 86089
pm 85418 85285 84579 83762 83562 83289 82451 82460 84009 82771 78420 73258
05/07/16 am 66809 63994 62420 61640 61848 63403 65736 68489 71850 74183 75403 75529
pm 75186 74613 74072 73950 74386 74978 75135 75585 77414 76451 72529 67957
05/08/16 am 63583 60903 59317 58492 58421 59378 60780 62971 66289 68997 70436 71212
pm 71774 71841 71635 71831 72605 73876 74619 75848 78338 77121 72665 67763
05/09/16 am 63865 61729 60669 60651 62175 66796 74620 79930 81978 83140 84307 84778
pm 85112 85562 85568 85484 85766 85924 85487 85737 87366 84987 78666 72166
05/10/16 am 67581 64686 62968 62364 63400 67603 75311 80515 82655 84252 86078 87120
pm 88021 88990 89311 89477 89752 89860 89256 89327 90469 87730 81220 74449
05/11/16 am 70367 67044 65125 64265 65054 69060 76424 81785 84646 87097 89541 91276
pm 92646 93906 94593 94970 95321 95073 93897 93162 93615 90974 84335 77172
05/12/16 am 71345 67840 65837 64892 65600 69547 76853 82077 84796 87053 89135 90527
pm 91495 92351 92583 92473 92541 92053 90818 90241 90750 88135 81816 75042
Let's use the previous output (in the rto.txt file) to obtain a more readable data using awk and sed:
$ awk '/^ [0-9]/{d=$1;print $0;next}{print d,$0}' rto.txt | sed 's/^ //;s/\s\+/,/g'
05/06/16,am,68640,66576,65295,65170,66106,70770,77926,83048,84949,85756,86131,86089
05/06/16,pm,85418,85285,84579,83762,83562,83289,82451,82460,84009,82771,78420,73258
05/07/16,am,66809,63994,62420,61640,61848,63403,65736,68489,71850,74183,75403,75529
05/07/16,pm,75186,74613,74072,73950,74386,74978,75135,75585,77414,76451,72529,67957
05/08/16,am,63583,60903,59317,58492,58421,59378,60780,62971,66289,68997,70436,71212
05/08/16,pm,71774,71841,71635,71831,72605,73876,74619,75848,78338,77121,72665,67763
05/09/16,am,63865,61729,60669,60651,62175,66796,74620,79930,81978,83140,84307,84778
05/09/16,pm,85112,85562,85568,85484,85766,85924,85487,85737,87366,84987,78666,72166
05/10/16,am,67581,64686,62968,62364,63400,67603,75311,80515,82655,84252,86078,87120
05/10/16,pm,88021,88990,89311,89477,89752,89860,89256,89327,90469,87730,81220,74449
05/11/16,am,70367,67044,65125,64265,65054,69060,76424,81785,84646,87097,89541,91276
05/11/16,pm,92646,93906,94593,94970,95321,95073,93897,93162,93615,90974,84335,77172
05/12/16,am,71345,67840,65837,64892,65600,69547,76853,82077,84796,87053,89135,90527
05/12/16,pm,91495,92351,92583,92473,92541,92053,90818,90241,90750,88135,81816,75042
now, read and reshape the above result with pandas:
df = pd.read_csv("rto2.txt",names=["date","period"]+list(range(1,13)),index_col=[0,1])
df = df.stack().reset_index().rename(columns={"level_2":"hour",0:"value"})
df.index = pd.to_datetime(df.apply(lambda x: "{date} {hour}:00 {period}".format(**x),axis=1))
df.drop(["date", "hour", "period"], axis=1, inplace=True)
At this point you have a beautiful time series :)
In [10]: df.head()
Out[10]:
value
2016-05-06 01:00:00 68640
2016-05-06 02:00:00 66576
2016-05-06 03:00:00 65295
2016-05-06 04:00:00 65170
2016-05-06 05:00:00 66106
to obtain the statistics:
In[11]: df.groupby(df.index.date).agg([min,max])
Out[11]:
value
min max
2016-05-06 65170 86131
2016-05-07 61640 77414
2016-05-08 58421 78338
2016-05-09 60651 87366
2016-05-10 62364 90469
2016-05-11 64265 95321
2016-05-12 64892 92583
I hope this can help you.
Regards.

Here is how you can do what you are looking for:
And the sample code:
import numpy as np
import pandas a pd
df = pd.DataFrame(np.random.rand(10,4), columns=list('abcd'))
df.loc[df['a'] < 0.5, 'a'] = 1
You can refer to this documentation
Added image showing how to access index:

Related

Python/Pandas: Search for date in one dataframe and return value in column of another dataframe with matching date

I have two dataframes, one with earnings date and code for before market/after market and the other with daily OHLC data.
First dataframe df:
earnDate
anncTod
103
2015-11-18
0900
104
2016-02-24
0900
105
2016-05-18
0900
...
..........
.......
128
2022-03-01
0900
129
2022-05-18
0900
130
2022-08-17
0900
Second dataframe af:
Datetime
Open
High
Low
Close
Volume
2005-01-03
36.3458
36.6770
35.5522
35.6833
3343500
...........
.........
.........
.........
........
........
2022-04-22
246.5500
247.2000
241.4300
241.9100
1817977
I want to take a date from the first dataframe and find the open and/or close price in the second dataframe. Depending on anncTod value, I want to find the close price of the previous day (if =0900) or the open and close price on the following day (else). I'll use these numbers to calculate the overnight, intraday and close-to-close move which will be stored in new columns on df.
I'm not sure how to search matching values and fetch values from that row but a different column. I'm trying to do this with a df.iloc and a for loop.
Here's the full code:
import pandas as pd
import requests
import datetime as dt
ticker = 'TGT'
## pull orats earnings dates and store in pandas dataframe
url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}'
response = requests.get(url, allow_redirects=True)
data = response.json()
df = pd.DataFrame(data['data'])
## reduce number of dates to last 28 quarters and remove updatedAt column
n = len(df.index)-28
df.drop(index=df.index[:n], inplace=True)
df = df.iloc[: , 1:-1]
## import daily OHLC stock data file
loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt"
af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume'])
## create total return, overnight and intraday columns in df
df['Total Move'] = '' ##col #2
df['Overnight'] = '' ##col #3
df['Intraday'] = '' ##col #4
for date in df['earnDate']:
if df.iloc[date,1] == '0900':
priorday = af.loc[af.index.get_loc(date)-1,0]
priorclose = af.loc[priorday,4]
open = af.loc[date,1]
close = af.loc[date,4]
df.iloc[date,2] = close/priorclose
df.iloc[date,3] = open/priorclose
df.iloc[date,4] = close/open
else:
print('afternoon')
I get an error:
if df.iloc[date,1] == '0900':
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
Converting the date columns to integers creates another error. Is there a better way I should go about doing this?
Ideal output would look like (made up numbers, abbreviated output):
earnDate
anncTod
Total Move
Overnight Move
Intraday Move
2015-11-18
0900
9%
7.2%
1.8%
But would include all the dates given in the first dataframe.
UPDATE
I swapped df.iloc for df.loc and that seems to have solved that problem. The new issue is searching for variable 'date' in the second dataframe af. I have simplified the code to just print the value in the 'Open' column while I trouble shoot.
Here is updated and simplified code (all else remains the same):
import pandas as pd
import requests
import datetime as dt
ticker = 'TGT'
## pull orats earnings dates and store in pandas dataframe
url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}'
response = requests.get(url, allow_redirects=True)
data = response.json()
df = pd.DataFrame(data['data'])
## reduce number of dates to last 28 quarters and remove updatedAt column
n = len(df.index)-28
df.drop(index=df.index[:n], inplace=True)
df = df.iloc[: , 1:-1]
## set index to earnDate
df = df.set_index(pd.DatetimeIndex(df['earnDate']))
## import daily OHLC stock data file
loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt"
af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume'])
## create total return, overnight and intraday columns in df
df['Total Move'] = '' ##col #2
df['Overnight'] = '' ##col #3
df['Intraday'] = '' ##col #4
for date in df['earnDate']:
if df.loc[date, 'anncTod'] == '0900':
print(af.loc[date,'Open']) ##this is line generating error
else:
print('afternoon')
I now get KeyError:'2015-11-18'
To use loc to access a certain row, that assumes that the label you search for is in the index. Specifically, that means that you'll need to set the date column as index. EX:
import pandas as pd
df = pd.DataFrame({'earnDate': ['2015-11-18', '2015-11-19', '2015-11-20'],
'anncTod': ['0900', '1000', '0800'],
'Open': [111, 222, 333]})
df = df.set_index(df["earnDate"])
for date in df['earnDate']:
if df.loc[date, 'anncTod'] == '0900':
print(df.loc[date, 'Open'])
# prints
# 111

Pandas parsing excel file all in column A

I have a wireless radio readout that basically dumps all of the data into one column (column 'A') a of a spreadsheet (.xlsx). Is there anyway to parse the twenty plus columns into a dataframe for pandas? This is example of the data that is in column A of the excel file:
DSP ALLMSINFO:SECTORID=0,CARRIERID=0;
Belgium351G
+++ HUAWEI 2020-04-03 10:04:47 DST
O&M #4421590
%%/*35687*/DSP ALLMSINFO:SECTORID=0,CARRIERID=0;%%
RETCODE = 0 Operation succeeded
Display Information of All MSs-
------------------------------
Sector ID Carrier ID MSID MSSTATUS MSPWR(dBm) DLCINR(dB) ULCINR(dB) DLRSSI(dBm) ULRSSI(dBm) DLFEC ULFEC DLREPETITIONFATCTOR ULREPETITIONFATCTOR DLMIMOFLAG BENUM NRTPSNUM RTPSNUM ERTPSNUM UGSNUM UL PER for an MS(0.001) NI Value of the Band Where an MS Is Located(dBm) DL Traffic Rate for an MS(byte/s) UL Traffic Rate for an MS(byte/s)
0 0 0011-4D10-FFBA Enter -2 29 27 -56 -107 21 20 0 0 MIMO B 2 0 0 0 0 0 -134 158000 46000
0 0 501F-F63B-FB3B Enter 13 27 28 -68 -107 21 20 0 0 MIMO A 2 0 0 0 0 0 -134 12 8
Basically I just want to parse this data and have the table in a dataframe. Any help would be greatly appreciated.
You could try pandas read excel
df = pd.read_excel(filename, skip_rows=9)
This assumes we want to ignore the first 9 rows that don't make up the dataframe! Docs here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
Load the excel file and split the column on the spaces.
A problem may occur with "DLMIMOFLAG" because it has a space in the data and this will cause it to be split over two columns. It's optional whether this is acceptable or if the columns are merged back together afterwards.
Add the header manually rather than load it, otherwise all the spaces in the header will confuse the loading & splitting routines.
import numpy as np
import pandas as pd
# Start on the first data row - row 10
# Make sure pandas knows that only data is being loaded by using
# header=None
df = pd.read_excel('radio.xlsx', skiprows=10, header=None)
This gives a dataframe that is only data, all held in one column.
To split these out, make sure pandas has a reference to the first column with df.iloc[:,0], split the column based on spaces with str.split() and inform pandas the output will be a numpy list values.tolist().
Together this looks like:
df2 = pd.DataFrame(df.iloc[:,0].str.split().values.tolist())
Note the example given has an extra column because of the space in "DLMIMOFLAG" causing it to be split over two columns. This will be referred to as "DLMIMOFLAG_A" and "DLMIMOFLAG_B".
Now add on the column headers.
Optionally create a list first.
column_names = ["Sector ID", "Carrier ID", "MSID", "MSSTATUS", "MSPWR(dBm)", "DLCINR(dB)", "ULCINR(dB)",
"DLRSSI(dBm)", "ULRSSI(dBm)", "DLFEC", "ULFEC", "DLREPETITIONFATCTOR", "ULREPETITIONFATCTOR",
"DLMIMOFLAG_A", "DLMIMOFLAG_B", "BENUM", "NRTPSNUM", "RTPSNUM", "ERTPSNUM", "UGSNUM",
"UL PER for an MS(0.001)", "NI Value of the Band Where an MS Is Located(dBm)",
"DL Traffic Rate for an MS(byte/s)", "UL Traffic Rate for an MS(byte/s)",]
df2.columns = column_names
This gives the output as a full dataframe with column headers.
Sector ID Carrier ID MSID MSSTATUS
0 0 0011-4D10-FFBA Enter
0 0 501F-F63B-FB3B Enter

Generating monthly means for all columns without initializing a list for each column?

I have time series data I want to generate the mean for each month, for each column. I have successfully done so, but by creating a list for each column - which wouldn't be feasible for thousands of columns.
How can I adapt my code to auto-populate the column names and values into a dataframe with thousands of columns?
For context, this data has 20 observations per hour for 12 months.
Original data:
timestamp 56TI1164 56FI1281 56TI1281 52FC1043 57TI1501
2016-12-31 23:55:00 117.9673 17876.27 39.10074 9302.815 49.23963
2017-01-01 00:00:00 118.1080 17497.48 39.10759 9322.773 48.97919
2017-01-01 00:05:00 117.7809 17967.33 39.11348 9348.223 48.94284
Output:
56TI1164 56FI1281 56TI1281 52FC1043 57TI1501
0 106.734147 16518.428734 16518.428734 7630.187992 45.992215
1 115.099825 18222.911023 18222.911023 9954.252911 47.334477
2 111.555504 19090.607211 19090.607211 9283.845649 48.939581
3 102.408996 18399.719852 18399.719852 7778.897037 48.130057
4 118.371951 20245.378742 20245.378742 9024.424210 64.796939
5 127.580516 21859.212675 21859.212675 9595.477455 70.952311
6 134.159082 22349.853561 22349.853561 10305.252112 75.195480
7 137.990638 21122.233427 21122.233427 10024.709142 74.755469
8 144.958318 18633.290818 18633.290818 11193.381098 66.776627
9 122.406489 20258.135923 20258.135923 10504.604420 61.793355
10 104.817850 18762.070668 18762.070668 9361.052983 51.802615
11 106.589672 20049.809554 20049.809554 9158.685383 51.611633
Successful code:
#separate data into months
v = list(range(1,13))
data_month = []
for i in v:
data_month.append(data[(data.index.month==i)])
# average per month for each sensor
mean_56TI1164 = []
mean_56FI1281 = []
mean_56TI1281 = []
mean_52FC1043 = []
mean_57TI1501 = []
for i in range(0,12):
mean_56TI1164.append(data_month[i]['56TI1164'].mean())
mean_56FI1281.append(data_month[i]['56FI1281'].mean())
mean_56TI1281.append(data_month[i]['56FI1281'].mean())
mean_52FC1043.append(data_month[i]['52FC1043'].mean())
mean_57TI1501.append(data_month[i]['57TI1501'].mean())
mean_df = {'56TI1164': mean_56TI1164, '56FI1281': mean_56FI1281, '56TI1281': mean_56TI1281, '52FC1043': mean_52FC1043, '57TI1501': mean_57TI1501}
mean_df = pd.DataFrame(mean_df, columns= ['56TI1164', '56FI1281', '56TI1281', '52FC1043', '57TI1501'])
mean_df
Unsuccessful attempt to condense:
col = list(data.columns)
mean_df = pd.DataFrame()
for i in range(0,12):
for j in col:
mean_df[j].append(data_month[i][j].mean())
mean_df
As suggested by G. Anderson, you can use groupby as in this example:
import pandas as pd
import io
csv="""timestamp 56TI1164 56FI1281 56TI1281 52FC1043 57TI1501
2016-12-31 23:55:00 117.9673 17876.27 39.10074 9302.815 49.23963
2017-01-01 00:00:00 118.1080 17497.48 39.10759 9322.773 48.97919
2017-01-01 00:05:00 117.7809 17967.33 39.11348 9348.223 48.94284
2018-01-01 00:05:00 120.0000 17967.33 39.11348 9348.223 48.94284
2018-01-01 00:05:00 124.0000 17967.33 39.11348 9348.223 48.94284"""
# The following lines read your data into a pandas dataframe;
# it may help if your data comes in the form you wrote in the question
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
data = pd.read_csv(io.StringIO(csv), sep='\s+(?!\d\d:\d\d:\d\d)', \
date_parser=dateparse, index_col=0, engine='python')
# Here is where your data is resampled by month and mean is calculated
data.groupby(pd.Grouper(freq='M')).mean()
# If you have missing months, use this instead:
#data.groupby(pd.Grouper(freq='M')).mean().dropna()
Result of data.groupby(pd.Grouper(freq='M')).mean().dropna() will be:
56TI1164 56FI1281 56TI1281 52FC1043 57TI1501
timestamp
2016-12-31 117.96730 17876.270 39.100740 9302.815 49.239630
2017-01-31 117.94445 17732.405 39.110535 9335.498 48.961015
2018-01-31 122.00000 17967.330 39.113480 9348.223 48.942840
Please note that I used data.groupby(pd.Grouper(freq='M')).mean().dropna() to get rid of NaN for the missing months (I added some data for January 2018 skipping what's in between).
Also note that the convoluted read_csv uses a regular expression as a separator: \s+ means one or more whitespace characters, while (?!\d\d:\d\d:\d\d) means "skip this whitespace if followed by something like 23:55:00".
Last engine='python' avoids warnings when read_csv() is used with regular expression

Organizing dates and holidays in a dataframe

Scenario: I have one with different columns of data, and another single dataframe with lists of dates.
Example of dataframe1:
iterationcount datecolumn list
iteration5 1
iteration5 2
iteration3 2
iteration3 2
iteration4 33
iteration3 4
iteration1 5
iteration2 3
iteration5 2
iteration4 22
Example of dataframe2:
iteration1 01.01.2018 26.01.2018 30.03.2018
iteration2 01.01.2018 30.03.2018 02.04.2018 25.12.2018 26.12.2018
iteration3
iteration4 01.01.2018 15.01.2018 19.02.2018
iteration5 01.01.2018 19.02.2018 30.03.2018 21.05.2018 02.07.2018 06.08.2018 03.09.2018 08.10.2018 12.11.2018
The second dataframe is a list of holidays for each of the iterations. And it will be used to fill the second column of the first dataframe
Constraints: For each iteration of the first dataframe the user will select a month and year: the script will then find the first date of that month. If that date is on the list of dates of dataframe2 for that iteration, then pick the next working date based on the program calender.
Ex: User selects January 2018, code returns 01/01/2018. For the first iteration, that date is a holiday, so pick the next workday, in this case 02/01/2018, and then input this date to all of dataframe1 corresponding to that iteration:
iterationcount datecolumn list
iteration5 1
iteration5 2
iteration3 2
iteration3 2
iteration4 33
iteration3 4
iteration1 02/01/2018 5
iteration2 3
iteration5 2
iteration4 22
Then move to the next iteration (some iterations will have the same calendar dates).
Code: I have tried multiple approaches so far, but could not achieve the result. The closest I think I got was with:
import pandas as pd
import datetime
import os
from os import listdir
from os.path import isfile, join
import glob
## Get Adjustments
mypath3 = "//DGMS/Desktop/Uploader_v1.xlsm"
ApplyOnDates = pd.read_excel(open(mypath3, 'rb'), sheet_name='Holidays')
# Get content
mypath = "//DGMS/Desktop/Uploaded"
all_files = glob.glob(os.path.join(mypath, "*.xls*"))
contentdataframes = []
contentdataframes2 = []
for f in all_files:
df = pd.read_excel(f)
df['Name'] = os.path.basename(f).split('.')[0].split('_')[0]
df['ApplyOn']= ''
mask = df.columns.str.contains('Base|Last|Fixing|Cash')
c2 = df.columns[~mask].tolist()
df = df[c2]
contentdataframes.append(df)
finalfinal = pd.concat(contentdataframes2)
for row in finalfinal.Name.itertuple():
datedatedate = datetime.datetime(2018, 01, 1)
if (pd.np.where(ApplyOnDates.Index.str.contains(finalfinal(row)).isin(datedatedate) = True:
datetouse = datedatedate + datetime.timedelta(days=1)
else:
datetouse = datedatedate
finalfinal['ApplyOn'] = datetouse
Question: Basically, my main trouble here is being able to match the rows in both dataframes and search the date in the column of the holidays dataframe. Is there a proper way to do this?
Obs: I was able to achieve a similar result directly in vba, by using the functions of excel (vlookup, match...), the problem is that doing in excel for the amount of data basically crashes the file every time.
so you want to basically merge the column of dataframe2 to dataframe1 right? Try to use merge:
newdf = pd.DataFrame.merge(dataframe1, dataframe2, left_on='iterationcount',
right_on='iterationcount', how='inner', indicator=False)
That should give you a new frame.

manipulating value of pandas dataframe cell based on value in previous row without iteration

I have a pandas dataframe with~3900 rows and 6 columns compiled from Google Finance . One of these columns defines a time in unix format, specifically defining a time during the trading day for a market. In this case the DJIA from 930A EST to 4P EST. However, only the cell for the beginning of each day (930A) has the complete unix time stamp (prefixed with an 'a') and the others are the minutes after the first time of the day.
Here is an example of the raw data:
Date Close High Low Open Volume
0 a1450449000 173.87 173.87 173.83 173.87 46987
1 1 173.61 173.83 173.55 173.78 19275
2 2 173.37 173.63 173.37 173.60 16014
3 3 173.50 173.59 173.31 173.34 14198
4 4 173.50 173.57 173.46 173.52 7010
Date Close High Low Open Volume
388 388 171.16 171.27 171.15 171.26 11809
389 389 171.11 171.23 171.07 171.18 30449
390 390 170.89 171.16 170.89 171.09 163937
391 a1450708200 172.28 172.28 172.28 172.28 23880
392 1 172.27 172.27 172.00 172.06 2719
The change at index 391 is not contiguous such that a solution like #Stefan's would unfortunately not correctly adjust the Date value.
I can easily enough go through with a lambda and line by line remove the 'a' (if necessary) convert the values to an integer and convert the minutes past 930A into seconds with the following code:
import pandas as pd
import numpy as np
import datetime
bars = pd.read_csv(r'http://www.google.com/finance/getprices?i=60&p=10d&f=d,o,h,l,c,v&df=cpct&q=DIA', skiprows=7, header=None, names=['Date', 'Close', 'High', 'Low', 'Open', 'Volume'])
bars['Date'] = bars['Date'].map(lambda x: int(x[1:]) if x[0] == 'a' else int(x))
bars['Date'] = bars['Date'].map(lambda u: u * 60 if u < 400 else u)
Now what I would like to do is, without iterating over the dataframe, determine if the value of bars['Date'] is not a unix time stamp (e.g. < 24000 in the terms of this data set). If so I want to add that value to the time stamp for that particular day to create a complete unix time stamp for each entry.
I know that I can compare the previous row via:
bars['Date'][:-1]>bars['Date'][1:]
I feel like that would be the way to go but I cant figure out a way to use this in a function as it returns a series.
Thanks in advance for any help!
You could add a new column that always contains the latest Timestamp and then add to the Date where necessary.
threshold = 24000
bars['Timestamp'] = bars[bars['Date']>threshold].loc[:, 'Date']
bars['Timestamp'] = bars['Timestamp'].fillna(method='ffill')
bars['Date'] = bars.apply(lambda x: x.Date + x.Timestamp if x.Date < threshold else x.Date, axis=1)
bars.drop('Timestamp', axis=1, inplace=True)
to get:
Date Close High Low Open Volume
0 1450449000 173.87 173.870 173.83 173.87 46987
1 1450449060 173.61 173.830 173.55 173.78 19275
2 1450449120 173.37 173.630 173.37 173.60 16014
3 1450449180 173.50 173.590 173.31 173.34 14198
4 1450449240 173.50 173.570 173.46 173.52 7010
5 1450449300 173.66 173.680 173.44 173.45 10597
6 1450449360 173.40 173.670 173.34 173.67 14270
7 1450449420 173.36 173.360 173.13 173.32 22485
8 1450449480 173.29 173.480 173.25 173.36 18542

Categories