How to create dataframe of output - python

I run a query to a webserver with certain added criteria.
I specify a date range which alters the date in the url.
I then pull the data line for specified symbols and I get a list of short volume etc. for the specified stock and time frame.
However, I want to be able to get the output as a dataframe.
The dataframe is now still the stored dataframe from the last ran url, and not of the output.
I tried to use list_.append which I did not get to work.
import pandas as pd
from datetime import datetime
import urllib
symbols = ['AABA']
start_date = datetime(2019, 5, 10 )
end_date = datetime(2019, 5, 15 )
datelist = pd.date_range(start_date, periods=(end_date-start_date).days+1).tolist()
for date in datelist:
url = f"http://regsho.finra.org/FNYXshvol{date.strftime('%Y%m%d')}.txt"
try:
df = pd.read_csv(url,delimiter='|')
if any(df['Symbol'].isin(symbles)):
stocks = df[df['Symbol'].isin(symbols)].to_string(index=False, header=False)
print(stocks)
else:
print(f'No stock found for {date.date()}' )
except urllib.error.HTTPError:
continue
The result is now:
20190510 AABA 2300.0 0.0 14617.0 N
20190513 AABA 2816.0 0.0 39128.0 N
20190514 AABA 1761.0 0.0 26191.0 N
20190515 AABA 24092.0 0.0 62745.0 N
I want the result to be in a dataframe so that I can directly export the result to csv

Why do you convert dataframe to string when you want the output to be a dataframe? (For example, df[df['Symbol'].isin(symbols)].to_csv('ABBA.csv', index=False, header=False)) Anyways, to convert string back to dataframe you can use pandas.read_fwf:
from io import StringIO
df=pd.read_fwf(StringIO(stocks), header=None)
OUTPUT:
0 1 2 3 4 5
0 20190510 AABA 2300.0 0.0 14617.0 N
1 20190513 AABA 2816.0 0.0 39128.0 N
2 20190514 AABA 1761.0 0.0 26191.0 N
3 20190515 AABA 24092.0 0.0 62745.0 N

stocks is a dataframe before you convert it to a string. Just keep it as a dataframe, store it in a list and just concat that list to obtain a full dataframe:
dflist = []
for date in datelist:
url = f"http://regsho.finra.org/FNYXshvol{date.strftime('%Y%m%d')}.txt"
try:
df = pd.read_csv(url,delimiter='|')
if any(df['Symbol'].isin(symbles)):
stocks = df[df['Symbol'].isin(symbols)]
print(stocks.to_string(index=False, header=False))
dflist.append(stocks)
else:
print(f'No stock found for {date.date()}' )
except urllib.error.HTTPError:
continue
df = pd.concat(dflist)

Related

Python/Pandas: Search for date in one dataframe and return value in column of another dataframe with matching date

I have two dataframes, one with earnings date and code for before market/after market and the other with daily OHLC data.
First dataframe df:
earnDate
anncTod
103
2015-11-18
0900
104
2016-02-24
0900
105
2016-05-18
0900
...
..........
.......
128
2022-03-01
0900
129
2022-05-18
0900
130
2022-08-17
0900
Second dataframe af:
Datetime
Open
High
Low
Close
Volume
2005-01-03
36.3458
36.6770
35.5522
35.6833
3343500
...........
.........
.........
.........
........
........
2022-04-22
246.5500
247.2000
241.4300
241.9100
1817977
I want to take a date from the first dataframe and find the open and/or close price in the second dataframe. Depending on anncTod value, I want to find the close price of the previous day (if =0900) or the open and close price on the following day (else). I'll use these numbers to calculate the overnight, intraday and close-to-close move which will be stored in new columns on df.
I'm not sure how to search matching values and fetch values from that row but a different column. I'm trying to do this with a df.iloc and a for loop.
Here's the full code:
import pandas as pd
import requests
import datetime as dt
ticker = 'TGT'
## pull orats earnings dates and store in pandas dataframe
url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}'
response = requests.get(url, allow_redirects=True)
data = response.json()
df = pd.DataFrame(data['data'])
## reduce number of dates to last 28 quarters and remove updatedAt column
n = len(df.index)-28
df.drop(index=df.index[:n], inplace=True)
df = df.iloc[: , 1:-1]
## import daily OHLC stock data file
loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt"
af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume'])
## create total return, overnight and intraday columns in df
df['Total Move'] = '' ##col #2
df['Overnight'] = '' ##col #3
df['Intraday'] = '' ##col #4
for date in df['earnDate']:
if df.iloc[date,1] == '0900':
priorday = af.loc[af.index.get_loc(date)-1,0]
priorclose = af.loc[priorday,4]
open = af.loc[date,1]
close = af.loc[date,4]
df.iloc[date,2] = close/priorclose
df.iloc[date,3] = open/priorclose
df.iloc[date,4] = close/open
else:
print('afternoon')
I get an error:
if df.iloc[date,1] == '0900':
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
Converting the date columns to integers creates another error. Is there a better way I should go about doing this?
Ideal output would look like (made up numbers, abbreviated output):
earnDate
anncTod
Total Move
Overnight Move
Intraday Move
2015-11-18
0900
9%
7.2%
1.8%
But would include all the dates given in the first dataframe.
UPDATE
I swapped df.iloc for df.loc and that seems to have solved that problem. The new issue is searching for variable 'date' in the second dataframe af. I have simplified the code to just print the value in the 'Open' column while I trouble shoot.
Here is updated and simplified code (all else remains the same):
import pandas as pd
import requests
import datetime as dt
ticker = 'TGT'
## pull orats earnings dates and store in pandas dataframe
url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}'
response = requests.get(url, allow_redirects=True)
data = response.json()
df = pd.DataFrame(data['data'])
## reduce number of dates to last 28 quarters and remove updatedAt column
n = len(df.index)-28
df.drop(index=df.index[:n], inplace=True)
df = df.iloc[: , 1:-1]
## set index to earnDate
df = df.set_index(pd.DatetimeIndex(df['earnDate']))
## import daily OHLC stock data file
loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt"
af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume'])
## create total return, overnight and intraday columns in df
df['Total Move'] = '' ##col #2
df['Overnight'] = '' ##col #3
df['Intraday'] = '' ##col #4
for date in df['earnDate']:
if df.loc[date, 'anncTod'] == '0900':
print(af.loc[date,'Open']) ##this is line generating error
else:
print('afternoon')
I now get KeyError:'2015-11-18'
To use loc to access a certain row, that assumes that the label you search for is in the index. Specifically, that means that you'll need to set the date column as index. EX:
import pandas as pd
df = pd.DataFrame({'earnDate': ['2015-11-18', '2015-11-19', '2015-11-20'],
'anncTod': ['0900', '1000', '0800'],
'Open': [111, 222, 333]})
df = df.set_index(df["earnDate"])
for date in df['earnDate']:
if df.loc[date, 'anncTod'] == '0900':
print(df.loc[date, 'Open'])
# prints
# 111

Create a dataframe To detail information of another dataframe

I have one dataframe with the value and number of payments and the start date. id like to create a new dataframe with the all the payments one row per month.
Can you guys give a tip about how to finish it?
# Import pandas library
import pandas as pd
# initialize list of lists
data = [[1,'2017-06-09',300,3]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['ID','DATE','VALUE','PAYMENTS'])
# print dataframe.
df
EXISTING DATAFRAME FIELDS:
DATAFRAME DESIRED, open the payments and update the date:
My first thought was to make a loop appending the payments. But if in this loop i already put the other fields and generate de new data frame, so the task would be done.
result = []
for value in df["PAYMENTS"]:
if value == 1:
result.append(1)
elif value ==3:
for x in range(1,4):
result.append(x)
else:
for x in range(1,7):
result.append(x)
Here's my try:
df.VALUE = df.VALUE / df.PAYMENTS
df = df.merge(df.ID.repeat(df.PAYMENTS), on='ID', how='outer')
df.PAYMENTS = df.groupby('ID').cumcount() + 1
Output:
ID DATE VALUE PAYMENTS
0 1 2017-06-09 100.0 1
1 1 2017-06-09 100.0 2
2 1 2017-06-09 100.0 3

Generating monthly means for all columns without initializing a list for each column?

I have time series data I want to generate the mean for each month, for each column. I have successfully done so, but by creating a list for each column - which wouldn't be feasible for thousands of columns.
How can I adapt my code to auto-populate the column names and values into a dataframe with thousands of columns?
For context, this data has 20 observations per hour for 12 months.
Original data:
timestamp 56TI1164 56FI1281 56TI1281 52FC1043 57TI1501
2016-12-31 23:55:00 117.9673 17876.27 39.10074 9302.815 49.23963
2017-01-01 00:00:00 118.1080 17497.48 39.10759 9322.773 48.97919
2017-01-01 00:05:00 117.7809 17967.33 39.11348 9348.223 48.94284
Output:
56TI1164 56FI1281 56TI1281 52FC1043 57TI1501
0 106.734147 16518.428734 16518.428734 7630.187992 45.992215
1 115.099825 18222.911023 18222.911023 9954.252911 47.334477
2 111.555504 19090.607211 19090.607211 9283.845649 48.939581
3 102.408996 18399.719852 18399.719852 7778.897037 48.130057
4 118.371951 20245.378742 20245.378742 9024.424210 64.796939
5 127.580516 21859.212675 21859.212675 9595.477455 70.952311
6 134.159082 22349.853561 22349.853561 10305.252112 75.195480
7 137.990638 21122.233427 21122.233427 10024.709142 74.755469
8 144.958318 18633.290818 18633.290818 11193.381098 66.776627
9 122.406489 20258.135923 20258.135923 10504.604420 61.793355
10 104.817850 18762.070668 18762.070668 9361.052983 51.802615
11 106.589672 20049.809554 20049.809554 9158.685383 51.611633
Successful code:
#separate data into months
v = list(range(1,13))
data_month = []
for i in v:
data_month.append(data[(data.index.month==i)])
# average per month for each sensor
mean_56TI1164 = []
mean_56FI1281 = []
mean_56TI1281 = []
mean_52FC1043 = []
mean_57TI1501 = []
for i in range(0,12):
mean_56TI1164.append(data_month[i]['56TI1164'].mean())
mean_56FI1281.append(data_month[i]['56FI1281'].mean())
mean_56TI1281.append(data_month[i]['56FI1281'].mean())
mean_52FC1043.append(data_month[i]['52FC1043'].mean())
mean_57TI1501.append(data_month[i]['57TI1501'].mean())
mean_df = {'56TI1164': mean_56TI1164, '56FI1281': mean_56FI1281, '56TI1281': mean_56TI1281, '52FC1043': mean_52FC1043, '57TI1501': mean_57TI1501}
mean_df = pd.DataFrame(mean_df, columns= ['56TI1164', '56FI1281', '56TI1281', '52FC1043', '57TI1501'])
mean_df
Unsuccessful attempt to condense:
col = list(data.columns)
mean_df = pd.DataFrame()
for i in range(0,12):
for j in col:
mean_df[j].append(data_month[i][j].mean())
mean_df
As suggested by G. Anderson, you can use groupby as in this example:
import pandas as pd
import io
csv="""timestamp 56TI1164 56FI1281 56TI1281 52FC1043 57TI1501
2016-12-31 23:55:00 117.9673 17876.27 39.10074 9302.815 49.23963
2017-01-01 00:00:00 118.1080 17497.48 39.10759 9322.773 48.97919
2017-01-01 00:05:00 117.7809 17967.33 39.11348 9348.223 48.94284
2018-01-01 00:05:00 120.0000 17967.33 39.11348 9348.223 48.94284
2018-01-01 00:05:00 124.0000 17967.33 39.11348 9348.223 48.94284"""
# The following lines read your data into a pandas dataframe;
# it may help if your data comes in the form you wrote in the question
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
data = pd.read_csv(io.StringIO(csv), sep='\s+(?!\d\d:\d\d:\d\d)', \
date_parser=dateparse, index_col=0, engine='python')
# Here is where your data is resampled by month and mean is calculated
data.groupby(pd.Grouper(freq='M')).mean()
# If you have missing months, use this instead:
#data.groupby(pd.Grouper(freq='M')).mean().dropna()
Result of data.groupby(pd.Grouper(freq='M')).mean().dropna() will be:
56TI1164 56FI1281 56TI1281 52FC1043 57TI1501
timestamp
2016-12-31 117.96730 17876.270 39.100740 9302.815 49.239630
2017-01-31 117.94445 17732.405 39.110535 9335.498 48.961015
2018-01-31 122.00000 17967.330 39.113480 9348.223 48.942840
Please note that I used data.groupby(pd.Grouper(freq='M')).mean().dropna() to get rid of NaN for the missing months (I added some data for January 2018 skipping what's in between).
Also note that the convoluted read_csv uses a regular expression as a separator: \s+ means one or more whitespace characters, while (?!\d\d:\d\d:\d\d) means "skip this whitespace if followed by something like 23:55:00".
Last engine='python' avoids warnings when read_csv() is used with regular expression

Organizing dates and holidays in a dataframe

Scenario: I have one with different columns of data, and another single dataframe with lists of dates.
Example of dataframe1:
iterationcount datecolumn list
iteration5 1
iteration5 2
iteration3 2
iteration3 2
iteration4 33
iteration3 4
iteration1 5
iteration2 3
iteration5 2
iteration4 22
Example of dataframe2:
iteration1 01.01.2018 26.01.2018 30.03.2018
iteration2 01.01.2018 30.03.2018 02.04.2018 25.12.2018 26.12.2018
iteration3
iteration4 01.01.2018 15.01.2018 19.02.2018
iteration5 01.01.2018 19.02.2018 30.03.2018 21.05.2018 02.07.2018 06.08.2018 03.09.2018 08.10.2018 12.11.2018
The second dataframe is a list of holidays for each of the iterations. And it will be used to fill the second column of the first dataframe
Constraints: For each iteration of the first dataframe the user will select a month and year: the script will then find the first date of that month. If that date is on the list of dates of dataframe2 for that iteration, then pick the next working date based on the program calender.
Ex: User selects January 2018, code returns 01/01/2018. For the first iteration, that date is a holiday, so pick the next workday, in this case 02/01/2018, and then input this date to all of dataframe1 corresponding to that iteration:
iterationcount datecolumn list
iteration5 1
iteration5 2
iteration3 2
iteration3 2
iteration4 33
iteration3 4
iteration1 02/01/2018 5
iteration2 3
iteration5 2
iteration4 22
Then move to the next iteration (some iterations will have the same calendar dates).
Code: I have tried multiple approaches so far, but could not achieve the result. The closest I think I got was with:
import pandas as pd
import datetime
import os
from os import listdir
from os.path import isfile, join
import glob
## Get Adjustments
mypath3 = "//DGMS/Desktop/Uploader_v1.xlsm"
ApplyOnDates = pd.read_excel(open(mypath3, 'rb'), sheet_name='Holidays')
# Get content
mypath = "//DGMS/Desktop/Uploaded"
all_files = glob.glob(os.path.join(mypath, "*.xls*"))
contentdataframes = []
contentdataframes2 = []
for f in all_files:
df = pd.read_excel(f)
df['Name'] = os.path.basename(f).split('.')[0].split('_')[0]
df['ApplyOn']= ''
mask = df.columns.str.contains('Base|Last|Fixing|Cash')
c2 = df.columns[~mask].tolist()
df = df[c2]
contentdataframes.append(df)
finalfinal = pd.concat(contentdataframes2)
for row in finalfinal.Name.itertuple():
datedatedate = datetime.datetime(2018, 01, 1)
if (pd.np.where(ApplyOnDates.Index.str.contains(finalfinal(row)).isin(datedatedate) = True:
datetouse = datedatedate + datetime.timedelta(days=1)
else:
datetouse = datedatedate
finalfinal['ApplyOn'] = datetouse
Question: Basically, my main trouble here is being able to match the rows in both dataframes and search the date in the column of the holidays dataframe. Is there a proper way to do this?
Obs: I was able to achieve a similar result directly in vba, by using the functions of excel (vlookup, match...), the problem is that doing in excel for the amount of data basically crashes the file every time.
so you want to basically merge the column of dataframe2 to dataframe1 right? Try to use merge:
newdf = pd.DataFrame.merge(dataframe1, dataframe2, left_on='iterationcount',
right_on='iterationcount', how='inner', indicator=False)
That should give you a new frame.

Taking certain column values from one row in a Pandas Dataframe and adding them into another dataframe

I would like to copy certain column values from a specific row in my dataframe df to another dataframe called bestdf
Here I create an empty dataframe (called bestdf):
new_columns = ['DATE', 'PRICE1', 'PRICE2']
bestdf = pd.DataFrame(columns = new_columns)
bestdf.set_index(['DATE'])
.I've located a certain row out of df and assigned the row to a variable last_time:
last_time = df.iloc[-1]
print last_time
gives me
DATETIME PRC
2016-10-03 00:07:39.295000 335.82
I then want to take the 2016-10-03 from the DATETIME column and put that into the DATE column of my other dataframe (bestdf).
I also want to take the PRC and put it into the PRICE1 column of my empty dataframe. I want bestdf to look like this:
DATE PRICE1 PRICE2
2016-10-03 335.82
Here is what I've got so far?
sample_date = str(last_time).split()
best_price = sample_date[2]
sample_date = sample_date[0]
bestdf['DATE'] = sample_date
bestdf['PRICE1'] = best_price
This doesn't seem to work though. FYI I also want to put this into a loop (where last_time will be amended and each time the new values will be written to a new row). I'm just currently trying to get the functionality correct.
Please help!
Thanks
There are multiple ways to do what are you are looking to do:
Also you can break your problem down into multiple pieces. That way you will be able to apply different steps to solve them.
Here is an example:
import pandas as pd
from datetime import datetime
data = [{'DATETIME': '2016-10-03 00:07:39.295000', 'PRC': 335.29},
{'DATETIME': '2016-10-03 00:07:39.295000', 'PRC': 33.9},
{'DATETIME': '2016-10-03 00:07:39.295000', 'PRC': 10.9}]
df = pd.DataFrame.from_dict(data, orient='columns')
df
output:
DATETIME PRC
0 2016-10-03 00:07:39.295000 335.29
1 2016-10-03 00:07:39.295000 33.90
2 2016-10-03 00:07:39.295000 10.90
code continue:
bestdf = df[df['PRC'] > 15].copy()
# we filter data from original df and make a copy
bestdf.columns = ['DATE','PRICE1']
# we change columns as we need
bestdf['PRICE2'] = None
bestdf
output:
DATE PRICE1 PRICE2
0 2016-10-03 00:07:39.295000 335.29 None
1 2016-10-03 00:07:39.295000 33.90 None
code continue:
bestdf['DATE'] = bestdf['DATE'].apply(lambda value: value.split(' ')[0])
# we change column format based on how we need it to be
bestdf
output:
DATE PRICE1 PRICE2
0 2016-10-03 335.29 None
1 2016-10-03 33.90 None
We can also do the same thing with datetime objects also. Doesn't have to be string necessarily.

Categories