Convert a datatable into mulitple dataframe

Convert a datatable into mulitple dataframe - python

I have written code to get data from website for a particular date say 26th feb 2021. the processed data from code is as follows.
Client Type
Client
DII
FII
Pro
Future Index Long
126331
584
82434
27321
Future Index Short
133088
34291
40107
29184
Option Index Call Long
1022372
267
198308
310605
Option Index Put Long
790647
12740
291494
292811
Option Index Call Short
964795
0
147444
419313
Option Index Put Short
919882
0
157139
310671
I want to convert the dataframe into multiple dataframes an e.g would
dfclient should be like this:-
Date
Future Index Long
Future Index Short
Option Index Call Long
Option Index Put Long
Option Index Call Short
Option Index Put Short
26-02-2021
126331
133088
1022372
790647
964795
919882
what is fastest way to achieve the objective.
I would be required to run a loop as i want data for last 10 business days
entire code is as follows:-
from numpy.core.fromnumeric import transpose
import pandas as pd
import datetime as dt
import xlwings as xw
from pandas.tseries.offsets import *
import requests as req
hols = ["2021-01-26", "2021-03-11", "2021-03-29", "2021-04-02",
"2021-04-14", "2021-04-21", "2021-05-13", "2021-07-21",
"2021-08-19", "2021-09-10", "2021-10-15", "2021-11-04",
"2021-11-05", "2021-11-19"]
hols = pd.to_datetime(hols)
bdays = pd.date_range(end = dt.date.today(),periods=60,freq = BDay())
wdays = bdays.difference(hols)[-10:]
wodays = pd.DataFrame(wdays,columns = ['Business_day'])
wodays['Datestring'] = wodays['Business_day'].dt.strftime("%d%m%Y")
# getting data from NSE website
url = 'https://archives.nseindia.com/content/nsccl/fao_participant_oi_'+wodays.Datestring[0]+'.csv'
headers = {'User-Agent': 'Mozilla/5.0'}
OC = req.get(url,headers=headers).content
data = pd.read_csv(url, header = 1, usecols = [0,1,2,5,6,7,8], index_col = 0 )
data = data.head(4).transpose()

For the looping among all date and saving data frame for each date you need to create a dictionary of data frame ... in your case dictionary of dfclient
here is a chunk of code you needed
dfclient = {}
for i, date in enumerate(wodays.Datestring):
# getting data from NSE website
url = 'https://archives.nseindia.com/content/nsccl/fao_participant_oi_' + date + '.csv'
headers = {'User-Agent': 'Mozilla/5.0'}
OC = req.get(url, headers=headers).content
data = pd.read_csv(url, header=1, usecols=[0, 1, 2, 5, 6, 7, 8], index_col=0)
data = data.head(4).iloc[0]
data = pd.DataFrame(data).transpose().reset_index()
data.at[0, 'index'] = dt.date(wdays[i].year, wdays[i].month, wdays[i].day)
data.rename(columns={'index': 'Date'}, inplace=True)
dfclient[date] = data
Each data frame of dfclient you can access by passing datetring of wodays as indexkey
i.e. dfclient['01032021'] or dfclient['02032021'] you will get your require data frame

Related

How to add a new row with new header information in same dataframe

I have written a code to retrieve JSON data from an URL. It works fine. I give the start and end date and it loops through the date range and appends everything to a dataframe.
The colums are populated with the JSON data sensor and its corresponding values, hence the column names are like sensor_1. When I request the data from the URL it sometimes happens that there are new sensors and the old ones are switched off and deliver no data anymore and often times the length of the columns change. In that case my code just adds new columns.
What I want is instead of new columns a new header in the ongoing dataframe.
What I currently get with my code:
datetime;sensor_1;sensor_2;sensor_3;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-01;23.2;43.5;45.2;NaN;NaN;NaN;NaN;NaN;
2023-01-02;13.2;33.5;55.2;NaN;NaN;NaN;NaN;NaN;
2023-01-03;26.2;23.5;76.2;NaN;NaN;NaN;NaN;NaN;
2023-01-04;NaN;NaN;NaN;75;12;75;93;123;
2023-01-05;NaN;NaN;NaN;23;31;24;15;136;
2023-01-06;NaN;NaN;NaN;79;12;96;65;72;
What I want:
datetime;sensor_1;sensor_2;sensor_3;
2023-01-01;23.2;43.5;45.2;
2023-01-02;13.2;33.5;55.2;
2023-01-03;26.2;23.5;76.2;
datetime;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-04;75;12;75;93;123;
2023-01-05;23;31;24;15;136;
2023-01-06;79;12;96;65;72;
My loop to retrieve the data:
start_date = datetime.datetime(2023,1,1,0,0)
end_date = datetime.datetime(2023,1,6,0,0)
sensor_data = pd.DataFrame()
while start_zeit < end_zeit:
q = 'url'
r = requests.get(q)
j = json.loads(r.text)
sub_data = pd.DataFrame()
if 'result' in j:
datetime = pd.to_datetime(np.array(j['result']['data'])[:,0])
sensors = np.array(j['result']['sensors'])
data = np.array(j['result']['data'])[:,1:]
df_new = pd.DataFrame(data, index=datetime, columns=sensors)
sub_data = pd.concat([sub_data, df_new])
sensor_data = pd.concat([sensor_data, sub_data])
start_date += timedelta(days=1)

if 2 DataFrames will do for you the you can simply split using the column names:
df1 = df[['datetime', 'sensor_1', 'sensor_2', 'sensor_3']]
df2 = df[['datetime', 'new_sensor_8', 'new-sensor_9', 'sensor_10', 'sensor_11']]
Note the [[ used.
and use .dropna() to lose the NaN rows

Why "index 1 is out of bounds for axis 0 with size 1" when requesting this API using iterrows()?

In a CSV-file I have a column with 150k id-values, like below. I am trying to iterate through them and call the API using each value. The API has the request limit 5000/min.
OBJEKT_ID
id1
id2
id3
...
I then want to put the requested data (CLASS) into a new csv-file together with the corresponding ID in another column. Like below.
OBJEKT_ID,CLASS
id1,X
id2,Y
id3,Z
...,...
However, I am only able to create one row of data (like below) in the new csv-file before I get an error message.
OBJEKT_ID,CLASS
id1,X
The error message is "index 1 is out of bounds for axis 0 with size 1". Why is this the case?
Here's the code:
object_df = pandas.read_csv("CSV_FILE.csv")
for index, row in object_df.iterrows():
response = requests.get(
f"url/{row[index]}",
headers=headers)
data = response.json()
result = data["features"][0]["properties"]["agande"][0]["agare"]["analyser"]
print(result)
df = pandas.DataFrame()
df['OBJEKT_ID'] = [row[index]]
df['CLASS'] = [result]
df.to_csv("collected_data.csv", index=False)
time.sleep(0.0125)
enter code here

Corrections for the following errors:
row is a 1 element pandas series, so df['OBJEKT_ID'] = [row[index]] becomes out of bounds
You're creating a new dataframe in each loop i.e. df = pandas.DataFrame()
You're overwriting the csv file with each loop i.e. df.to_csv("collected_data.csv", index=False)
Code
object_df = pandas.read_csv("CSV_FILE.csv")
# Will hold the output data
output = {'OBJEKT_ID':[],
'CLASS':[]}
for index, row in object_df.iterrows():
response = requests.get(
f"url/{row[index]}",
headers=headers)
data = response.json()
result = data["features"][0]["properties"]["agande"][0]["agare"]["analyser"]
print(result)
output['OBJEKT_ID'].append(row[0]) # ID column
output['CLASS'].append(result) # class column
time.sleep(0.0125) # rate limiting (note: another option is to use a rate limiting modules
# such as https://pypi.org/project/ratelimit/)
# Create Dataframe
df = pd.DataFrame(output)
# Write to csv
df.to_csv("collected_data.csv", index=False)
Alternative Method
Use rate limiter and apply function
import requests
from ratelimit import limits
#limits(calls=4900, period=60) # limits to 4900 calls per minute (backoff from 5000 max)
def call_api(row): # function to process requests
response = requests.get(
f"url/row",
headers=headers) # note: headers not shown in posted code
data = response.json()
result = data["features"][0]["properties"]["agande"][0]["agare"]["analyser"]
if response.status_code != 200:
raise Exception('API response: {}'.format(response.status_code))
return response
# Create dataframe from CSV file
object_df = pandas.read_csv("CSV_FILE.csv")
# Add class column (calling api on each row)
object_df['CLASS'] = object_df['OBJEKT_ID'].apply(call_api)
# Write to csv
df.to_csv("collected_data.csv", index=False)

Python pandas dataframe column and index formatting issues

I am trying to output a pandas dataframe and I am only getting one column (PADD 5) instead of (PADD 1 through PADD 5). In addition, I cannot get the index to format in YYYY-MM-DD. I would appreciate if anyone knew how to output these two things. Thanks much!
# API Key from EIA
api_key = 'xxxxxxxxxxx'
# api_key = os.getenv("EIA_API_KEY")
# PADD Names to Label Columns
# Change to whatever column labels you want to use.
PADD_NAMES = ['PADD 1','PADD 2','PADD 3','PADD 4','PADD 5']
# Enter all your Series IDs here separated by commas
PADD_KEY = ['PET.MCRRIP12.M',
'PET.MCRRIP22.M',
'PET.MCRRIP32.M',
'PET.MCRRIP42.M',
'PET.MCRRIP52.M']
# Initialize list - this is the final list that you will store all the data from the json pull. Then you will use this list to concat into a pandas dataframe.
final_data = []
# Choose start and end dates
startDate = '2009-01-01'
endDate = '2023-01-01'
for i in range(len(PADD_KEY)):
url = 'https://api.eia.gov/series/?api_key=' + api_key + '&series_id=' + PADD_KEY[i]
r = requests.get(url)
json_data = r.json()
if r.status_code == 200:
print('Success!')
else:
print('Error')
print(json_data)
df = pd.DataFrame(json_data.get('series')[0].get('data'),
columns = ['Date', PADD_NAMES[i]])
df.set_index('Date', drop=True, inplace=True)
final_data.append(df)
# Combine all the data into one dataframe
crude = pd.concat(final_data, axis=1)
# Create date as datetype datatype
crude['Year'] = crude.index.astype(str).str[:4]
crude['Month'] = crude.index.astype(str).str[4:]
crude['Day'] = 1
crude['Date'] = pd.to_datetime(crude[['Year','Month','Day']])
crude.set_index('Date',drop=True,inplace=True)
crude.sort_index(inplace=True)
crude = crude[startDate:endDate]
crude = crude.iloc[:,:5]
df.head()
PADD 5
Date
202201 1996
202112 2071
202111 2125
202110 2128
202109 2232

Read tables from HTML page by changing the ID using python

I am using the html link below to read the table in the page:
http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664
The last part of the link(allbin) is an ID. This ID changes and by using different IDs you can access different tables and records. The link although remains the same, just the ID in the end changes every time. I have like 1000 more different IDs like this. So, How can I actually use those different IDs to access different tables and join them together?
Output Like this,
ID Number Type FileDate
2016664 NB 14581-26 New Building 12/21/2020
4257909 NB 1481-29 New Building 3/6/2021
4138920 NB 481-29 New Building 9/4/2020
List of other ID for use:
['4257909', '4138920', '4533715']
This was my attempt, I can read a single table with this code.
import requests
import pandas as pd
url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
html = requests.get(url).content
df_list = pd.read_html(html,header=0)
df = df_list[3]
df

To get all pages from list of IDs you can use next example:
import requests
import pandas as pd
from io import StringIO
url = "http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin={}&allcount={}"
def get_info(ID, page=1):
out = []
while True:
try:
print("ID: {} Page: {}".format(ID, page))
t = requests.get(url.format(ID, page), timeout=1).text
df = pd.read_html(StringIO(t))[3].loc[1:, :]
if len(df) == 0:
break
df.columns = ["NUMBER", "NUMBER", "TYPE", "FILE DATE"]
df["ID"] = ID
out.append(df)
page += 25
except requests.exceptions.ReadTimeout:
print("Timeout...")
continue
return out
list_of_ids = [2016664, 4257909, 4138920, 4533715]
dfs = []
for ID in list_of_ids:
dfs.extend(get_info(ID))
df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=None)
Prints:
NUMBER NUMBER TYPE FILE DATE ID
1 ALT 1469-1890 NaN ALTERATION 00/00/0000 2016664
2 ALT 1313-1874 NaN ALTERATION 00/00/0000 2016664
3 BN 332-1938 NaN BUILDING NOTICE 00/00/0000 2016664
4 BN 636-1916 NaN BUILDING NOTICE 00/00/0000 2016664
5 CO NB 1295-1923 (PDF) CERTIFICATE OF OCCUPANCY 00/00/0000 2016664
...
And saves data.csv (screenshot from LibreOffice):

The code below will extract all the tables in a web page
import numpy as np
import pandas as pd
url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
df_list = pd.read_html(url) #returns as list of dataframes from the web page
print(len(df_list)) #print the number of dataframes
i = 0
while i < len(df_list): #loop through the list to print all tables
df = df_list[i]
print(df)
i = i + 1

how to extract weekends and bank holidays for stock price data

markowitz = pd.read_excel('C:/Users/jordan/Desktop/book2.xlsx')
markowitz = markowitz.set_index('Dates')
markowitz
there are some NaN values in the data,some of them are weekends and some of them are holidays,i have to identify the holidays and set it as previous value
is there a simple way i can do this ,i used
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
dr = pd.date_range(start='2013-01-01', end='2018-06-12')
df = pd.DataFrame()
df['Date'] = dr
cal = calendar()
holidays = cal.holidays(start=dr.min(), end=dr.max())
df['Holiday'] = df['Date'].isin(holidays)
print (df)
df = df[df['Holiday'] == True]
df
but there are still a lot of dates i have to copy and paste(can i just display the second row "date")and then set them as previous trading day value, is there a simpler way to do this ? Thanks a lot in advance.

There may be a simpler way, if I know what you are trying to do. The fillna method on dataframes lets you forward fill. So if you don't want to fill weekend days but want to fill all other nas (i.e. holidays), you can just exclude Saturdays and Sundays as follows:
df.loc[~df['Date'].dt.weekday_name.isin(['Saturday','Sunday'])] = df.loc[~df['Date'].dt.weekday_name.isin(['Saturday','Sunday'])].fillna(method='ffill')
You can use this on the whole dataframe or on particular columns.

I think your best bet is to get an API key from quandl.com. It's free and it gives you access to all kinds of time series historical data. There used to be access to Yahoo Finance and Google Finance, but I think both were depreciated well over 1 year ago.
Here is a small sample of code that can definitely help you.
import quandl
quandl.ApiConfig.api_key = 'your_api_key_goes_here'
# get the table for daily stock prices and,
# filter the table for selected tickers, columns within a time range
# set paginate to True because Quandl limits tables API to 10,000 rows per call
data = quandl.get_table('WIKI/PRICES', ticker = ['AAPL', 'MSFT', 'WMT'],
qopts = { 'columns': ['ticker', 'date', 'adj_close'] },
date = { 'gte': '2015-12-31', 'lte': '2016-12-31' },
paginate=True)
print(data)
Check the link below for info about how to get the data you need.
https://blog.quandl.com/api-for-stock-data
Also, please see this for more details about using Python for quantitative finance.
https://financetrain.com/best-python-librariespackages-finance-financial-data-scientists/
Finally, and I apologize if this is a little off topic, but I think it may be helpful at some level...consider something like this...
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("AAA.csv", header=False)
It's not time series data, but rather fundamental data. I haven't spent a lot of time on that site, but maybe you can poke around and find something there that suits your needs. Just a thought.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert a datatable into mulitple dataframe - python

Related

How to add a new row with new header information in same dataframe

Why "index 1 is out of bounds for axis 0 with size 1" when requesting this API using iterrows()?

Python pandas dataframe column and index formatting issues

Read tables from HTML page by changing the ID using python

how to extract weekends and bank holidays for stock price data

Categories

Resources