Don't understand why I can't do even the most simple data manipulation with this data i've scraped. I've tried all sorts of methjods to manipulate the data but all come up with the same sort of error. Is my data even in a data frame yet? I can't tell.
import pandas as pd
from urllib.request import Request, urlopen
req = Request('https://smallcaps.com.au/director-transactions/'
, headers={'User-Agent': 'Mozilla/5.0'})
trades = urlopen(req).read()
df = pd.read_html(trades)
print(df) #<-- This line prints the df and works fine
df.drop([0, 1]) #--> THis one shows the error below
print(df)
Error:
Traceback (most recent call last):
File "C:\Users\User\PycharmProjects\Scraper\DirectorTrades.py", line 10, in <module>
df.drop([0, 1])
AttributeError: 'list' object has no attribute 'drop'
Main issue is as mentioned that pandas.read_html() returns a list of dataframes and you have to specify by index wich you like to choose.
Is my data even in a data frame yet?
df = pd.read_html(trades) No it is not, cause it provides a list of dataframes
df = pd.read_html(trades)[0] Yes, this will give you the first dataframe from list of frames
Example
import pandas as pd
from urllib.request import Request, urlopen
req = Request('https://smallcaps.com.au/director-transactions/'
, headers={'User-Agent': 'Mozilla/5.0'})
trades = urlopen(req).read()
df = pd.read_html(trades)[0]
df.drop([0, 1])
df
Output
Date
Code
Company
Director
Value
0
27/4/2022
ESR
Estrella Resources
L. Pereira
↑$1,075
1
27/4/2022
LNY
Laneway Resources
S. Bizzell
↑126,750
2
26/4/2022
FGX
Future Generation Investment Company
G. Wilson
↑$13,363
3
26/4/2022
CDM
Cadence Capital
J. Webster
↑$25,110
4
26/4/2022
TEK
Thorney Technologies
A. Waislitz
↑$35,384
5
26/4/2022
FGX
Future Generation Investment Company
K. Thorley
↑$7,980
...
read_html returns a list of dataframes.
Try:
dfs = pd.read_html(trades)
dfs = [df.drop([0,1]) for df in dfs]
Related
I'm trying to transfer strings into integers from the plans_data:
import pandas as pd
from bs4 import BeautifulSoup
import requests
plans_data = pd.DataFrame(columns = ['Country', 'Currency', 'Mobile', 'Basic', 'Standard', 'Premium'])
for index, row in countries_list.iterrows():
country = row['ID']
url = f'https://help.netflix.com/en/node/24926/{country}'
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find("table", class_="c-table")
try:
plan_country = pd.read_html(results.prettify())[0] #creates a list(!) of dataframe objects
plan_country = plan_country.rename(columns = {'Unnamed: 0':'Currency'})
plan_country = pd.DataFrame(plan_country.iloc[0,:]).transpose()
plans_data = pd.concat([plans_data, plan_country], ignore_index=True)
except AttributeError:
country_name = row['Name']
print(f'No data found for {country_name}.')
plans_data.loc[index, 'Country'] = row['Name']
plans_data
Firstly, I atempted to transfer using function float:
# 1. Here we import pandas
import pandas as pd
# 2. Here we import numpy
import numpy as np
ans_2_1_ = float(plans_data['Basic', 'Standard', 'Premium'])
However, I always get the NameError:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_15368/3072127414.py in <module>
3 # 2. Here we import numpy
4 import numpy as np
----> 5 ans_2_1_ = float(plans_data['Basic', 'Standard', 'Premium'])
NameError: name 'plans_data' is not defined
How can I solve this problem?
If my code is not appropriate for my task "transfer strings into integers", can you advise me on how to convert?
The error indicates that the second piece of code does not know what plans_data is, so first make sure they plans_data is defined where you do that, ie in the same file or the same Jupyter notebook
second problem is that plans_data['Basic', 'Standard', 'Premium'] is not valid syntax
Thirdly, and probably is your real question, how to convert the values in those columns to floats.
Elements in columns 'Basic', 'Standard', 'Premium' are strings in currency format eg '£ 5.99'. You can convert them to floats as such (you need to do it for each column):
ans_2_1_ = plans_data['Basic'].str[1:].astype(float)
... # same for Standard and Premium
I am using the html link below to read the table in the page:
http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664
The last part of the link(allbin) is an ID. This ID changes and by using different IDs you can access different tables and records. The link although remains the same, just the ID in the end changes every time. I have like 1000 more different IDs like this. So, How can I actually use those different IDs to access different tables and join them together?
Output Like this,
ID Number Type FileDate
2016664 NB 14581-26 New Building 12/21/2020
4257909 NB 1481-29 New Building 3/6/2021
4138920 NB 481-29 New Building 9/4/2020
List of other ID for use:
['4257909', '4138920', '4533715']
This was my attempt, I can read a single table with this code.
import requests
import pandas as pd
url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
html = requests.get(url).content
df_list = pd.read_html(html,header=0)
df = df_list[3]
df
To get all pages from list of IDs you can use next example:
import requests
import pandas as pd
from io import StringIO
url = "http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin={}&allcount={}"
def get_info(ID, page=1):
out = []
while True:
try:
print("ID: {} Page: {}".format(ID, page))
t = requests.get(url.format(ID, page), timeout=1).text
df = pd.read_html(StringIO(t))[3].loc[1:, :]
if len(df) == 0:
break
df.columns = ["NUMBER", "NUMBER", "TYPE", "FILE DATE"]
df["ID"] = ID
out.append(df)
page += 25
except requests.exceptions.ReadTimeout:
print("Timeout...")
continue
return out
list_of_ids = [2016664, 4257909, 4138920, 4533715]
dfs = []
for ID in list_of_ids:
dfs.extend(get_info(ID))
df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=None)
Prints:
NUMBER NUMBER TYPE FILE DATE ID
1 ALT 1469-1890 NaN ALTERATION 00/00/0000 2016664
2 ALT 1313-1874 NaN ALTERATION 00/00/0000 2016664
3 BN 332-1938 NaN BUILDING NOTICE 00/00/0000 2016664
4 BN 636-1916 NaN BUILDING NOTICE 00/00/0000 2016664
5 CO NB 1295-1923 (PDF) CERTIFICATE OF OCCUPANCY 00/00/0000 2016664
...
And saves data.csv (screenshot from LibreOffice):
The code below will extract all the tables in a web page
import numpy as np
import pandas as pd
url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
df_list = pd.read_html(url) #returns as list of dataframes from the web page
print(len(df_list)) #print the number of dataframes
i = 0
while i < len(df_list): #loop through the list to print all tables
df = df_list[i]
print(df)
i = i + 1
I have written code to get data from website for a particular date say 26th feb 2021. the processed data from code is as follows.
Client Type
Client
DII
FII
Pro
Future Index Long
126331
584
82434
27321
Future Index Short
133088
34291
40107
29184
Option Index Call Long
1022372
267
198308
310605
Option Index Put Long
790647
12740
291494
292811
Option Index Call Short
964795
0
147444
419313
Option Index Put Short
919882
0
157139
310671
I want to convert the dataframe into multiple dataframes an e.g would
dfclient should be like this:-
Date
Future Index Long
Future Index Short
Option Index Call Long
Option Index Put Long
Option Index Call Short
Option Index Put Short
26-02-2021
126331
133088
1022372
790647
964795
919882
what is fastest way to achieve the objective.
I would be required to run a loop as i want data for last 10 business days
entire code is as follows:-
from numpy.core.fromnumeric import transpose
import pandas as pd
import datetime as dt
import xlwings as xw
from pandas.tseries.offsets import *
import requests as req
hols = ["2021-01-26", "2021-03-11", "2021-03-29", "2021-04-02",
"2021-04-14", "2021-04-21", "2021-05-13", "2021-07-21",
"2021-08-19", "2021-09-10", "2021-10-15", "2021-11-04",
"2021-11-05", "2021-11-19"]
hols = pd.to_datetime(hols)
bdays = pd.date_range(end = dt.date.today(),periods=60,freq = BDay())
wdays = bdays.difference(hols)[-10:]
wodays = pd.DataFrame(wdays,columns = ['Business_day'])
wodays['Datestring'] = wodays['Business_day'].dt.strftime("%d%m%Y")
# getting data from NSE website
url = 'https://archives.nseindia.com/content/nsccl/fao_participant_oi_'+wodays.Datestring[0]+'.csv'
headers = {'User-Agent': 'Mozilla/5.0'}
OC = req.get(url,headers=headers).content
data = pd.read_csv(url, header = 1, usecols = [0,1,2,5,6,7,8], index_col = 0 )
data = data.head(4).transpose()
For the looping among all date and saving data frame for each date you need to create a dictionary of data frame ... in your case dictionary of dfclient
here is a chunk of code you needed
dfclient = {}
for i, date in enumerate(wodays.Datestring):
# getting data from NSE website
url = 'https://archives.nseindia.com/content/nsccl/fao_participant_oi_' + date + '.csv'
headers = {'User-Agent': 'Mozilla/5.0'}
OC = req.get(url, headers=headers).content
data = pd.read_csv(url, header=1, usecols=[0, 1, 2, 5, 6, 7, 8], index_col=0)
data = data.head(4).iloc[0]
data = pd.DataFrame(data).transpose().reset_index()
data.at[0, 'index'] = dt.date(wdays[i].year, wdays[i].month, wdays[i].day)
data.rename(columns={'index': 'Date'}, inplace=True)
dfclient[date] = data
Each data frame of dfclient you can access by passing datetring of wodays as indexkey
i.e. dfclient['01032021'] or dfclient['02032021'] you will get your require data frame
markowitz = pd.read_excel('C:/Users/jordan/Desktop/book2.xlsx')
markowitz = markowitz.set_index('Dates')
markowitz
there are some NaN values in the data,some of them are weekends and some of them are holidays,i have to identify the holidays and set it as previous value
is there a simple way i can do this ,i used
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
dr = pd.date_range(start='2013-01-01', end='2018-06-12')
df = pd.DataFrame()
df['Date'] = dr
cal = calendar()
holidays = cal.holidays(start=dr.min(), end=dr.max())
df['Holiday'] = df['Date'].isin(holidays)
print (df)
df = df[df['Holiday'] == True]
df
but there are still a lot of dates i have to copy and paste(can i just display the second row "date")and then set them as previous trading day value, is there a simpler way to do this ? Thanks a lot in advance.
There may be a simpler way, if I know what you are trying to do. The fillna method on dataframes lets you forward fill. So if you don't want to fill weekend days but want to fill all other nas (i.e. holidays), you can just exclude Saturdays and Sundays as follows:
df.loc[~df['Date'].dt.weekday_name.isin(['Saturday','Sunday'])] = df.loc[~df['Date'].dt.weekday_name.isin(['Saturday','Sunday'])].fillna(method='ffill')
You can use this on the whole dataframe or on particular columns.
I think your best bet is to get an API key from quandl.com. It's free and it gives you access to all kinds of time series historical data. There used to be access to Yahoo Finance and Google Finance, but I think both were depreciated well over 1 year ago.
Here is a small sample of code that can definitely help you.
import quandl
quandl.ApiConfig.api_key = 'your_api_key_goes_here'
# get the table for daily stock prices and,
# filter the table for selected tickers, columns within a time range
# set paginate to True because Quandl limits tables API to 10,000 rows per call
data = quandl.get_table('WIKI/PRICES', ticker = ['AAPL', 'MSFT', 'WMT'],
qopts = { 'columns': ['ticker', 'date', 'adj_close'] },
date = { 'gte': '2015-12-31', 'lte': '2016-12-31' },
paginate=True)
print(data)
Check the link below for info about how to get the data you need.
https://blog.quandl.com/api-for-stock-data
Also, please see this for more details about using Python for quantitative finance.
https://financetrain.com/best-python-librariespackages-finance-financial-data-scientists/
Finally, and I apologize if this is a little off topic, but I think it may be helpful at some level...consider something like this...
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("AAA.csv", header=False)
It's not time series data, but rather fundamental data. I haven't spent a lot of time on that site, but maybe you can poke around and find something there that suits your needs. Just a thought.
I'm looking to pull the historical data for ~200 securities in a given index. I import the list of securities from a csv file then iterate over them to pull their respective data from the quandl api. That dataframe for each security has 12 columns, so I create a new column with the name of the security and the Adjusted Close value, so I can later identify the series.
I'm receiving an error when I try to join all the new columns into an empty dataframe. I receive an attribute error:
'''
Print output data
'''
grab_constituent_data()
AttributeError: 'Series' object has no attribute 'join'
Below is the code I have used to arrive here thus far.
'''
Import the modules necessary for analysis
'''
import quandl
import pandas as pd
import numpy as np
'''
Set file pathes and API keys
'''
ticker_path = ''
auth_key = ''
'''
Pull a list of tickers in the IGM ETF
'''
def ticker_list():
df = pd.read_csv('{}IGM Tickers.csv'.format(ticker_path))
# print(df['Ticker'])
return df['Ticker']
'''
Pull the historical prices for the securities within Ticker List
'''
def grab_constituent_data():
tickers = ticker_list()
main_df = pd.DataFrame()
for abbv in tickers:
query = 'EOD/{}'.format(str(abbv))
df = quandl.get(query, authtoken=auth_key)
print('Competed the query for {}'.format(query))
df['{} Adj_Close'.format(str(abbv))] = df['Adj_Close'].copy()
df = df['{} Adj_Close'.format(str(abbv))]
print('Completed the column adjustment for {}'.format(str(abbv)))
if main_df.empty:
main_df = df
else:
main_df = main_df.join(df)
print(main_df.head())
It seems that in your line
df = df['{} Adj_Close'.format(str(abbv))]
you're getting a Serie and not a Dataframe. If you want to convert your serie to a dataframe, you can use the function to_frame() like:
df = df['{} Adj_Close'.format(str(abbv))].to_frame()
I didn't check if your code might be more simple, but this should fix your issue.
To change a series into pandas dataframe you can use the following
df = pd.DataFrame(df)
After running above code, the series will become dataframe, then you can proceed with join tasks you have mentioned earlier