Create Pandas Dataframe from WebScraping results of stock price - python

I'm trying to write an script which creates an Pandas Dataframe (df) an add every time x an stock price to the df. The Data is from wrebscraping.
This is my code, but I have no idea how to add every time x (e.g. 1min) new data to the df and not replace the old.
import bs4 as bs
import urllib.request
import time as t
import re
import pandas as pd
i = 1
while i == 1:
# !! Scraping
link = 'https://www.onvista.de/aktien/DELIVERY-HERO-SE-Aktie-DE000A2E4K43'
parser = urllib.request.urlopen(link).read()
soup = bs.BeautifulSoup(parser, 'lxml')
stock_data = soup.find('ul', {'class': 'KURSDATEN'})
stock_price_eur_temp = stock_data.find('li')
stock_price_eur = stock_price_eur_temp.get_text()
final_stock_price = re.sub('[EUR]','', stock_price_eur)
print (final_stock_price)
t.sleep(60)
# !! Building an dataframe
localtime = t.asctime(t.localtime(t.time()))
stock_data_b = {
'Price': [final_stock_price],
'Time': [localtime],
}
df = pd.DataFrame(stock_data_b, columns=['Price', 'Time'])
I hope you can help me with an idea for this problem.

Because you create df inside the loop, you're re-writing that variable name each time, writing over the data from the previous iteration. You want to initialize a dataframe before the loop, and then add to it each time.
Before your loop, add the line
df2 = pd.DataFrame()
which just creates an empty dataframe. After the end of the code you posted, add
df2 = df2.append(df, ignore_index = True)
which will tack each new df on to the end of df2.

Related

Web scraping golf data from ESPN. I am receiving 3 ouputs of the same table and only want 1. How can I limit this?

I am new to python and am stuck. I cant figure out how to only output one of the tables given. In the output, it gives the desired table, but three versions of them. The first two are awfully formatted, and the last table is the table desired.
I have tried running a for loop and counting to only print the third table.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in dfs:
print(df[0:])
Just use index to print the table.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
print(dfs[2])
OR
print(dfs[-1])
OR If you want to use loop then try that.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in range(len(dfs)):
if df==2:
print(dfs[df])

Web scraping from Website having multiple page results

I want the table of results of all the webpages for my selection from a website which returns multiple pages of results.
I tried the below code :
enter code here
import pandas as pd
dfs = []
while i<27:
url = " "
dframe = pd.read_html(url.str(i), header=1)
dfs.append(dframe[0].dropna(thresh=3))
i=i+1
I expect the dframe to hold the records of all the 30 pages of results.
But im unable to run this ,it never stops running even after hours of running
import pandas as pd
import numpy as np
df2 = pd.DataFrame()
for i in np.arange(26):
url = "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=2;page="+str(i)+";spanmin1=01+Jan+2007;spanval1=span;template=results;type=bowling"
df = pd.read_html(url)[2]
df2 = pd.concat([df2, df])
df2.drop(columns = 'Unnamed: 14', inplace = True)
This worked for me. When I browsed the website I only had 26 pages though. I also investigated a single page, and the table your looking at is the [2] df on the list that read_html returns. Unnamed: 14 is the column with the arrow on the very right.
I've added and changed some stuff from your original code to make it work.
import pandas as pd
dfs = []
i = 0
while i < 26:
url = (
"http://stats.espncricinfo.com/ci/engine/stats/index.html?class=2;page="
+ str(i)
+ ";spanmin1=01+Jan+2007;spanval1=span;template=results;type=bowling"
)
dframe = pd.read_html(url, attrs={"class": "engineTable"})
dfs.append(dframe[2].drop(columns="Unnamed: 14"))
i = i + 1
result = pd.concat(dfs)
print(result)

Why is my for loop overwriting instead of appending CSV?

I am trying to scrape IB website. So, what I am doing, I have created the urls to iterate over, and I am able to extract the required information, but seems the dataframe keeps being overwritten vs appending.
import pandas as pd
from pandas import DataFrame as df
from bs4 import BeautifulSoup
import csv
import requests
base_url = "https://www.interactivebrokers.com/en/index.phpf=2222&exch=mexi&showcategories=STK&p=&cc=&limit=100"
n = 1
url_list = []
while n <= 2:
url = (base_url + "&page=%d" % n)
url_list.append(url)
n = n+1
def parse_websites(url_list):
for url in url_list:
html_string = requests.get(url)
soup = BeautifulSoup(html_string.text, 'lxml') # Parse the HTML as a string
table = soup.find('div',{'class':'table-responsive no-margin'}) #Grab the first table
df = pd.DataFrame(columns=range(0,4), index = [0]) # I know the size
for row_marker, row in enumerate(table.find_all('tr')):
column_marker = 0
columns = row.find_all('td')
try:
df.loc[row_marker] = [column.get_text() for column in columns]
except ValueError:
# It's a safe way when [column.get_text() for column in columns] is empty list.
continue
print(df)
df.to_csv('path_to_file\\test1.csv')
parse_websites(url_list)
Can you please take a look at my code at advise what I am doing wrong ?
One solution if you want to append the data frames on the file is to write in append mode:
df.to_csv('path_to_file\\test1.csv', mode='a', header=False)
otherwise you should create the data frame outside as mentioned in the comments.
If you define a data structure from within a loop, each iteration of the loop
will redefine the data structure, meaning that the work is being rewritten.
The dataframe should be defined outside of the loop if you do not want it to be overwritten.

python pd.read_html gives error

I've been using pandas and request to pull some tables to get NFL statistics. It's been going pretty well, I've been able to pull tables from other sites, until I tried to get NFL combine table from this one particular site.
It gives me the error message after df_list = pd.read_html(html)
The error I get is:
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('<U1')
Here's the code I've been using at other sites that worked really well.
import requests
import pandas as pd
df = pd.DataFrame()
url = 'http://nflcombineresults.com/nflcombinedata_expanded.php?
year=1987&pos=&college='
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
I've read and seen a little bit about BeautifulSoup, but the simplicity of the pd.read_html() is just so nice and compact. So I don't know if there's a quick fix that I am not aware of, or if I need to indeed dive into BeautifulSoup to get these tables from 1987 - 2017.
This isn't shorter, but may be more robust:
import requests
import pandas as pd
from bs4 import BeautifulSoup
A convenience function:
def souptable(table):
for row in table.find_all('tr'):
yield [col.text for col in row.find_all('td')]
Return a DataFrame with data loaded for a given year:
def getyear(year):
url = 'http://nflcombineresults.com/nflcombinedata_expanded.php?year=%d&pos=&college=' % year
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
data = list(souptable(soup.table))
df = pd.DataFrame(data[1:], columns=data[0])
df = df[pd.notnull(df['Name'])]
return df.apply(pd.to_numeric, errors="ignore")
This function slices out the heading row when the DataFrame is created, uses the first row for column names, and filters out any rows with an empty Name value.
Finally, concatenate up as many years as you need into a single DataFrame:
dfs = pd.concat([getyear(year) for year in range(1987, 1990)])
Ok. After doing some more research. I looks like the issue is that the last row is one merged cell, and that's possibly where the issue is coming in. So I did go into using BeautifulSoup to pull the data. Here is my solution:
import requests
import pandas as pd
from bs4 import BeautifulSoup
I wanted to pull for each year from 1987 to 2017
seasons = list(range(1987, 2018))
df = pd.DataFrame()
temp_df = pd.DataFrame()
So it would run through each year. Appending each cell into a new row. Then again knowing the last cell is a "blank", I eliminate that last row by defining the dataframe as df[:-1] before it loops and appends the next years data.
for i in seasons:
df = df[:-1]
url = 'http://nflcombineresults.com/nflcombinedata_expanded.php?
year=%s&pos=&college=' % (i)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for tr in soup.table.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
temp_df = row
df = df.append(temp_df, ignore_index = True)
Finally, since there is no new year to append, I need to eliminate the last row. Then I reshape the dataframe into the 16 columns, rename the columns from the first row, and then eliminate the row headers within the dataframe.
df = df[:-1]
df = (pd.DataFrame(df.values.reshape(-1, 16)))
df.columns = df.iloc[0]
df = df[df.Name != 'Name']
I'm still learning python, so any input, advice, any respectful constructive criticism is always welcome. Maybe there is a better, more appropriate solution?

For loop after for loop produces wrong output Python

I am trying to use for loops to iterate through some Yahoo Finance data and calculate the return the papers. The problem is that I want to do this for different times, and that I have a document containing the different start and end dates. This is the code I have been using:
import pandas as pd
import numpy as np
from pandas.io.data import DataReader
from datetime import datetime
# This function is just used to download the data I want and saveing
#it to a csv file.
def downloader():
start = datetime(2005,1,1)
end = datetime(2010,1,1)
tickers = ['VIS', 'VFH', 'VPU']
stock_data = DataReader(tickers, "yahoo", start, end)
price = stock_data['Adj Close']
price.to_csv('data.csv')
downloader()
#reads the data into a Pandas DataFrame.
price = pd.read_csv('data.csv', index_col = 'Date', parse_dates = True)
#Creates a Pandas DataFrame that holdt multiple dates. The formate on this is the same as the format I have on the dates when I load the full csv file of dates.
inp = [{'start' : datetime(2005,1,3), 'end' : datetime(2005,12,30)},
{'start' : datetime(2005,2,1), 'end' : datetime(2006,1,31)},
{'start' : datetime(2005,3,1), 'end' : datetime(2006,2,28)}]
df = pd.DataFrame(inp)
#Everything above this is not part of the original script, but this
#is just used to replicate the problem I am having.
results = pd.DataFrame()
for index, row in df.iterrows():
start = row['start']
end = row['end']
price_initial = price.ix[start:end]
for column1 in price_initial:
price1 = price_initial[column1]
startprice = price1.ix[end]
endprice = price1.ix[start]
momentum_value = (startprice / endprice)-1
results = results.append({'Ticker' : column1, 'Momentum' : momentum_value}, ignore_index=True)
results = results.sort(columns = "Momentum", ascending = False).head(1)
print(results.to_csv(sep= '\t', index=False))
I am not sure what I am doing wrong here. But I suspect there is something about the way I iterate over or the way I save the output from the script.
The output I get is this:
Momentum Ticker
0.16022263953253435 VPU
Momentum Ticker
0.16022263953253435 VPU
Momentum Ticker
0.16022263953253435 VPU
That is clearly not correct. Hope someone can help me get this right.

Categories