Parsing CSV to chart stock ticker data - python

I created a program that takes stock tickers, crawls the web to find a CSV of each ticker's historical prices, and plots them using matplotlib. Almost everything is working fine, but I've run into a problem parsing the CSV to separate out each price.
The error I get is:
prices = [float(row[4]) for row in csv_rows]
IndexError: list index out of range
I get what the problem is here, I'm just not really sure how I should fix it.
(The issue is in the parseCSV() method)
# Loop to chart multiple stocks
def chartStocks(*tickers):
for ticker in tickers:
chartStock(ticker)
# Single chart stock method
def chartStock(ticker):
url = "http://finance.yahoo.com/q/hp?s=" + str(ticker) + "+Historical+Prices"
sourceCode = requests.get(url)
plainText = sourceCode.text
soup = BeautifulSoup(plainText, "html.parser")
csv = findCSV(soup)
parseCSV(csv)
# Find the CSV URL
def findCSV(soupPage):
CSV_URL_PREFIX = 'http://real-chart.finance.yahoo.com/table.csv?s='
links = soupPage.findAll('a')
for link in links:
href = link.get('href', '')
if href.startswith(CSV_URL_PREFIX):
return href
# Parse CSV for daily prices
def parseCSV(csv_text):
csv_rows = csv.reader(csv_text.split('\n'))
prices = [float(row[4]) for row in csv_rows]
days = list(range(len(prices)))
point = collections.namedtuple('Point', ['x', 'y'])
for price in prices:
i = 0
p = point(days[i], prices[i])
points = []
points.append(p)
print(points)
plotStock(points)
# Plot the data
def plotStock(points):
plt.plot(points)
plt.show()

The problem is that parseCSV() expects a string containing CSV data, but it is actually being passed the URL of the CSV data, not the downloaded CSV data.
This is because findCSV(soup) returns the value of href for the CSV link found on the page, and then that value is passed to parseCSV(). The CSV reader finds a single undelimited row of data, so there is only one column, not the >4 that is expected.
At no point is the CSV data actually being downloaded.
You could write the first few lines of parseCSV() like this:
def parseCSV(csv_url):
r = requests.get(csv_url)
csv_rows = csv.reader(r.iter_lines())

You need to check if your row has at least five elements (i.e. index location 4).
prices = [float(row[4]) for row in csv_rows if len(row) > 4]

Related

How to convert wikipedia tables into pandas dataframes? [duplicate]

This question already has answers here:
scraping data from wikipedia table
(3 answers)
Closed 2 years ago.
I want to apply some statistics to data tables obtained directly from specific internet pages.
This tutorial https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059 helped me creating a data frame from a table at the webpage http://pokemondb.net/pokedex/all. However, I want to do the same for geographic data, such as population and gdp of several countries.
I found some tables at wikipedia, but it doesn't work quite well and I don't understand why. Here's my code, that follows the above mentioned tutorial:
import requests
import lxml.html as lh
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Check the length of the first 12 rows
print('Length of first 12 rows')
print ([len(T) for T in tr_elements[:12]])
#Create empty list
col=[]
i=0 #For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
#If row is not of size 10, the //tr data is not from our table
if len(T)!=10:
break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Check if row is empty
if i>0:
#Convert any numerical value to integers
try:
data=int(data)
except:
pass
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
print('Data gathering: done!')
print('Column lentgh:')
print([len(C) for (title,C) in col])
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
print(df.head())
The output is the following:
Length of first 12 rows
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
1:"Ranks
"
2:"Countries(or dependent territory)
"
3:"Officialfigure(whereavailable)
"
4:"Date oflast figure
"
5:"Source
"
Data gathering: done!
Column lentgh:
[0, 0, 0, 0, 0]
Empty DataFrame
Columns: [Ranks
, Countries(or dependent territory)
, Officialfigure(whereavailable)
, Date oflast figure
, Source
]
Index: []
The length of the columns shouldn't be null. The format is not the same as the one of the tutorial. Any idea of how to make it right? Or maybe another data source that doesn't return this strange output format?
The length of your rows, as you've shown by your print statement in line 16 (which corresponds to the first line of your output), is not 10. It is 5. And your code breaks out of the loop in the very first iteration, instead of populating your col.
changing this statement:
if len(T)!=10:
break
to
if len(T)!=5:
break
should fix the problem.
Instead of using requests, use pandas to read the url data.
‘df = pd.read_html(url)
On Line 52 you are trying to edit a tuple. This is not possible in Python.
To correct this, use a list instead.
Change line 25 to col.append([name,[]])
In addition, when using the break it breaks the for loop, this causes it to have no data inside the array.
When doing these sorts of things you also must look at the html. The table isn't formatting as nice as one would hope. For example, it has a bunch of new lines, and also has the images of the countries flag. You can see this example of North America for how the format is different every time.
It seems like you want an easy way to do this.I would look into BeautifulSoup4. I have added a way that I would do this with bs4. You'll have to do some editing to make it look better
import requests
import bs4 as bs
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'
column_names = []
data = []
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the html in the soup object
soup = bs.BeautifulSoup(page.content, 'html.parser')
#Gets the table html
table = soup.find_all('table')[0]
#gets the table header
thead = table.find_all('th')
#Puts the header into the column names list. We will use this for the dict keys later
for th in thead:
column_names.append(th.get_text())
#gets all the rows of the table
rows = table.find_all('tr')
#I do not take the first how as it is the header
for row in rows[1:]:
#Creates a list with each index being a different entry in the row.
values = [r for r in row]
#Gets each values that we care about
rank = values[1].get_text()
country = values[3].get_text()
pop = values[5].get_text()
date = values[7].get_text()
source = values[9].get_text()
temp_list = [rank,country,pop,date,source]
#Creates a dictionary with keys being the column names and the values being temp_list. Appends this to list data
data.append(dict(zip(column_names, temp_list)))
print(column_names)
df = pd.DataFrame(data)

Adding extra column for MIN value on each row in excel sheet from python

I have a python program which gets data and writes to a excel file , however i would like to add at then end of each row a formula that gets the MIN value of all numbers from column C to column CM therefore the MIN of that will be written in column CO here I have an image of what I would like it to be like ,
The Min function is in the CO column click here for image (this was done by manually using the MIN function)
My current code is as follows
import requests
import csv
from bs4 import BeautifulSoup
players = []
def get_name(td):
return td.find('b').text
def get_id(td):
links = td.find_all('a')
link = links[0]['href']
link = link.split('/')
return link[5]
def get_ovr(ovr):
return ovr.text
def read_page(pageId):
link = 'https://www.futwiz.com/en/fifa21/players?page=' + str(pageId) + '&release=icons'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
ovrList = soup.find_all('div', {'class', 'otherversion21-txt'})
td = soup.find_all('td', {'class', 'player'})
for i in range(0, len(ovrList)):
player = {
'name': get_name(td[i + 1]),
'id': get_id(td[i + 1]),
'ovr': get_ovr(ovrList[i])
}
players.append(player)
def get_prices(playerId):
link = 'https://www.futwiz.com/en/app/price_history_player21?p=' + str(playerId) + '&c=xb'
r = requests.get(link)
pricePageText = r.text
pricePageText = pricePageText.replace('[', '')
pricePageText = pricePageText.replace(']', '')
pricePageText = pricePageText.split(',')
prices = []
for i in range(len(pricePageText)):
if i % 2 == 1:
prices.append(pricePageText[i])
return prices
for pageId in range(0, 13):
print('Reading page: ' + str(pageId+1))
read_page(pageId)
f = open('newData.csv', 'w', newline='')
with f:
writer = csv.writer(f)
for player in players:
print('Read player: ' + player['name'])
prices = get_prices(player['id'])
profile = [player['name'], player['ovr']]
#To download all numbers, uncomment following line and comment (1) line
profile.extend(prices)
#To download only last number
# profile.append(prices[len(prices) - 1]) #(1) Line
writer.writerow(profile)
any help would be appreciated thanks in advance hope its not too confusing
I dont know if it's a requirement to get the minimum value within the Python script or it's also acceptable to just have it in your Excel workbook. If it's not required that the python script calculates the min: the following solution can be used:
Create a new .xlsx file > Click on the data tab in the ribbon and select Get data from file from csv. see the picture below:
Select the .csv file that your python script has created. and click on transforming data. The Power Query editor will pop up, here is where you can automate to add a column that has the minum value. Do this by selecting the column that participate in the min (column C to CM), select multiple columns by holding ctrl and click on the column name.
If all columns are selected: click on the 'add column tab', click on statistics and choose 'Minimum'. Also change the name of the column if desired. Now that the column is created you can click on the 'Home' tab and close and load the data.
The result: Run your python script, open the excel file, click on refresh data in the data tab and you will have your dataset with a minimum column.
Let me know if you get stuck somewhere.
,Jimmy

How to Loop and Save Data from Each Iteration

I am trying to learn how to scrape data from a webpage in python and am running into trouble with how to structure my nested loops in python. I received some assistance in how I was scraping with this question (How to pull links from within an 'a' tag). I am trying to have that code essentially iterate through different weeks (and eventually years) of webpages. What I have currently is below, but it is not iterating through the two weeks I would like it to and saving it off.
import requests, re, json
from bs4 import BeautifulSoup
weeks=['1','2']
data = pd.DataFrame(columns=['Teams','Link'])
scripts_head = soup.find('head').find_all('script')
all_links = {}
for i in weeks:
r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/2018/seasontype/2/week/'+i)
soup = BeautifulSoup(r.text, 'html.parser')
for script in scripts_head:
if 'window.espn.scoreboardData' in script.text:
json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
for event in json_scoreboard['events']:
name = event['name']
for link in event['links']:
if link['text'] == 'Gamecast':
gamecast = link['href']
all_links[name] = gamecast
#Save data to dataframe
data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
#Append new data to existing data
data=data.append(data2,ignore_index = True)
#Save dataframe with all links to csv for future use
data.to_csv(r'game_id_data.csv')
Edit: So to add some clarification, it is creating duplicates of the data from one week and repeatedly appending it to the end. I also edited the code to include the proper libraries, it should be able to be copy and pasted and run in python.
The problem is in your loop logic:
if 'window.espn.scoreboardData' in script.text:
...
data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
#Append new data to existing data
data=data.append(data2,ignore_index = True)
Your indentation on the last line is wrong. As given, you append data2 regardless of whether you have new scoreboard data. When you don't, you skip the if body and simply append the previous data2 value.
So the workaround I came up with is below, I am still getting duplicate game ID's in my final dataset, but at least I am looping through the entire desired set and getting all of them. Then at the end I dedupe.
import requests, re, json
from bs4 import BeautifulSoup
import csv
import pandas as pd
years=['2015','2016','2017','2018']
weeks=['1','2','3','4','5','6','7','8','9','10','11','12','13','14']
data = pd.DataFrame(columns=['Teams','Link'])
all_links = {}
for year in years:
for i in weeks:
r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/'+ year + '/seasontype/2/week/'+i)
soup = BeautifulSoup(r.text, 'html.parser')
scripts_head = soup.find('head').find_all('script')
for script in scripts_head:
if 'window.espn.scoreboardData' in script.text:
json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
for event in json_scoreboard['events']:
name = event['name']
for link in event['links']:
if link['text'] == 'Gamecast':
gamecast = link['href']
all_links[name] = gamecast
#Save data to dataframe
data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
#Append new data to existing data
data=data.append(data2,ignore_index = True)
#Save dataframe with all links to csv for future use
data_test=data.drop_duplicates(keep='first')
data_test.to_csv(r'all_years_deduped.csv')

Python3 Read Html Table With Pandas

Need some help here. Plan to extract all the statistical data of this site https://lotostats.ro/toate-rezultatele-win-for-life-10-20
My issue is that I am not able to read the table. I can't do this nor for the first page.
Can someone pls help?
import requests
import lxml.html as lh
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
url='https://lotostats.ro/toate-rezultatele-win-for-life-10-20'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
#If row is not of size 10, the //tr data is not from our table
# if len(T)!=10:
# break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Check if row is empty
if i>0:
#Convert any numerical value to integers
try:
data=int(data)
except:
pass
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()
print(df)
Data is dynamically added. You can find the source, returning json, in network tab
import requests
r = requests.get('https://lotostats.ro/all-rez/win_for_life_10_20?draw=1&columns%5B0%5D%5Bdata%5D=0&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=1&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=false&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&start=0&length=20&search%5Bvalue%5D=&search%5Bregex%5D=false&_=1564996040879').json()
You can decode that and likely (investigate that) remove timestamp part (or simply replace with random number)
import requests
r = requests.get('https://lotostats.ro/all-rez/win_for_life_10_20?draw=1&columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&start=0&length=20&search[value]=&search[regex]=false&_=1').json()
To see the lottery lines:
print(r['data'])
The draw parameter seems to be related to page of draws e.g. 2nd page:
https://lotostats.ro/all-rez/win_for_life_10_20?draw=2&columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&start=20&length=20&search[value]=&search[regex]=false&_=1564996040880
You can alter the length to retrieve more results. For example, I can deliberately oversize it to get all results
import requests
r = requests.get('https://lotostats.ro/all-rez/win_for_life_10_20?draw=1&columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&start=0&length=100000&search[value]=&search[regex]=false&_=1').json()
print(len(r['data']))
Otherwise, you can set the length param to a set number, do an initial request, and calculate the number of pages from the total (r['recordsFiltered']) records count divided by results per page.
import math
total_results = r['recordsFiltered']
results_per_page = 20
num_pages = math.ceil(total_results/results_per_page)
Then do a loop to get all results (remembering to alter draw param). Obviously, the less requests the better.

Why is my for loop overwriting instead of appending CSV?

I am trying to scrape IB website. So, what I am doing, I have created the urls to iterate over, and I am able to extract the required information, but seems the dataframe keeps being overwritten vs appending.
import pandas as pd
from pandas import DataFrame as df
from bs4 import BeautifulSoup
import csv
import requests
base_url = "https://www.interactivebrokers.com/en/index.phpf=2222&exch=mexi&showcategories=STK&p=&cc=&limit=100"
n = 1
url_list = []
while n <= 2:
url = (base_url + "&page=%d" % n)
url_list.append(url)
n = n+1
def parse_websites(url_list):
for url in url_list:
html_string = requests.get(url)
soup = BeautifulSoup(html_string.text, 'lxml') # Parse the HTML as a string
table = soup.find('div',{'class':'table-responsive no-margin'}) #Grab the first table
df = pd.DataFrame(columns=range(0,4), index = [0]) # I know the size
for row_marker, row in enumerate(table.find_all('tr')):
column_marker = 0
columns = row.find_all('td')
try:
df.loc[row_marker] = [column.get_text() for column in columns]
except ValueError:
# It's a safe way when [column.get_text() for column in columns] is empty list.
continue
print(df)
df.to_csv('path_to_file\\test1.csv')
parse_websites(url_list)
Can you please take a look at my code at advise what I am doing wrong ?
One solution if you want to append the data frames on the file is to write in append mode:
df.to_csv('path_to_file\\test1.csv', mode='a', header=False)
otherwise you should create the data frame outside as mentioned in the comments.
If you define a data structure from within a loop, each iteration of the loop
will redefine the data structure, meaning that the work is being rewritten.
The dataframe should be defined outside of the loop if you do not want it to be overwritten.

Categories