Scraping multiple webpages and writing to a CSV file - python

I'm writing a program that will take seven pieces of data from a website and write it to a csv file per a company in the symbols.txt file, such as AAPL or NFLX. My problems seems to come from my confusion with index to make the script work. I am at a loss on how it fits. I thought that this format would work...
import urllib2
from BeautifulSoup import BeautifulSoup
import csv
import re
import urllib
# import modules
symbolfile = open("symbols.txt")
symbolslist = symbolfile.read()
newsymbolslist = symbolslist.split("\n")
i = 0
f = csv.writer(open("pe_ratio.csv","wb"))
# short cut to write
f.writerow(["Name","PE","Revenue % Quarterly","ROA% YOY","Operating Cashflow","Debt to Equity"])
#first write row statement
# define name_company as the following
while i<len(newsymbolslist):
page = urllib2.urlopen("http://finance.yahoo.com/q/ks?s="+newsymbolslist[i] +"%20Key%20Statistics").read()
soup = BeautifulSoup(page)
name_company = soup.findAll("div", {"class" : "title"})
for name in name_company: #add multiple iterations?
all_data = soup.findAll('td', "yfnc_tabledata1")
stock_name = name.find('h2').string #find company's name in name_company with h2 tag
f.writerow([stock_name, all_data[2].getText(),all_data[17].getText(),all_data[13].getText(), all_data[29].getText(),all_data[26].getText()]) #write down PE data
i+=1
I get the following error below when I try to run the code as is:
Traceback (most recent call last):
File "company_data_v1.py", line 28, in <module>
f.writerow([stock_name, all_data[2].getText(),all_data[17].getText(),all_data[13].getText(), all_data[29].getText()
all_data[26].getText()]) #write down PE data
IndexError: list index out of range
Thanks for your help in advance.

name_company = soup.findAll("div", {"class" : "title"})
soup = BeautifulSoup(page) #this is the first time you define soup
You define soup on the line after you attempt to do soup.findAll. Python tells you exactly what the problem is: you haven't defined soup at the findAll line.
Flip the order of those lines.

I assume when you said "where to put the variables to make the script work" you were referring to this 'soup' variable? The one in your error message?
If so then I suggest declaring 'soup' before you try to use it in soup.findAll(). As you can see, you declared soup = BeautifulSoup(page) one line after soup.findAll(). It should go above it.

Related

Is there such thing as a "if x (or any variable) has any value" function in Python?

I'm trying to build a web crawler that generates a text file for multiple different websites. After it crawls a website it is supposed to get all the links in a website. However, I have encountered a problem while web crawling Wikipedia. The python script gives me the error:
Traceback (most recent call last):
File "/home/banana/Desktop/Search engine/data/crawler?.py", line 22, in <module>
urlwaitinglist.write(link.get('href'))
TypeError: write() argument must be str, not None
I looked deeper into it by having it print the discovered links and it has "None" at the top. I'm wondering if there is a function to see if the variable has any value.
Here is the code I have written so far:
from bs4 import BeautifulSoup
import os
import requests
import random
import re
toscan = "https://en.wikipedia.org/wiki/Wikipedia:Contents"
url = toscan
source_code = requests.get(url)
plain_text = source_code.text
removal_list = ["http://", "https://", "/"]
for word in removal_list:
toscan = toscan.replace(word, "")
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
urlwaitinglist = open("/home/banana/Desktop/Search engine/data/toscan", "a")
urlwaitinglist.write('\n')
urlwaitinglist.write(link.get('href'))
urlwaitinglist.close()
print(soup.get_text())
directory = "/home/banana/Desktop/Search engine/data/Crawled Data/"
results = soup.get_text()
results = results.strip()
f = open("/home/banana/Desktop/Search engine/data/Crawled Data/" + toscan + ".txt", "w")
f.write(url)
f.write('\n')
f.write(results)
f.close()
Looks like not every <a> tag you are grabbing is returning a value. I would suggest making every link variable you grab a string and check if its not None. It is also bad practice to to open a file without using the 'with' clause. I have added an example that grabs every https|http link and writing it to file using some of your code below:
from bs4 import BeautifulSoup
import os
import requests
import random
import re
file_directory = './' # your specified directory location
filename = 'urls.txt' # your specified filename
url = "https://en.wikipedia.org/wiki/Wikipedia:Contents"
res = requests.get(url)
html = res.text
soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.find_all('a'):
link = link.get('href')
print(link)
match = re.search('^(http|https)://', str(link))
if match:
links.append(str(link))
with open(file_directory + filename, 'w') as file:
for link in links:
file.write(link + '\n')

BeautifulSoup: save each interation of loop's resulting HTML

I have written the following code to obtain the html of some pages, according to some id which I can input in a URL. I would like to then save each html as a .txt file in a desired path. This is the code that I have written for that purpose:
import urllib3
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_html(id):
url = f'https://www.myurl&id={id}'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
html=print(soup)
return html
id = ['11111','22222']
for id in id:
path=f'D://MyPath//{id}.txt'
a = open(path, 'w')
a.write(get_html(id))
a.close()
Although generating the html pages is quite simple. This loop is not working properly. I am getting the following message TypeError: write() argument must be str, not None. Which means that the first loop somehow is failing to generate a string to be saved as a text file.
I would like to say that in the original data I have around 9k ids, so you can also let me know if instead of several .txt files you would recommend a big csv to store all the results. Thanks!
The problem is, that the print() returns None. Use str() instead:
def get_html(id):
url = f'https://www.myurl&id={id}'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
#html=print(soup) <-- print() returns None
return str(soup) # <--- convert soup to string

I tried lot of times to grab the data from booking.com.But i couldn't

I want to scrape the data from the booking.com but got some errors and couldn't find any similar codes.
I want to scrape the name of the hotel,price and etc.
i have tried beautifulSoup 4 and tried to get data to a csv file.
import requests
from bs4 import BeautifulSoup
import pandas
# Replace search_url with a valid one byb visiting and searching booking.com
search_url = 'https://www.booking.com/searchresults.....'
page = requests.get(search_url)
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id = 'search_results_table' )
#print(week)
items = week.find_all(class_='sr-hotel__name')
print(items[0])
print(items[0].find(class_ = 'sr-hotel__name').get_text())
print(items[0].find(class_ = 'short-desc').get_text())
Here is a sample URL that can be used in place of search_url.
This is the error msg...
<span class="sr-hotel__name " data-et-click="
">
The Fort Printers
</span>
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-44-77b38c8546bb> in <module>
11 items = week.find_all(class_='sr-hotel__name')
12 print(items[0])
---> 13 print(items[0].find(class_ = 'sr-hotel__name').get_text())
14 print(items[0].find(class_ = 'short-desc').get_text())
15
AttributeError: 'NoneType' object has no attribute 'get_text'
Instead of using find() method multiple times, if you consider using getText() method directly it can help.
import requests
from bs4 import BeautifulSoup
import pandas
# Replace search_url with a valid one byb visiting and searching booking.com
search_url = 'https://www.booking.com/searchresults.....'
page = requests.get(search_url)
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id = 'search_results_table' )
#print(week)
items = week.find_all(class_='sr-hotel__name')
# print the whole thing
print(items[0])
hotel_name = items[0].getText()
# print hotel name
print(hotel_name)
# print without newlines
print(hotel_name[1:-1])
Hope this helps. I would suggest reading more of BeautifulSoup documentation.
first of all, buddy, using requests might be really hard since you have to completely imitate the request your browser will send.
You'll have to use some sniffing tool (burp, fiddler, wireshark) or in some cases look at the network in the developer mode on your browser which is relatively hard...
I'd suggest you to use "selenium" which is a web driver that makes your life easy when trying to scrape sites... read more about it here- https://medium.com/the-andela-way/introduction-to-web-scraping-using-selenium-7ec377a8cf72
And as for your error, I think you should use only .text instead of .get_text()

How to Grab Specific Text

I want to grab the price of bitcoin from this website: https://www.coindesk.com/price/bitcoin
but I am not sure how to do it, i'm pretty new to coding.
This is my code so far, I am not sure what I am doing wrong. Thanks in advance.
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.coindesk.com/price/bitcoin')
r_content = r.content
soup = BeautifulSoup(r_content, 'lxml')
p_value = soup.find('span', {'class': "currency-price", "data-value": True})['data-value']
print(p_value)
This is the result:
Traceback (most recent call last): File
"C:/Users/aidan/PycharmProjects/scraping/Scraper.py", line 8, in
p_value = soup.find('span', {'class': "currency-price", "data-value": True})['data-value'] TypeError: 'NoneType' object is not
subscriptable
Content is dynamically sourced from an API call returning json. You can use a list of currencies or a single currency. With requests javascript doesn't run and this content isn't added to the DOM and various DOM changes, to leave html as seen in browser, don't occur.
import requests
r = requests.get('https://production.api.coindesk.com/v1/currency/ticker?currencies=BTC').json()
print(r)
price = r['data']['currency']['BTC']['quotes']['USD']['price']
print(price)
r = requests.get('https://production.api.coindesk.com/v1/currency/ticker?currencies=ADA,BCH,BSV,BTC,BTG,DASH,DCR,DOGE,EOS,ETC,ETH,IOTA,LSK,LTC,NEO,QTUM,TRX,XEM,XLM,XMR,XRP,ZEC').json()
print(r)
The problem here is that the soup.find() call is not returning a value (that is, there is no span with the attributes you have defined on the page) therefore when you try to get data-value there is no dictionary to look it up in.
your website doesn't hold the data in html, this way you can't scrape it, but they are using an end point that you could use:
data = requests.get('https://production.api.coindesk.com/v1/currency/ticker?currencies=BTC').json()
p_value = data['data']['currency']['BTC']['quotes']['USD']['price']
print(p_value)
# output: 11375.678380772
the price is changing all the time so my output may be diffrent

Trouble grabbing data from a webpage located within comment

I've written a script in python to get some data from a website. It seems I did it the right way. However, when I print the data I get an error list index out of range. The data are within comment. So in my script I tried to use the python's built-in comment processing method. Could anybody point me out where I'm going wrong?
Link to the website: website_link
Script I've tried so far with:
import requests
from bs4 import BeautifulSoup, Comment
res = requests.get("replace_with_the_above_link")
soup = BeautifulSoup(res.text, 'lxml')
for comment in soup.find_all(string=lambda text:isinstance(text,Comment)):
sauce = BeautifulSoup(comment, 'lxml')
items = sauce.select("#tco_detail_data")[0]
data = ' '.join([' '.join(item.text.split()) for item in items.select("li")])
print(data)
This is the traceback:
Traceback (most recent call last):
File "C:\Users\Local\Programs\Python\Python35-32\new_line_one.py", line 8, in <module>
items = sauce.select("#tco_detail_data")[0]
IndexError: list index out of range
Please click on the below link to see which portion of data I would like to grab: Expected_output_link
None of the comments contain html with a "#tco_detail_data" tag, so select returns an empty list, which raises an IndexError when you try to select the first item.
However, you can find the data in a "ul#tco_detail_data" tag.
res = requests.get(link)
soup = BeautifulSoup(res.text, 'lxml')
data = soup.select_one("#tco_detail_data")
print(data)
If you want data in a list,
data = [list(item.stripped_strings) for item in data.select("ul")]
If you prefer a string,
data = '\n'.join([item.get_text(' ', strip=True) for item in data.select("ul")])

Categories