I am still extremely new to Python, and I am working on an assignment for my school.
I need to write code to pull all of the html from a website then save it to a csv file.
I believe I somehow need to turn the links into a list and then write the list, but I'm unsure how to do that.
This is what I have so far:
import bs4
import requests
from bs4 import BeautifulSoup, SoupStrainer
import csv
search_link = "https://www.census.gov/programs-surveys/popest.html"
r = requests.get(search_link)
raw_html = r.text
soup = BeautifulSoup(raw_html, 'html.parser')
all_links = soup.find_all("a")
rem_dup = set()
for link in all_links:
hrefs = str(link.get("href"))
if hrefs.startswith('#http'):
rem_dup.add(hrefs[1:])
elif hrefs.endswith('.gov'):
rem_dup.add(hrefs + '/')
elif hrefs.startswith('/'):
rem_dup.add('https://www.census.gov' + hrefs)
else:
rem_dup.add(hrefs)
filename = "Page_Links.csv"
f = open(filename, "w+")
f.write("LINKS\n")
f.write(all_links)
f.close()
The write() function expects a character buffer object as a parameter. all_links essentially holds the ResultSet of all the hyperlinks. So, instead of -
f.write(all_links)
You should be writing the values in the set() defined by the rem_dup variable (since those contain the actual hyperlinks represented in a string format) -
for hyperlink in rem_dup:
f.write(hyperlink + "\n")
all_links is a set or results from Beautiful Soup. rem_dup is where you are storing all of the hrefs, so I assume that's what you want to be writing to the file, so just f.write(rem_dup).
Further explanation: rem_dup is actually a set. If you want it to be a list, then say rem_dup = list() instead of set(). append is usually used with lists, so you are using the correct syntax/.
Related
I have written the following code to obtain the html of some pages, according to some id which I can input in a URL. I would like to then save each html as a .txt file in a desired path. This is the code that I have written for that purpose:
import urllib3
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_html(id):
url = f'https://www.myurl&id={id}'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
html=print(soup)
return html
id = ['11111','22222']
for id in id:
path=f'D://MyPath//{id}.txt'
a = open(path, 'w')
a.write(get_html(id))
a.close()
Although generating the html pages is quite simple. This loop is not working properly. I am getting the following message TypeError: write() argument must be str, not None. Which means that the first loop somehow is failing to generate a string to be saved as a text file.
I would like to say that in the original data I have around 9k ids, so you can also let me know if instead of several .txt files you would recommend a big csv to store all the results. Thanks!
The problem is, that the print() returns None. Use str() instead:
def get_html(id):
url = f'https://www.myurl&id={id}'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
#html=print(soup) <-- print() returns None
return str(soup) # <--- convert soup to string
I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (for example: <p class="hello"> inside <div>).
Every time I try finding such tag using page.findAll() (page is Beautiful Soup object containing the whole page) method it simply doesn't find any, although there are. Is there any simple method or another way to do it?
Maybe I'm guessing what you are trying to do is first looking in a specific div tag and the search all p tags in it and count them or do whatever you want. For example:
soup = bs4.BeautifulSoup(content, 'html.parser')
# This will get the div
div_container = soup.find('div', class_='some_class')
# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
# prints the p tag content
print(ptag.text)
Hope that helps
Try this one :
data = []
for nested_soup in soup.find_all('xyz'):
data = data + nested_soup.find_all('abc')
Maybe you can turn in into lambda and make it cool, but this works. Thanks.
UPDATE: I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs
we read that there is a method called get_text(), use it as:
from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))
INCORRECT, please read above.Supposing that you have your html file locally in index.html you can:
from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)
count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
continue
temp = matcher.split(tag.text) # Split using tokens such as \s and \n
temp = filter(None, temp) # remove empty elements in the list
count +=len(temp)
print "number of words in the document %d" %count
fd.close()
Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason
You can find all <p> tags using regular expressions (re module).
Note that r.content is a string which contains the whole html of the site.
for eg:
r = requests.get(url,headers=headers)
p_tags = re.findall(r'<p>.*?</p>',r.content)
this should get you all the <p> tags irrespective of whether they are nested or not. And if you want the a tags specifically inside the tags you can add that whole tag as a string in the second argument instead of r.content.
Alternatively if you just want just the text you can try this:
from readability import Document #pip install readability-lxml
import requests
r = requests.get(url,headers=headers)
doc = Document(r.content)
simplified_html = doc.summary()
this will get you a more bare bones form of the html from the site, and now proceed with the parsing.
I have a list of URLs I want to run through, clean using BeautifulSoup and save to a .txt file.
This is my code right now with just a couple of items in the list, there will be many more coming in from a txt file but for now this keeps it simple.
While the loop is working it is passing the output for both URLs to the URL.txt file. I would like each instance in the list to output to its unique .txt file.
import urllib
from bs4 import BeautifulSoup
x = ["https://www.sec.gov/Archives/edgar/data/1000298/0001047469-13-002555.txt",
"https://www.sec.gov/Archives/edgar/data/1001082/0001104659-13-011967.txt"]
for url in x:
#I want to open the URL listed in my list
fp = urllib.request.urlopen(url)
test = fp.read()
soup = BeautifulSoup(test,"lxml")
output=soup.get_text()
#and then save the get_text() results to a unique file.
file=open("url.txt","w",encoding='utf-8')
file.write(output)
file.close()
Thank you for taking a look. Best, George
Create different filename for each item in the list like below:
import urllib
from bs4 import BeautifulSoup
x = ["https://www.sec.gov/Archives/edgar/data/1000298/0001047469-13-002555.txt",
"https://www.sec.gov/Archives/edgar/data/1001082/0001104659-13-011967.txt"]
for index , url in enumerate(x):
#I want to open the URL listed in my list
fp = urllib.request.urlopen(url)
test = fp.read()
soup = BeautifulSoup(test,"lxml")
output=soup.get_text()
#and then save the get_text() results to a unique file.
file=open("url%s.txt" % index,"w",encoding='utf-8')
file.write(output)
file.close()
I'm doing a project in Python right now where we are supposed to parse the HTML from a Project Gutenberg file to isolate the book's contents. I've managed to get rid of everything except the Table of Contents. I want to remove the Table of Contents by making the soup.prettify() a string object, splitting it on the last phrase of the Table of Contents, and pulling the last element out of the list, which will be everything except for the table of contents. This is what I have so far.
def get_text(): #writes the html into a new text file called new_christie.txt
with open('new_christie.txt','w', encoding='utf-8') as book:
url = 'http://www.gutenberg.org/files/1155/1155-h/1155-h.htm'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
str = soup.prettify()
text = str.split('XXVIII. AND AFTER') #last phrase in Table of Contents
text = soup.find_all('p') #finds all of the text between paragraphs
content = text[-1:]
for p in content:
line = p.get_text()
book.write(line)
I think my problem lies in when I try to pull the last element out of the list using content = text[-1:], but I can't figure out another way to do it.
I offer this solution, except note that I use lxml instead of beautiful soup because I know it better. I don't remember if it is natively installed, but you can install it with pip install lxml in your terminal.
import requests
from lxml import html
def get_text():
with open('new_christie.txt','w') as book:
url = 'http://www.gutenberg.org/files/1155/1155-h/1155-h.htm'
r = requests.get(url)
data = r.text
soup = html.fromstring(data.encode('utf8'))
text = ' '.join(soup.xpath('//p/text()'))
text = text.partition('AND AFTER')[2]
book.write(text)
I'm trying to open a webpage and return all the links as a dictionary that would look like this.
{"http://my.computer.com/some/file.html" : "link text"}
So the link would be after the href= and the text would be between the > and the </a>
I'm using https://www.yahoo.com/ as my test website
I keep getting a this error:
'href=' in line:
TypeError: a bytes-like object is required, not 'str'
Heres my code:
def urlDict(myUrl):
url = myUrl
page = urllib.request.urlopen(url)
pageText = page.readlines()
urlList = {}
for line in pageText:
if '<a href=' in line:
try:
url = line.split('<a href="')[-1].split('">')[0]
txt = line.split('<a href="')[-1].split('">')[-1].split('< /a>')[0]
urlList[url] = txt
except:
pass
return urlList
What am I doing wrong? I've looked around and people have mostly suggest this mysoup parser thing. I'd use it, but I don't think that would fly with my teacher.
The issue is that you're attempting to compare a byte string to a regular string. If you add print(line) as the first command in your for loops, you'll see that it will print a string of HTML but it will have a b' at the beginning, indicating it's not utf-8 encoding. This makes things difficult. The proper way to use urllib here is the following:
def url_dict(myUrl):
with urllib.request.urlopen(myUrl) as f:
s = f.read().decode('utf-8')
This will have the s variable hold the entire text of the page. You can then use a regular expression to parse out the links and the link target. Here is an example which will pull the link targets without the HTML.
import urllib.request
import re
def url_dict():
# url = myUrl
with urllib.request.urlopen('http://www.yahoo.com') as f:
s = f.read().decode('utf-8')
r = re.compile('(?<=href=").*?(?=")')
print(r.findall(s))
url_dict()
Using regex to get both the html and the link itself in a dictionary is outside the scope of where you are in your class, so I would absolutely not recommend submitting it for the assignment, although I would recommend learning it for later use.
You'll want to use BeautifulSoup as suggested, as it make this entire thing extremely easy. There is an example in the docs that you can cut and paste to extract the URLs.
For what it's worth, here is a BeautifulSoup and requests approach.
Feel free to replace requests with urllib, but BeautifulSoup doesn't really have a nice replacement.
import requests
from bs4 import BeautifulSoup
def get_links(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
return { a_tag['href']: a_tag.text for a_tag in soup.find_all('a') }
for link, text in get_links('https://www.yahoo.com/').items():
print(text.strip(), link)