Loop through list of URLs, run BeautifulSoup, write to file - python

I have a list of URLs I want to run through, clean using BeautifulSoup and save to a .txt file.
This is my code right now with just a couple of items in the list, there will be many more coming in from a txt file but for now this keeps it simple.
While the loop is working it is passing the output for both URLs to the URL.txt file. I would like each instance in the list to output to its unique .txt file.
import urllib
from bs4 import BeautifulSoup
x = ["https://www.sec.gov/Archives/edgar/data/1000298/0001047469-13-002555.txt",
"https://www.sec.gov/Archives/edgar/data/1001082/0001104659-13-011967.txt"]
for url in x:
#I want to open the URL listed in my list
fp = urllib.request.urlopen(url)
test = fp.read()
soup = BeautifulSoup(test,"lxml")
output=soup.get_text()
#and then save the get_text() results to a unique file.
file=open("url.txt","w",encoding='utf-8')
file.write(output)
file.close()
Thank you for taking a look. Best, George

Create different filename for each item in the list like below:
import urllib
from bs4 import BeautifulSoup
x = ["https://www.sec.gov/Archives/edgar/data/1000298/0001047469-13-002555.txt",
"https://www.sec.gov/Archives/edgar/data/1001082/0001104659-13-011967.txt"]
for index , url in enumerate(x):
#I want to open the URL listed in my list
fp = urllib.request.urlopen(url)
test = fp.read()
soup = BeautifulSoup(test,"lxml")
output=soup.get_text()
#and then save the get_text() results to a unique file.
file=open("url%s.txt" % index,"w",encoding='utf-8')
file.write(output)
file.close()

Related

What's wrong with this BS4 to CSV code? Not getting list of items in csv correctly

import csv
from email import header
from fileinput import filename
from tokenize import Name
import requests
from bs4 import BeautifulSoup
url = "https://www.azlyrics.com/l/linkinpark.html"
r=requests.get(url)
htmlContent = r.content
soup = BeautifulSoup(htmlContent,"html.parser")
albumList=[]
table = soup.find('div', id="listAlbum")
for row in table.find_all('div', class_="album"):
albumName = {}
albumName['Name'] = row.b.text
albumList.append(albumName)
print(albumName)
noOfAlbum = len(albumList)
print(noOfAlbum)
with open('lpalbumr6.csv','w',encoding='utf8',newline='') as f:
thewriter = csv.writer(f)
header=['Name']
thewriter.writerow(header)
for i in albumList:
thewriter.writerow(albumName)
Hello,
I was trying to get the list of album on artist page on azlyrics.com. When I export the list in csv, I am getting exported list as follows:
print(albumName) works perfectly, but exported csv looks like above image.
albumList contains all the information you need, so the issue is just with the part where you write the csv at the end.
You have:
for i in albumList:
thewriter.writerow(albumName)
but albumName is not referring to the elements of albumList - it's the temporary variable you used when creating that list. You need to refer to the loop variable i in the loop. You also need to specify that you need the value of the Name key in each dictionary:
for i in albumList:
thewriter.writerow([i['Name']])
This is all inside an extra [] because of the way csvwriter handles strings (see this question)
With this change the generated csv looks like this:

How to create a loop to go through a list of URLs, scrape all data. all URLs of the similar pages are saved in one file.txt (every URL in one line)?

I want to extract some information from multiple pages which have similar page structures.
all URLs of the pages are saved in one file.txt (every URL in one line).
I already create the code to scrape all the data from one link (it works).
But I don't know how I create a loop to go through all the list of URLs from the txt file, and scrape all the data.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
from bs4 import Comment
import re
import rispy # Writing an ris file
with open('F:\Python\Python-FilePy-Thesis-DownLoad/Thesis2.txt', 'r') as f:
for line in f:
url = line.strip()
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
Just work with each url:page inside the loop!
for line in f:
url = line.strip()
html = requests.get(url).text # is .content better?
soup = BeautifulSoup(html, "html.parser")
# work with soup here!
Creating more functions may help your program be easier to read if you find yourself packing a lot into some block
See Cyclomatic Complexity (which is practically a count of the control statements like if and for)
Additionally, if you want to collect up all the values before doing further processing (though this is frequently better accomplished with more esoteric logic like a generator or asyncio to collect many pages in parallel), you might consider creating some collection before the loop to store the results
collected_results = [] # a new list
...
for line in fh:
result = # process the line via whatever logic
collected_results.append(result)
# now collected_results has the result from each line
you are making a big mistake by writing :
with open('F:\Python\Python-FilePy-Thesis-DownLoad/Thesis2.txt', 'r') as f:
for line in f:
url = line.strip()
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
because that will store the html data of the last url obtained from the TXT file in html variable.
after the for loop finish, the last line of the TXT file will be stored in variable url and that mean you will get only the last url in the TXT file
the code should be :
with open('F:\Python\Python-FilePy-Thesis-DownLoad/Thesis2.txt', 'r') as f:
for line in f:
url = line.strip()
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")

BeautifulSoup: save each interation of loop's resulting HTML

I have written the following code to obtain the html of some pages, according to some id which I can input in a URL. I would like to then save each html as a .txt file in a desired path. This is the code that I have written for that purpose:
import urllib3
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_html(id):
url = f'https://www.myurl&id={id}'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
html=print(soup)
return html
id = ['11111','22222']
for id in id:
path=f'D://MyPath//{id}.txt'
a = open(path, 'w')
a.write(get_html(id))
a.close()
Although generating the html pages is quite simple. This loop is not working properly. I am getting the following message TypeError: write() argument must be str, not None. Which means that the first loop somehow is failing to generate a string to be saved as a text file.
I would like to say that in the original data I have around 9k ids, so you can also let me know if instead of several .txt files you would recommend a big csv to store all the results. Thanks!
The problem is, that the print() returns None. Use str() instead:
def get_html(id):
url = f'https://www.myurl&id={id}'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
#html=print(soup) <-- print() returns None
return str(soup) # <--- convert soup to string

Python write web scrape data to csv

I am still extremely new to Python, and I am working on an assignment for my school.
I need to write code to pull all of the html from a website then save it to a csv file.
I believe I somehow need to turn the links into a list and then write the list, but I'm unsure how to do that.
This is what I have so far:
import bs4
import requests
from bs4 import BeautifulSoup, SoupStrainer
import csv
search_link = "https://www.census.gov/programs-surveys/popest.html"
r = requests.get(search_link)
raw_html = r.text
soup = BeautifulSoup(raw_html, 'html.parser')
all_links = soup.find_all("a")
rem_dup = set()
for link in all_links:
hrefs = str(link.get("href"))
if hrefs.startswith('#http'):
rem_dup.add(hrefs[1:])
elif hrefs.endswith('.gov'):
rem_dup.add(hrefs + '/')
elif hrefs.startswith('/'):
rem_dup.add('https://www.census.gov' + hrefs)
else:
rem_dup.add(hrefs)
filename = "Page_Links.csv"
f = open(filename, "w+")
f.write("LINKS\n")
f.write(all_links)
f.close()
The write() function expects a character buffer object as a parameter. all_links essentially holds the ResultSet of all the hyperlinks. So, instead of -
f.write(all_links)
You should be writing the values in the set() defined by the rem_dup variable (since those contain the actual hyperlinks represented in a string format) -
for hyperlink in rem_dup:
f.write(hyperlink + "\n")
all_links is a set or results from Beautiful Soup. rem_dup is where you are storing all of the hrefs, so I assume that's what you want to be writing to the file, so just f.write(rem_dup).
Further explanation: rem_dup is actually a set. If you want it to be a list, then say rem_dup = list() instead of set(). append is usually used with lists, so you are using the correct syntax/.

Scrape websites and export only the visible text to a text document Python 3 (Beautiful Soup)

Problem: I am trying to scrape multiple websites using beautifulsoup for only the visible text and then export all of the data to a single text file.
This file will be used as a corpus for finding collocations using NLTK. I'm working with something like this so far but any help would be much appreciated!
import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
with open('thisisanew.txt','w') as file:
for item in text:
print(file, item)
Unfortunately, there are two issues with this: when I try to export the file to a .txt file it is completely blank.
Any ideas?
print(file, item) should be print(item, file=file).
But don't name your files file as this shadows the file builtin, something like this is better:
with open('thisisanew.txt','w') as outfile:
for item in text:
print(item, file=outfile)
To solve the next problem, overwriting the data from the first URL, you can move the file writing code into your loop, and open the file once before entering the loop:
import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]
with open('thisisanew.txt', 'w', encoding='utf-8') as outfile:
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
for item in text:
print(item, file=outfile)
There is another problem: you are collecting the text only from the last url: reassigning the text variable over and over.
Define the text as an empty list before the loop and add new data to it inside:
text = []
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text += [''.join(s.findAll(text=True))for s in soup.findAll('p')]

Categories