I am trying to crawl several links, extract text found on <p> HTML tag and write output to different files. Each link should have its own output file. So far:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re
import csv
import pyperclip
import pprint
import requests
urls = ['https://link1',
'https://link2']
url_list = list(urls)
#scrape elements
for url in urls:
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, "html.parser")
page = soup.find_all('p')
page = soup.getText()
for line in urls:
with open('filename{}.txt'.format(line), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
I am getting OSError: [Errno 22] Invalid argument: filenamehttps://link1
If I change my code into this
for index, line in enumerate(urls):
with open('filename{}.txt'.format(index), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
The script runs but I have a semantic error; both output files contain the text extracted from link2. I guess the second for-loop does this.
I've researched S/O for similar 1 answers but I can't figure it out.
I'm guessing you're on some sort of *nix system as the error has to do with / interpreted a part of the path.
So, you have to do something to name your files correctly or create the path you want to save the output.
Having said that, using the URL as a file name is not a great idea, because of the above error.
You could either replace the / with, say _ or just name your files differently.
Also, this:
urls = ['https://link1',
'https://link2']
Is already a list, so no need for this:
url_list = list(urls)
And there's no need for two for loops. You can write to a file as you scrape the URLS from the list.
Here's the working code with some dummy website:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
urls = ['https://lipsum.com/', 'https://de.lipsum.com/']
for url in urls:
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, "html.parser")
page = soup.find("div", {"id": "Panes"}).find("p").getText()
with open('filename_{}.txt'.format(url.replace("/", "_")), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
You could also use your approach with enumerate():
import requests
from bs4 import BeautifulSoup
urls = ['https://lipsum.com/', 'https://de.lipsum.com/']
for index, url in enumerate(urls, start=1):
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, "html.parser")
page = soup.find("div", {"id": "Panes"}).find("p").getText()
with open('filename_{}.txt'.format(index), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
Related
I have been able to successfully scrape the website and am having some trouble saving as a csv and need help to see where I have messed up. Here is my code and I have also included a snippet of my code:
import bs4 as BeautifulSoup
import CSV
import re
import urllib.request
from IPython.display import HTML
# Program that scraps the website for
r= urllib.request.urlopen('https://www.census.gov/programs-
surveys/popest.html').read()
soup = BeautifulSoup(r,"html.parser")
for link in soup.find_all('a'):
print(link.get('href'))
with open("Giles_C996.csv","w") as csv_file:
writer = csv.writer(csv_file,delimiter="/n")
writer.writerow(Links)
Close()
Error message:
Traceback (most recent call last):
File "C:\Users\epiph\Giles_C996 Project 2.txt", line 2, in
import CSV
ModuleNotFoundError: No module named 'CSV'
My Code
You've incorrectly imported csv and bs4 modules. Also Close() is incorrect. And you can use conversion to set to get rid of duplicates.
import csv
import urllib.request
from bs4 import BeautifulSoup
r = urllib.request.urlopen('https://www.census.gov/programs-surveys/popest.html').read()
soup = BeautifulSoup(r, "html.parser")
links = set([a['href'] for a in soup.find_all('a', href=True)])
with open("Giles_C996.csv", "w", newline='') as f:
writer = csv.writer(f)
writer.writerows([link] for link in links)
Output is:
https://www.census.gov/programs-surveys/cps.html
/newsroom/press-releases/2020/65-older-population-grows/65-older-population-grows-spanish.html
https://www.census.gov/businessandeconomy
https://www.census.gov/data
/programs-surveys/popest/library.html
etc.
You had some erroneous imports and called an undefined variable.
I'm not very familiar with iPython so I can't comment much on your use of it. And always have trouble with urllibs so I just used requests.
I included some scrap code for an alternative layout for the csv file, as well as a function which can help determine if a link is valid, and a list comprehension in case you prefer that approach.
Also opens your csv file for you.
import csv, re, urllib.request, os
import requests
from bs4 import BeautifulSoup
# from IPython.display import HTML
def exists(link) -> bool:
"""
Check if request response is 200
"""
try:
return 200 == requests.get(link).status_code
except requests.exceptions.MissingSchema:
return False
except requests.exceptions.InvalidSchema:
return False
def scrapeLinks(url):
checked = set()
page = requests.get(url).text
soup = BeautifulSoup(page,"html.parser")
for a in soup.find_all('a',href=True):
link = a['href']
if not link in checked and exists(link):
yield link
checked.add(link)
# Program that scrapes the website for
url = 'https://www.census.gov/programs-surveys/popest.html'
# r = urllib.request.urlopen(url).read()
r = requests.get(url).text
soup = BeautifulSoup(r,"html.parser")
# links = [
# a['href'] for a in soup.find_all('a',href=True)\
# if exists(a['href'])
# ]
file_name = "Giles_C996.csv"
with open(file_name,"w") as csv_file:
# writer = csv.writer(csv_file),delimiter="/n")
writer = csv.writer(csv_file)
# writer.writerow(set(links)) # conversion to remove duplicates
writer.writerow(scrapeLinks(url))
# writer.writerows(enumerate(scrapeLinks(url),1)) ## if you want a 2d-indexed collection
os.startfile(file_name)
# Close()
I am trying to find a downloadable video links in a website. For example, I am working with urls like these https://www.loc.gov/item/2015669100/. You can see that there is a m3u8 video link under mejs__mediaelement div tag.
However my code is not printing anything. Meaning that it's not finding the Video urls for the website.
My code is below
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
with open('pages2crawl.txt', 'r') as inFile:
lines = [line.rstrip() for line in inFile]
for page in lines:
req = Request(page, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(urlopen(req).read(), 'html.parser')
pages = soup.findAll('div', attrs={'class' : 'mejs__mediaelement'})
for e in pages:
video = e.find("video").get("src")
if video.endswith("m3u8"):
print(video)
If you just want to make a simple script it would probably be easier to use regex.
import re, requests
s = requests.Session() #start the session
data = s.get(url) #http get request to download data
data = data.text #get the raw text
vidlinks = re.findall("src='(.*?).m3u8'/>", data) #find all between the two parts in the data
print(vidlinks[0] + ".m3u8") #print the full link with extension
You can use CSS selector source[type="application/x-mpegURL"] to extract MPEG link (or source[type="video/mp4"] to extract mp4 link):
import requests
from bs4 import BeautifulSoup
url = "https://www.loc.gov/item/2015669100/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
link_mpeg = soup.select_one('source[type="application/x-mpegURL"]')["src"]
link_mp4 = soup.select_one('source[type="video/mp4"]')["src"]
print(link_mpeg)
print(link_mp4)
Prints:
https://tile.loc.gov/streaming-services/iiif/service:afc:afc2010039:afc2010039_crhp0001:afc2010039_crhp0001_mv04/full/full/0/full/default.m3u8
https://tile.loc.gov/storage-services/service/afc/afc2010039/afc2010039_crhp0001/afc2010039_crhp0001_mv04.mp4
I'm new in python programming and trying to scrape every link available in my Urls.txt file.
the code I wrote is :
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
user_agent = UserAgent()
fp = open("Urls.txt", "r")
values = fp.readlines()
fin = open("soup.html", "a")
for link in values:
print( link )
page = requests.get(link, headers={"user-agent": user_agent.chrome})
html = page.content
soup = BeautifulSoup(html, "html.parser")
fin.write(str(soup))
The code works absolutely fine when the links are provided directly as string instead of as variable but when used as it is the output differs.
Maybe the string you read from the file has a line break.
To remove it use link.strip("\n")
I'm using Python 3 to write a webscraper to pull URL links and write them to a csv file. The code does this successfully; however, there are many duplicates. How can I create the csv file with only single instances (unique) of each URL?
Thanks for the help!
import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin
r = requests.get('url')
soup = BeautifulSoup(r.text, 'html.parser')
data = []
for link in soup.find_all('a', href=True):
if '#' in link['href']:
pass
else:
print(urljoin('base-url',link.get('href')))
data.append(urljoin('base-url',link.get('href')))
with open('test.csv', 'w', newline='') as csvfile:
write = csv.writer(csvfile)
for row in data:
write.writerow([row])
Using set() somewhere along the line is the way to go. In the code below, I've added that as data = set(data) on its own line to best illustrate the usage. Here, we replace data with set(data), which drops your ~250-url list to around ~130:
import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin
r = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(r.text, 'html.parser')
data = []
for link in set(soup.find_all('a', href=True)):
if '#' in link['href']:
pass
else:
print(urljoin('https://www.census.gov',link.get('href')))
data.append(urljoin('https://www.census.gov',link.get('href')))
data = set(data)
with open('CensusLinks.csv', 'w', newline='') as csvfile:
write = csv.writer(csvfile)
for row in data:
write.writerow([row])
import requests
import csv
from bs4 import BeautifulSoup
page = requests.get("https://www.google.com/search?q=cars")
soup = BeautifulSoup(page.content, "lxml")
import re
links = soup.findAll("a")
with open('aaa.csv', 'wb') as myfile:
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
a = (re.split(":(?=http)",link["href"].replace("/url?q=","")))
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(a)
The output of this code is that I have a CSV file where 28 URLs are saved, however the URLs are not correct. For example this is a wrong URL:-
http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A
Instead it should be:-
http://www.imdb.com/title/tt0317219/
How can I remove the second part for each URL if it contains "&sa="
Because then the second part of the URL starting from:-
"&sa=" should be removed, so that all URLs are saved like the second URL.
I am using python 2.7 and Ubuntu 16.04.
If every time redundant part of url starts with &, you can apply split() to each url:
url = 'http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A'
url = url.split('&')[0]
print(url)
output:
http://www.imdb.com/title/tt0317219/
Not the best way, but you could do one more time split, adding one more line after a:
a=[a[0].split("&")[0]]
print(a)
Result:
['https://de.wikipedia.org/wiki/Cars_(Film)']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:I2SHYtLktRcJ']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Handlung']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Synchronisation']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Soundtrack']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Kritik']
['https://www.mytoys.de/disney-cars/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:9Ohx4TRS8KAJ']
['https://www.youtube.com/watch%3Fv%3DtNmo09Q3F8s']
['https://www.youtube.com/watch%3Fv%3DtNmo09Q3F8s']
['https://www.youtube.com/watch%3Fv%3DkLAnVd5y7M4']
['https://www.youtube.com/watch%3Fv%3DkLAnVd5y7M4']
['http://cars.disney.com/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:1BoR6M9fXwcJ']
['http://cars.disney.com/']
['http://cars.disney.com/']
['https://www.whichcar.com.au/car-style/12-cartoon-cars']
['https://www.youtube.com/watch%3Fv%3D6JSMAbeUS-4']
['http://filme.disney.de/cars-3-evolution']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:fO7ypFFDGk0J']
['http://www.4players.de/4players.php/spielinfonews/Allgemein/36859/2169193/Project_CARS_2-Zehn_Ferraris_erweitern_den_virtuellen_Fuhrpark.html']
['http://www.4players.de/4players.php/spielinfonews/Allgemein/36859/2169193/Project_CARS_2-Zehn_Ferraris_erweitern_den_virtuellen_Fuhrpark.html']
['http://www.play3.de/2017/08/02/project-cars-2-6/']
['http://www.imdb.com/title/tt0317219/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:-xdXy-yX2fMJ']
['http://www.carmagazine.co.uk/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:PRPbHf_kD9AJ']
['http://google.com/search%3Ftbm%3Disch%26q%3DCars']
['http://www.imdb.com/title/tt0317219/']
['https://de.wikipedia.org/wiki/Cars_(Film)']