I'm new in python programming and trying to scrape every link available in my Urls.txt file.
the code I wrote is :
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
user_agent = UserAgent()
fp = open("Urls.txt", "r")
values = fp.readlines()
fin = open("soup.html", "a")
for link in values:
print( link )
page = requests.get(link, headers={"user-agent": user_agent.chrome})
html = page.content
soup = BeautifulSoup(html, "html.parser")
fin.write(str(soup))
The code works absolutely fine when the links are provided directly as string instead of as variable but when used as it is the output differs.
Maybe the string you read from the file has a line break.
To remove it use link.strip("\n")
Related
I am doing a lot of work with Beautiful Soup. However, my supervisor does not want me doing the work "in real time" from the web. Instead, he wants me to download all the text from a webpage and then work on it later. He wants to avoid repeated hits on a website.
Here is my code:
import requests
from bs4 import BeautifulSoup
url = 'https://scholar.google.com/citations?user=XpmZBggAAAAJ'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
I am unsure whether I should save "page" as a file and then import that into Beautiful Soup, or whether I should save "soup" as a file to open later. I also do not know how to save this as a file in a way that can be accessed as if it were "live" from the internet. I know almost nothing about Python, so I need the absolute easiest and simplest process for this.
So saving soup would be... tough, and out of my experience (read more about the pickleing process if interested). You can save the page as follows:
page = requests.get(url)
with open('path/to/saving.html', 'wb+') as f:
f.write(page.content)
Then later, when you want to do analysis on it:
with open('path/to/saving.html', 'rb') as f:
soup = BeautifulSoup(f.read(), 'lxml')
Something like that, anyway.
The following code iterates over url_list and saves all the responses into the list all_pages, which is stored to the response.pickle file.
import pickle
import requests
from bs4 import BeautifulSoup
all_pages = []
for url in url_list:
all_pages.append(requests.get(url))
with open("responses.pickle", "wb") as f:
pickle.dump(all_pages, f)
Then later on, you can load this data, "soupify" each response and do whatever you need with it.
with open("responses.pickle", "rb") as f:
all_pages = pickle.load(f)
for page in all_pages:
soup = BeautifulSoup(page.text, 'lxml')
# do stuff
Working with our request:
url = 'https://scholar.google.com/citations?user=XpmZBggAAAAJ'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
you can use this also:
f=open("path/page.html","w")
f.write(page.prettify())
f.close
I am trying to crawl several links, extract text found on <p> HTML tag and write output to different files. Each link should have its own output file. So far:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re
import csv
import pyperclip
import pprint
import requests
urls = ['https://link1',
'https://link2']
url_list = list(urls)
#scrape elements
for url in urls:
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, "html.parser")
page = soup.find_all('p')
page = soup.getText()
for line in urls:
with open('filename{}.txt'.format(line), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
I am getting OSError: [Errno 22] Invalid argument: filenamehttps://link1
If I change my code into this
for index, line in enumerate(urls):
with open('filename{}.txt'.format(index), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
The script runs but I have a semantic error; both output files contain the text extracted from link2. I guess the second for-loop does this.
I've researched S/O for similar 1 answers but I can't figure it out.
I'm guessing you're on some sort of *nix system as the error has to do with / interpreted a part of the path.
So, you have to do something to name your files correctly or create the path you want to save the output.
Having said that, using the URL as a file name is not a great idea, because of the above error.
You could either replace the / with, say _ or just name your files differently.
Also, this:
urls = ['https://link1',
'https://link2']
Is already a list, so no need for this:
url_list = list(urls)
And there's no need for two for loops. You can write to a file as you scrape the URLS from the list.
Here's the working code with some dummy website:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
urls = ['https://lipsum.com/', 'https://de.lipsum.com/']
for url in urls:
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, "html.parser")
page = soup.find("div", {"id": "Panes"}).find("p").getText()
with open('filename_{}.txt'.format(url.replace("/", "_")), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
You could also use your approach with enumerate():
import requests
from bs4 import BeautifulSoup
urls = ['https://lipsum.com/', 'https://de.lipsum.com/']
for index, url in enumerate(urls, start=1):
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, "html.parser")
page = soup.find("div", {"id": "Panes"}).find("p").getText()
with open('filename_{}.txt'.format(index), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
I'm trying to scrape images from a site using beautifulsoup HTML parser.
There are 2 kinds of image tags for each image on the site. One is for the thumbnail and the other is the bigger size image that only appears after I click on the thumbnail and expand. The bigger size tag contains a class="expanded-image" attribute.
I'm trying to parse through the HTML and get the "src" attribute of the expanded image which contains the source for the image.
When I try to execute my code, nothing happens. It just says the process finished without scraping any image. But when I don't try to filter the code and just give tag as an argument, it downloads all the thumbnails.
Here's my code:
import webbrowser, requests, os
from bs4 import BeautifulSoup
def getdata(url):
r = requests.get(url)
return r.text
htmldata = getdata('https://boards.4chan.org/a/thread/30814')
soup = BeautifulSoup(htmldata, 'html.parser')
list = []
for i in soup.find_all("img",{"class":"expanded-thumb"}):
list.append(i['src'].replace("//","https://"))
def download(url, pathname):
if not os.path.isdir(pathname):
os.makedirs(pathname)
filename = os.path.join(pathname, url.split("/")[-1])
response = requests.get(url, stream=True)
with open(filename, "wb") as f:
f.write(response.content)
for a in list:
download(a,"file")
You might be running into a problem using "list" as a variable name. It's a type in python. Start with this (replacing TEST_4CHAN_URL with whatever thread you want), incorporating my suggestion from the comment above.
import requests
from bs4 import BeautifulSoup
TEST_4CHAN_URL = "https://boards.4chan.org/a/thread/<INSERT_THREAD_ID_HERE>"
def getdata(url):
r = requests.get(url)
return r.text
htmldata = getdata(TEST_4CHAN_URL)
soup = BeautifulSoup(htmldata, "html.parser")
src_list = []
for i in soup.find_all("a", {"class":"fileThumb"}):
src_list.append(i['href'].replace("//", "https://"))
print(src_list)
I am trying to find a downloadable video links in a website. For example, I am working with urls like these https://www.loc.gov/item/2015669100/. You can see that there is a m3u8 video link under mejs__mediaelement div tag.
However my code is not printing anything. Meaning that it's not finding the Video urls for the website.
My code is below
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
with open('pages2crawl.txt', 'r') as inFile:
lines = [line.rstrip() for line in inFile]
for page in lines:
req = Request(page, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(urlopen(req).read(), 'html.parser')
pages = soup.findAll('div', attrs={'class' : 'mejs__mediaelement'})
for e in pages:
video = e.find("video").get("src")
if video.endswith("m3u8"):
print(video)
If you just want to make a simple script it would probably be easier to use regex.
import re, requests
s = requests.Session() #start the session
data = s.get(url) #http get request to download data
data = data.text #get the raw text
vidlinks = re.findall("src='(.*?).m3u8'/>", data) #find all between the two parts in the data
print(vidlinks[0] + ".m3u8") #print the full link with extension
You can use CSS selector source[type="application/x-mpegURL"] to extract MPEG link (or source[type="video/mp4"] to extract mp4 link):
import requests
from bs4 import BeautifulSoup
url = "https://www.loc.gov/item/2015669100/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
link_mpeg = soup.select_one('source[type="application/x-mpegURL"]')["src"]
link_mp4 = soup.select_one('source[type="video/mp4"]')["src"]
print(link_mpeg)
print(link_mp4)
Prints:
https://tile.loc.gov/streaming-services/iiif/service:afc:afc2010039:afc2010039_crhp0001:afc2010039_crhp0001_mv04/full/full/0/full/default.m3u8
https://tile.loc.gov/storage-services/service/afc/afc2010039/afc2010039_crhp0001/afc2010039_crhp0001_mv04.mp4
import requests
import csv
from bs4 import BeautifulSoup
page = requests.get("https://www.google.com/search?q=cars")
soup = BeautifulSoup(page.content, "lxml")
import re
links = soup.findAll("a")
with open('aaa.csv', 'wb') as myfile:
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
a = (re.split(":(?=http)",link["href"].replace("/url?q=","")))
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(a)
The output of this code is that I have a CSV file where 28 URLs are saved, however the URLs are not correct. For example this is a wrong URL:-
http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A
Instead it should be:-
http://www.imdb.com/title/tt0317219/
How can I remove the second part for each URL if it contains "&sa="
Because then the second part of the URL starting from:-
"&sa=" should be removed, so that all URLs are saved like the second URL.
I am using python 2.7 and Ubuntu 16.04.
If every time redundant part of url starts with &, you can apply split() to each url:
url = 'http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A'
url = url.split('&')[0]
print(url)
output:
http://www.imdb.com/title/tt0317219/
Not the best way, but you could do one more time split, adding one more line after a:
a=[a[0].split("&")[0]]
print(a)
Result:
['https://de.wikipedia.org/wiki/Cars_(Film)']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:I2SHYtLktRcJ']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Handlung']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Synchronisation']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Soundtrack']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Kritik']
['https://www.mytoys.de/disney-cars/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:9Ohx4TRS8KAJ']
['https://www.youtube.com/watch%3Fv%3DtNmo09Q3F8s']
['https://www.youtube.com/watch%3Fv%3DtNmo09Q3F8s']
['https://www.youtube.com/watch%3Fv%3DkLAnVd5y7M4']
['https://www.youtube.com/watch%3Fv%3DkLAnVd5y7M4']
['http://cars.disney.com/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:1BoR6M9fXwcJ']
['http://cars.disney.com/']
['http://cars.disney.com/']
['https://www.whichcar.com.au/car-style/12-cartoon-cars']
['https://www.youtube.com/watch%3Fv%3D6JSMAbeUS-4']
['http://filme.disney.de/cars-3-evolution']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:fO7ypFFDGk0J']
['http://www.4players.de/4players.php/spielinfonews/Allgemein/36859/2169193/Project_CARS_2-Zehn_Ferraris_erweitern_den_virtuellen_Fuhrpark.html']
['http://www.4players.de/4players.php/spielinfonews/Allgemein/36859/2169193/Project_CARS_2-Zehn_Ferraris_erweitern_den_virtuellen_Fuhrpark.html']
['http://www.play3.de/2017/08/02/project-cars-2-6/']
['http://www.imdb.com/title/tt0317219/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:-xdXy-yX2fMJ']
['http://www.carmagazine.co.uk/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:PRPbHf_kD9AJ']
['http://google.com/search%3Ftbm%3Disch%26q%3DCars']
['http://www.imdb.com/title/tt0317219/']
['https://de.wikipedia.org/wiki/Cars_(Film)']