How to get text from specific url? - python

I was wondering if there's any way to get text from certain url using python.
For example, from this one https://www.ixbt.com/news/2022/04/20/160-radeon-rx-6400.html
Thank you in advance.

You can do web scraping in python using BeautifulSoup:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.ixbt.com/news/2022/04/20/160-radeon-rx-6400.html"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
text = soup.get_text()
After that you could save the extracted text into a text file:
text_file = open("webscrap.txt", "w", encoding="utf-8")
text_file.write(text)
text_file.close()

Related

How to get video src using BeautifulSoup in Python

I am trying to find a downloadable video links in a website. For example, I am working with urls like these https://www.loc.gov/item/2015669100/. You can see that there is a m3u8 video link under mejs__mediaelement div tag.
However my code is not printing anything. Meaning that it's not finding the Video urls for the website.
My code is below
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
with open('pages2crawl.txt', 'r') as inFile:
lines = [line.rstrip() for line in inFile]
for page in lines:
req = Request(page, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(urlopen(req).read(), 'html.parser')
pages = soup.findAll('div', attrs={'class' : 'mejs__mediaelement'})
for e in pages:
video = e.find("video").get("src")
if video.endswith("m3u8"):
print(video)
If you just want to make a simple script it would probably be easier to use regex.
import re, requests
s = requests.Session() #start the session
data = s.get(url) #http get request to download data
data = data.text #get the raw text
vidlinks = re.findall("src='(.*?).m3u8'/>", data) #find all between the two parts in the data
print(vidlinks[0] + ".m3u8") #print the full link with extension
You can use CSS selector source[type="application/x-mpegURL"] to extract MPEG link (or source[type="video/mp4"] to extract mp4 link):
import requests
from bs4 import BeautifulSoup
url = "https://www.loc.gov/item/2015669100/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
link_mpeg = soup.select_one('source[type="application/x-mpegURL"]')["src"]
link_mp4 = soup.select_one('source[type="video/mp4"]')["src"]
print(link_mpeg)
print(link_mp4)
Prints:
https://tile.loc.gov/streaming-services/iiif/service:afc:afc2010039:afc2010039_crhp0001:afc2010039_crhp0001_mv04/full/full/0/full/default.m3u8
https://tile.loc.gov/storage-services/service/afc/afc2010039/afc2010039_crhp0001/afc2010039_crhp0001_mv04.mp4

Download multiple csv files from a web directory using Python and store them in disk, using as filename the anchor text

From this URL:
http://vs-web-fs-1.oecd.org/piaac/puf-data/CSV
I want to download all the files and save them with the text of the anchor tag. I guess my main struggle is to retrieve the text of the anchor tag right now:
from bs4 import BeautifulSoup
import requests
import urllib.request
url_base = "http://vs-web-fs-1.oecd.org"
url_dir = "http://vs-web-fs-1.oecd.org/piaac/puf-data/CSV"
r = requests.get(url_dir)
data = r.text
soup = BeautifulSoup(data,features="html5lib")
for link in soup.find_all('a'):
if link.get('href').endswith(".csv"):
print(link.find("a"))
urllib.request.urlretrieve(url_base+link.get('href'), "test.csv")
Line print(link.find("a")) returns None. How can I retrieve the text?
You get the text accessing the content, like this:
link.contents[0]
or
link.string

Python BeautifulSoup find_all() only returns first table from html instead of all tables?

I'm trying to use BeautifulSoup to get the tables from this website: https://www.basketball-reference.com/players/b/bryanko01.html
My code is as follows:
from bs4 import BeautifulSoup
from urllib.request import urlopen
f = open("testhtml.txt", 'w')
url = "https://www.basketball-reference.com/players/b/bryanko01.html"
html = urlopen(url)
bs = BeautifulSoup(html, "html5lib")
totals = [s.encode('utf-8') for s in bs.find_all("table")]
print(len(totals)) # prints 1
f.write(bs.prettify().encode('utf-8'))
f.close()
I write to a file to look at the raw html and there are multiple tables (with the table tags), but for some reason, my call to find_all("table") only returns one table.
Please let me know if you have any thoughts as to what I may be doing wrong.f

cut html in half using python beautifulsoup

I'm trying to scrap a website and I need to cut the HTML code in half. The problem is that the HTML code is not really well organized and I can't just use findAll.
Here is my code to parse the HTML code :
resultats = requests.get(URL)
bs = BeautifulSoup(resultats.text, 'html.parser')
What I want to do is to divide bs for each <h2> I find :
The solution might be really simple but I can't find it...
edit : the website, here
this scrapes the whole text without html in it:
import urllib2, json, re
from bs4 import BeautifulSoup
url = "https://fr.wikipedia.org/wiki/Liste_de_sondages_sur_l'%C3%A9lection_pr%C3%A9sidentielle_fran%C3%A7aise_de_2017#Avril"
resultats = urllib2.urlopen(url)
html = resultats.read()
soup = BeautifulSoup(html, 'html5lib')
soup = soup.get_text() # Extracts Text from HTML
print soup
If you want to leave certain information out, you could add this:
soup = re.sub(re.compile('yourRegex', re.DOTALL), '', soup)\
.strip()

Python webscraping - NoneObeject Failure - broken HTML?

Ive got a problem with my parsing script in python. Ive tried it already at another page (yahoo-Finance) and it worked fine. On morningstar nevertheless its not working.
I get the Error in the terminal "NoneObject" of the table variable. I guess it has to do with the structure of the moriningstar site, but iḿ not sure. Maybey somneone can tell me what went wrong.
Or is it not possible because of the sitestructure of the Morningstar site to use my simple script?
A simple csv export direct from morningstar is not a solution because I would like to use the script for other sites which dont have this functionality.
import requests
import csv
from bs4 import BeautifulSoup
from lxml import html
url = 'http://financials.morningstar.com/ratios/r.html?t=SBUX&region=USA&culture=en_US'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'r_table1 text2'})
print table.prettify() #debugging
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells =[]
for cell in row.findAll(['th','td']):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
print list_of_rows #debugging
outfile = open("./test.csv", "wb")
writer = csv.writer(outfile)
writer.writerows(list_of_rows)
The table is dynamically loaded with a separate XHR call to an endpoint which would return JSONP response. Simulate that request, extract the JSON string from the JSONP response, load it with json, extract the HTML from the componentData key and load with BeautifulSoup:
import json
import re
import requests
from bs4 import BeautifulSoup
# make a request
url = 'http://financials.morningstar.com/financials/getFinancePart.html?&callback=jsonp1450279445504&t=XNAS:SBUX&region=usa&culture=en-US&cur=&order=asc&_=1450279445578'
response = requests.get(url)
# extract the HTML under the "componentData"
data = json.loads(re.sub(r'([a-zA-Z_0-9\.]*\()|(\);?$)', '', response.content))["componentData"]
# parse HTML
soup = BeautifulSoup(data, "html.parser")
table = soup.find('table', attrs={'class': 'r_table1 text2'})
print(table.prettify())

Categories