How can I convert a wikimeadia dump xml file into text in python. is there any package in python?
Not sure what dump you have, from the post you are trying to convert the web content and read the elements and write in to a file using python.
better use website scrape using requests and bs4 objects:
#Getting data from website - scrape
import requests, bs4
#Getting HTML from the wikipedia page
url = "https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors"
req = requests.get(url)
#Create a bs4 object
soup = bs4.BeautifulSoup(req.text, "html5lib")
element = soup.select('.mwe-math-element')
print(element)
#You can save the required content to a file by manipulating the content in element list
Related
How do I parse the following SEC data from a .html website in Python?
Im trying to parse the html from the following webpage: https://www.sec.gov/Archives/edgar/data/1831187/0001831187-23-000001-index.html
The .txt version of the page contains the three top-level elements that I need to extract and parse into data frames:
Header < SEC-HEADER >
Primary Document <edgarSubmission... >
Information Table <informationTable...>
I can see some of the information with the following code, but I am ignorant of how to find the equivalent text element in the HTML and extract it. How can I proceed?
import bs4
from bs4 import BeautifulSoup
import requests
url= "https://www.sec.gov/Archives/edgar/data/1831187/0001831187-23-000001-index.html"
request = requests.get(index, headers = {"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(request.text, 'lxml')
I'm trying make an app gets the source links on bandcamp but im kinda stuck. Is there a way to get the source link with beautifulsoup.
The link im trying to get
Bandcamp
The data is within the <script> tags in json format. So use BeautifulSoup to get the 'script'. The data you are after is in the data-tralbum attribute.
Onece you get thet, have json read it in, then just iterate through the json structure:
from bs4 import BeautifulSoup
import requests
import json
url = 'https://vine.bandcamp.com/album/another-light'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
script = str(soup.find_all('script')[4]['data-tralbum'])
jsonData = json.loads(script)
trackinfo = jsonData['trackinfo']
links = []
for each in trackinfo:
links.append(each['file']['mp3-128'])
Output:
print(links)
['https://t4.bcbits.com/stream/efbba461835eff472bd04a2f9e9910a9/mp3-128/1761020287?p=0&ts=1638288735&t=8ae6343808036ab513cd5436ea009e5d0de784e4&token=1638288735_9139d56ec86f2d44b83a11f3eed8caf7075d6039', 'https://t4.bcbits.com/stream/3e5ef92e6d83e853958ed01955c95f5f/mp3-128/1256475880?p=0&ts=1638288735&t=745a6c701cf1c5772489da5467f6cae5d3622818&token=1638288735_7e86a32c635ba92e0b8320ef56a457d988286cff', 'https://t4.bcbits.com/stream/bbb49d4a72cb80feaf759ec7890abbb6/mp3-128/3439518541?p=0&ts=1638288735&t=dcc7ef7d1d7823e227339fb3243385089478ebe7&token=1638288735_5db36a29c58ea038828d7b34b67e13bd80597dd8', 'https://t4.bcbits.com/stream/8c8a69959337f6f4809f6491c2822b45/mp3-128/1330130896?p=0&ts=1638288735&t=d108dac84dfaac901a546c5fcf5064240cca376b&token=1638288735_8d9151aa82e7a00042025f924660dd3a093c2f74', 'https://t4.bcbits.com/stream/4d4253633405f204d7b1c101379a73be/mp3-128/2478242466?p=0&ts=1638288735&t=a8cd539d0ce8ff417f9b69740070870ed9a182a5&token=1638288735_ad8b5e93c8ffef6623615ce82a6754678fa67b67', 'https://t4.bcbits.com/stream/6c4feee38e289aea76080e9ddc997fa5/mp3-128/2243532902?p=0&ts=1638288735&t=83417c3aba0cef0f969f93bac5165e582f24a588&token=1638288735_c1d9d43b4e10cc6d02c822de90eda3a52c382df2', 'https://t4.bcbits.com/stream/a24dc5dad7b619d47b006e26084ff38f/mp3-128/3054008347?p=0&ts=1638288735&t=4563c326a272c9f5b8462fef1d082e46fac7f605&token=1638288735_55978e7edbe0410ff745913224b8740becad59d5', 'https://t4.bcbits.com/stream/6221790d7f55d3b1f006bd5fac5458fe/mp3-128/1500140939?p=0&ts=1638288735&t=9ecc210c53af05f4034ee00cd1a96a043312a4a7&token=1638288735_0f2faba41da8952f841669513d04bdaaae35a629', 'https://t4.bcbits.com/stream/030506909569626a0d2d7d182b61c691/mp3-128/1707615013?p=0&ts=1638288735&t=c8dcbb2c491789928f5cb6ef8b755df999cb58b8&token=1638288735_b278ba825129ae1b5588b47d5cda345ef2db4e58', 'https://t4.bcbits.com/stream/d1ae0cbc281fc81ddd91f3a3e3d80973/mp3-128/2808772965?p=0&ts=1638288735&t=1080ff51fc40bb5b7afb3a2460f3209cbda549e3&token=1638288735_c93249c847acba5cf23521fa745e05b426a5ba05', 'https://t4.bcbits.com/stream/1b9d50f8210bdc3cf4d2e33986f319ae/mp3-128/2751220220?p=0&ts=1638288735&t=9f24f06dfc5c8a06f24f28664438a6f1a75a038c&token=1638288735_f3a98a20b3c344dc5a37a602a41572d5fe8539c1', 'https://t4.bcbits.com/stream/203cd15629ba03e3249f850d5e1ac42e/mp3-128/4188265472?p=0&ts=1638288735&t=4b4bc2f2194c63a1d3b957e3dd6046bd764c272a&token=1638288735_53a70e7d83ce8c2800baeaf92a5c19db4e146e3f', 'https://t4.bcbits.com/stream/c63b5c9ca090b233e675974c7e7ee4b2/mp3-128/258670123?p=0&ts=1638288735&t=a81ae9dc33dea2b2660d13dbbec93dbcb06e6b63&token=1638288735_446d0ae442cbbadbceb342fe4f7b69d0fbab2928', 'https://t4.bcbits.com/stream/2e824d3c643658c8e9e24b548bc8cb0b/mp3-128/2332945345?p=0&ts=1638288735&t=5bdf0264b9ffe4616d920c55f5081744bf0822d4&token=1638288735_872191bb67a3438ef0fd1ce7e8a9e5ca09e6c37e']
I'm trying to use BS4 to parse through the HTML for an about page on a youtube channel so I can scrape the number of channel views. Below is the code to scrape the channel views (located in the 'yt-formatted-string') and also the whole right column of the page. Both lines of code return either an empty list and a "None" value for the findAll() and find() functions, respectively.
I read another thread saying I may be receiving an empty list or "None" value because the page is accessing an API to get the total channel views to count and the values aren't actually in the HTML I'm parsing.
I know I could access much of this info through the Youtube API, but I want to iterate this code over multiple channels that are not my own. Moreover, I want to understand how to use BS4 to its full extent so I can replicate this process on an Instagram page or Facebook page.
Should I be using a different library that isn't BS4? Is what I'm looking to accomplish even possible?
My CODE
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
#find Youtube channel views and subscriber counts
my_url = 'https://www.youtube.com/c/Rozziofficial/about'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
body = page_soup.body
views_count = body.find_all('yt-formmated-string',{"class":"style-scope ytd-channel-about-metadata-renderer"})
right_column = body.find('div', {"id":"right-column"})
print(right_column)
print(views_count)
YouTube is loaded dynamically, therefore urlib won't support it.
However, the data is available in JSON format on the website. You can convert this data to a Python dictionary (dict) using the built-in json library.
This example is using the URL you have provided: https://www.youtube.com/c/Rozziofficial/about, you can change the channel name, it will work for all channels.
Here's an example using requests, you can use urlib instead:
import re
import json
import requests
from bs4 import BeautifulSoup
URL = "https://www.youtube.com/c/Rozziofficial/about"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
# We locate the JSON data using a regular-expression pattern
data = re.search(r"var ytInitialData = ({.*});", str(soup)).group(1)
# Uncomment to view all the data
# print(json.dumps(data))
# This converts the JSON data to a python dictionary (dict)
json_data = json.loads(data)
# This is the info from the webpage on the right-side under "stats", it contains the data you want
stats = json_data["contents"]["twoColumnBrowseResultsRenderer"]["tabs"][5]["tabRenderer"]["content"]["sectionListRenderer"]["contents"][0]["itemSectionRenderer"]["contents"][0]["channelAboutFullMetadataRenderer"]
print("Channel Views:", stats["viewCountText"]["simpleText"])
print("Joined:", stats["joinedDateText"]["runs"][1]["text"])
Output:
Channel Views: 10,263,762 views
Joined: Jun 30, 2007
Further reading:
Web-scraping JavaScript page with Python.
I am trying to scrape the data from this link
I have tried this way
from bs4 import BeautifulSoup
import urllib.request
import csv
# specify the url
urlpage = 'https://www.ikh.se/sysNet/getProductsJSON/getProductsJSONDB.aspx?' \
'sua=1&lang=2&navid=19277994'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)
# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
tag = soup.find('div', attrs={'class':'dnsCell'})
text = (''.join(tag.stripped_strings))
print (page)
I got the HTML dom but the product list dom are missing. Actually I guess the product list dom manages by a JSON array that requests from this link but I am not sure about the product list dom load method. I am right or wrong.
I want to scrape the all product details from this site and export in Excel.
The requests library does not load the Javascript. If you want to download completely rendered website, use selenium library : https://selenium-python.readthedocs.io/
From this URL:
http://vs-web-fs-1.oecd.org/piaac/puf-data/CSV
I want to download all the files and save them with the text of the anchor tag. I guess my main struggle is to retrieve the text of the anchor tag right now:
from bs4 import BeautifulSoup
import requests
import urllib.request
url_base = "http://vs-web-fs-1.oecd.org"
url_dir = "http://vs-web-fs-1.oecd.org/piaac/puf-data/CSV"
r = requests.get(url_dir)
data = r.text
soup = BeautifulSoup(data,features="html5lib")
for link in soup.find_all('a'):
if link.get('href').endswith(".csv"):
print(link.find("a"))
urllib.request.urlretrieve(url_base+link.get('href'), "test.csv")
Line print(link.find("a")) returns None. How can I retrieve the text?
You get the text accessing the content, like this:
link.contents[0]
or
link.string