I'm trying to web scrape text and organize it, the text looks something like this:
https://i.stack.imgur.com/bKuXl.png
It is a mess.
How can I organize it to a json file or something?
Also I just did a basic web scrape:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://api.hypixel.net/skyblock/bazaar').text
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())
Try this:
source = requests.get('https://api.hypixel.net/skyblock/bazaar')
json_response = source.json()
You can use the json module to dump into json format:
import json
import requests
source = requests.get('https://api.hypixel.net/skyblock/bazaar').text
json_file = json.dumps(source) # Returns a str() to json_file
# alternatively:
with open("path/to/file.txt", 'w') as file:
json.dump(source, file) # outputs json to file.txt
Related
I'm currently using BS4 to extract some information from a Kickstarter webpage: https://www.kickstarter.com/projects/louisalberry/louis-alberry-debut-album-uk-european-tour
The project information is located inside one of the script tags: (pseudo-code)
...
<script>...</script>
<script>
window.current_ip = ...
...
window.current_project = "<I want this part>"
</script>
...
My current code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import html
html_ = urlopen("https://www.kickstarter.com/projects/louisalberry/louis-alberry-debut-album-uk-european-tour").read()
soup = BeautifulSoup(html_, 'html.parser')
# why does this not work?
# soup.find('script', re.compile("window.current_project"))
# currently, I'm doing this:
all_string = html.unescape(soup.find_all('script')[4].get_text())
# then some regex here on all_string to extract the current_project information
Currently I can get the section I want using indexing [4], but as I am not sure if this is true in general, how can I extract out the text from the correct script tag?
Thanks!
You can gather all the script elements and loop. Access the response object content with requests
from bs4 import BeautifulSoup
import requests
res = requests.get("https://www.kickstarter.com/projects/louisalberry/louis-alberry-debut-album-uk-european-tour")
soup = BeautifulSoup(res.content, 'lxml')
scripts = soup.select('script')
scripts = [script for script in scripts]
for script in scripts:
if 'window.current_project' in script.text:
print(script)
This should work (Instead of dumping to json, you might be able to print the output instead if wanted, oh yeah and REMEMBER TO CHANGE THE VARIABLES WHERE I SAID "Choose a path" AND "if theres any class add it here"):
from bs4 import BeuatifulSoup
import requests
import json
website = requests.get("https://www.kickstarter.com/projects/louisalberry/louis-alberry-debut-album-uk-european-tour")
soup= BeautifulSoup(website.content, 'lxml')
mytext = soup.findAll("script", {"class": "If theres any class add it here, or else delete this part"})
save_path = 'CHOOSE A PATH'
ogname = "kickstarter_text.json"
completename = os.path.join(save_path, ogname)
with open(completename, "w") as output:
json.dump(listofurls, output)
I have a CSV containing a string of HTMLs.
I need to extract all links from each and every HTML in that CSV
Ok, I think this will do what you want.
import csv
import urllib2
import re
urls = csv.reader(open('C:\\your_path_here\\download_data.csv'))
for url in urls:
response = urllib2.urlopen(url[0])
html = response.read()
print re.findall('msApplication-PackageFamilyName',html)
##################
In the CSV file:
http://www.cnn.com
http://www.yahoo.com
http://www.cbc.ca
I am trying to import a list of urls and grab pn2 and main1. I can run it without importing the file so I know it works but I just have no idea what to do with the import. Here is what I have tried most recent and below it is a small portion of the urls. Thanks in advance.
import urllib
import urllib.request
import csv
from bs4 import BeautifulSoup
csvfile = open("ecco1.csv")
csvfilelist = csvfile.read()
theurl="csvfilelist"
soup = BeautifulSoup(theurl,"html.parser")
for row in csvfilelist:
for pn in soup.findAll('td',{"class":"productText"}):
pn2.append(pn.text)
for main in soup.find_all('div',{"class":"breadcrumb"}):
main1 = main.text
print (main1)
print ('\n'.join(pn2))
Urls:
http://www.eccolink.com/products/productresults.aspx?catId=2458
http://www.eccolink.com/products/productresults.aspx?catId=2464
http://www.eccolink.com/products/productresults.aspx?catId=2435
http://www.eccolink.com/products/productresults.aspx?catId=2446
http://www.eccolink.com/products/productresults.aspx?catId=2463
From what I see, you are opening a CSV file and using BeautifulSoup to parse it.
That should not be the way.
BeautifulSoup parses html files, not CSV.
Looking at your code, it seems correct if you were passing in html code to Bs4.
from bs4 import BeautifulSoup
import requests
links = []
file = open('links.txt')
html = requests.get('http://www.example.com')
soup = BeautifulSoup(html, 'html.parser')
for x in soup.find_all('a',"class":"abc"):
links.append(x)
file.write(x)
file.close()
Above is a very basic implementation of how I could get a target element in the html code and write it to a file/ or append it to a list. Use Requests rather than urllib. It is a better library and more modern.
If you want to input your data as CSV, my best option is to use csv reader as import.
Hope that helps.
Ive got a problem with my parsing script in python. Ive tried it already at another page (yahoo-Finance) and it worked fine. On morningstar nevertheless its not working.
I get the Error in the terminal "NoneObject" of the table variable. I guess it has to do with the structure of the moriningstar site, but iḿ not sure. Maybey somneone can tell me what went wrong.
Or is it not possible because of the sitestructure of the Morningstar site to use my simple script?
A simple csv export direct from morningstar is not a solution because I would like to use the script for other sites which dont have this functionality.
import requests
import csv
from bs4 import BeautifulSoup
from lxml import html
url = 'http://financials.morningstar.com/ratios/r.html?t=SBUX®ion=USA&culture=en_US'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'r_table1 text2'})
print table.prettify() #debugging
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells =[]
for cell in row.findAll(['th','td']):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
print list_of_rows #debugging
outfile = open("./test.csv", "wb")
writer = csv.writer(outfile)
writer.writerows(list_of_rows)
The table is dynamically loaded with a separate XHR call to an endpoint which would return JSONP response. Simulate that request, extract the JSON string from the JSONP response, load it with json, extract the HTML from the componentData key and load with BeautifulSoup:
import json
import re
import requests
from bs4 import BeautifulSoup
# make a request
url = 'http://financials.morningstar.com/financials/getFinancePart.html?&callback=jsonp1450279445504&t=XNAS:SBUX®ion=usa&culture=en-US&cur=&order=asc&_=1450279445578'
response = requests.get(url)
# extract the HTML under the "componentData"
data = json.loads(re.sub(r'([a-zA-Z_0-9\.]*\()|(\);?$)', '', response.content))["componentData"]
# parse HTML
soup = BeautifulSoup(data, "html.parser")
table = soup.find('table', attrs={'class': 'r_table1 text2'})
print(table.prettify())
I am trying to get all the urls on a website using python. At the moment I am just copying the websites html into the python program and then using code to extract all the urls. Is there a way I could do this straight from the web without having to copy the entire html?
In Python 2, you can use urllib2.urlopen:
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
In Python 3, you can use urllib.request.urlopen:
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
If you have to perform more complicated tasks like authentication or passing parameters I suggest to have a look at the requests library.
The most straightforward would probably be urllib.urlopen if you're using python2, or urllib.request.urlopen if you're using python3 (you have to do import urllib or import urllib.request first of course). That way you get an file like object from which you can read (ie f.read()) the html document.
Example for python 2:
import urllib
f = urlopen("http://stackoverflow.com")
http_document = f.read()
f.close()
The good news is that you seem to have done the hard part which is analyzing the html document for links.
You might want to use the bs4(BeautifulSoup) library.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
You can download bs4 with the followig command at the cmd line. pip install BeautifulSoup4
import urllib2
import urlparse
from bs4 import BeautifulSoup
url = "http://www.google.com"
response = urllib2.urlopen(url)
content = response.read()
soup = BeautifulSoup(content, "html.parser")
for link in soup.find_all('a', href=True):
print urlparse.urljoin(url, link['href'])
You can simply use the combination of requests and BeautifulSoup.
First make an HTTP request using requests to get the HTML content. You will get it as a Python string, which you can manipulate as you like.
Take the HTML content string and supply it into the BeautifulSoup, which has done all the job to extract the DOM, and get all URLs, i.e. <a> elements.
Here is an example of how to fetch all links from StackOverflow:
import requests
from bs4 import BeautifulSoup, SoupStrainer
response = requests.get('http://stackoverflow.com')
html_str = response.text
bs = BeautifulSoup(html_str, parseOnlyThese=SoupStrainer('a'))
for a_element in bs:
if a_element.has_attr('href'):
print(a_element['href'])
Sample output:
/questions/tagged/facebook-javascript-sdk
/questions/31743507/facebook-app-request-dialog-keep-loading-on-mobile-after-fb-login-called
/users/3545752/user3545752
/questions/31743506/get-nuspec-file-for-existing-nuget-package
/questions/tagged/nuget
...