I am trying to extract tables from an HTML document using the xpath module in Python. If I print the downloaded HTML, I see the full DOM as it should be. However, when I use xpath.get, it give me a tbody section, but not the one I want and certainly not the only one that should be there. Here is the script.
import requests
from webscraping import download, xpath
D = download.Download()
url = 'http://labs.mementoweb.org/timemap/json/http://www.awebsiteimscraping.com'
r = requests.get(url)
data = []
mementos = r.json()['mementos']['list']
for memento in mementos:
data.append(D.get(memento['uri']))
# print xpath.get(data[10], '//table')
print type(data[0])
# print data[10]
print len(data)
I'm new to this, so idk if it matters, but the type of each element in 'data' is str.
Convert type of data to dict using json.loads()
Try this,
import requests
import json
from webscraping import download, xpath
D = download.Download()
url = 'http://labs.mementoweb.org/timemap/json/http://www.awebsiteimscraping.com'
r = requests.get(url)
data = []
mementos = r.json()['mementos']['list']
for memento in mementos:
data.append(D.get(memento['uri']))
# print xpath.get(data[10], '//table')
print type(data[0])
# print data[10]
print len(data)
json_data = json.loads(data)
print type(json_data[0])
Related
I am trying to parse a txt, example as below link.
The txt, however, is in the form of html. I am trying to get "COMPANY CONFORMED NAME" which located at the top of the file, and my function should return "Monocle Acquisition Corp".
https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt
I have tried below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
However, "soup" does not contain "COMPANY CONFORMED NAME" at all.
Can someone point me in the right direction?
The data you are looking for is not in an HTML structure so Beautiful Soup is not the best tool. The correct and fast way of searching for this data is just using a simple Regular Expression like this:
import re
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
text_string = r.content.decode()
name_re = re.compile("COMPANY CONFORMED NAME:[\\t]*(.+)\n")
match = name_re.search(text_string).group(1)
print(match)
the part you look like is inside a huge tag <SEC-HEADER>
you can get the whole section by using soup.find('sec-header')
but you will need to parse the section manually, something like this works, but it's some dirty job :
(view it in replit : https://repl.it/#gui3/stackoverflow-parsing-html)
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
header = soup.find('sec-header').text
company_name = None
for line in header.split('\n'):
split = line.split(':')
if len(split) > 1 :
key = split[0]
value = split[1]
if key.strip() == 'COMPANY CONFORMED NAME':
company_name = value.strip()
break
print(company_name)
There may be some library able to parse this data better than this code
I have a next link which represent an exact graph I want to scrape: https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1
I'm simply can't understand is it a xml or svg graph and how to scrape data. I think I need to use bs4, requests but don't know the way to do that.
Anyone could help?
You will load HTML like this:
import requests
url = "https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1"
resp = requests.get(url)
data = resp.text
Then you will create a BeatifulSoup object with this HTML.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, features="html.parser")
After this, it is usually very subjective how to parse out what you want. The candidate codes may vary a lot. This is how I did it:
Using BeautifulSoup, I parsed all "rect"s and check if "onmouseover" exists in that rect.
rects = soup.svg.find_all("rect")
yx_points = []
for rect in rects:
if rect.has_attr("onmouseover"):
text = rect["onmouseover"]
x_start_index = text.index("'") + 1
y_finish_index = text[x_start_index:].index("'") + x_start_index
yx = text[x_start_index:y_finish_index].split()
print(text[x_start_index:y_finish_index])
yx_points.append(yx)
As you can see from the image below, I scraped onmouseover= part and get those 02.2015 155,1 parts.
Here, this is how yx_points looks like now:
[['12.2009', '100,0'], ['01.2010', '101,8'], ['02.2010', '103,7'], ...]
from bs4 import BeautifulSoup
import requests
import re
#First get all the text from the url.
url="https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1"
response = requests.get(url)
html = response.text
#Find all the tags in which the data is stored.
soup = BeautifulSoup(html, 'lxml')
texts = soup.findAll("rect")
final = []
for each in texts:
names = each.get('onmouseover')
try:
q = re.findall(r"'(.*?)'", names)
final.append(q[0])
except Exception as e:
print(e)
#The details are appended to the final variable
Currently writing a webscraper to scrape some reviews. The goal is to scrape reviews over multiple URLS. Therefore, i made a list of urls. I want to retrieve the content of the specific reviews per url and merge them in one list.
When i only scrape one page, everything works like a charm. However, when i try to scrape multiple pagines. See the following code plus error:
from lxml import html
from urllib import request
import requests
from datetime import datetime
import dateparser
import csv
import re
links = open('file')
urls = links.readlines()
for url in urls:
req=requests.get(url)
tree = html.fromstring(request.urlopen(req).read().decode(encoding="utf-8",errors="ignore"))
reviews = tree.xpath('//*[#class="review-body"]')
reviews = [r.text_content() for r in reviews]
reviews = [r.replace('\n', ' ') for r in reviews]
reviews = [r.replace('\r', ' ') for r in reviews]
reviews = [r.replace(' ', '') for r in reviews]
protocol = req.type
AttributeError: 'Response' object has no attribute 'type'.
Can somebody explain to me what this is and how i can solve this?
You need to have reviews list outside of for loop.
This way you will fill it while iterating.
You can either:
append temporary list of reviews in each loop step (temp) and then you will have reviews = [ [...], [...]] or
add temporary list with + operator e.g. reviews += temp which should result in what you probably expect to end up with.
Here is the possible resolution:
from lxml import html
from urllib import request
import requests
from datetime import datetime
import dateparser
import csv
import re
links = open('file', 'r')
reviews = []
for url in links:
req = requests.get(url)
tree = html.fromstring(req.content.decode(encoding="utf-8", errors="ignore"))
temp = tree.xpath('//*[#class="review-body"]')
temp = [r.text_content() for r in temp]
temp = [r.replace('\n', ' ') for r in temp]
temp = [r.replace('\r', ' ') for r in temp]
temp = [r.replace(' ', '') for r in temp]
reviews += temp
As for the AttributeError it seems you are trying to access attribute that doesn't exist.
Edit 1.
links is an iterable that can be iterated to fetch line by line. This way you do not have to read all lines in memory.
req has a content and text attributes. Both holds page HTML source just depending on encoding.
I am trying to write an HTML parser in Python that takes as its input a URL or list of URLs and outputs specific data about each of those URLs in the format:
URL: data1: data2
The data points can be found at the exact same HTML node in each of the URLs. They are consistently between the same starting tags and ending tags. If anyone out there would like to help an amateur python programmer get the job done, it would be greatly appreciated. Extra points if you can come up with a way to output the information that can be easily copied and pasted into an excel document for subsequent data analysis!
For example, lets say I would like to output the view count for a particular YouTube video. For the URL http://www.youtube.com/watch?v=QOdW1OuZ1U0, the view count is around 3.6 million. For all YouTube videos, this number is found in the following format within the page's source:
<span class="watch-view-count ">
3,595,057
</span>
Fortunately, these exact tags are found only once on a particular YouTube video's page. These starting and ending tags can be inputted into the program or built-in and modified when necessary. The output of the program would be:
http://www.youtube.com/watch?v=QOdW1OuZ1U0: 3,595,057 (or 3595057).
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.youtube.com/watch?v=QOdW1OuZ1U0'
f = urllib2.urlopen(url)
data = f.read()
soup = BeautifulSoup(data)
span = soup.find('span', attrs={'class':'watch-view-count'})
print '{}:{}'.format(url, span.text)
If you do not want to use BeautifulSoup, you can use re:
import urllib2
import re
url = 'http://www.youtube.com/watch?v=QOdW1OuZ1U0'
f = urllib2.urlopen(url)
data = f.read()
pattern = re.compile('<span class="watch-view-count.*?([\d,]+).*?</span>', re.DOTALL)
r = pattern.search(data)
print '{}:{}'.format(url, r.group(1))
As for the outputs, I think you can store them in a csv file.
I prefer HTMLParser over re for this type of task. However, HTMLParser can be a bit tricky. I use immutable objects to store data... I'm sure this this the wrong way of doing it. But its worked with several projects for me in the past.
import urllib2
from HTMLParser import HTMLParser
import csv
position = []
results = [""]
class hp(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'span' and ('class', 'watch-view-count ') in attrs:
position.append('bingo')
def handle_endtag(self, tag):
if tag == 'span' and 'bingo' in position:
position.remove('bingo')
def handle_data(self, data):
if 'bingo' in position:
results[0] += " " + data.strip() + " "
my_pages = ["http://www.youtube.com/watch?v=QOdW1OuZ1U0"]
data = []
for url in my_pages:
response = urllib2.urlopen(url)
page = str(response.read())
parser = hp()
parser.feed(page)
data.append(results[0])
# reinitialize immutiable objects
position = []
results = [""]
index = 0
with open('/path/to/test.csv', 'wb') as f:
writer = csv.writer(f)
header = ['url', 'output']
writer.writerow(header)
for d in data:
row = [my_pages[index], data[index]]
writer.writerow(row)
index += 1
Then just open /path/to/test.csv in Excel
I've been attempting to write a little scraper in Python using BeautifulSoup.
Everything goes smoothly until I attempt to print (or write to a file) the strings contained inside the various HTML elements. The site i'm scraping is: http://www.yellowpages.ca/search/si/1/Boots/Montreal+QC which contains various french characters. For some reason, when I attempt to print the content in the terminal or into a file, instead of decoding the string like it's supposed to, I'm getting the raw unicode output.
Here's the script:
from BeautifulSoup import BeautifulSoup as bs
import urllib as ul
##import re
base_url = 'http://www.yellowpages.ca'
data_file = open('yellow_file.txt', 'a')
data = ul.urlopen(base_url + '/locations/Quebec/Montreal/90014002.html').readlines()
bt = bs(str(data))
result = bt.findAll('div', 'ypgCategory')
bt = bs(str(result))
result = bt.findAll('a')
for tag in result:
link = base_url + tag['href']
##print str(link)
data = ul.urlopen(link).readlines()
#data = str(data).decode('latin-1')
bt = bs(str(data), convertEntities=bs.HTML_ENTITIES, fromEncoding='latin-1')
titles = bt.findAll('span', 'listingTitle')
phones = bt.findAll('a', 'phoneNumber')
entries = zip(titles, phones)
for title, phone in entries:
#print title.prettify(encoding='latin-1')
#data_file.write(title.text.decode('utf-8') + " " + phone.text.decode('utf-8') + "\n")
print title.text
data_file.close()
/************/
And the output of this is: Projets Autochtones Du Qu\xc3\xa9bec
As you can see the e with accent that's supposed to go in Quebec isn't displaying. I've tried everything mentioned on SO, calling unicode(), passing fromEncoding to soup, .decode('latin-1') but i'm getting nothing.
Any ideas?
This should be something like what you want:
from BeautifulSoup import BeautifulSoup as bs
import urllib as ul
base_url = 'http://www.yellowpages.ca'
data_file = open('yellow_file.txt', 'a')
bt = bs(ul.urlopen(base_url + '/locations/Quebec/Montreal/90014002.html'))
for div in bt.findAll('div', 'ypgCategory'):
for a in div.findAll('a'):
link = base_url + a['href']
bt = bs(ul.urlopen(link), convertEntities=bs.HTML_ENTITIES)
titles = bt.findAll('span', 'listingTitle')
phones = bt.findAll('a', 'phoneNumber')
for title, phone in zip(titles, phones):
line = '%s %s\n' % (title.text, phone.text)
data_file.write(line.encode('utf-8'))
print line.rstrip()
data_file.close()
Who told you to use latin-1 to decode something that is UTF-8? (clearly specified on the meta tag)
If you ware on Windows you may have problems outputting Unicode to console, better to test writing to text files first.
if you open a file as text do no write binary to it:
codecs.open(...,"w","utf-8").write(unicode_str)
open(...,"wb").write(unicode_str.encode("utf_8"))