I am parsing a .html file using BeautifulSoup4 doing the following:
data = [item.text.strip() for item in soup.find_all('span')]
The code takes all the items in a given table and stores into data. I noticed some of the elements in the data contain texts what seems like html entity encoding. An example element:
data[5] stores 'CSCI-GA.1144-\u200b001'
the text I expected was just CSCI-GA.1144-001'
In the html file, I find it as 'CSCI-GA.1144-001'
Why does it show differently when I parse, vs when I inspect the html code? And how do I parse the data so it does not take into account these encodings? Is there a way to exclude?
Related
I have a test pdf file with just a 3x3 table that are marked properly with table headings and the sort. What I want to do is extract the format of the table. Like so:
left
center
right
One
Two
Three
If that table was in the pdf, I want to be able to know programmatically that the table has three headers "" and one row of data. ""
I am using fitz and when i use this code:
for page in doc:
tp = page.get_textpage() # display list from above
html = tp.extractHTML() # HTML format
print(html)
It seems to just remove all the actual html and replace it with just paragraph tags and div tags. What am I doing wrong?
I have a problem while trying to access some values on the website during the process of web scraping the data. The problem is that the text I want to extract is in the class which contains several texts separated by tags (these body tags also have texts which are also important for me).
So firstly, I tried to look for the tag with the text I needed ('Category' in this case) and then extract the exact category from the text below this body tag assignment. I could use precise XPath but here it is not the case because other pages I need to web scrape contain a different amount of rows in this sidebar so the locations, as well as XPaths, are different.
The expected output is 'utility' - the category in the sidebar.
The website and the text I need to extract look like that (look right at the sidebar containing 'Category':
The element looks like that:
And the code I tried:
driver = webdriver.Safari()
driver.get('https://www.statsforsharks.com/entry/MC_Squares')
element = driver.find_elements_by_xpath("//b[contains(text(), 'Category')]/following-sibling")
for value in element:
print(value.text)
driver.close()
the link to the page with the data is https://www.statsforsharks.com/entry/MC_Squares.
Thank you!
You might be better off using regex here, as the whole text comes under the 'company-sidebar-body' class, where only some text is between b tags and some are not.
So, you can the text of the class first:
sidebartext = driver.find_element_by_class_name("company-sidebar-body").text
That will give you the following:
"EOY Proj Sales: $1,000,000\r\nSales Prev Year: $200,000\r\nCategory: Utility\r\nAsking Deal\r\nEquity: 10%\r\nAmount: $300,000\r\nValue: $3,000,000\r\nEquity Deal\r\nSharks: Kevin O'Leary\r\nEquity: 25%\r\nAmount: $300,000\r\nValue: $1,200,000\r\nBite: -$1,800,000"
You can then use regex to target the category:
import re
c = re.search("Category:\s\w+", sidebartext).group()
print(c)
c will result in 'Category: Utility' which you can then work with. This will also work if the value of the category ('Utility') is different on other pages.
There are easier ways when it's a MediaWiki website. You could, for instance, access the page data through the API with a JSON request and parse it with a much more limited DOM.
Any particular reason you want to scrape my website?
I am working on a project where I am crawling thousands of websites to extract text data, the end use case is natural language processing.
EDIT * since I am crawling 100's of thousands of websites I cannot tailor a scraping code for each one, which means I cannot search for specific element id's, the solution I am looking for is a general one *
I am aware of solutions such as the .get_text() function from beautiful soup. The issue with this method is that it gets all the text from the website, much of it being irrelevant to the main topic on that particular page. for the most part a website page will be dedicated to a single main topic, however on the sides and top and bottom there may be links or text about other subjects or promotions or other content.
With the .get_text() function it return all the text on the site page in one go. the problem is that it combines it all (the relevant parts with the irrelevant ones. is there another function similar to .get_text() that returns all text but as a list and every list object is a specific section of the text, that way it can be know where new subjects start and end.
As a bonus, is there a way to identify the main body of text on a web page?
Below I have mentioned snippets that you could use to query data in desired way using BeautifulSoup4 and Python3:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://yoursite/page')
soup = BeautifulSoup(response.text, 'html.parser')
# Print the body content in list form
print(soup.body.contents[0])
# Print the first found div on html page
print(soup.find('div'))
# Print the all divs on html page in list form
print(soup.find_all('div'))
# Print the element with 'required_element_id' id
print(soup.find(id='required_element_id'))
# Print the all html elements in list form that matches the selectors
print(soup.select(required_css_selectors))
# Print the attribute value in list form
print(soup.find(id='someid').get("attribute-name"))
# You can also break your one large query into multiple queries
parent = soup.find(id='someid')
# getText() return the text between opening and closing tag
print(parent.select(".some-class")[0].getText())
For your more advance requirement, you can check Scrapy as well. Let me know if you face any challenge in implementing this or if your requirement is something else.
I want to scrape the data on this link
http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json
I am not sure what type of this link is, is it html or json or something else. Sorry for my bad web knowledge. But I try to use the following code to scrape:
import requests
url='http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json'
source=requests.get(url).text
The type of the source is unicode. I also try to use the urllib2 to scrape like:
source2=urllib2.urlopen(url).read()
The type of source2 is string. I am not sure which method is better. Because the link is not like the normal webpage contains different tags. If I want to clean the scraped data and form the dataframe data (like the pandas dataframe), what method or process I should follow/
Thanks.
The returned response is text containing valid JSON data within it. You can validate it on your own using a service like http://jsonlint.com/ if you want. For doing so just copy the code within the brackets
return_json("JSON code to copy")
In order to make use of that data you just need to parse it in your program. Here an example: https://docs.python.org/2/library/json.html
The response is text. It does contain JSON, just need to extract it
import json
strip_len = len("return_json(")
source=requests.get(url).text[strip_len:-2]
source = json.loads(source)
I have an HTML file (encoded in utf-8). I open it with codecs.open(). The file architecture is:
<html>
// header
<body>
// some text
<table>
// some rows with cells here
// some cells contains tables
</table>
// maybe some text here
<table>
// a form and other stuff
</table>
// probably some more text
</body></html>
I need to retrieve only first table (discard the one with form). Omit all input before first <table> and after corresponding </table>. Some cells contains also paragraphs, bolds and scripts. There is no more than one nested table per row of main table.
How can I extract it to get a list of rows, where each elements holds plain (unicode string) cell's data and a list of rows for each nested table? There's no more than 1 level of nesting.
I tried HTMLParse, PyParse and re module, but can't get this working.
I'm quite new to Python.
Try beautiful soup
In principle you need to use a real parser (which Beaut. Soup is), regex cannot deal with nested elements, for computer sciencey reasons (finite state machines can't parse context-free grammars, IIRC)
You may like lxml. I'm not sure I really understood what you want to do with that structure, but maybe this example will help...
import lxml.html
def process_row(row):
for cell in row.xpath('./td'):
inner_tables = cell.xpath('./table')
if len(inner_tables) < 1:
yield cell.text_content()
else:
yield [process_table(t) for t in inner_tables]
def process_table(table):
return [process_row(row) for row in table.xpath('./tr')]
html = lxml.html.parse('test.html')
first_table = html.xpath('//body/table[1]')[0]
data = process_table(first_table))
If the HTML is well-formed you can parse it into a DOM tree and use XPath to extract the table you want. I usually use lxml for parsing XML, and it can parse HTML as well.
The XPath for pulling out the first table would be "//table[1]".