So I'm webscraping a page (http://canoeracing.org.uk/marathon/results/burton2016.htm) where there are multiline cells in tables:
I'm using the following code to scrape each column (the one below so happens to scrape the names):
import lxml.html
from lxml.cssselect import CSSSelector
# get some html
import requests
r = requests.get('http://canoeracing.org.uk/marathon/results/burton2016.htm')
# build the DOM Tree
tree = lxml.html.fromstring(r.text)
# construct a CSS Selector
sel1 = CSSSelector('body > table > tr > td:nth-child(2)')
# Apply the selector to the DOM tree.
results1 = sel1(tree)
# get the text out of all the results
data1 = [result.text for result in results1]
Unfortunately it's only returning the first name from each cell, not both. I've tried a similar thing on the webscraping tool Kimono and I'm able to scrape both, however I want to sent up a Python code as Kimono falls down when running over multiple webpages.
The problem is that some of the cells contain multiple text nodes delimited by a <br>. In cases like this, find all text nodes and join them:
data1 = [", ".join(result.xpath("text()")) for result in rows]
For the provided rows in the screenshot, you would get:
OSCAR HUISSOON, FREJA WEBBER
ELLIE LAWLEY, RHYS TIPPINGS
ALLISON MILES, ALEX MILES
NICOLA RUDGE, DEBORAH CRUMP
You could have also used .text_content() method, but you would lose the delimiter between the text nodes, getting things like OSCAR HUISSOONFREJA WEBBER in the result.
Related
I am working on a project where I am crawling thousands of websites to extract text data, the end use case is natural language processing.
EDIT * since I am crawling 100's of thousands of websites I cannot tailor a scraping code for each one, which means I cannot search for specific element id's, the solution I am looking for is a general one *
I am aware of solutions such as the .get_text() function from beautiful soup. The issue with this method is that it gets all the text from the website, much of it being irrelevant to the main topic on that particular page. for the most part a website page will be dedicated to a single main topic, however on the sides and top and bottom there may be links or text about other subjects or promotions or other content.
With the .get_text() function it return all the text on the site page in one go. the problem is that it combines it all (the relevant parts with the irrelevant ones. is there another function similar to .get_text() that returns all text but as a list and every list object is a specific section of the text, that way it can be know where new subjects start and end.
As a bonus, is there a way to identify the main body of text on a web page?
Below I have mentioned snippets that you could use to query data in desired way using BeautifulSoup4 and Python3:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://yoursite/page')
soup = BeautifulSoup(response.text, 'html.parser')
# Print the body content in list form
print(soup.body.contents[0])
# Print the first found div on html page
print(soup.find('div'))
# Print the all divs on html page in list form
print(soup.find_all('div'))
# Print the element with 'required_element_id' id
print(soup.find(id='required_element_id'))
# Print the all html elements in list form that matches the selectors
print(soup.select(required_css_selectors))
# Print the attribute value in list form
print(soup.find(id='someid').get("attribute-name"))
# You can also break your one large query into multiple queries
parent = soup.find(id='someid')
# getText() return the text between opening and closing tag
print(parent.select(".some-class")[0].getText())
For your more advance requirement, you can check Scrapy as well. Let me know if you face any challenge in implementing this or if your requirement is something else.
I'm trying to extract the text from an element whose class value contains compText. The problem is that it extracts everything but the text that I want.
The CSS selector identifies the element correctly when I use it in the developer tools.
I'm trying to scrape the text that appears in Yahoo SERP when the query entered doesn't have results.
If my query is (quotes included) "klsf gl glkjgsdn lkgsdg" nothing is displayed expect the complementary text "We did not find results blabla" and the Selector extract the data correctly
If my query is (quotes included) "based specialty. Blocks. Organosilicone. Reference". Yahoo will add ads because of the keyword "Organosilicone" and that triggers the behavior described in the first paragraph.
Here is the code:
import requests
from bs4 import BeautifulSoup
url = "http://search.yahoo.com/search?p="
query = '"based specialty chemicals. Blocks. Organosilicone. Reference"'
r = requests.get(url + query)
soup = BeautifulSoup(r.text, "html.parser")
for EachPart in soup.select('div[class*="compText"]'):
print (EachPart.text)
What could be wrong?
Thx,
EDIT: The text extracted seems to be the defnition of the word "Organosilicone" which I can find on the SERP.
EDIT2: This is a snippet of the text I get: "The products created and produced by ‘Specialty Chemicals’ member companies, many of which are Small and Medium Enterprises, stem from original and continuous innovation. They drive the low-carbon, resource-efficient and knowledge based economy of the future." and a screenshot of the SERP when I use my browser
I am trying to scrape column names (player, cost, sel., form, pts) from the page below:
https://fantasy.premierleague.com/a/statistics/total_points
However, I am failing to do so.
Before I go further, let me show you what I have done.
from lxml import html
import requests
page = 'https://fantasy.premierleague.com/a/statistics/total_points'
#Take site and structure html
page = requests.get(page)
tree = html.fromstring(page.content)
#Using the page's CSS classes, extract all links pointing to a team
Location = tree.cssselect('.ism-thead-bold tr .ism-table--el-stats__name')
When I do this, Location should be a list that contains a string "Player".
However, it returns an empty list which means cssselect did not capture anything.
Though each column name has a different 'th class', I used one of them (ism-table--el-stats__name) for this specific trial just to make it simple.
When this problem is fixed, I want to use regex since every class has different suffix after two underscores.
If anyone can help me on these two tasks, I would really appreciate!
thank you guys.
I've written a script in python using xpath to parse tabular data from a webpage. Upon execution, it is able to parse the data from tables flawlessly. The only thing that I can't fix is parse the table header that means th tag. If I would do the same using css selector, i could have used .cssselect("th,td") but in case of xpath I got stuck. Any help as to how I could parse the data from th tag also will be highly appreciated.
Here is the script which is able to fetch everything from different tables except for the data within th tag:
import requests
from lxml.html import fromstring
response = requests.get("https://fantasy.premierleague.com/player-list/")
tree = fromstring(response.text)
for row in tree.xpath("//*[#class='ism-table']//tr"):
tab_d = row.xpath('.//td/text()')
print(tab_d)
I'm not sure I get your point, but if you want to fetch both th and td nodes with single XPath, you can try to replace
tab_d = row.xpath('.//td/text()')
with
tab_d = row.xpath('.//*[name()=("th" or "td")]/text()')
Change
.//td/text()
to
.//*[self::td or self::th]/text()
to include th elements too.
Note that it would be reasonable to assume that both td and th are immediate children of the tr context node, so you might further simplify your XPath to this:
*[self::td or self::th]/text()
I'm learning python requests and BeautifulSoup. I've managed to write a script that log-in in a site and scraps a table. Here's the code:
soup = BeautifulSoup(req.content, "lxml")
table = soup.find_all('table', attrs={'class': 'griglia_tab', 'id':'data_table'})[2]
print(table.text)
When i run the script i get the desired output but there are a lot of empty rows between the values. How can i delete them and maybe output the values in a proper way?
If you were trying to scrape this page, here's a code that might work. You'll need to use re to substitute multiple consecutive newlines with a single newline.
print(re.sub(r'\n\s*\n', '\n', table.text.strip()))
Edit:
Few suggestions to your code:
An id attribute will uniquely identify a single table in the page. Since the table you need to print has one, using 'class': 'griglia_tab' is unnecessary. You can do away with find_all as well, and use find instead.
So, replace your code to assign to table with this:
table = soup.find('table', attrs={'id':'data_table'})