Delete empty rows from print output in python - python

I'm learning python requests and BeautifulSoup. I've managed to write a script that log-in in a site and scraps a table. Here's the code:
soup = BeautifulSoup(req.content, "lxml")
table = soup.find_all('table', attrs={'class': 'griglia_tab', 'id':'data_table'})[2]
print(table.text)
When i run the script i get the desired output but there are a lot of empty rows between the values. How can i delete them and maybe output the values in a proper way?

If you were trying to scrape this page, here's a code that might work. You'll need to use re to substitute multiple consecutive newlines with a single newline.
print(re.sub(r'\n\s*\n', '\n', table.text.strip()))
Edit:
Few suggestions to your code:
An id attribute will uniquely identify a single table in the page. Since the table you need to print has one, using 'class': 'griglia_tab' is unnecessary. You can do away with find_all as well, and use find instead.
So, replace your code to assign to table with this:
table = soup.find('table', attrs={'id':'data_table'})

Related

Trying to find HREF from table with Selenium in Python

Webscraping a table into an Excel-file. Its a "Dynamic" table per 10 rows.
All the data is placed into Excel correctly, But having issues with the HREF-data.
The issue i am facing is that some rows dont have a HREF. I am using the following Xpath:
map = driver.find_elements(By.XPATH,'//*[#id="table_1"]/tbody//td[12]/a')
To get the HREF:
.get_attribute("href")[30:].split(",%20")[0]
.get_attribute("href")[30:].split(",%20")[1]
Via above Xpath is can find every HREF, but in case of NO HREF in the row, the following HREF-data is placed into the row where NO HREF-data should be.
Tried the below (without the "/a") but it returns nothing.
map_test = driver.find_elements(By.XPATH, '//*[#id="table_1"]/tbody//td[12]')
When below code is used, it returns the text content which is not what I need, but keeps the data where is should be.
.get_attribute("textContent")
Any idea how i can find the HREFs and keep the data in the rows where it should be?

Can't parse data from `th` tag along with `td` tag from different tables

I've written a script in python using xpath to parse tabular data from a webpage. Upon execution, it is able to parse the data from tables flawlessly. The only thing that I can't fix is parse the table header that means th tag. If I would do the same using css selector, i could have used .cssselect("th,td") but in case of xpath I got stuck. Any help as to how I could parse the data from th tag also will be highly appreciated.
Here is the script which is able to fetch everything from different tables except for the data within th tag:
import requests
from lxml.html import fromstring
response = requests.get("https://fantasy.premierleague.com/player-list/")
tree = fromstring(response.text)
for row in tree.xpath("//*[#class='ism-table']//tr"):
tab_d = row.xpath('.//td/text()')
print(tab_d)
I'm not sure I get your point, but if you want to fetch both th and td nodes with single XPath, you can try to replace
tab_d = row.xpath('.//td/text()')
with
tab_d = row.xpath('.//*[name()=("th" or "td")]/text()')
Change
.//td/text()
to
.//*[self::td or self::th]/text()
to include th elements too.
Note that it would be reasonable to assume that both td and th are immediate children of the tr context node, so you might further simplify your XPath to this:
*[self::td or self::th]/text()

soup.find("div", id = "tournamentTable"), None returned - python 2.7 - BS 4.5.1

I'm Trying to parse the following page: http://www.oddsportal.com/soccer/france/ligue-1-2015-2016/results/
The part I'm interested in is getting the table along with the scores and odds.
The code I have so far:
url = "http://www.oddsportal.com/soccer/france/ligue-1-2015-2016/results/"
req = requests.get(url, timeout = 9)
soup = BeautifulSoup(req.text)
print soup.find("div", id = "tournamentTable"), soup.find("#tournamentTable")
>>> <div id="tournamentTable"></div> None
Very simple but I'm thus weirdly stuck at finding the table in the tree. Although I found already prepared datasets I would like to know to why the printed strings are a tag and None.
Any ideas?
Thanks
First, this page uses JavaScript to fetch data, if you disable the JS in your browser, you will notice that the div tag exist but nothing in it, so, the first will print a single tag.
Second, # is CSS selector, you can not use it in find()
Any argument that’s not recognized will be turned into a filter on one
of a tag’s attributes.
so, the second find will to find some tag with #tournamentTable as it's attribute, and nothing will be match, so it will return None
It looks like the table gets populated with an Ajax call back to the server. That is why why you print soup.find("div", id = "tournamentTable") you get only the empty tag. When you print soup.find("#tournamentTable"), you get None because that is trying to find a element with the tag "#tournamentTable". If you want to use CSS selectors, you should use soup.select() like this, soup.select('#tournamentTable') or soup.select('div#tournamentTable') if you want to be even more particular.

Webscraping multiline cells in tables using CSS Selectors and Python

So I'm webscraping a page (http://canoeracing.org.uk/marathon/results/burton2016.htm) where there are multiline cells in tables:
I'm using the following code to scrape each column (the one below so happens to scrape the names):
import lxml.html
from lxml.cssselect import CSSSelector
# get some html
import requests
r = requests.get('http://canoeracing.org.uk/marathon/results/burton2016.htm')
# build the DOM Tree
tree = lxml.html.fromstring(r.text)
# construct a CSS Selector
sel1 = CSSSelector('body > table > tr > td:nth-child(2)')
# Apply the selector to the DOM tree.
results1 = sel1(tree)
# get the text out of all the results
data1 = [result.text for result in results1]
Unfortunately it's only returning the first name from each cell, not both. I've tried a similar thing on the webscraping tool Kimono and I'm able to scrape both, however I want to sent up a Python code as Kimono falls down when running over multiple webpages.
The problem is that some of the cells contain multiple text nodes delimited by a <br>. In cases like this, find all text nodes and join them:
data1 = [", ".join(result.xpath("text()")) for result in rows]
For the provided rows in the screenshot, you would get:
OSCAR HUISSOON, FREJA WEBBER
ELLIE LAWLEY, RHYS TIPPINGS
ALLISON MILES, ALEX MILES
NICOLA RUDGE, DEBORAH CRUMP
You could have also used .text_content() method, but you would lose the delimiter between the text nodes, getting things like OSCAR HUISSOONFREJA WEBBER in the result.

How do I find all rows in one table with BeautifulSoup

I'm trying to write my first parser with BeautifulSoup (BS4) and hitting a conceptual issue, I think. I haven't done much with Python -- I'm much better at PHP.
I can get BeautifulSoup to find the table I want, but when I try to step into the table and find all the rows, I get some variation on:
AttributeError: 'ResultSet' object has no attribute 'attr'
I tried walking through the sample code at How do I draw out specific data from an opened url in Python using urllib2? and got more or less the same error (note: if you want to try it you'll need a working URL.)
Some of what I'm reading says that the issue is that the ResultSet is a list. How would I know that? If I do print type(table) it just tells me <class 'bs4.element.ResultSet'>
I can find text in the table with:
for row in table:
text = ''.join(row.findAll(text=True))
print text
but if I try to search for HTML with:
for row in table:
text = ''.join(row.find_all('tr'))
print text
It complains about expected string, Tag found So how do I wrangle this string (which is a string full of HTML) back into a beautifulsoup object that I can parse?
BeautifulSoup data-types are bizarre to say the least. A lot of times they don't give enough information to easily piece together the puzzle. I know your pain! Anyway...on to my answer...
Its hard to provide a completely accurate example without seeing more of your code, or knowing the actual site you're attempting to scrape, but I'll do my best.
The problem is your ''.join(). .findAll('tr') returns a list of elements of the BeautifulSoup datatype 'tag'. Its how BS knows to find trs. Because of this, you're passing the wrong datatype to your ''.join().
You should code one more iteration. (I'm assuming there are td tags withing the trs)
text_list = []
for row in table:
table_row = row('tr')
for table_data in table_row:
td = table_data('td')
for td_contents in td:
content = td_contents.contents[0]
text_list.append(content)
text = ' '.join(str(x) for x in text_list)
This returns the entire table content into a single string. You can refine the value of text by simply changing the locations of text_list and text =.
This probably looks like more code than is required, and that might be true, but I've found my scrapes to be much more thorough and accurate when I go about it this way.

Categories