Extract values with condition with xpath and python - python

This is an extraction of an HTML file from http://www.flashscore.com/hockey/sweden/shl/results/
<td title="Click for match detail!" class="cell_sa score bold">4:3<br><span class="aet">(3:3)</span></td>
<td title="Click for match detail!" class="cell_sa score bold">2:5</td>
I would now like to extract the scores after regulation time.
This means whenever 'span class = "aet"' is present (after td class="cell_sa score bold") I need to get the text from span class = "aet". If span class = "aet" is not present I would like to extract the text from td class="cell_sa score bold".
In the above case the desired output (in a list) would be:
[3:3,2:5]
How could go I go for that with xpath statements in python?

You can reach text node of desired tags, obeying conditions you defined by this:
(/tr/td[count(./span[#class = 'aet']) > 0]/span[#class = 'aet'] | /tr/td[0 = count(./span[#class = 'aet'])])/text()
I supposed <td> tags are grouped in a <tr> tag.
If you want to strictly just choose <td>s having 'cell_sa' and 'score' and 'bold' add [contains(#class, 'cell_sa')][contains(#class, 'score')][contains(#class, 'bold')] after each td. As below:
(/tr/td[contains(#class, 'cell_sa')][contains(#class, 'score')][contains(#class, 'bold')][count(./span[#class = 'aet']) > 0]/span[#class = 'aet'] | /tr/td[contains(#class, 'cell_sa')][contains(#class, 'score')][contains(#class, 'bold')][0 = count(./span[#class = 'aet'])])/text()
As you see I tried to implement #class check method order-independently and loose (Just as it is in css selector). You could implement this check as a simple string comparison which results a fragile data consumer

Related

How can I use Beautiful Soup to webscrape also the URLs to be used in my pandas dataframe

I am attempting to webscrape the following URL: http://eecs.qmul.ac.uk/postgraduate/programmes
I have the following code:
#Create loop to look for the td tag and print the rows
for row in rows:
row_td = row.find_all('td')
row_url = row.find_all('a')
print(row_td)
type(row_td)
This produces the following result:
[<td>Artificial Intelligence</td>, <td style="text-align: center;">I4U2 </td>, <td style="text-align: center;">I4U1 </td>]
[<td>Artificial Intelligence with Machine Learning (January 2022 Entry Only)</td>, <td style="text-align: center;"> </td>, <td style="text-align: center;">I4U8</td>]
[<td><span>Big Data Science</span></td>, <td style="text-align: center;">H6J6</td>, <td style="text-align: center;">H6J7</td>]
[<td><span>Big Data Science with Machine Learning Systems (January 2022 Entry Only)</span></td>, <td style="text-align: center;"> </td>, <td style="text-align: center;">I4U7</td>]
As you can see, for each there is a title of the course, a URL and code for full time and then a URL and code for part time on the subsequent line. I would like to record these so it would read:
Artificial Intelligence, https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/artificial-intelligence-msc, I4U2, https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/artificial-intelligence-msc/, I4U1.
I have used the following code to clean up the rows so that I can achieve this, however, it isn't including the URL's.
INPUT:
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells).get_text()
print(cleantext)
OUTPUT:
[Digital and Technology Solutions (Apprenticeship), I4DA, ]
Would you be able to assist so that I can include the URLs in the output as well please?
I have found out that I can select the URLs using soup.find_all("a") but I cannot work out how to combine this with the code above.
Thank you.
EDIT:
I have amended my code, using the suggestion below, however, when I am attempting to add this as a pandas dataframe it does not seem to be working correctly. Would someone be able to spot my mistake please, and help me to convert this to a pandas dataframe?
dfObj = pd.DataFrame(columns = ['C1', 'C2', 'C3', 'C4', 'C5'])
for row in rows:
courses = row.find_all("td")
# The fragments list will store things to be included in the final
string, such as the course title and its URLs
fragments = []
for course in courses:
if course.text.isspace():
continue
# Add the <td>'s text to fragments
fragments.append(course.text)
# Try and find an <a> tag
a_tag = course.find("a")
if a_tag:
# If one was found, add the URL to fragments
fragments.append(a_tag["href"])
# Make a string containing every fragment with ", " spacing them apart.
cleantext = ", ".join(fragments)
series_obj = pd.Series(cleantext,
index=dfObj.columns)
# Add a series as a row to the dataframe
mod_df = dfObj.append( series_obj,
ignore_index=True)
print(mod_df)
I then get the following ouput:
You can build the string by taking advantage of the .text property of BeautifulSoup4's Tag objects, and storing things in an intermediate variable (I chose a list so that combining the items with a comma is easier).
Here's a fully-fledged example:
for row in rows:
children = row.find_all("td")
# The fragments list will store all of the stuff we want
# included in the final string, such as the course title and
# its URLs.
fragments = []
for child in children:
# Ignore <td> tags without text (can happen when no
# part-time version of a course exists)
if child.text.isspace():
continue
# Add the <td>'s text to fragments. This could be the course
# title (e.g. Internet of Things (Data)) or its ID (e.g. I1T2)
fragments.append(child.text)
# Try and find an <a> tag in the child <td>.
a_tag = child.find("a")
if a_tag:
# If one was found, add the URL to fragments.
fragments.append(a_tag["href"])
# Make a string containing every fragment with ", " spacing them apart.
cleantext = ", ".join(fragments)
print(cleantext)

Beautiful soup: Finding <a> tags within a <td> but not within <sup> inside the <td>

Some background here, please skip to the end if you're not interested in this.
I'm trying to scrape some data about singers on the Billboard Hot 100 and have run into the following problem while scraping information about what genres the singers belong to. I'm currently using the infobox table all singers have.
Eg. https://en.wikipedia.org/wiki/Lil_Nas_X
Inside the infobox, the genres cell is structured differently for different singers. Some singers like Lil Nas above has the genres tag inside a list
<th scope="row">Genres</th><td><div class="hlist hlist-separated">
<ul>
<li>Hip hop<sup id="cite_ref-Allmusic_genre_1-0" class="reference">[1]</sup></li>
<li>country rap<sup id="cite_ref-Allmusic_genre_1-1" class="reference">[1]</sup><sup id="cite_ref-2" class="reference">[2]</sup><sup id="cite_ref-3" class="reference">[3]</sup></li>
<li>pop<sup id="cite_ref-Allmusic_genre_1-2" class="reference"><a href="#cite_note-Allmusic_genre-1">
While some others don't.
<th scope="row">Genres</th>
<td>Alternative rock,
post-grunge
<sup id="cite_ref-Villains_AMG_1-0" class="reference">[1]</sup></td></tr>
In these cases I run into a problem. If I extract all the 'a' tags within 'td' I'm getting the 'a' tag within the cite note which is inside 'sup' too.
I want to be able to extract the 'a' tags inside 'td' but not inside the 'sup'. This is the code I am currently using for this portion.
cell_containing_genres = soup.find('th', string = "Genres").find_next('td') # This finds the next cell after the title Genres
if cell_containing_genres.find_all('li') == []: #Extracting all the links from the cell above without list items
genre_list = cell_containing_genres.find_all('a')
for genre in genre_list:
genres.append(genre.get('href'))
else: # Extracting all the links from the cell above with list items
genre_list = cell_containing_genres.find_all('li')
for genre in genre_list:
genres.append(genre.find_all('a')[0].get('href'))
Thanks a lot for your help.

How to get the <h1> elements and then next sibling <h2> and <p> elements

I'm working in python on a Selenium problem. I'm trying to gather each element with an h1 tag and following that tag, I want to get the closest h2 and paragraph text tags and place that data into an object.
My current code looks like:
cards = browser.find_elements_by_tag_name("h1")
ratings = browser.find_elements_by_tag_name('h3')
descriptions = browser.find_elements_by_tag_name('p')
print(len(cards))
print(len(ratings))
print(len(descriptions))
which is generating inconsistent numbers.
To get the <h1> tag elements and then the next sibling <h2> and <p> tag elements you can use the following solution:
cards = browser.find_elements_by_tag_name("h1")
ratings = browser.find_elements_by_xpath("//h1//following-sibling::h2")
descriptions = browser.find_elements_by_xpath("//h1//following-sibling::p")
print(len(cards))
print(len(ratings))
print(len(descriptions))

Parsing XML in Python finding element by id-tag

I'm trying to use regex to parse an XML file (in my case this seems it's right way).
My XML looks like this:
line='<form id="main">\n<input {disable} style="display:none" id="CALLERID"
value="58713780">\n<input {disable} style="display:none" id="GR_BUS"
value="VGH1"\n<td><input id="label" {disable} style="font-
size:9px;width:100%;margin:0;padding:1;" type=text></td>\n</form>>'
To access the text , I'm using:
attr = re.search('[#id = (CALLERID|GR_BUS|label)]', line)
I want to get the result of parsing xml in format:
<CALLERID>58713780</CALLERID><GR_BUS>VGH1</GR_BUS><label></label>
but nothing is being returned.
Can someone point out what I'm doing wrong?
Thanks to
Here is the output :
line = '''<form id="main">\n
<input {disable} style="display:none" id="CALLERID" value = "58713780" >\n
<input{disable} style = "display:none" id = "GR_BUS" value = "VGH1"\n >
< td >< inputid = "label"{disable}style = "font-size: 9px;width: 100 %;margin: 0;padding: 1;" type=text></td>
</form>>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, "lxml")
for values in soup.findAll("input"):
id = values["id"]
value = values["value"]
print(id, value)
output :
('CALLERID', '58713780')
('GR_BUS', 'VGH1')
First, what you have in your example is not valid XML, but HTML. More likely an HTML template, considering the {disable} directives in your string.
Second, your regex is invalid, as it does not take into account the quotes around the id attribute. I also suppose you need a capturing group for the value attribute too, in order to build your final result and take into account that the value is not always present (i.e. in case of the label id).
The regex which does that is id=\"(CALLERID|GR_BUS|label)\"(\s*value=\"(\S*)\")?. For each match, the first capturing group will contain the value of the id attribute and the 3rd group (if present) will contain the value of the value attribute.
You can test it at https://regex101.com, by selecting python as language.

Extracting text from find_next_sibling(), BeautifulSoup

I am trying to extract the description of the Chinese character from this website: http://www.hsk.academy/en/hsk_1
Example html:
<tr>
<td>
<span class="hanzi">爱</span>
<br/>ài</td>
<td>to love; affection; to be fond of; to like</td>
</tr>
I would like the last td tag's text be put into a list for each description of the character. However, currently I am given the whole tag including the tags themselves. I can't .text the find_next_sibling(): AttributeError: 'NoneType' object has no attribute 'text'.
This is my code:
for item in soup.find_all("td"):
EnglishItem = item.find_next_sibling()
if EnglishItem:
if not any(EnglishItem in s for s in EnglishDescriptionList):
EnglishDescriptionList.insert(count, EnglishItem)
count += 1
print EnglishDescriptionList
Try this:
english_descriptions = []
table = soup.find('table', id='flat_list')
for e in table.select('.hanzi'):
english_desc = e.parent.find_next_sibling().text
if not any(english_desc in s for s in english_descriptions):
english_descriptions.append(english_desc)
This selects (finds) all tags of class hanzi (within the table with id="flat_list") which will be the <span> tags. Then the parent of each <span> is accessed - this is the first <td> in each row. Finally the next sibling is accessed and this is the target tag that contains the English description.
You can do away with the count and just append items to the list with
english_descriptions.append()
Also, I don't think that you need to check whether the current english description is a substring of an existing one (is that what you're trying to do?). If not you can simplify to this list comprehension:
table = soup.find('table', id='flat_list')
english_descriptions = [e.parent.find_next_sibling().text for e in table.select('.hanzi')]

Categories