How to collect siblings between two specific siblings? - python

I have an html book with many <h2> headers and I want to collect the content after each of them using beautifulsoup. This is my current code which crashes, because it is just adding over and over again every sibling through the entire document and then repeating the element loop:
chapter_data = []
for e in soup.select('h2'): # select the tags
chapter = ''
for sib in e.next_siblings:
if sib.name == 'h2':
break
chapter += str(sib)
# add data to list
chapter_data.append({
'chapter': e.get_text(strip=True).split()[-1],
'text': chapter
})
chapters = pd.DataFrame(chapter_data)
The goal is getting each chapter into its own cell. I figured that this would collect all the siblings and content after each <h2></h2> tag separately to store it in a list. There are a lot of headings and I could manually do this, but I would like to know the proper code that could do this for me.
I tried an extract() method but this doesn't work either.

How you did it is pretty much what I would do, but you should unindent the chapter_data.append... statement by a level since there is no need to append to chapter_data for every sib. The example below parses this Wikipedia page.
# [ linkToSoup defined at https://pastebin.com/rBTr06vy ]
# soup = linkToSoup('https://en.wikipedia.org/wiki/Coal_gas')
headerTags = soup.select('h2')
chapter_data, totalCh = [], len(headerTags)
for ci, (h, nxt_h) in enumerate(zip(headerTags, headerTags[1:]+[None]), 1):
print('', end=f'\rAdding chapter {ci} of {totalCh}') ## track progress
texts, htmls = '', ''
for sib in h.next_siblings:
if sib == nxt_h: break
if callable(getattr(sib, 'get_text', None)):
texts += '\n' + sib.get_text(' ', strip=True)
htmls += '\n' + str(sib).strip()
# texts = BeautifulSoup(htmls).get_text(' ', strip=True)
chapter_data.append({
'chapter': h.get_text(' ', strip=True),
'text': texts.strip(), 'html': htmls.strip()
})
print('\n')
chapters = pd.DataFrame(chapter_data)
Output:
Btw, this is what it would look like if my chapter_data.append... statement was indented like yours - there would be many redundant appends for each chapter.
EDIT: just realized that you can probably get your chapter_data in one statement with list comprehension:
chapter_data = [{
'chapter': h.get_text(' ', strip=True),
'html': '\n'.join(
str(sib) for sib in h.next_siblings if
sib!=nxt_h and nxt_h not in sib.previous_siblings
).strip()
} for h, nxt_h in zip(headerTags, headerTags[1:]+[None])]
View Screenshot with Output

Related

Parsing just first result with beautiful soup

I have the following code which succesfully pulls links, titles, etc. for podcast episodes. How would I go about just pulling the first one it comes to (i.e. the latest episode) and then immediately stop and produce just that result? Any advice would be greatly appreciated.
def get_playable_podcast(soup):
"""
#param: parsed html page
"""
subjects = []
for content in soup.find_all('item'):
try:
link = content.find('enclosure')
link = link.get('url')
print "\n\nLink: ", link
title = content.find('title')
title = title.get_text()
desc = content.find('itunes:subtitle')
desc = desc.get_text()
thumbnail = content.find('itunes:image')
thumbnail = thumbnail.get('href')
except AttributeError:
continue
item = {
'url': link,
'title': title,
'desc': desc,
'thumbnail': thumbnail
}
subjects.append(item)
return subjects
def compile_playable_podcast(playable_podcast):
"""
#para: list containing dict of key/values pairs for playable podcasts
"""
items = []
for podcast in playable_podcast:
items.append({
'label': podcast['title'],
'thumbnail': podcast['thumbnail'],
'path': podcast['url'],
'info': podcast['desc'],
'is_playable': True,
})
return items
The answer of #John Gordon is completely correct.
#John Gordon pointed out that:
soup.find()
will always display the first found item (for you thats perfectly fine, when you want to scrape the "latest episode").
However, imagine you just wanted to select the second, third, fourth, etc. item of your BeautifulSoup. Then you could do that with the following line of code:
soup.find()[0] # This will works the same way as soup.find() and displays the first item
When you replace the 0 by any other number (e.g. 4) you solely get the choosen (in this example fourth) item ;).

Removing new line characters in web scrape

I'm trying to scrape baseball lineup data but would only like to return the player names. However, as of right now, it is giving me - position, newline character, name, newline character, and then batting side. For example I want
'D. Fletcher'
but instead I get
'LF\nD. Fletcher\nR'
Additionally, it is giving me all players on the page. It would be preferable that I group them by team, which maybe requires a dictionary set up of some sort but am not sure what that code would look like.
I've tried using the strip function but I believe that only removes leading or trailing issues as opposed to in the middle. I've tried researching how to just get the title information from the anchor tag but have not figured out how to do that.
from bs4 import BeautifulSoup
import requests
url = 'https://www.rotowire.com/baseball/daily_lineups.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
players = soup.find_all('li', {'class': 'lineup__player'})
####for link in players.find('a'):
##### print (link.string)
awayPlayers = [player.text.strip() for player in players]
print(awayPlayers)
You should only get the .text for the a tag, not the whole li:
awayPlayers = [player.find('a').text.strip() for player in players]
That would result in something like the following:
['L. Martin', 'Jose Ramirez', 'J. Luplow', 'C. Santana', ...
Say you wanted to build that dict with team names and players you could do something like as follows. I don't know if you want the highlighted players e.g. Trevor Bauer? I have added variables to hold them in case needed.
Ad boxes and tools boxes are excluded via :not pseudo class which is passed a list of classes to ignore.
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.rotowire.com/baseball/daily-lineups.php')
soup = bs(r.content, 'lxml')
team_dict = {}
teams = [item.text for item in soup.select('.lineup__abbr')] #26
matches = {}
i = 0
for teambox in soup.select('.lineups > div:not(.is-ad, .is-tools)'):
team_visit = teams[i]
team_home = teams[i + 1]
highlights = teambox.select('.lineup__player-highlight-name a')
visit_highlight = highlights[0].text
home_highlight = highlights[1].text
match = team_visit + ' v ' + team_home
visitors = [item['title'] for item in teambox.select('.is-visit .lineup__player [title]')]
home = [item['title'] for item in teambox.select('.is-home .lineup__player [title]')]
matches[match] = {'visitor' : [{team_visit : visitors}] ,
'home' : [{team_home : home}]
}
i+=1
Example info:
Current structure:
I think you were almost there, you just needed to tweak it a little bit:
awayPlayers = [player.find('a').text for player in players]
This list comprehension will grab just the names from the list then pull the text from the anchor...you get just a list of the names:
['L. Martin',
'Jose Ramirez',
'J. Luplow'...]
You have to find a tag and title attribute in it, check below answer.
awayPlayers = [player.find('a').get('title') for player in players]
print(awayPlayers)
Output is:
['Leonys Martin', 'Jose Ramirez', 'Jordan Luplow', 'Carlos Santana',

How to go through all items and than save them in a dictionary key

I want to load automatically a code from website.
I have a list with some names and want to go through every item. Go through the first item, make request, open website, copy the code/number from HTML (text in span) and than save this result in dictionary and so on (for all items).
I read from csv all lines and save them into a list.
After this I make request to load HTML from a website, search the company and read the numbers from span.
My code:
with open(test_f, 'r') as file:
rows = csv.reader(file,
delimiter=',',
quotechar='"')
data = [data for data in rows]
print(data)
url_part1 = "http://www.monetas.ch/htm/651/de/Firmen-Suchresultate.htm?Firmensuche="
url_enter_company = [data for data in rows]
url_last_part = "&CompanySearchSubmit=1"
firma_noga = []
for data in firma_noga:
search_noga = url_part1 + url_enter_company + url_last_part
r = requests.get(search_noga)
soup = BeautifulSoup(r.content, 'html.parser')
lii = soup.find_all("span")
# print all numbers that are in a span
numbers = [d.text for d in lii]
print("NOGA Codes: ")
I want to get in dictionary the result, where the key should be the company name (item in a list) and the value should be the number that I read from the span:
dict = {"firma1": "620100", "firma2": "262000, 465101"}
Can some one help me, I am new at web scraping and python, and don't know what I am doing wrong.
Split your string with regex and do your stuff depending on wether it is a number or not:
import re
for partial in re.split('([0-9]+)', myString):
try:
print(int(partial))
except:
print(partial + ' is not a number')
EDIT:
Well, myString is somewhat expected to be a string.
To get the text content of your spans as a string you should be able to use .text something like this:
spans = soup.find_all('span')
for span in spans:
myString = span.text #
for partial in re.split('([0-9]+)', myString):
try:
print(int(partial))
except:
print(partial + ' is not a number')
Abstracting from my requirements in comments I think somethinfg like this should work for you:
firma_noga = ['firma1', 'firma2', 'firma3'] #NOT EMPTY as in your code!
res_dict = {}
for data in firma_noga:
search_noga = url_part1 + url_enter_company + url_last_part
r = requests.get(search_noga)
soup = BeautifulSoup(r.content, 'html.parser')
lii = soup.find_all("span")
for l in lii:
if data not in res_dict:
res_dict[data] = [l]
else:
res_dict[data].append(l)
Obviously this will work obviously if firma-noga won't be empty like in your code; and all the rest (your) parsing logic should be valid as well.

Finding direct child of an element

I'm writing a solution to test this phenomenon in Python. I have most of the logic done, but there are many edge cases that arise when following links in Wikipedia articles.
The problem I'm running into arises for a page like this where the first <p> has multiple levels of child elements and the first <a> tag after the first set of parentheses needs to be extracted. In this case, (to extract this link), you have to skip over the parentheses, and then get to the very next anchor tag/href. In most articles, my algorithm can skip over the parentheses, but with the way that it looks for links in front of parentheses (or if they don't exist), it is finding the anchor tag in the wrong place. Specifically, here: <span style="font-size: small;"><span id="coordinates">Coordinates
The algorithm works by iterating through the elements in the first paragraph tag (in the main body of the article), stringifying each element iteratively, and first checking to see if it contains either an '(' or an '
Is there any straight forward way to avoid embedded anchor tags and only take the first link that is a direct child of the first <p> ?
Below is the function with this code for reference:
**def getValidLink(self, currResponse):
currRoot = BeautifulSoup(currResponse.text,"lxml")
temp = currRoot.body.findAll('p')[0]
parenOpened = False
parenCompleted = False
openCount = 0
foundParen = False
while temp.next:
temp = temp.next
curr = str(temp)
if '(' in curr and str(type(temp)) == "<class 'bs4.element.NavigableString'>":
foundParen = True
break
if '<a' in curr and str(type(temp)) == "<class 'bs4.element.Tag'>":
link = temp
break
temp = currRoot.body.findAll('p')[0]
if foundParen:
while temp.next and not parenCompleted:
temp = temp.next
curr = str(temp)
if '(' in curr:
openCount += 1
if parenOpened is False:
parenOpened = True
if ')' in curr and parenOpened and openCount > 1:
openCount -= 1
elif ')' in curr and parenOpened and openCount == 1:
parenCompleted = True
try:
return temp.findNext('a').attrs['href']
except KeyError:
print "\nReached article with no main body!\n"
return None
try:
return str(link.attrs['href'])
except KeyError:
print "\nReached article with no main body\n"
return None**
I think you are seriously overcomplicating the problem.
There are multiple ways to use the direct parent-child relationship between the elements in BeautifulSoup. One way is the > CSS selector:
In [1]: import requests
In [2]: from bs4 import BeautifulSoup
In [3]: url = "https://en.wikipedia.org/wiki/Sierra_Leone"
In [4]: response = requests.get(url)
In [5]: soup = BeautifulSoup(response.content, "html.parser")
In [6]: [a.get_text() for a in soup.select("#mw-content-text > p > a")]
Out[6]:
['West Africa',
'Guinea',
'Liberia',
...
'Allen Iverson',
'Magic Johnson',
'Victor Oladipo',
'Frances Tiafoe']
Here we've found a elements that are located directly under the p elements directly under the element with id="mw-content-text" - from what I understand this is where the main Wikipedia article is located in.
If you need a single element, use select_one() instead of select().
Also, if you want to solve it via find*(), pass the recursive=False argument.

Extract Information out of html

how can I extract the information of the appended html and save in a text file the following:
Paragraph-ID \t TokenID \t TokenCoordinates \t TokenContent
So, for example, the first lines should look like this:
T102633 1 109,18,110,18 IV
T102634 1 527,29,139,16 Seit
...
I'd like to use python. At the moment, I have the following:
root = lxml.html.parse('html-file').getroot()
tables = root.cssselect('table.main')
tables = root.xpath('//table[#class="main" and not(ancestor::table[#class="main"])]')
for elem in root.xpath("//span[#class='finereader']"):
text = (elem.text or "") + (elem.tail or "")
if elem.getprevious() is not None: # If there's a previous node
previous = elem.getprevious()
previous.tail = (previous.tail or "") + text # append to its tail
else:
parent = elem.getparent() # Otherwise use the parent
parent.text = (parent.text or "") + text # and append to its text
elem.getparent().remove(elem)
txt = []
txt += ([lxml.etree.tostring(t, method="html", encoding="utf-8") for t in tables])
text = "\n".join(el for el in txt)
output.write(text.decode("utf-8"))
This gives me something like this:
[:T102633-1
coord="109,18,110,18":]IV[:/T102633-1:]
Now, it's clear that I could use the string-find-method to extract the information I want. But is there no more elegant solution? With ".attrib" or something like that?
Thanks for any help!
Here, one can find the html: http://tinyurl.com/qjvsp4n
This code using BeautifulSoup gives all the spans you are interested in:
from bs4 import BeautifulSoup
html_file = open('html_file')
soup = BeautifulSoup(html_file)
table = soup.find('table', attrs={'class':'main'})
# The first two tr's dont seem to contain the info you need,
# so get rid of them
rows = table.find_all('tr')[2:]
for row in rows:
data = row.find_all('td')[1]
span_element = data.find_all('span')
for ele in span_element:
print ele.text
Once you have the data in the format [:T102639-3 coord="186,15,224,18":]L.[:/T102639-3:], use the python regex module to get the content.
import re
pattern = re.compile('\[:(.*):\](.*)\[:\/(.*):\]')
data = "[:T102639-3 coord="186,15,224,18":]L.[:/T102639-3:]"
res = re.search(pattern, data)
# res.group(1).split()[0] then gives 'T102639-3'
# res.group(1).split()[1] gives coord="186,15,224,18"
# res.group(2) gives 'L.'

Categories