Remove everything after tag in BeautifulSoup

Remove everything after tag in BeautifulSoup - python

I am extracting plaintext from HTML emails using BeautifulSoup. I've got everything working nicely except for one issue. My emails often have replies included below the message at the top. So I have threaded emails, and I end up capturing the same text repeatedly. In most cases, I want to just get rid of everything after the first <div> tag I find. If I print, soup.contents, it outputs the following:
p
None
p
None
p
None
p
None
p
None
p
None
p
None
p
None
p
None
p
None
p
None
div
None
meta
None
style
None
div
None
p
I am looking to return a BeautifulSoup object with everything passed the first div tag removed.
HTMLwise, here's the before and after I'm going for:
Before:
<p> Hi Joe </p>
<p> I will be at the meeting tonight</p>
<p> Allison </p>
<div style='border-width: 1pt medium medium; border-style: solid none none; border-color: rgb(181, 196, 223) currentColor currentColor; font-family: "Arial","sans-serif";'>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>From: </b>John Doe <jdoe#example.com></p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>Sent: </b>Wednesday, May 30, 2018 6:48 AM</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>To: </b>Allison <allison#example.com></p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>Subject: </b>RE: meeting tonight</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
</p>
</div>
<p>Will you be at the meeting tonight?</p>
After:
<p> Hi Joe </p>
<p> I will be at the meeting tonight</p>
<p> Allison </p>

In BeautifulSoup4, you can use the find_all_next method to delete everything after the tag, including the tag itself. This is only going to work if the elements afterwards are defined, e.g. they can't just belong to the Body element.
target = soup.find('div')
for e in target.find_all_next():
e.clear()

The easiest way in this case is just run re and remove all contents after first <div> tag:
s = """<p> Hi Joe </p>
<p> I will be at the meeting tonight</p>
<p> Allison </p>
<div style='border-width: 1pt medium medium; border-style: solid none none; border-color: rgb(181, 196, 223) currentColor currentColor; font-family: "Arial","sans-serif";'>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>From: </b>John Doe <jdoe#example.com></p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>Sent: </b>Wednesday, May 30, 2018 6:48 AM</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>To: </b>Allison <allison#example.com></p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>Subject: </b>RE: meeting tonight</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
</p>
</div>
<p>Will you be at the meeting tonight?</p>"""
import re
new_s = re.sub(r'<div.*', '', s, flags=re.DOTALL).strip()
print(new_s)
Prints:
<p> Hi Joe </p>
<p> I will be at the meeting tonight</p>
<p> Allison </p>
Then you can feed this new string to BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(re.sub(new_s, 'lxml')
print(soup.prettify())
Outputs:
<html>
<body>
<p>
Hi Joe
</p>
<p>
I will be at the meeting tonight
</p>
<p>
Allison
</p>
</body>
</html>

Related

input suggestions webscraping selenium python

Input Element:
<input type="text" auto_complete_item_format="<b xmlns="none">[_full_name]</b>" auto_complete_display_field="_full_name" auto_complete_id_field="suburb_id" auto_complete_search_fields="_full_name,suburb" auto_complete_minimum_characters="1" auto_complete_data_source_web_method="" auto_complete_data_source_object="" auto_complete_data_source_function="get_suburb_suggestions" auto_complete_width="300px" onblur="smart_update(this);" onkeypress="smart_keypress(this, event); " onchange="smart_date_onchange(this); " selected_id="-1" original_value="" value="3000" smart_field_type="auto_complete" class="smart_auto_complete" id="suburb" name="suburb" field="jims.addresses.smart_table.suburb" record_id="-1" bind="True" auto_complete_cursor="2" autocomplete="off" fdprocessedid="5bithf" auto_complete_last_search_string="300" selected_row_index="-1" style="width: 300px;">
Suggestions List:
<div id="suburb_list" name="suburb_list" style="display: none; position: absolute; border: 1px solid rgb(0, 0, 0); z-index: 100; left: 533px; height: 400px; width: 300px; background-color: white; opacity: 0.93; overflow: auto;"><div id="suburb_list_item_0" list_index="0" row_index="0" style="border-width: 1px; border-color: rgb(170, 170, 170); border-style: solid; background-color: rgb(221, 221, 221); padding: 3px;"><b xmlns="none">150 Lonsdale Street, Melbourne 3000 VIC</b></div><div id="suburb_list_item_1" list_index="1" row_index="1" style="border-width: 1px; border-color: rgb(170, 170, 170); border-style: solid; background-color: rgb(255, 255, 255); padding: 3px;"><b xmlns="none">1 Elizabeth Street, Melbourne 3000 VIC</b></div><div id="suburb_list_item_2" list_index="2" row_index="4" style="border-width: 1px; border-color: rgb(170, 170, 170); border-style: solid; background-color: lightblue; padding: 3px;"><b xmlns="none">Carlton 3000 VIC</b></div><div id="suburb_list_item_3" list_index="3" row_index="5" style="border-width: 1px; border-color: rgb(170, 170, 170); border-style: solid; background-color: rgb(255, 255, 255); padding: 3px;"><b xmlns="none">Docklands 3000 VIC</b></div><div id="suburb_list_item_4" list_index="4" row_index="10" style="border-width: 1px; border-color: rgb(170, 170, 170); border-style: solid; background-color: rgb(221, 221, 221); padding: 3px;"><b xmlns="none">East Melbourne 3000 VIC</b></div><div id="suburb_list_item_5" list_index="5" row_index="12" style="border-width: 1px; border-color: rgb(170, 170, 170); border-style: solid; background-color: rgb(255, 255, 255); padding: 3px;"><b xmlns="none">Footscray 3000 VIC</b></div><div id="suburb_list_item_6" list_index="6" row_index="15" style="border-width: 1px; border-color: rgb(170, 170, 170); border-style: solid; background-color: rgb(221, 221, 221); padding: 3px;"><b xmlns="none">Melbourne 3000 VIC</b></div><div id="suburb_list_item_7" list_index="7" row_index="19" style="border-width: 1px; border-color: rgb(170, 170, 170); border-style: solid; background-color: rgb(255, 255, 255); padding: 3px;"><b xmlns="none">Roxburgh Park 3000 VIC</b></div><div id="suburb_list_item_8" list_index="8" row_index="28" style="border-width: 1px; border-color: rgb(170, 170, 170); border-style: solid; background-color: rgb(221, 221, 221); padding: 3px;"><b xmlns="none">West Melbourne 3000 VIC</b></div></div>
My Code:
for num in range(30):
try:
Element = browser.find_element(By.XPATH, f'//*[#id="suburb_list_item_{num}"]')
print(Element)
except:
break
The two html code at the top are for a input suggestion list and i am trying to scrape of the website. The python code is the code i used and it should print all the elements. But it does not because the suggestion list is clicked. How could i click the text box because this didn't work:
Element = browser.find_element(By.XPATH, '//*[#id="suburb"]')
the element could not be found.
I am not allowed to share the website link.

You can try:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# <insert other code here for setting up the browser>
input_element = wait.until(EC.visibility_of_element_located((By.ID, "suburb")))
input_element.click()
wait = WebDriverWait(browser, 10)
element = wait.until(EC.visibility_of_element_located((By.ID, "suburb_list")))

How to extract text and remove HTML tags in another column

I have a free text field column in python dataframe with html tags.
ID Free text field
1 <p><span style="background-color: rgb(255, 255, 255); color: rgb(37, 36, 35); font-family:
Arial; font-size: 10.5pt;">TExt1:</span></p><p><span style="background-color: rgb(255, 255,
255); color: rgb(37, 36, 35); font-family: Arial; font-size: 10.5pt;">Score: 5</span></p><p>
<span style="background-color: rgb(255, 255, 255); color: rgb(37, 36, 35); font-family: Arial;
font-size: 10.5pt;">B - </span><span style="background-color: rgb(255, 255, 255); color:
rgb(36, 36, 36); font-family: Arial; font-size: 10.5pt;">TExt2</span></p><p><span
style="background-color: rgb(255, 255, 255); color: rgb(37, 36, 35); font-family: Arial;
font-size: 10.5pt;">Text6</span></p><p><span style="background-color: rgb(255, 255, 255);
color: rgb(37, 36, 35); font-family: Arial; font-size: 10.5pt;">Text3</span></p><p><span
style="background-color: rgb(255, 255, 255); color: rgb(37, 36, 35); font-family: Arial;
font-size: 10.5pt;">Text4</span></p>
2 <p>Text10</p>
3 <p>Sky is blue</p>
4 <p>Text3</p><p><br></p><p>Text19</p>
5 <p> Complaint1</p><p><br></p><p>Text1</p><p>hospo 2</p><p>Tes45</p><p><br></p><p>test</p>
6 <p>Test44</p>
7 <p>Test54</p>
Tried using this;
from bs4 import BeautifulSoup
df['free text'].apply(
lambda x: list(BeautifulSoup(x, "html.parser").stripped_strings)
)
but getting this error,
TypeError: object of type 'NoneType' has no len()
What am I doing incorrectly?
Any help would be appreciated.
Thanks

Make sure to remove all missing data using 'df.dropna()' before applying the lambda function, otherwise you will get 'TypeError: object of type 'float' has no len()' error if your data frame has missing data.

How to scrape span tag that is inside a iframe using Selenium?

WebDriverWait(driver,10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,'//[#id="dataviscontainer"]/div/iframe')))
span_texts = [item.text for item in driver.find_elements_by_css_selector('span')]
print(span_texts)
My objective here is to scrape text from multiple span tags which actually are in dropdown list which is inside a iframe & the iframe does not have any name class or id so I used the XPath. After running this code I get an empty list. below are 2 span tags that has the text
<span class="slicerText" title="Albury (C)" style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); border-style: solid; border-color: rgb(96, 94, 92); border-width: 0px; font-size: 13.3333px; font-family: Arial; line-height: 17px;">Albury (C)</span>
<span class="slicerText" title="Armidale Regional (A)" style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); border-style: solid; border-color: rgb(96, 94, 92); border-width: 0px; font-size: 13.3333px; font-family: Arial; line-height: 17px;">Armidale Regional (A)</span>

You should first select the iframe element in a variable then apply the selection of the spans on the variable that contains the iframe reference.
this answer should help you.

Are you looking for something like this:
spanText = driver.find_elements_by_xpath(".//span[#class='slicerText']")
for e in spanText:
print(e.get_attribute("title"))

var iframe = driver.findElements(By.tagName("iframe"))
driver.switchTo().frame(iframe)
WebDriverWait(driver,10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,'//[#id="dataviscontainer"]/div/iframe')))
span_texts = [item.text for item in
driver.find_elements_by_css_selector('span')]
print(span_texts)
)

Get the URL from this output of Selenium in Python

How can I get the URL from this output of Selenium in Python?
<div style="z-index: 999; overflow: hidden; background-position: 0px 0px; text-align: center; background-color: rgb(255, 255, 255); width: 480px; height: 672.172px; float: left; background-size: 1054px 1476px; display: none; border: 0px solid rgb(136, 136, 136); background-repeat: no-repeat; position: absolute; background-image: url("https://photo.venus.com/im/19230307.jpg?preset=zoom");" class="zoomWindow"> </div>
I got the above output from the following command line:
driver.find_element_by_class_name('zoomWindowContainer')

Firstly, get style atribute by:
div = driver.find_element_by_class_name('zoomWindow')
style = div.get_attribute("style") # str
Then, using regex to find url from style:
import re
urls = re.findall(r"https?://.+\.jpg", style) # list
print (urls[0])

How to handle this alert or frame using python selenium?

https://niioa.immigration.gov.tw/NIA_OnlineApply_inter/visafreeApply/visafreeApplyForm.action
Something pop up after I select the first item and I cannot handle the popup . I do not know what it is, it's not alert. and I cant find the frame for the (switch to frame)
its a Chinese website....
so I have pasted the elements that's loaded after I selected the first item
<div class="blockUI" style="display:none"></div>
<div class="blockUI blockOverlay" style="z-index: 1000; border: none; margin: 0px; padding: 0px; width: 100%; height: 100%; top: 0px; left: 0px; background-color: rgb(0, 0, 0); opacity: 0.6; cursor: wait; position: fixed;"></div>
<div class="blockUI blockMsg blockPage" style="z-index: 1011; position: fixed; padding: 0px; margin: 0px; width: 450px; top: 539.5px; left: 119.5px; text-align: center; color: rgb(0, 0, 0); border: 3px solid rgb(170, 170, 170); background-color: rgb(255, 255, 255); height: 140px; overflow: hidden;"><div id="showWarnMessage1" style="">
<table class="application" style="margin: 10px;">
<tbody><tr>
<td>
<p class="Prompt" style="text-align: center">注意</p>
<p>除香港居民持有BNO護照及澳門居民持有1999年前取得之葡萄牙護照外，持有外國護照，不適合辦理本許可。</p>
</td>
</tr>
</tbody></table>
<div>
<input class="btn" value="確認" type="button" onclick="$.unblockUI();">
</div>
</div></div>

This worked for me to get past the pop-up:
chromedriver = "your_path"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
driver.implicitly_wait(15)
driver.get('https://niioa.immigration.gov.tw/NIA_OnlineApply_inter/visafreeApply/visafreeApplyForm.action')
driver.find_element_by_xpath('//*[#id="isHKMOVisaN"]').click()
And then this last line is what gets rid of the pop-up:
driver.find_element_by_xpath('//*[#id="showWarnMessage1"]/div/input').click()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove everything after tag in BeautifulSoup - python

Related

input suggestions webscraping selenium python

How to extract text and remove HTML tags in another column

How to scrape span tag that is inside a iframe using Selenium?

Get the URL from this output of Selenium in Python

How to handle this alert or frame using python selenium?

Categories

Resources