I just found the Mammoth Python package a couple of days ago and its a great tool which really creates clean HTML code from a Word doc. Its nearly perfect. There is just one artifact I don’t understand. The heading elements (h1-h6) it creates from the Word headings contain several <a> elements with strange TOC ids. Looks like this:
<h1><a id="_Toc48228035"></a><a id="_Toc48288791"></a><a id="_Toc48303673"></a><a id="_Toc48306159"></a><a id="_Toc48308644"></a><a id="_Toc48311128"></a><a id="_Toc48313611"></a>Arteriosklerose</h1>
Does anybody know how the get rid of these?
Thanks in advance
Cheers,
Peter
This is just a guess, but I hope it helps:
TOC stands most probably for "Table of Content". When you want to skip to an element in the page, (like a certain Chapter), you give the chapter an ID and append #ID to your url. In this way the browser would scroll directly to that point.
I guess you are using a table of content somehow and it has links in it and when you inspect them you fill find something like Arteriosklerose
Related
I'm using the docx package in Python and I have many small tables which I want to put into the docx Word file. However, there are times when a table may cross the page (for example, half the table will be on Page 1 and the other half will be on Page 2). Is there a way to stop this from happening such that the full table will be printed on only one page?
I tried doing things like table.autofit = False and table.allow_autofit = False but I don't think they're what I need, so if anyone can help me out with it I would greatly appreciate it, thanks!
There is a pagination property called "keep_together". This is its description in the documentation:
keep_together causes the entire paragraph to appear on the same page, issuing a page break before the paragraph if it would otherwise be broken across two pages.
I have not tried this in code, but it is how I would format the tables if I was doing it in Word.
This is a link to the documentation:
https://python-docx.readthedocs.io/en/latest/user/text.html?highlight=break#pagination-properties
I'm trying to get the number in all the <b> tags on this website. I want every single "qid" (question id) so I think I have to use qids = driver.find_elements_by_tag_name("b"), and based on other questions I've found I also need to implement a for loop and then print(qids.get_attribute("text")) but my code can't even seem to find elements with the <b> since I keep on getting the NoSuchElementException. The appearance of the website leads me to believe the content I'm looking for is within an iframe but I'm not sure if that affects the functionality of my code.
Here's a screencap of the website for reference
The html isn't of much use because the tag is its only defining trait:
<b>13570etc...</b>
Any help is much appreciated.
You could try searching by XPath:
driver.find_elements_by_xpath("//b")
Where // means "find all matching elements regardless of where they are in the document/current scope." Check out the XPath syntax here and mess around with a few different options.
Okay so.
The heading might seem like this question has already been asked but I had no luck finding an answer for it.
I need help with making link extracting program with python.
Actually It works. It finds all <a> elements on a webpage. Takes their href="" and puts it in an array. Then it exports it in csv file. Which is what I want.
But I can't get a hold of one thing.
The website is dynamic so I am using the Selenium webdriver to get JavaScript results.
The code for the program is pretty simple. I open a website with webdriver and then get its content. Then I get all links with
results = driver.find_elements_by_tag_name('a')
Then I loop through results with for loop and get href with
result.get_attribute("href")
I store results in an array and then print them out.
But the problem is that I can't get the name of the links.
This leads to Google
Is there any way to get 'This leads to Google' string.
I need it for every link that is stored in an array.
Thank you for your time
UPDATE!!!!!
As it seems it only gets dynamic links. I just notice this. This is really strange now. For hard coded items, it returns an empty string. For a dynamic link, it returns its name.
Okay. So. The answer is that instad of using .text you shoud use get_attribute("textContent"). Works better than get_attribute("innerHTML")
Thanks KunduK for this answer. You saved my day :)
Python and Selenium beginner here. I'm trying to scrape the title of the sections of an Udemy class. I've tried using the find_elements_by_class_name and others but for some reason only brings back partial data.
page I'm scraping: https://www.udemy.com/selenium-webdriver-with-python3/
1) I want to get the title of the sections. They are the bold titles.
2) I want to get the title of the subsections.
from selenium import webdriver
driver = webdriver.Chrome()
url = 'https://www.udemy.com/selenium-webdriver-with-python3/'
driver.get(url)
main_titles = driver.find_elements_by_class_name("lecture-title-text")
sub_titles = driver.find_elements_by_class_name("title")
Problem
1) Using main_titles, I got the length to be only 10. It only goes from Introduction to Modules. Working With Files and ones after all don't come out. However, the class names are exactly the same. Not sure why it's not. Modules / WorkingWithFiles is basically the cutoff point. The elements in the inspection also looks different at this point. They all have same span class tag but not sure why only partial is being returned
<span class="lecture-title-text">
Element Inspection between Modules title and WorkingWithFiles title
At this point the webscrape breaks down. Not sure why.
2) Using sub_titles, I got length to be 58 items but when I print them out, I only get the top two:
Introduction
How to reach me anytime and ask questions? *** MUST WATCH ***
After this, it's all blank lines. Not sure why it's only pulling the top two and not the rest when all the tags have
<div class='title'>
Maybe I could try using BeautifulSoup but currently I'm trying to get better using Selenium. Is there a dynamic content throwing off the selenium scrape or am I not scraping it in a proper way?
Thank you guys for the input. Sorry for the long post. I wanted to make sure I describe the problem correctly.
The reason why your only getting the first 10 sections is because only the first ten courses are shown. You might be logged in on your browser, so when you go to check it out, it shows every section. But for me and your scraper it's only showing the first 10. You'll need to click that .section-container--more-sections button before looking for the titles.
As for the weird case of the titles not being scraped properly: It's because when a element is hidden text attribute will always be undefined, which is why it only works for the first section. I'd try using the WebElement.get_attribute('textContent') to scrape the text.
Ok I've went through the suggestions in the comments and have solved it. I'm writing it here in case anyone in future wants to see how solution went.
1) Using suggestions, I made a command to click on the '24 more sections' to expand the tab and then scrape it, which worked perfectly!
driver.find_element_by_class_name("js-load-more").click()
titles = driver.find_elements_by_class_name("lecture-title-text")
for each in titles:
print (each.text)
This pulled all 34 section titles.
2) Using Matt's suggestion, I found the WebElement and used get_attribute('textContent') to pull out the text data. There were bunch of spaces so I used split() to get strings only.
sub_titles = driver.find_elements_by_class_name("title")
for each in sub_titles:
print (each.get_attribute('textContent').strip())
This pulled all 210 subsection titles!
I am using the Python version of the lxml libray. I am currently trying to parse the text from a table but am encountering a problem in that some of the text is links.
For example, one of the cells may look something like this:
<td>
Can I kick it, <a>to all the people</a> who can quest like a <a>tribe</a> does
</td>
Say after parsing the html, the td element is stored as foo. Then foo.text will not display the whole text, only the parts that aren't links. Moreover, if I find the link text using [i.text for i in foo.getchildren()] I no longer know the order in which to put the non-link text and link text.
Is there an easy way to get around this?
Well after searching for an hour, within 2 minutes of posting this question I have found the solution.
Use the method foo.text_content() and this will display what is needed.