Taking CSV Field and Extracting Links to Separate Python-DocX Paragraphs

Taking CSV Field and Extracting Links to Separate Python-DocX Paragraphs - python

I'm working on a Python project that takes as CSV output file and re-formats the data and puts it into a Word Document using Python-DocX. Everything so far works great, but working with multiple hyperlinks in the same field causes all links to point to just the first link of the set.
Currently this is the code that is causing the issue:
p7 = document.add_paragraph()
hyperlink = add_hyperlink(p7, row['See Also'], str(row['See Also']))
As you can see the blank paragraph is initialised and then the hyperlink is assigned to it. row['See Also'] is the row that contains the links I need to work with. Some entries contain a single link and some contain a lot.
This (https://github.com/python-openxml/python-docx/issues/74) is the function that adds the hyperlink as per the documented method for Python-Docx:
def add_hyperlink(paragraph, url, text):
# This gets access to the document.xml.rels file and gets a new relation id value
part = paragraph.part
r_id = part.relate_to(
url, docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK,
is_external=True
)
# Create the w:hyperlink tag and add needed values
hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
hyperlink.set(docx.oxml.shared.qn('r:id'), r_id, )
# Create a w:r element
new_run = docx.oxml.shared.OxmlElement('w:r')
# Create a new w:rPr element
rPr = docx.oxml.shared.OxmlElement('w:rPr')
# Join all the xml elements together add add the required text to the w:r element
new_run.append(rPr)
new_run.text = text
hyperlink.append(new_run)
paragraph._p.append(hyperlink)
return hyperlink
The way I thought to do it was to use a for loop to iterate through each hyperlink in the field and assign them to a paragraph each, that way the hyperlinks should work just fine. I tried the following but this just creates 1000's of links which do not work right.
for x in row['See Also']:
p = document.add_paragraph()
hyperlink = add_hyperlink(p, row['See Also'], row['See Also'])
I'm currently testing with a very small CSV file with just two sets of data as follows:
https://www.openssl.org/blog/blog/2016/08/24/sweet32/
This of course causes no issue and the hyperlink works as expected, however the following causes all links to point to the first address.
https://downloads.avaya.com/elmodocs2/security/ASA-2006-217.htm
http://www.kb.cert.org/vuls/id/JARL-5ZQR4D
http://www-01.ibm.com/support/docview.wss?uid=isg1IY55949
http://www-01.ibm.com/support/docview.wss?uid=isg1IY55950
http://www-01.ibm.com/support/docview.wss?uid=isg1IY62006
http://www.juniper.net/support/security/alerts/niscc-236929.txt
http://technet.microsoft.com/en-us/security/bulletin/ms05-019
http://technet.microsoft.com/en-us/security/bulletin/ms06-064
http://www.kb.cert.org/vuls/id/JARL-5YGQ9G
http://www.kb.cert.org/vuls/id/JARL-5ZQR7H
http://www.kb.cert.org/vuls/id/JARL-5YGQAJ
http://www.nessus.org/u?cf64c2ca
https://isc.sans.edu/diary.html?date=2004-04-20
The fix is probably quite straight forward, any help with this issue would be appreciated.

You haven't provided enough of the context code to show the specifics, but I suspect your problem is in the line:
for x in row['See Also']:
If you run:
for x in row['See Also']:
print x
I think you'll get:
h
t
t
p
s
:
...
As you can see, using a string value as the iterable in a for loop iterates the characters of the string.
What I think you need instead is something like:
for row in csv_rows:
p = document.add_paragraph()
hyperlink = add_hyperlink(p, row['See Also'], row['See Also'])

Figured the issue out and the following code solves the problem:
for row in csv_rows:
links = row['See Also'].split("\n")
for item in links:
p = document.add_paragraph()
hyperlink = add_hyperlink(p, item, item)
This splits each line of the 'See Also' row into a list and then this list is iterated through with each item being turned into a hyperlink.

Related

Find_all function returning multiple strings that are separated but not delimited by a character

Background: I am pretty new to python and decided to practice by making a webscraper for https://www.marketwatch.com/tools/markets/stocks/a-z which would allow me to pull the company name, ticker, origin, and sector. I could then use this in another scraper to combine it with more complex information. The page is separated by 2 indexing methods- one to select the first letter of the company name (at the top of the page and another for the number of pages within that letter index (at the bottom of the page). These two tags have the class = "pagination" identifier but when I scrape based on that criteria, I get two separate strings but they are not a delimited list separated by a comma.
Does anyone know how to get the strings as a list? or individually? I really only care about the second.
from bs4 import BeautifulSoup
import requests
# open the source code of the website as text
source = 'https://www.marketwatch.com/tools/markets/stocks/a-z/x'
page = requests.get(source).text
soup = BeautifulSoup(page, 'lxml')
for tags in soup.find_all('ul', class_='pagination'):
tags_text = tags.text
print(tags_text)
Which returns:
0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther
«123»
When I try to split on /n:
tags_text = tags.text.split('/n')
print(tags_text)
The return is:
['\n0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther']
['«123»']
Neither seem to form a list. I have found many ways to get the first string but I really only need the second.
Also, please note that I am using the X index as my current tab. If you build it from scratch, you might have more numbers in the second string and the word (current) might be in a different place in the first list.
THANK YOU!!!!
Edit:
Cleaned old, commented out code from the source and realized I did not show the results of trying to call the second element despite the lack of a comma in the split example:
tags_text = tags.text.split('/n')[1]
print(tags_text)
Returns:
File, "C:\.....", line 22, in <module>
tags_text = tags.text.split('/n')[1]
IndexError: list index out of range

Never mind, I was using print() when I should have been using return, .apphend(), or another term to actually do something with the value...

How to add inter-page links in MS word using python-docx? [duplicate]

This question already has answers here:
how to create bookmarks in a word document, then create internal hyperlinks to the bookmark w/ python
(2 answers)
Closed 1 year ago.
I am creating a file which contain text data on 1st 4 pages and all images from page 5 onwards.
There is a table having page numbers as column. I want to add link to each of the page number in that column by clicking on which it should take me the the image page referenced by that page number.
I am creating this document using python-docx.
While stumbling on google I got a solution for creating hyperlink using python-docx. Clicking on the text with hyperlink takes me to url referenced by it.
The code for hyperlink is as follows:
import docx
from docx.enum.dml import MSO_THEME_COLOR_INDEX
def add_hyperlink(paragraph, text, url):
# This gets access to the document.xml.rels file and gets a new relation id value
part = paragraph.part
r_id = part.relate_to(url, docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK, is_external=True)
# Create the w:hyperlink tag and add needed values
hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
hyperlink.set(docx.oxml.shared.qn('r:id'), r_id, )
# Create a w:r element and a new w:rPr element
new_run = docx.oxml.shared.OxmlElement('w:r')
rPr = docx.oxml.shared.OxmlElement('w:rPr')
# Join all the xml elements together add add the required text to the w:r element
new_run.append(rPr)
new_run.text = text
hyperlink.append(new_run)
# Create a new Run object and add the hyperlink into it
r = paragraph.add_run ()
r._r.append (hyperlink)
# A workaround for the lack of a hyperlink style (doesn't go purple after using the link)
# Delete this if using a template that has the hyperlink style in it
r.font.color.theme_color = MSO_THEME_COLOR_INDEX.HYPERLINK
r.font.underline = True
return hyperlink
document = docx.Document()
p = document.add_paragraph('A plain paragraph having some ')
add_hyperlink(p, 'Link to Google', "http://google.com")
document.save('demo_hyperlink.docx')
I want that link to point to an inside document page.

i had the exact same question:
I got the answer from here hyperlink bookmarks
you can find an example of implementation here hyperlink to bookmark
Enjoy
T

Python Selenium only getting first row when iterating over table

I am trying to extract the most recent headlines from the following news site:
http://news.sina.com.cn/hotnews/
#save ids of relevant buttons that need to be clicked on the site
buttons_ids = ['Tab21' , 'Tab22', 'Tab32']
#save ids of relevant subsections
con_ids = ['Con11']
#start webdriver, go to site, hover over buttons
driver = webdriver.Chrome()
driver.get("http://news.sina.com.cn/hotnews/")
time.sleep(3)
for button_id in buttons_ids:
button = driver.find_element_by_id(button_id)
ActionChains(driver).move_to_element(button).perform()
Then I iterate through each section that I am interested in and within each section through all the headlines which are rows in an HTML table. However, on every iteration, it returns the first element
for con_id in con_ids:
for news_id in range(2,10):
print(news_id)
headline = driver.find_element_by_xpath("//div[#id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]")
text = headline.find_element_by_xpath("//td[2]/a")
print(text.get_attribute("innerText"))
print(text.get_attribute("href"))
com_no = comment.find_element_by_xpath("//td[3]/a")
print(com_no.get_attribute("innerText"))
I also tried the following approach by essentially saving the table as a list and then iterating through the rows:
for con_id in con_ids:
table = driver.find_elements_by_xpath("//div[#id='"+con_id+"']/table/tbody/tr")
for headline in table:
text = headline.find_element_by_xpath("//td[2]/a")
print(text.get_attribute("innerText"))
print(text.get_attribute("href"))
com_no = comment.find_element_by_xpath("//td[3]/a")
print(com_no.get_attribute("innerText"))
In the second case I get exactly the number of headlines in the section, so it apparently correctly picks up the number of rows. However, it is still only returning the first row on all iterations. Where am I going wrong? I know a similar question has been asked here: Selenium Python iterate over a table of rows it is stopping at the first row but I am still unable to figure out where I am going wrong.

In XPath, queries that begin with // will search relative to the document root; so even though you're calling find_element_by_xpath() on the correct container element, you're breaking out of that scope, thereby performing the same global search and yielding the same result every time.
To constrain your query to descendants of the current element, begin your query with .//, e.g.,:
text = headline.find_element_by_xpath(".//td[2]/a")

try this:
for con_id in con_ids:
for news_id in range(2,10):
print(news_id)
print("(//div[#id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
headline = driver.find_element_by_xpath("(//div[#id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
value = headline.find_element_by_xpath(".//td[2]/a")
print(value.get_attribute("innerText").encode('utf-8'))
I am able to get the headlines with above code

I was able to solve it by specifying the entire XPath in one go like this:
headline = driver.find_element_by_xpath("(//*[#id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]/td[2]/a)")
print(headline.get_attribute("innerText"))
print(headline.get_attribute("href"))
rather than splitting it into two parts.
My only explanation for why it only prints the first row repeatedly is that there is some weird Javascript at work that doesn't let you iterate properly when splitting the request.
Or my first version had a syntax error, which I am not aware of.
If anyone has a better explanation, I'd be glad to hear it!

Scrapy: how can I extract text with hyperlink together

I'm crawling a professor's webpage.
Under her research description, there are two hyperlinks, which are " TEDx UCL" and "here".
I use xpath like '//div[#class="group"]//p/text()'
to get the first 3 paragraphs.
And '//div[#class="group"]/text()' to get the last paragraph with some newlines. But these can be cleaned easily.
The problem is the last paragraph contains only text. The hyperlinks are lost. Though I can extract them separately, it is tedious to put them back to their corresponding position.
How can I get all the text and keep the hyperlinks?

You can use html2text.
sample = response.xpath("//div[#class="group"]//p/text()")
converter = html2text.HTML2Text()
converter.ignore_links = True
converter.handle(sample)

Try this:
'//div[#class="group"]/p//text()[normalize-space(.)]'

Extracting multimedia tags with alt in Python

The aim of this is to run through the contents of a HTML file and scout out all the img, input, area tags which don't have/have "alt" as one of their attributes. I've written the following for it. The libraries I'm using in Python are BeautifulSoup for extraction and urllib for opening the url. Posting only the relevant part.
alttrue = altfalse = []
multimedialist = ['img','input','area']
for tag in multimedialist:
for incodetag in soup.findAll(tag):
if incodetag.get('alt') is None:
altfalse.append(incodetag)
else:
alttrue.append(incodetag)
print(alttrue)
print(altfalse)
In the end, the code is able to find all img, input and area tags but when I print out alttrue and altfalse, both have the same img/input/area links even if there isn't an alt attribute in them!
Also, another question I have is, in Django, I'm returning these two lists to a calling function in my views.py. And I'm putting those 2 lists as well as a bunch of other lists into a variable and passing that variable to a html page using the render function. In my html file, I'm using a for loop and iterating over all the lists I received from my views.py and printing them out. However, for these 2 lists in particular, on the html page, it shows as blank lists ([]). But If I normally print the variable on the html page without using a for loop for each element, then it prints. There isn't a problem with how I'm passing the lists from my views.py to my html page because the others are working just fine. Why is it with this one though?

The alttrue and altfalse variables are both pointing at the same list, so appends to one will affect the other as well. You should create two separate lists:
alttrue = []
altfalse = []

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Taking CSV Field and Extracting Links to Separate Python-DocX Paragraphs - python

Related

Find_all function returning multiple strings that are separated but not delimited by a character

How to add inter-page links in MS word using python-docx? [duplicate]

Python Selenium only getting first row when iterating over table

Scrapy: how can I extract text with hyperlink together

Extracting multimedia tags with alt in Python

Categories

Resources