What I have is a column in Excel with a list of URLs like this:
first link
second link
...
What I would like is to separate the "text" from the "URL" like this:
first link | http://www.example.com/1
second link | http://www.example.com/2
...
I'm using LibreOffice, but I'd accept and answer also for Google Spreadsheet or even a python script.
Three steps simple solution (for Google Sheets):
just copy and 'paste values only' to get the text (or use concatenate-with-nothing)
check this answer for a custom function to get the url
concatenate text and url.
Related
I have a problem while trying to access some values on the website during the process of web scraping the data. The problem is that the text I want to extract is in the class which contains several texts separated by tags (these body tags also have texts which are also important for me).
So firstly, I tried to look for the tag with the text I needed ('Category' in this case) and then extract the exact category from the text below this body tag assignment. I could use precise XPath but here it is not the case because other pages I need to web scrape contain a different amount of rows in this sidebar so the locations, as well as XPaths, are different.
The expected output is 'utility' - the category in the sidebar.
The website and the text I need to extract look like that (look right at the sidebar containing 'Category':
The element looks like that:
And the code I tried:
driver = webdriver.Safari()
driver.get('https://www.statsforsharks.com/entry/MC_Squares')
element = driver.find_elements_by_xpath("//b[contains(text(), 'Category')]/following-sibling")
for value in element:
print(value.text)
driver.close()
the link to the page with the data is https://www.statsforsharks.com/entry/MC_Squares.
Thank you!
You might be better off using regex here, as the whole text comes under the 'company-sidebar-body' class, where only some text is between b tags and some are not.
So, you can the text of the class first:
sidebartext = driver.find_element_by_class_name("company-sidebar-body").text
That will give you the following:
"EOY Proj Sales: $1,000,000\r\nSales Prev Year: $200,000\r\nCategory: Utility\r\nAsking Deal\r\nEquity: 10%\r\nAmount: $300,000\r\nValue: $3,000,000\r\nEquity Deal\r\nSharks: Kevin O'Leary\r\nEquity: 25%\r\nAmount: $300,000\r\nValue: $1,200,000\r\nBite: -$1,800,000"
You can then use regex to target the category:
import re
c = re.search("Category:\s\w+", sidebartext).group()
print(c)
c will result in 'Category: Utility' which you can then work with. This will also work if the value of the category ('Utility') is different on other pages.
There are easier ways when it's a MediaWiki website. You could, for instance, access the page data through the API with a JSON request and parse it with a much more limited DOM.
Any particular reason you want to scrape my website?
I'm crawling a professor's webpage.
Under her research description, there are two hyperlinks, which are " TEDx UCL" and "here".
I use xpath like '//div[#class="group"]//p/text()'
to get the first 3 paragraphs.
And '//div[#class="group"]/text()' to get the last paragraph with some newlines. But these can be cleaned easily.
The problem is the last paragraph contains only text. The hyperlinks are lost. Though I can extract them separately, it is tedious to put them back to their corresponding position.
How can I get all the text and keep the hyperlinks?
You can use html2text.
sample = response.xpath("//div[#class="group"]//p/text()")
converter = html2text.HTML2Text()
converter.ignore_links = True
converter.handle(sample)
Try this:
'//div[#class="group"]/p//text()[normalize-space(.)]'
I want to know how I can collect line, mailto link using selenium python the emails contains # sign in the contact page I tried the following code but it is somewhere works and somewhere not..
//*[contains(text(),"#")]
the emails formats are different somewhere it is <p>Email: name#domain.com</p> or <span>Email: name#domain.com</span> or name#domain.com
is there anyway to collect them with one statement..
Thanks
Here is the XPath you are looking for my friend.
//*[contains(text(),"#")]|//*[contains(#href,"#")]
You could create a collection of the link text values that contain # on the page and then iterate through to format. You are going to have to format the span like that has Email: name#domain.com anyway.
Use find_elements_by_partial_link_text to make the collection.
I think you need 2 XPath. First XPath for finding element that contains text "Email:", second XPath for element that contains attribute "mailto:".
//*[contains(text(),"Email:")]|//*[contains(#href,"mailto:")]
I wish to use ElementMaker in lxml to build an xml representation of an excel spreadsheet with the corresponding nesting. I would like
<excelbook>
workbook info
<excelsheet>
<sheetname>Sheet1</sheetname>
<exceltable>
<numrows>10</numrows>
<numcols>13</numcols>
</exceltable>
<exceltable>
<numrows>10</numrows>
<numcols>13</numcols>
</exceltable>
</excelsheet>
<excelsheet>
<sheetname>Sheet2</sheetname>
<exceltable>
<numrows>10</numrows>
<numcols>13</numcols>
</exceltable>
<exceltable>
<numrows>10</numrows>
<numcols>13</numcols>
</exceltable>
</excelsheet>
</excelbook>
My python code looks like the following
for excelSheet in excelBook.excelSheets:
for excelTable in excelSheet.excelTables:
exceltable = E.exceltable(
E.num_rows(str(excelTable.num_rows)),
E.num_cols(str(excelTable.num_cols)),
)
excelsheet = E.excelsheet(
exceltable,
E.sheetname(excelSheet.sheetName),
)
excelbook = E.excelbook(
excelsheet,
E.bookname(fullpathname),
E.numSheets(str(excelBook.numSheets)))
root = E.XML(excelbook)
The problem is that I can only nest one sheet inside each book and one table inside each sheet. How do I change the code to allow multiple sheets in each book and multiple tables inside each sheet.
You can't next a tag with the same tag name twice under the same root and expect them to be appended one after another rather than the second one overriding the first one, because this is not very coherent.
For example if you have:
<root>
<elem1>some text 1</elem1>
</root>
If you try to append another "elem1" tag to root, it will override the existing one.
I'd suggest you use indexing ("elem_0", "elem_1", "elem_2".. etc).
I have a set names on a text file. My purpose is to search each name (row in the text file) on google and record the very first link appears on the google search results. Is there any way to automatically execute this process with a script? Otherwise I have to type 1000 names one by one on google and list the first link :(
Is there a way? Yes. Is there a super quick and easy way? I'm not too sure about that.
What I would do given the task is use BeautifulSoup4 for web-scraping. You could easily iterate through each line in your text file with a loop and then you could convert the phrase into Google-URL-friendly.
EX: Take the phrase "This is a test sentence", replace spaces with "+" and then add it to the end of a google search default URL. Like this:
https://www.google.com/?gws_rd=ssl#q=This+is+a+test+sentence
After that you find whatever the id or class is of the link of the first page of a Google result and code your program to return that information.