I have a set names on a text file. My purpose is to search each name (row in the text file) on google and record the very first link appears on the google search results. Is there any way to automatically execute this process with a script? Otherwise I have to type 1000 names one by one on google and list the first link :(
Is there a way? Yes. Is there a super quick and easy way? I'm not too sure about that.
What I would do given the task is use BeautifulSoup4 for web-scraping. You could easily iterate through each line in your text file with a loop and then you could convert the phrase into Google-URL-friendly.
EX: Take the phrase "This is a test sentence", replace spaces with "+" and then add it to the end of a google search default URL. Like this:
https://www.google.com/?gws_rd=ssl#q=This+is+a+test+sentence
After that you find whatever the id or class is of the link of the first page of a Google result and code your program to return that information.
Related
I am trying to automate a process using Selenium on Python.
I need to search for an account in the search bar. The value to be entered in the search bar will be stored in an Excel file and updated every time a new account is to be searched.
inputElement = driver.find_element_by_id("ctl00_tbxAcctSearch")
inputElement.send_keys('Value to be added here to search which is stored in excel')
I have the idea that I would need to get the value stored in a variable and then get the variable in the inputElement.send_keys('variable') -- but I do not know how to do it.
I think if you have to periodically search for an account, you have to iterate over the entries in that excel file.
So in the idea you mentioned, that variable is a list. Basically you have to store the entries in your excel file to a list, account_to_search_list variable then iterate over it:
for i in range(0, len(account_to_search_list)):
'''do standard selenium path or button searching here''' # Do what you have to do here- which browser element you need to click or check for existence, etc.
inputElement.send_keys('account_to_search_list[i]')
I'm crawling a professor's webpage.
Under her research description, there are two hyperlinks, which are " TEDx UCL" and "here".
I use xpath like '//div[#class="group"]//p/text()'
to get the first 3 paragraphs.
And '//div[#class="group"]/text()' to get the last paragraph with some newlines. But these can be cleaned easily.
The problem is the last paragraph contains only text. The hyperlinks are lost. Though I can extract them separately, it is tedious to put them back to their corresponding position.
How can I get all the text and keep the hyperlinks?
You can use html2text.
sample = response.xpath("//div[#class="group"]//p/text()")
converter = html2text.HTML2Text()
converter.ignore_links = True
converter.handle(sample)
Try this:
'//div[#class="group"]/p//text()[normalize-space(.)]'
I'm new to scrapy and XPath but programming in Python for sometime. I would like to get the email, name of the person making the offer and phone number from the page https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ using scrapy. As you see, the email and phone is provided as text inside the <p> tag and that makes it hard to extract.
My idea is to first get text inside the Job Overview or at least all the text talking about this respective job and use ReGex to get the email, phone number and if possible the name of the person.
So, I fired up the scrapy shell using the command: scrapy shell https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ and get the response from there.
Now, I try to get all the text from the div job_description where I actually get nothing. I used
full_des = response.xpath('//div[#class="job_description"]/text()').extract()
It returns [u'\t\t\t\n\t\t ']
How do I get all the text from the page mentioned ? Obviously, the task will come afterwards to get the attributes mentioned before, but, first things first.
Update: This selection only returns [] response.xpath('//div[#class="job_description"]/div[#class="container"]/div[#class="row"]/text()').extract()
You were close with
full_des = response.xpath('//div[#class="job_description"]/text()').extract()
The div-tag actually does not have any text besides what you get.
<div class="job_description" (...)>
"This is the text you are getting"
<p>"This is the text you want"</p>
</div>
As you see, the text you are getting with response.xpath('//div[#class="job_description"]/text()').extract() is the text that is in between the div-tag, not in between the tags inside the div-tag. For this you would need:
response.xpath('//div[#class="job_description"]//*/text()').extract()
What this does is it selects all the child-nodes from div[#class="job_description] and returns the text (see here for what the different xpaths do).
You will see that this returns much useless text as well, as you are still getting all the \n and such. For this I suggest that you narrow your xpath down to the element that you want, instead of doing a broad approach.
For example the entire job description would be in
response.xpath('//div[#class="col-sm-5 justify-text"]//*/text()').extract()
What I have is a column in Excel with a list of URLs like this:
first link
second link
...
What I would like is to separate the "text" from the "URL" like this:
first link | http://www.example.com/1
second link | http://www.example.com/2
...
I'm using LibreOffice, but I'd accept and answer also for Google Spreadsheet or even a python script.
Three steps simple solution (for Google Sheets):
just copy and 'paste values only' to get the text (or use concatenate-with-nothing)
check this answer for a custom function to get the url
concatenate text and url.
I want to know how I can collect line, mailto link using selenium python the emails contains # sign in the contact page I tried the following code but it is somewhere works and somewhere not..
//*[contains(text(),"#")]
the emails formats are different somewhere it is <p>Email: name#domain.com</p> or <span>Email: name#domain.com</span> or name#domain.com
is there anyway to collect them with one statement..
Thanks
Here is the XPath you are looking for my friend.
//*[contains(text(),"#")]|//*[contains(#href,"#")]
You could create a collection of the link text values that contain # on the page and then iterate through to format. You are going to have to format the span like that has Email: name#domain.com anyway.
Use find_elements_by_partial_link_text to make the collection.
I think you need 2 XPath. First XPath for finding element that contains text "Email:", second XPath for element that contains attribute "mailto:".
//*[contains(text(),"Email:")]|//*[contains(#href,"mailto:")]