Generate a list of web queries - python

I'm pretty new at this and I'm trying to figure out a way to look up a list of websites automatically. I have a very large list of companies and essentially I'd want the algorithm to type the company into Google, click the first link (most likely the company website) and figure out whether the company matches the target industry (ice cream distributors) or has anything to do with the industry. The way I'd want to check for this is by seeing if the home page contains any of the key words in a given dictionary (let's say, 'chocolate, vanilla, ice cream, etc'). I would really appreciate some help with this - thank you so much.

I recommend using a combination of requests and lxml. To accomplish this you could do something similar to this.
import requests
from lxml.cssselect import CSSSelector
from lxml import html
use requests or grequests to get the html from all the pages.
queries = ['cats', 'dogs']
queries = [requests.get(x) for x in queries]
data = [x.text for x in queries]
parse the html with lxml and extract the first link on each page.
data = [html.document_fromstring(x) for x in data]
sel = CSSSelector('h3.r a')
links = [sel(x)[0] for x in data]
finally grab the html from all the first results.
pages = [requests.get(a.attrib['href'] for a in links]
this will give you an html string each of the pages you want. From there you should be able to simply search for the words you want in the pages html. You might find a counter helpful.

Related

How to access text element in selenium if it is splitted by body tags

I have a problem while trying to access some values on the website during the process of web scraping the data. The problem is that the text I want to extract is in the class which contains several texts separated by tags (these body tags also have texts which are also important for me).
So firstly, I tried to look for the tag with the text I needed ('Category' in this case) and then extract the exact category from the text below this body tag assignment. I could use precise XPath but here it is not the case because other pages I need to web scrape contain a different amount of rows in this sidebar so the locations, as well as XPaths, are different.
The expected output is 'utility' - the category in the sidebar.
The website and the text I need to extract look like that (look right at the sidebar containing 'Category':
The element looks like that:
And the code I tried:
driver = webdriver.Safari()
driver.get('https://www.statsforsharks.com/entry/MC_Squares')
element = driver.find_elements_by_xpath("//b[contains(text(), 'Category')]/following-sibling")
for value in element:
print(value.text)
driver.close()
the link to the page with the data is https://www.statsforsharks.com/entry/MC_Squares.
Thank you!
You might be better off using regex here, as the whole text comes under the 'company-sidebar-body' class, where only some text is between b tags and some are not.
So, you can the text of the class first:
sidebartext = driver.find_element_by_class_name("company-sidebar-body").text
That will give you the following:
"EOY Proj Sales: $1,000,000\r\nSales Prev Year: $200,000\r\nCategory: Utility\r\nAsking Deal\r\nEquity: 10%\r\nAmount: $300,000\r\nValue: $3,000,000\r\nEquity Deal\r\nSharks: Kevin O'Leary\r\nEquity: 25%\r\nAmount: $300,000\r\nValue: $1,200,000\r\nBite: -$1,800,000"
You can then use regex to target the category:
import re
c = re.search("Category:\s\w+", sidebartext).group()
print(c)
c will result in 'Category: Utility' which you can then work with. This will also work if the value of the category ('Utility') is different on other pages.
There are easier ways when it's a MediaWiki website. You could, for instance, access the page data through the API with a JSON request and parse it with a much more limited DOM.
Any particular reason you want to scrape my website?

extract information from Twitters anonymous search page

Considering the call limits of Twitter API, I am looking for possibilities to get search results without having an account/app. I have realized that this URL
https://twitter.com/search?f=tweets&q=<keyWord1>%20<keyWord2>%20<keyWord3>&src=typd&lang=en
where <keyWord1>%20<keyWord2>%20<keyWord3> are the search queries, indeed returns a page (for example this) including the information scrambled in the HTML format:
<div class="js-tweet-text-container">
<p class="TweetTextSize js-tweet-text tweet-text" lang="en" data-aria-label-part="0">tweetText..</p>
</div>
I can extract the page using this snippet:
#%%
import requests
def srch(*keyWords):
string = "%20".join(keyWords)
url = 'https://twitter.com/search?f=tweets&q=' + string + '&src=typd&lang=en'
return requests.get(url)
Now my questions are:
what is the best way to extract these information? using regular expressions re module or BeautifulSoup...?
what information can be extracted? Tweet's text, user-ID/name, time-date, number of likes-retweets-comments are visible in that page and should be probably extractable?
how many tweets can be extracted at one request or certain time span? is there any rate limit for example for the request module to call that page and extract the HTML? Is it possible that they block certain IPs?
I would appreciate if you could give an example of how this should be done.
Try Kenneth Reitz package Twitter-scraper(https://github.com/kennethreitz/twitter-scraper). You get to scrape Twitter without the fuzz.
Btw: Kenneth is the author of requests packages. Everything he makes is awesome.
it easy using beautifulsoup but faster using re but it maybe harder to do.
what information will you can get just see in li.js-stream-item
it can extract 20 tweet without pagination
example code
tweets = soup.select('li.js-stream-item')
for tweet in tweets:
name = tweet.select_one('FullNameGroup strong')
text = tweet.select_one('p.TweetTextSize')
timeStamp = tweet.select_one('a.tweet-timestamp').get('title')

Python Web Scraping with lxml

I am trying to scrape column names (player, cost, sel., form, pts) from the page below:
https://fantasy.premierleague.com/a/statistics/total_points
However, I am failing to do so.
Before I go further, let me show you what I have done.
from lxml import html
import requests
page = 'https://fantasy.premierleague.com/a/statistics/total_points'
#Take site and structure html
page = requests.get(page)
tree = html.fromstring(page.content)
#Using the page's CSS classes, extract all links pointing to a team
Location = tree.cssselect('.ism-thead-bold tr .ism-table--el-stats__name')
When I do this, Location should be a list that contains a string "Player".
However, it returns an empty list which means cssselect did not capture anything.
Though each column name has a different 'th class', I used one of them (ism-table--el-stats__name) for this specific trial just to make it simple.
When this problem is fixed, I want to use regex since every class has different suffix after two underscores.
If anyone can help me on these two tasks, I would really appreciate!
thank you guys.

Advice about using a loop through parameters of a get request

I am trying to get each runners' information from this 2017 marathon. The problem is that to get the information I want, I would have to click on each runners' name to get his partial splits.
I know that I can use a get request to get each runners' information. For example, for the runner Josh Griffiths I can use request.get using the parameters in the url.
My problem is that I don't know how to figure out the idp term because this term changes with every runner.
My questions are the following:
Is it possible to use a loop to get all runners' information using a get request? How can I solve the issue with the `idp? I mean, the fact that I don't know how this term is determined and I don't know how to define a loop using it.
Is there a better method to get each runners' information? I thought about using Selenium-Webdriver, but this would be very slow.
Any advice would be appreciated!
You will need to use something like BeautifulSoup to parse the HTML for the links you need, that way there is no need to try and figure out how to construct the request.
import requests
from bs4 import BeautifulSoup
base_url = "http://results-2017.virginmoneylondonmarathon.com/2017/"
r = requests.get(base_url + "?pid=list")
soup = BeautifulSoup(r.content, "html.parser")
tbody = soup.find('tbody')
for tr in tbody.find_all('tr'):
for a in tr.find_all('a', href=True, class_=None):
print
print a.parent.get_text(strip=True)[1:]
r_runner = requests.get(base_url + a['href'])
soup_runner = BeautifulSoup(r_runner.content, "html.parser")
# Find the start of the splits
for h2 in soup_runner.find_all('h2'):
if "Splits" in h2:
splits_table = h2.find_next('table')
splits = []
for tr in splits_table.find_all('tr'):
splits.append([td.text for td in tr.find_all('td')])
for row in splits:
print ' {}'.format(', '.join(row))
break
For each link, you then need to follow it and parse splits from the returned HTML. The script will display starting as follows:
Boniface, Anna (GBR)
5K, 10:18:05, 00:17:55, 17:55, 03:35, 16.74, -
10K, 10:36:23, 00:36:13, 18:18, 03:40, 16.40, -
15K, 10:54:53, 00:54:44, 18:31, 03:43, 16.21, -
20K, 11:13:25, 01:13:15, 18:32, 03:43, 16.19, -
Half, 11:17:31, 01:17:21, 04:07, 03:45, 16.04, -
25K, 11:32:00, 01:31:50, 14:29, 03:43, 16.18, -
30K, 11:50:44, 01:50:34, 18:45, 03:45, 16.01, -
35K, 12:09:34, 02:09:24, 18:51, 03:47, 15.93, -
40K, 12:28:43, 02:28:33, 19:09, 03:50, 15.67, -
Finish, 12:37:17, 02:37:07, 08:35, 03:55, 15.37, 1
Griffiths, Josh (GBR)
5K, 10:15:52, 00:15:48, 15:48, 03:10, 18.99, -
10K, 10:31:42, 00:31:39, 15:51, 03:11, 18.94, -
....
To better how understand how this works, you first need to take a look at the HTML source for each of the pages. The idea being is to find something unique about what you are looking for in the structure of the page to allow you to extract it using a script.
Next I would recommend reading through the documentation page for BeautifulSoup. This assumes you understand the basic structure of an HTML document. This library gives you many tools to help you search and extract elements from the HTML. For example finding where the links are. Not all webpages can be parsed like this as the information is often created using Javascript. In these cases you would need to use something like selenium but in this case, requests and beautifulsoup do the job nicely.

Python- Unable to retrieve complete text data for 1 more pages

I'm a newbie in Python Programming and I am facing following issue:
Objective: I need to scrap Freelancers website and store the list of theusers along with their attributes (score, ratings,reviews,details, rate,etc)
into a file. I have following codes but I am not able to get all the users.
Also, sometimes I run the program, the output changes.
import requests
from bs4 import BeautifulSoup
pages = 1
fileWriter =open('freelancers.txt','w')
url = 'https://www.freelancer.com/freelancers/skills/all/'+str(pages)+'/'
r = requests.get(url)
#gets the html contents and stores them into soup object
soup = BeautifulSoup(r.content)
links = soup.findAll("a")
#Finds the freelancer-details nodes and stores the html content into c_data
c_data = soup.findAll("div", {"class":"freelancer-details"})
for item in c_data:
print item.text
fileWriter.write('Freelancers Details:'+item.text+'\t')
#Writes the result into text file
I need to get the user details under specific users. But so far, the output looks dispersed.
Sample Output:
Freelancers Details:
thetechie13
507 Reviews
$20 USD/hr
Top Skills:
Website Design,
HTML,
PHP,
eCommerce,
Volusion
Dear Customer - We are a team of 75 Most Creative People and proud to be
Preferred Freelancer on Freelancer.com. We offer wide range of web
solutions and IT services that are bespoke in nature, can best fit our
clients' business needs and provide them cost benefits.
If you want each individual text component on its own (each assigned a different name), I would advise you to parse the text from from the HTML separately. However if you want it all grouped together you could join the strings:
print ' '.join(item.text.split())
This will place a single space between each word.

Categories