Considering the call limits of Twitter API, I am looking for possibilities to get search results without having an account/app. I have realized that this URL
https://twitter.com/search?f=tweets&q=<keyWord1>%20<keyWord2>%20<keyWord3>&src=typd&lang=en
where <keyWord1>%20<keyWord2>%20<keyWord3> are the search queries, indeed returns a page (for example this) including the information scrambled in the HTML format:
<div class="js-tweet-text-container">
<p class="TweetTextSize js-tweet-text tweet-text" lang="en" data-aria-label-part="0">tweetText..</p>
</div>
I can extract the page using this snippet:
#%%
import requests
def srch(*keyWords):
string = "%20".join(keyWords)
url = 'https://twitter.com/search?f=tweets&q=' + string + '&src=typd&lang=en'
return requests.get(url)
Now my questions are:
what is the best way to extract these information? using regular expressions re module or BeautifulSoup...?
what information can be extracted? Tweet's text, user-ID/name, time-date, number of likes-retweets-comments are visible in that page and should be probably extractable?
how many tweets can be extracted at one request or certain time span? is there any rate limit for example for the request module to call that page and extract the HTML? Is it possible that they block certain IPs?
I would appreciate if you could give an example of how this should be done.
Try Kenneth Reitz package Twitter-scraper(https://github.com/kennethreitz/twitter-scraper). You get to scrape Twitter without the fuzz.
Btw: Kenneth is the author of requests packages. Everything he makes is awesome.
it easy using beautifulsoup but faster using re but it maybe harder to do.
what information will you can get just see in li.js-stream-item
it can extract 20 tweet without pagination
example code
tweets = soup.select('li.js-stream-item')
for tweet in tweets:
name = tweet.select_one('FullNameGroup strong')
text = tweet.select_one('p.TweetTextSize')
timeStamp = tweet.select_one('a.tweet-timestamp').get('title')
Related
I am working on a project where I am crawling thousands of websites to extract text data, the end use case is natural language processing.
EDIT * since I am crawling 100's of thousands of websites I cannot tailor a scraping code for each one, which means I cannot search for specific element id's, the solution I am looking for is a general one *
I am aware of solutions such as the .get_text() function from beautiful soup. The issue with this method is that it gets all the text from the website, much of it being irrelevant to the main topic on that particular page. for the most part a website page will be dedicated to a single main topic, however on the sides and top and bottom there may be links or text about other subjects or promotions or other content.
With the .get_text() function it return all the text on the site page in one go. the problem is that it combines it all (the relevant parts with the irrelevant ones. is there another function similar to .get_text() that returns all text but as a list and every list object is a specific section of the text, that way it can be know where new subjects start and end.
As a bonus, is there a way to identify the main body of text on a web page?
Below I have mentioned snippets that you could use to query data in desired way using BeautifulSoup4 and Python3:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://yoursite/page')
soup = BeautifulSoup(response.text, 'html.parser')
# Print the body content in list form
print(soup.body.contents[0])
# Print the first found div on html page
print(soup.find('div'))
# Print the all divs on html page in list form
print(soup.find_all('div'))
# Print the element with 'required_element_id' id
print(soup.find(id='required_element_id'))
# Print the all html elements in list form that matches the selectors
print(soup.select(required_css_selectors))
# Print the attribute value in list form
print(soup.find(id='someid').get("attribute-name"))
# You can also break your one large query into multiple queries
parent = soup.find(id='someid')
# getText() return the text between opening and closing tag
print(parent.select(".some-class")[0].getText())
For your more advance requirement, you can check Scrapy as well. Let me know if you face any challenge in implementing this or if your requirement is something else.
I'm trying to extract the text from an element whose class value contains compText. The problem is that it extracts everything but the text that I want.
The CSS selector identifies the element correctly when I use it in the developer tools.
I'm trying to scrape the text that appears in Yahoo SERP when the query entered doesn't have results.
If my query is (quotes included) "klsf gl glkjgsdn lkgsdg" nothing is displayed expect the complementary text "We did not find results blabla" and the Selector extract the data correctly
If my query is (quotes included) "based specialty. Blocks. Organosilicone. Reference". Yahoo will add ads because of the keyword "Organosilicone" and that triggers the behavior described in the first paragraph.
Here is the code:
import requests
from bs4 import BeautifulSoup
url = "http://search.yahoo.com/search?p="
query = '"based specialty chemicals. Blocks. Organosilicone. Reference"'
r = requests.get(url + query)
soup = BeautifulSoup(r.text, "html.parser")
for EachPart in soup.select('div[class*="compText"]'):
print (EachPart.text)
What could be wrong?
Thx,
EDIT: The text extracted seems to be the defnition of the word "Organosilicone" which I can find on the SERP.
EDIT2: This is a snippet of the text I get: "The products created and produced by ‘Specialty Chemicals’ member companies, many of which are Small and Medium Enterprises, stem from original and continuous innovation. They drive the low-carbon, resource-efficient and knowledge based economy of the future." and a screenshot of the SERP when I use my browser
I am trying to write a script which performs a Google search for the input keyword and returns only the content from the top 10 URLs.
Note: Content specifically refers to the content that is being requested by the searched term and is found in the body of the returned URLs.
I am done with the search and top 10 url retrieval part. Here is the script:
from google import search
top_10_links = search(keyword, tld='com.in', lang='en',stop=10)
however i am unable to retrieve only the content from the links without knowing their structure. I can scrape content from a particular site by finding the class etc. of the tags using dev tools.But i am unable to figure out how to get content from the top 10 result URLs since for every searched term there are different URLs(different sites have different css selectors) and it would to pretty hard to find the css class of the required content. here is the sample code to extract content from a particular site.
content_dict = {}
i = 1
for page in links:
print(i, ' # link: ', page)
article_html = get_page(page)#get_page() returns page's html
soup = BeautifulSoup(article_html, 'lxml')
content = soup.find('div',{'class': 'entry-content'}).get_text()
content_dict[page] = content
i += 1
However the css class changes for the different sites. Is there someway i can get this script working and get the desired content?
You can't do scraping without knowing the structure of what you're scraping.But there is a package that does something similar. Take a look at newspaper
I'm pretty new at this and I'm trying to figure out a way to look up a list of websites automatically. I have a very large list of companies and essentially I'd want the algorithm to type the company into Google, click the first link (most likely the company website) and figure out whether the company matches the target industry (ice cream distributors) or has anything to do with the industry. The way I'd want to check for this is by seeing if the home page contains any of the key words in a given dictionary (let's say, 'chocolate, vanilla, ice cream, etc'). I would really appreciate some help with this - thank you so much.
I recommend using a combination of requests and lxml. To accomplish this you could do something similar to this.
import requests
from lxml.cssselect import CSSSelector
from lxml import html
use requests or grequests to get the html from all the pages.
queries = ['cats', 'dogs']
queries = [requests.get(x) for x in queries]
data = [x.text for x in queries]
parse the html with lxml and extract the first link on each page.
data = [html.document_fromstring(x) for x in data]
sel = CSSSelector('h3.r a')
links = [sel(x)[0] for x in data]
finally grab the html from all the first results.
pages = [requests.get(a.attrib['href'] for a in links]
this will give you an html string each of the pages you want. From there you should be able to simply search for the words you want in the pages html. You might find a counter helpful.
I'd like to grab all the index words and its definitions from here. Is it possible to scrape web content with Python?
Firebug exploration shows the following URL returns my desirable contents including both index and its definition as to 'a'.
http://pali.hum.ku.dk/cgi-bin/cpd/pali?acti=xart&arid=14179&sphra=undefined
what are the modules used? Is there any tutorial available?
I do not know how many words indexed in the dictionary. I`m absolute beginner in the programming.
You should use urllib2 for gettting the URL contents and BeautifulSoup for parsing the HTML/XML.
Example - retrieving all questions from the StackOverflow.com main page:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://stackoverflow.com")
soup = BeautifulSoup(page)
for incident in soup('h3'):
print [i.decode('utf8') for i in incident.contents]
print
This code sample was adapted from the BeautifulSoup documentation.
You can get data from the web using the built-in urllib or urllib2, but the parsing itself is the most important part. May I suggest the wonderful BeautifulSoup? It can deal with just about anything.
http://www.crummy.com/software/BeautifulSoup/
The documentation is built like a tutorial. Sorta:
http://www.crummy.com/software/BeautifulSoup/documentation.html
In your case, you probably need to use wildcards to see all entries in the dictionary. You can do something like this:
import urllib2
def getArticles(query, start_index, count):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xsea&tsearch=%s&rfield=entr&recf=%d&recc=%d' %
(query, start_index, count))
# TODO:
# parse xml code here (using BeautifulSoup or an xml parser like Python's
# own xml.etree. We should at least have the name and ID for each article.
# article = (article_name, article_id)
return (article_names # a list of parsed names from XML
def getArticleContent(article):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xart&arid=%d&sphra=undefined' % article_id)
# TODO: parse xml
return parsed_article
Now you can loop over things. For instance, to get all articles starting in 'ana', use the wildcard 'ana*', and loop until you get no results:
query = 'ana*'
article_dict = {}
i = 0
while (true):
new_articles = getArticles(query, i, 100)
if len(new_articles) == 0:
break
i += 100
for article_name, article_id in new_articles:
article_dict[article_name] = getArticleContent(article_id)
Once done, you'll have a dictionary of the content of all articles, referenced by names. I omitted the parsing itself, but it's quite simple in this case, since everything is XML. You might not even need to use BeautifulSoup (even though it's still handy and easy to use for XML).
A word of warning though:
You should check the site's usage policy (and maybe robots.txt) before trying to heavily scrap articles. If you're just getting a few articles for yourself they may not care (the dictionary copyright owner, if it's not public domain, may care though), but if you're going to scrape the entire dictionary, this is going to be some heavy usage.