SerializationError while scrapng the data and push to elastic search - python

Below is the code
i am trying to scrape the data and try to push to elastic search
import re
import time
import requests
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch
es_client = Elasticsearch(['http://localhost:9200'])
#drop_index = es_client.indices.create(index='blog-sysadmins', ignore=400)
create_index = es_client.indices.delete(index='blog-sysadmins', ignore=[400, 404])
def urlparser(title, url):
# scrape title
p = {}
post = title
page = requests.get(post).content
soup = BeautifulSoup(page, 'lxml')
title_name = soup.title.string
# scrape tags
tag_names = []
desc = soup.findAll(attrs={"property":"article:tag"})
for x in range(len(desc)):
tag_names.append(desc[x-1]['content'].encode('utf-8'))
print (tag_names)
# payload for elasticsearch
doc = {
'date': time.strftime("%Y-%m-%d"),
'title': title_name,
'tags': tag_names,
'url': url
}
# ingest payload into elasticsearch
res = es_client.index(index="blog-sysadmins", doc_type="docs", body=doc)
time.sleep(0.5)
sitemap_feed = 'https://sysadmins.co.za/sitemap-posts.xml'
page = requests.get(sitemap_feed)
sitemap_index = BeautifulSoup(page.content, 'html.parser')
urlss = [element.text for element in sitemap_index.findAll('loc')]
urls = urlss[0:2]
print ('urls',urls)
for x in urls:
urlparser(x, x)
my error:
SerializationError: ({'date': '2020-07-04', 'title': 'Persistent Storage with OpenEBS on Kubernetes', 'tags': [b'Cassandra', b'Kubernetes', b'Civo', b'Storage'], 'url': 'http://sysadmins.co.za/persistent-storage-with-openebs-on-kubernetes/'}, TypeError("Unable to serialize b'Cassandra' (type: <class 'bytes'>)",))

The json serialization error appears when you try to indicize a data that is not a primitive datatype of javascript, the language with which json was developed. It is a json error and not an elastic one. The only rule of json format is that it accepts inside itself only these datatypes - for more explanation please read here. In your case the tags field has a bytes datatype as written in your error stack:
TypeError("Unable to serialize b'Cassandra' (type: <class 'bytes'>)
To solve your problem you should simply cast your tags content to string. So just change this line:
tag_names.append(desc[x-1]['content'].encode('utf-8'))
to:
tag_names.append(str(desc[x-1]['content']))

Related

Store web scraping data in mongodb atlas

I'm a new beginner. I successfully scraped data from a website and put it into a json file. I would like to put the data in mongo atlas. I hid the 'user-agent' and 'my cononection string'. I created an account on Altas and tried to find the related code to import data to altas. How can I utilize the code at the bottom correctly?
import requests
from bs4 import BeautifulSoup
import json
import pymongo
mystocks = ['^GSPC', 'QQQ', 'TSLA', 'AAPL']
stockdata = []
def getData(symbol):
headers = {
'User-Agent': 'my-user-agent'
}
url = f'https://finance.yahoo.com/quote/{symbol}'
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.text, 'html.parser')
# created a dictionary
stock = {
'symbol' : symbol,
'price' : soup.find('fin-streamer', {'class': "Fw(b) Fz(36px) Mb(-4px) D(ib)"}).text,
'change' : soup.find('fin-streamer', {'class': 'Fw(500) Pstart(8px) Fz(24px)'}).text,
'change_percentage' : soup.find('div', {'class': 'D(ib) Mend(20px)'}).find_all('span')[1].text,
}
return stock
for i in mystocks:
stockdata.append(getData(i))
print('Getting: ', i)
# with open('stockdata.json', 'w') as f:
# json.dump(stockdata, f)
# print('Finish')
client = pymongo.MongoClient('my cononection string')
db = client.db.stock
try:
db.insert_many(stock)
print(f'inserted {len(stock)} articles')
except:
print('an error occurred quotes were not stored to db')
I am not sure how to use the code at the bottom.

How to scrape multiple RSS feeds and store results respectively in their CSVs?

Is there a way one can scrape data from multiple RSS feeds and store results?
I'm scraping data from multiple RSS feeds and storing them respectively in their CSVs in the worst way possible - Separate .py files for each feed to their CSVs and running all .py files in the folder.
I have multiple py files like this in a folder with only the url different. I'm not sure how to run them in a loop and store the results in their respective CSVs
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'anyRSSFeedLink.com'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
output = []
for entry in soup.find_all('entry'):
item = {
'Title': entry.find('title', {'type': 'html'}).text,
'Pubdate': entry.find('published').text,
'Content': entry.find('content').text,
'Link': entry.find('link')['href']
}
output.append(item)
df = pd.DataFrame(output)
df.to_csv('results/results_feed01.csv', index=False)
How can I read from a CSV that has all the RSS feed links like this:
And run them in a single scraping file while storing in their respective result's CSVs that looks something like this?
Is there a way one can scrape data from multiple RSS feeds and store results?
Yes it is - Simply read your urls into a list or iterate directly over each line in your csv.
feeds = ['http://feeds.bbci.co.uk/news/rss.xml','http://www.cbn.com/cbnnews/us/feed/']
for url in feeds:
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'xml')
output = []
for entry in soup.find_all('item'):
item = {
'Title': entry.find('title').text,
'Pubdate': e.text if(e := entry.find('pubDate')) else None,
'Content': entry.find('description').text,
'Link': entry.find('link').text
}
output.append(item)
In each iteration you scrape the feed and save it to its csv, that e.g. could be named by domain, ...
df.to_csv(f'results_feed_{url.split("/")[2]}.csv', index=False)
or use a counter if you like:
for enum, url in enumerate(feeds):
...
df.to_csv(f'results_feed{enum}.csv', index=False)
Be aware - This will only work, if all feeds follows the same stucture, else you have to make some adjustments. You also should check if your elements you try to find are available before calling methods or properties:
'Pubdate': e.text if(e := entry.find('pubDate')) else None
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
feeds = ['http://feeds.bbci.co.uk/news/rss.xml','http://www.cbn.com/cbnnews/us/feed/']
for url in feeds:
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'xml')
output = []
for entry in soup.find_all('item'):
item = {
'Title': entry.find('title').text,
'Pubdate': e.text if(e := entry.find('pubDate')) else None,
'Content': entry.find('description').text,
'Link': entry.find('link').text
}
output.append(item)
df = pd.DataFrame(output)
df.to_csv(f'results_feed_{url.split("/")[2]}.csv', index=False)
output = [] has to be out of the loop, otherwise your result will be only the last url. If you check your cdv file, you will only have the feed of cbnnews. the result will be crushed.
code should be:
import requests
from bs4 import BeautifulSoup
import pandas as pd
feeds = ['http://feeds.bbci.co.uk/news/rss.xml','http://www.cbn.com/cbnnews/us/feed/']
output = []
for url in feeds:
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'xml')
for entry in soup.find_all('item'):
item = {
'Title': entry.find('title').text,
'Pubdate': e.text if(e := entry.find('pubDate')) else None,
'Content': entry.find('description').text,
'Link': entry.find('link').text
}
output.append(item)
df = pd.DataFrame(output)
df.to_csv(f'results_feed_{url.split("/")[2]}.csv', index=False)
now if you check your cdv file, you will have both url feeds.

JSON loads returning string

I have the following code:
url = 'https://finance.yahoo.com/quote/SPY'
result = requests.get(url)
c = result.content
html = BeautifulSoup(c, 'html.parser')
scripts = html.find_all('script')
sl =[]
for s in scripts:
sl.append(s)
s = (sl[-3])
s = s.contents
s = str(s)
s = s[119:-16]
s = json.dumps(s)
json_data = json.loads(s)
Once I check the data type for json_data I get a string. I am assuming that there are potentially some text encoding errors in the json data and it cannot properly be recognized as a json object.
However when I try dumping the data into a file and entering it into an online json parser, the parser can read the json data properly and recognize keys and values.
How can I fix this so that I can properly access the data within the json object?
You have to change [119:-16] into [112:-12] and you can get json as dictionary
import requests
from bs4 import BeautifulSoup
import json
url = 'https://finance.yahoo.com/quote/SPY'
result = requests.get(url)
html = BeautifulSoup(result.content, 'html.parser')
script = html.find_all('script')[-3].text
data = script[112:-12]
json_data = json.loads(data)
print(type(json_data))
#print(json_data)
print(json_data.keys())
print(json_data['context'].keys())
print(json_data['context']['dispatcher']['stores']['PageStore']['currentPageName'])
Result:
<class 'dict'>
dict_keys(['context', 'plugins'])
dict_keys(['dispatcher', 'options', 'plugins'])
quote

Scraping AJAX loaded content with python?

So i have function that is called when i click a button , it goes as below
var min_news_id = "68feb985-1d08-4f5d-8855-cb35ae6c3e93-1";
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id;
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
Now i don't have much experience with javascript , but i assume its returning some json data from some sort of api at "en/ajax/more_news" .
Is there i way could directly call this api and get the json data from my python script. If Yes,how?
If not how do i scrape the content that is being generated?
You need to post the news id that you see inside the script to https://www.inshorts.com/en/ajax/more_news, this is an example using requests:
from bs4 import BeautifulSoup
import requests
import re
# pattern to extract min_news_id
patt = re.compile('var min_news_id\s+=\s+"(.*?)"')
with requests.Session() as s:
soup = BeautifulSoup(s.get("https://www.inshorts.com/en/read").content)
new_id_scr = soup.find("script", text=re.compile("var\s+min_news_id"))
print(new_id_scr.text)
news_id = patt.search(new_id_scr.text).group()
js = s.post("https://www.inshorts.com/en/ajax/more_news", data={"news_offset":news_id})
print(js.json())
js gives you all the html, you just have to access the js["html"].
Here is the script that will automatically loop through all the pages in inshort.com
from bs4 import BeautifulSoup
from newspaper import Article
import requests
import sys
import re
import json
patt = re.compile('var min_news_id\s+=\s+"(.*?)"')
i = 0
while(1):
with requests.Session() as s:
if(i==0):soup = BeautifulSoup(s.get("https://www.inshorts.com/en/read").content,"lxml")
new_id_scr = soup.find("script", text=re.compile("var\s+min_news_id"))
news_id = patt.search(new_id_scr.text).group(1)
js = s.post("https://www.inshorts.com/en/ajax/more_news", data={"news_offset":news_id})
jsn = json.dumps(js.json())
jsonToPython = json.loads(jsn)
news_id = jsonToPython["min_news_id"]
data = jsonToPython["html"]
i += 1
soup = BeautifulSoup(data, "lxml")
for tag in soup.find_all("div", {"class":"news-card"}):
main_text = tag.find("div", {"itemprop":"articleBody"})
summ_text = main_text.text
summ_text = summ_text.replace("\n", " ")
result = tag.find("a", {"class":"source"})
art_url = result.get('href')
if 'www.youtube.com' in art_url:
print("Nothing")
else:
art_url = art_url[:-1]
#print("Hello", art_url)
article = Article(art_url)
article.download()
if article.is_downloaded:
article.parse()
article_text = article.text
article_text = article_text.replace("\n", " ")
print(article_text+"\n")
print(summ_text+"\n")
It gives both the summary from inshort.com and complete news from respective news channel.

Beautiful soup pareses in some cases but not in others. Why?

I am using Beautiful Soup to parse some JSON out of an HTML file.
Basically I am using to get all employee profiles out of a LinkedIn search result.
However, for some reason it does not work with companies that have more than 10 employees for some reason.
Here is my code
import requests, json
from bs4 import BeautifulSoup
s = requests.session()
def get_csrf_tokens():
url = "https://www.linkedin.com/"
req = s.get(url).text
csrf_token = req.split('name="csrfToken" value=')[1].split('" id="')[0]
login_csrf_token = req.split('name="loginCsrfParam" value="')[1].split('" id="')[0]
return csrf_token, login_csrf_token
def login(username, password):
url = "https://www.linkedin.com/uas/login-submit"
csrfToken, loginCsrfParam = get_csrf_tokens()
data = {
'session_key': username,
'session_password': password,
'csrfToken': csrfToken,
'loginCsrfParam': loginCsrfParam
}
req = s.post(url, data=data)
print "success"
login(USERNAME PASSWORD)
def get_all_json(company_link):
r=s.get(company_link)
html= r.content
soup=BeautifulSoup(html)
html_file= open("html_file.html", 'w')
html_file.write(html)
html_file.close()
Json_stuff=soup.find('code', id="voltron_srp_main-content")
print Json_stuff
return remove_tags(Json_stuff)
def remove_tags(p):
p=str(p)
return p[62: -10]
def list_of_employes():
jsons=get_all_json('https://www.linkedin.com/vsearch/p?f_CC=2409087')
print jsons
loaded_json=json.loads(jsons.replace(r'\u002d', '-'))
employes=loaded_json['content']['page']['voltron_unified_search_json']['search']['results']
return employes
def get_employee_link(employes):
profiles=[]
for employee in employes:
print employee['person']['link_nprofile_view_3']
profiles.append(employee['person']['link_nprofile_view_3'])
return profiles , len(profiles)
print get_employee_link(list_of_employes())
It will not work for the link that is in place; however it will work for this company search: https://www.linkedin.com/vsearch/p?f_CC=3003796
EDIT:
I am pretty sure that this is an error with the get_all_json() function. If
you take a look, it does not correctly fetch the JSON for companies with more than 10 employees.
This is because the results are paginated. You need get over all pages defined inside the json data at:
data['content']['page']['voltron_unified_search_json']['search']['baseData']['resultPagination']['pages']
pages is a list, for the company 2409087 it is:
[{u'isCurrentPage': True, u'pageNum': 1, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=1'},
{u'isCurrentPage': False, u'pageNum': 2, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=2', u'page_number_i18n': u'Page 2'},
{u'isCurrentPage': False, u'pageNum': 3, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=3', u'page_number_i18n': u'Page 3'}]
This is basically a list of URLs you need to get over and get the data.
Here's what you need to do (ommiting the code for login):
def get_results(json_code):
return json_code['content']['page']['voltron_unified_search_json']['search']['results']
url = "https://www.linkedin.com/vsearch/p?f_CC=2409087"
soup = BeautifulSoup(s.get(url).text)
code = soup.find('code', id="voltron_srp_main-content").contents[0].replace(r'\u002d', '-')
json_code = json.loads(code)
results = get_results(json_code)
pages = json_code['content']['page']['voltron_unified_search_json']['search']['baseData']['resultPagination']['pages']
for page in pages[1:]:
soup = BeautifulSoup(s.get(page['pageURL']).text)
code = soup.find('code', id="voltron_srp_main-content").contents[0].replace(r'\u002d', '-')
json_code = json.loads(code)
results += get_results(json_code)
print len(results)
It prints 25 for https://www.linkedin.com/vsearch/p?f_CC=2409087 - exactly how much you see in browser.
Turns out it was a problem with the default BeautifulSoup parser.
I changed it to html5lib by doing this:
Install in the console
pip install html5lib
And change the type of parser you choose when first creating the soup object.
soup = BeautifulSoup(html, 'html5lib')
This is documented in the BeautifulSoup docs here

Categories