Formating Results from Scaper

Formating Results from Scaper - python

I'm getting an error while trying to format a simple amazon scraper.
I'm trying to scrape amazon then create a tweet using the twitter API. After scraping Amazon I want to format the results, so I can pull the results to my twitter API.
While trying to format I get an error
ERROR:
File "/Users/user/Coding/TestRequests/amazonscraper.py", line 32, in <module>
deals = tvprices[0].replace("'title'", "Product: ")
AttributeError: 'dict' object has no attribute 'replace'
CODE:
from requests_html import HTMLSession
urls = ['https://amzn.to/3PUatLc']
def getPrice(url):
s = HTMLSession()
r = s.get(url)
r.html.render(sleep=1)
try:
product = {
'title': r.html.xpath('//*[#id="productTitle"]', first=True).text,
'price': r.html.xpath('//*[#id="corePriceDisplay_desktop_feature_div"]/div[1]/span[2]/span[1]', first=True).text,
'discount': r.html.xpath('//*[#id="corePriceDisplay_desktop_feature_div"]/div[1]/span[1]', first=True).text.replace('-', '')
}
print(product)
except:
product = {
'title': r.html.xpath('//*[#id="productTitle"]', first=True).text,
'price': 'item unavailable'
}
print(product)
return product
tvprices = []
for url in urls:
tvprices.append(getPrice(url))
deals = tvprices[0].replace("'title'", "Product: ")
print(deals)```
Any help would be appreciated. I'm just learning so this might be way more simple than I'm thinking.
Thanks all!

You can't replace Dictionary keys. If you'd really like to do something along the similar lines, you could delete the existing key and put another key as Product: . However, that's not the best resort.
You might want to build another List with the formatted data.
formatted_deals: List[str] = []
for tvprice in tvprices:
formatted_deals.append(f"Product: {tvprice['title']}")

Related

I wan to post search query in search box with using python requests

Regarding: https://www.casamundo.com/search/
In this link html there is
<input
placeholder="Where are you going?"
class="c-gray-extra-dark autocomplete-search bdn posr w100p olfn bg-transparent bdtrrsn-xs bdbrrsn-xs mb0 h100p pv4 pr8 bdrss fw400"
readonly=""
data-test="autocomplete-input"
value="England"
>
I want to change value by myself.
I have try to use data={'value' : 'England'} but it's not working for me. any ideas?
I don't want to use Selenium

There are two separate requests
Let's try entering New York
First request is GET https://www.casamundo.com/api/v2/autocomplete?limit=6&q=new%20york
It returns list of suggestions and their ids:
{"suggestions":[{"id":"5460aeb030147","shortTitle":"New York",...
We need to take that id and place it in second request: GET https://www.casamundo.com/search/5460aeb030147
Also it's possible to get offers in json format by appending ?_format=json to last request
Here is python code:
import urllib.parse
import requests
query = 'New York'
params = {
'limit': 1,
'q': query,
}
r = requests.get(
'https://www.casamundo.com/api/v2/autocomplete',
params=params,
)
suggestion_id = r.json()['suggestions'][0]['id']
final_url = f'https://www.casamundo.com/search/{suggestion_id}'
print(final_url)
# => https://www.casamundo.com/search/5460aeb030147
r = requests.get(final_url, params={'_format': 'json'})
suggestions = r.json()
print(suggestions)
# => {'offers': [{'id': 'd264ad61e1139f56', 'title': 'Apartment', 'imageLin...

How Do I Find Amazon Product Names with requests-html?

I've been trying to code a program in python that can return a list of all the product names on the first page. I have a function that gets the URL based on what you want to search:
def get_url(search_term):
template = 'https://www.amazon.com/s?k={}&ref=nb_sb_noss_1'
search_term = search_term.replace(' ', '+')
url = template.format(search_term)
print(url)
return URL
Then I pass the URL into another function and here is where I need help. Right now my function to retrieve the title and number of reviews is this:
def getInfo(url):
r = HTMLSession().get(url)
r.html.render()
product = {
'title': r.html.find('.a-size-medium' '.a-color-base' '.a-text-normal', first=True).text,
'reviews': r.html.find('.a-size-base', first=True).text
}
print(product)
However, the r.html.find part isn't getting the info I need, it either returns [] or None if I add first=True. I've tried different ways like using the XPath and selector. None of those seemed to work. Can anyone help find a way to use html.find method to find all the product names and save them in title in the dictionary product?

SerializationError while scrapng the data and push to elastic search

Below is the code
i am trying to scrape the data and try to push to elastic search
import re
import time
import requests
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch
es_client = Elasticsearch(['http://localhost:9200'])
#drop_index = es_client.indices.create(index='blog-sysadmins', ignore=400)
create_index = es_client.indices.delete(index='blog-sysadmins', ignore=[400, 404])
def urlparser(title, url):
# scrape title
p = {}
post = title
page = requests.get(post).content
soup = BeautifulSoup(page, 'lxml')
title_name = soup.title.string
# scrape tags
tag_names = []
desc = soup.findAll(attrs={"property":"article:tag"})
for x in range(len(desc)):
tag_names.append(desc[x-1]['content'].encode('utf-8'))
print (tag_names)
# payload for elasticsearch
doc = {
'date': time.strftime("%Y-%m-%d"),
'title': title_name,
'tags': tag_names,
'url': url
}
# ingest payload into elasticsearch
res = es_client.index(index="blog-sysadmins", doc_type="docs", body=doc)
time.sleep(0.5)
sitemap_feed = 'https://sysadmins.co.za/sitemap-posts.xml'
page = requests.get(sitemap_feed)
sitemap_index = BeautifulSoup(page.content, 'html.parser')
urlss = [element.text for element in sitemap_index.findAll('loc')]
urls = urlss[0:2]
print ('urls',urls)
for x in urls:
urlparser(x, x)
my error:
SerializationError: ({'date': '2020-07-04', 'title': 'Persistent Storage with OpenEBS on Kubernetes', 'tags': [b'Cassandra', b'Kubernetes', b'Civo', b'Storage'], 'url': 'http://sysadmins.co.za/persistent-storage-with-openebs-on-kubernetes/'}, TypeError("Unable to serialize b'Cassandra' (type: <class 'bytes'>)",))

The json serialization error appears when you try to indicize a data that is not a primitive datatype of javascript, the language with which json was developed. It is a json error and not an elastic one. The only rule of json format is that it accepts inside itself only these datatypes - for more explanation please read here. In your case the tags field has a bytes datatype as written in your error stack:
TypeError("Unable to serialize b'Cassandra' (type: <class 'bytes'>)
To solve your problem you should simply cast your tags content to string. So just change this line:
tag_names.append(desc[x-1]['content'].encode('utf-8'))
to:
tag_names.append(str(desc[x-1]['content']))

Scrape Text After Specific Text and Before Specific Text

<script type="text/javascript">
'sku': 'T3246B5',
'Name': 'TAS BLACKY',
'Price': '111930',
'categories': 'Tas,Wanita,Sling Bags,Di bawah Rp 200.000',
'brand': '',
'visibility': '4',
'instock': "1",
'stock': "73.0000"
</script>
I want to scrape the text between : 'stock': " and .0000" so the desireable result is 73
What I used to know is to do something like this:
for url2 in urls2:
req2 = Request(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
html2 = uReq(req2).read()
page_soup2 = soup(html2, "html.parser")
# Grab text
stock = page_soup2.findAll("p", {"class": "stock"})
stocks = stock[0].text
I used something like this in my previous code, It works before the web change their code.
But now there is more than 1 ("script", {"type": "text/javascript"}) in the entire page I want to scrape. So I dont know how to find the right ("script", {"type": "text/javascript"})
I also don't know hot to get the specific text before and after the text.
I have googled it all this day but can't find the solution. Please help.
I found that strings = 'stock': " and .0000" is unique in the entire page, only 1 'stock': and only 1 .0000"
So I think it could be the sign of location where I want to scrape the text.
Please help, thank you for your kindness.
I also apologize for my lack of English, and I am actually unfamiliar with programming. I'm just trying to learn from Google, but I can't find the answer. Thank you for your understanding.
the url = view-source:sophieparis.com/blacky-bag.html

Since you are sure 'stock' only shows up in the script tag you want, you can pull out that text that contains 'stock. Once you have that, it's a matter of trimming off the excess, and change to double quotes to get it into a valid json format and then simply read that in using json.loads()
import requests
from bs4 import BeautifulSoup
import json
url2 = 'https://www.sophieparis.com/blacky-bag.html'
req2 = requests.get(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
page_soup2 = BeautifulSoup(req2.text, "html.parser")
scripts = page_soup2.find_all('script')
for script in scripts:
if 'stock' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('productObject = ')[-1].strip()
jsonStr = jsonStr.rsplit('}',1)[0].strip() + '}'
jsonData = json.loads(jsonStr.replace("'",'"'))
print (jsonData['stock'].split('.')[0])
Output:
print (jsonData['stock'].split('.')[0])
71
You could also do this without the loop and just grab the script that has the string stock in it using 1 line:
jsonStr = page_soup2.find('script', text=re.compile(r'stock')).text
Full code would look something like:
import requests
from bs4 import BeautifulSoup
import json
import re
url2 = 'https://www.sophieparis.com/blacky-bag.html'
req2 = requests.get(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
page_soup2 = BeautifulSoup(req2.text, "html.parser")
jsonStr = page_soup2.find('script', text=re.compile(r'stock')).text
jsonStr = jsonStr.split('productObject = ')[-1].strip()
jsonStr = jsonStr.rsplit('}',1)[0].strip() + '}'
jsonData = json.loads(jsonStr.replace("'",'"'))
print (jsonData['stock'].split('.')[0])

I would write a regex that targets the javascript dictionary variable that houses the values of interest. You can apply this direct to response.text with no need for bs4.
The dictionary variable is called productObject, and you want the non-empty dictionary which is the second occurrence of productObject = {..} i.e. not the one which has 'var ' preceeding it. You can use negative lookbehind to specify this requirement.
Use hjson to handle property names enclosed in single quotes.
Py
import requests, re, hjson
r = requests.get('https://www.sophieparis.com/blacky-bag.html')
p = re.compile(r'(?<!var\s)productObject = ([\s\S]*?})')
data = hjson.loads(p.findall(r.text)[0])
print(data)
Regex: try

If you want to provide me with the webpage you wish to scrape the data from, I'll see if I can fix the code to pull the information.

Combining BeautifulSoup and json into one output

I have probably not explained my question well but as this is new to me... Anyway, I need to combine these two pieces of code.
I can get the BS working, but it uses the wrong image. To get the right fields and the right image, I have to parse the json part of the website and therefore BS won't work.
The json parsing here
import json
import urllib
import requests
import re
r = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(r.read().decode('utf-8'))
for post in data['posts']:
print post['episodeNumber']
print post['title']
print post['audioSource']
print post['image']['medium']
print post['content']
And replace the try / BS part here:
def get_playable_podcast(soup):
"""
#param: parsed html page
"""
subjects = []
for content in soup.find_all('item'):
try:
link = content.find('enclosure')
link = link.get('url')
print "\n\nLink: ", link
title = content.find('title')
title = title.get_text()
desc = content.find('itunes:subtitle')
desc = desc.get_text()
thumbnail = content.find('itunes:image')
thumbnail = thumbnail.get('href')
except AttributeError:
continue
item = {
'url': link,
'title': title,
'desc': desc,
'thumbnail': thumbnail
}
subjects.append(item)
return subjects
def compile_playable_podcast(playable_podcast):
"""
#para: list containing dict of key/values pairs for playable podcasts
"""
items = []
for podcast in playable_podcast:
items.append({
'label': podcast['title'],
'thumbnail': podcast['thumbnail'],
'path': podcast['url'],
'info': podcast['desc'],
'is_playable': True,
})
return items
I have tried all sorts of variations with passing through the output to the items section, but the most common error I get is. I just have no idea how to pass the data from the json through.
Error Type: <type 'exceptions.NameError'>
Error Contents: name 'title' is not defined
Traceback (most recent call last):
File ".../addon.py", line 6, in <module>
from resources.lib import thisiscriminal
File "....resources/lib/thisiscriminal.py", line 132, in <module>
'title': title,
NameError: name 'title' is not defined

Your JSON request should contain all the information you need. You should print json_data and take a look at what is returned and decide which parts you need.
Based on what your other code was looking for, the following code shows how you could extract some of the fields:
import requests
r = requests.get('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
json_data = r.json()
items = []
for post in json_data['posts']:
items.append([
post['title'].encode('utf-8'),
post['image']['thumb'],
post['excerpt']['long'],
post['permalink'],
])
for item in items:
print item
This would give you output starting:
['Stowaway', u'https://thisiscriminal.com/wp-content/uploads/2019/07/Stowaway_art-150x150.png', u'One day in 1969, Paulette Cooper decided to see what she could get away with.', u'https://thisiscriminal.com/episode-118-stowaway-7-5-2019/']
['The Lake', u'https://thisiscriminal.com/wp-content/uploads/2019/06/Lake_art-150x150.png', u'Amanda Hamm and her boyfriend Maurice LaGrone drove to Clinton Lake one night in 2003. The next day, DeWitt County Sheriff Roger Massey told a local newspaper, \u201cWe don\u2019t want to blow this up into something that it\u2019s not. But on the other side, we\u2019ve got three children...', u'https://thisiscriminal.com/episode-117-the-lake-6-21-2019/']
['Jessica and the Bunny Ranch', u'https://thisiscriminal.com/wp-content/uploads/2019/06/Bunny_art-150x150.png', u'In our\xa0last episode\xa0we spoke Cecilia Gentili, a trans Latina who worked for many years as an undocumented sex worker. Today, we get two more views of sex work in America. We speak with a high-end escort in New York City, and take a trip to one of the...', u'https://thisiscriminal.com/episode-116-jessica-and-the-bunny-ranch-6-7-2019/']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Formating Results from Scaper - python

Related

I wan to post search query in search box with using python requests

How Do I Find Amazon Product Names with requests-html?

SerializationError while scrapng the data and push to elastic search

Scrape Text After Specific Text and Before Specific Text

Combining BeautifulSoup and json into one output

Categories

Resources