text substitution {} does not work at scrapinghub - python

I create a url with {} format to change the url on the fly.
It works totally fine on my PC.
But once I upload and run it from scrapinghub one(state) of the many substitutions(others work fine) does not work, it returns %7B%7D& in the url which is encoded curly braces.
Why does this happen? What do I miss when referencing State variable?
This is the url from my code:
def __init__(self):
self.state = 'AL'
self.zip = '35204'
self.tax_rate = 0
self.years = [2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017]
def parse_m(self, response):
r = json.loads(response.text)
models = r['models']
year = response.meta['year']
make = response.meta['make']
for model in models:
for milage in [40000,50000,60000,70000,80000,90000,100000]:
url = '****/vehicles/?year={}&make={}&model={}&state={}&mileage={}&zip={}'.format(year,make, model, self.state, milage, self.zip)
and this is the url i see in the log of scrapinghub:
***/vehicles/?year=2010&make=LOTUS&model=EXIGE%20S&state=%7B%7D&mileage=100000&zip=35204

This is not a scrapinghub issue. It has to be your code only. If I do below
>>> "state={}".format({})
'state={}'
This would end up being
state=%7B%7D
I would add
assert type(self.state) is str
to my code to ensure this situation doesn't happen and if it does then you get an AssertionError

Related

Having Trouble Returning a value from a smart contract, using flask on python anywhere

I am using flask and web3.eth on pythonanywhere and connecting to a contract, but am having issues returning a value for one of the smart contract functions. No errors are being logged. I have placed several print statements to see where the app is stopping and found that it stops when I call a smart contract function.
Also I should mention that I am able to run this exact code locally without issues.
This is the function that is most likely failing:
def getDataFromTokenID(tokenid, contract):
print('getting uri')
uri = contract.functions.tokenURI(tokenid).call() # This is where it stops printing
print('PRINT:',uri)
html = requests.get(uri)
name, img_url = html.json()['name'], html.json()['image']
code = name[-5:]
return name, img_url, code
The function above is called in the following blueprint:
#TokenInfo.route('/rarity/<int:tokenid>', methods=['GET'])
def sendTokenInfo(tokenid):
contract_address = ' ' # left empty for posting purposes
w3 = Web3(Web3.WebsocketProvider(' ')) # left empty purposefully as well
contract = w3.eth.contract(address=contract_address, abi=contract_abi.abi)
model = Shape_classifier()
model.load_state_dict(load(os.getcwd()+'/mysite/app/state_dict.pth'))
uri = current_app.config['MONGO_URI']
mongo.init_app(current_app, uri)
gs = mongo.db.gantomstone_info
try:
id_exists = [{"$match": {'_id': tokenid}}, {"$count": "count"}]
list(gs.aggregate(id_exists))[0]
except:
print('getting data from token id')
name, img_url, serial = getDataFromTokenID(tokenid, contract) ## Stops printing here
print('opening image')
img = Image.open(requests.get(img_url, stream=True).raw)
shape = getImageShape(img, model)
colors = getColors(getCounts(img))
rgb_count = getCounts(img)
serialTF = getCodeInfo(serial)
to_db = {'_id': tokenid, 'name': name, 'img_url': img_url, 'serial': serial,
'shape': shape, 'colors': colors, 'serialTF': serialTF, 'rgb_count': rgb_count}
gs.insert_one(to_db)
rarity = getRarity(gs, tokenid)
gs.update_one({'_id': tokenid}, {
'$set': {'rarity_values': rarity}}, upsert=True)
to_json = list(gs.find({'_id': tokenid}))[0]
return jsonify(to_json)
I have tried moving contract address around (both out of TokenInfo view function and into the functions file) to no avail. I have also tried changing the function inputs to receive the get request args instead of the int in the URL, which made no difference either.
If the code uses websockets it won't currently work in web apps on PythonAnywhere.

how to make full path for url in python?

I am little bit confused about creating full url.
I have such code :
def flats(self):
return [JsonFlatPage(property_data = flat, url = flat['propertyUrl'])
for flat in self.data['properties']]
in flat['propertyUrl'] I have '/properties/75599853', but I need to get like this one:
'https://www.rightmove.co.uk/properties/75599853#/'
with full path and # at the end.
I know that I should make constant URI in settings file, but then how I can convert it? Should I use f-strings?
I think since the base url https://www.rightmove.co.uk/ is fixed, you can do something like below to get what you need:
def flats(self):
baseUrl = 'https://www.rightmove.co.uk/'
return [JsonFlatPage(property_data = flat, url = baseUrl + flat['propertyUrl'] + "#/")
for flat in self.data['properties']]
You can also use f-strings as you mentioned as:
def flats(self):
baseUrl = 'https://www.rightmove.co.uk/'
return [JsonFlatPage(property_data = flat, url = f"{baseUrl}{flat['propertyUrl']}#/")
for flat in self.data['properties']]

Eliminate unwanted characters from JSON file using different threads (Python)

In my python file, I have created a class called Download. The code where the class is:
import requests, json, os, pytube, threading
class Download:
def __init__(self, url, json=False, get=False, post=False, put=False, unwanted="", wanted="", unwanted2="", wanted2="", unwanted3="", wanted3=""):
self.url = url
self.json = json
self.get = get
self.post = post
self.put = put
self.unwanted = unwanted
self.wanted = wanted
self.unwanted2 = unwanted2
self.wanted2 = wanted2
self.unwanted3 = unwanted3
self.wanted3 = wanted3
def downloadJson(self):
if self.get is True:
downloadJson = requests.get(self.url)
downloadJson = str(downloadJson.content)
downloadJsonS = str(downloadJson) # This saves the downloaded JSON file as string
if self.json is True:
with open("downloadedJson.json", "w") as writeDownloadedJson:
writeDownloadedJson.write(json.dumps(downloadJson))
writeDownloadedJson.close()
with open("downloadedJson.json", "r") as replaceUnwanted:
a = replaceUnwanted.read()
x = a.replace(self.unwanted, self.wanted)
# y = a.replace(self.unwanted2, self.wanted2)
# z = a.replace(self.unwanted3, self.wanted3)
print(x)
with open("downloadedJson.json", "w") as writeUnwanted:
# writeUnwanted.write(y)
# writeUnwanted.write(z)
writeUnwanted.write(x)
else:
# with open("downloadedJson.json", "w")as j:
# j.write(downloadJsonS)
# j.close()
pass
I have written all this by myself, and I understand how it works. My objective is to remove all the unwanted characters that come in the JSON file once downloaded, such as: \\n, \' or \n. I have many arguments in the __init__() function, like the __init__(unwanted="", wanted="", unwanted2="") etcetera.
By this, when adding any character to the unwanted parameter, such as: \\n, it should replace all these characters by a space. This is done properly, and it works. The lines of code that are comments are the lines of code that I was using, but that did not work. It would only replace the characters from only 1 argument.
Is there any way of passing all the unwanted characters in each for each argument, using threads. If it is not possible using threads, is there any alternative?
By the way, the file where I am executing the class: (main.py):
from downloader import Download
with open("url.txt", "r")as url:
x = Download(url.read(), get=True, json=True, unwanted="\\n")
x.downloadJson()
Thanks
You could apply the replacements one after another:
x = a.replace(self.unwanted, self.wanted)
x = x.replace(self.unwanted2, self.wanted2)
x = x.replace(self.unwanted3, self.wanted3)
You could also chain the replacement together, but that would quickly become hard to read:
x = a.replace(...).replace(...).replace(...)
Btw, instead of having multiple unwantedN and wantedN,
it would be probably a lot easier to use a list of (unwanted, wanted) pairs, something like this:
def __init__(self, url, json=False, get=False, post=False, put=False, replacements=[]):
self.url = url
self.json = json
self.get = get
self.post = post
self.put = put
self.replacements = replacements
And then you could perform the replacements in a loop:
x = a
for unwanted, wanted in self.replacements:
x = x.replace(unwanted, wanted)

Python Praw ways to store data for calling later?

Is a dictionary the correct way to be doing this? Ideally this will be more then 5+ deep. Sorry my only language experience is powershell there I would just make an array of object. Im not looking for someone to write the code I just wanna know if there is a better way?
Thanks
Cody
My Powershell way:
[$title1,$title2,$title3]
$titleX.comment = "comment here"
$titleX.comment.author = "bob"
$titleX.comment.author.karma = "200"
$titleX.comment.reply = "Hey Bob love your comment."
$titleX.comment.reply.author = "Alex"
$titleX.comment.reply.reply = "I disagree"
#
Python code Borken:
import praw
d = {}
reddit = praw.Reddit(client_id='XXXX',
client_secret='XXXX',
user_agent='android:com.example.myredditapp:'
'v1.2.3 (by /u/XXX)')
for submission in reddit.subreddit('redditdev').hot(limit=2):
d[submission.id] = {}
d[submission.id]['comment'] = {}
d[submission.id]['title']= {}
d[submission.id]['comment']['author']={}
d[submission.id]['title'] = submission.title
mySubmission = reddit.submission(id=submission.id)
mySubmission.comments.replace_more(limit=0)
for comment in mySubmission.comments.list():
d[submission.id]['comment'] = comment.body
d[submission.id]['comment']['author'] = comment.author.name
print(submission.title)
print(comment.body)
print(comment.author.name)
print(d)
File "C:/git/tensorflow/Reddit/pull.py", line 23, in <module>
d[submission.id]['comment']['author'] = comment.author.name
TypeError: 'str' object does not support item assignment
#
{'6xg24v': {'comment': 'Locking this version. Please comment on the [original post](https://www.reddit.com/r/changelog/comments/6xfyfg/an_update_on_the_state_of_the_redditreddit_and/)!', 'title': 'An update on the state of the reddit/reddit and reddit/reddit-mobile repositories'}}
I think your approach using a dictionary is okay, but you might also solve this by using a data structure for your posts: Instead of writing
d[submission.id] = {}
d[submission.id]['comment'] = {}
d[submission.id]['title']= {}
d[submission.id]['comment']['author']={}
d[submission.id]['title'] = submission.title
you could create a class Submission like this:
class Submission(object):
def __init__(self, id, author, title, content):
self.id = id
self.author = author
self.title = title
self.content = content
self.subSubmissions = {}
def addSubSubmission(self,submission):
self.subSubmission[submission,id] = submission
def getSubSubmission(self,id):
return self.subSubmission[id]
by using you could change your code to this
submissions = {}
for sm in reddit.subreddit('redditdev').hot(limit=2):
submissions[sm.id] = Submission(sm.id, sm.author, sm.title, sm.content)
# I am not quite sure what these lines are supposed to do, so you might be able to improve these, too
mySubmission = reddit.submission(id=sm.id)
mySubmission.comments.replace_more(limit=0)
for cmt in mySubmission.comments.list():
submissions[sm.id].addSubSubmission(Submission(cmt.id, cmt.title, cmt.author, cmt.body))
By using this apporach you are also able to export the code to readout the comments/subSubmissions into an extra function which can call itself recursively, so that you can read infitive depths of the comments.

Python OOP Project Organization

I'm a bit new to Python dev -- I'm creating a larger project for some web scraping. I want to approach this as "Pythonically" as possible, and would appreciate some help with the project structure. Here's how I'm doing it now:
Basically, I have a base class for an object whose purpose is to go to a website and parse some specific data on it into its own array, jobs[]
minion.py
class minion:
# Empty getJobs() function to be defined by object pre-instantiation
def getJobs(self):
pass
# Constructor for a minion that requires site authorization
# Ex: minCity1 = minion('http://portal.com/somewhere', 'user', 'password')
# or minCity2 = minion('http://portal.com/somewhere')
def __init__(self, title, URL, user='', password=''):
self.title = title
self.URL = URL
self.user = user
self.password = password
self.jobs = []
if (user == '' and password == ''):
self.reqAuth = 0
else:
self.reqAuth = 1
def displayjobs(self):
for j in self.jobs:
j.display()
I'm going to have about 100 different data sources. The way I'm doing it now is to just create a separate module for each "Minion", which defines (and binds) a more tailored getJobs() function for that object
Example: minCity1.py
from minion import minion
from BeautifulSoup import BeautifulSoup
import urllib2
from job import job
# MINION CONFIG
minTitle = 'Some city'
minURL = 'http://www.somewebpage.gov/'
# Here we define a function that will be bound to this object's getJobs function
def getJobs(self):
page = urllib2.urlopen(self.URL)
soup = BeautifulSoup(page)
# For each row
for tr in soup.findAll('tr'):
tJob = job()
span = tr.findAll(['span', 'class="content"'])
# If row has 5 spans, pull data from span 2 and 3 ( [1] and [2] )
if len(span) == 5:
tJob.title = span[1].a.renderContents()
tJob.client = 'Some City'
tJob.source = minURL
tJob.due = span[2].div.renderContents().replace('<br />', '')
self.jobs.append(tJob)
# Don't forget to bind the function to the object!
minion.getJobs = getJobs
# Instantiate the object
mCity1 = minion(minTitle, minURL)
I also have a separate module which simply contains a list of all the instantiated minion objects (which I have to update each time I add one):
minions.py
from minion_City1 import mCity1
from minion_City2 import mCity2
from minion_City3 import mCity3
from minion_City4 import mCity4
minionList = [mCity1,
mCity2,
mCity3,
mCity4]
main.py references minionList for all of its activities for manipulating the aggregated data.
This seems a bit chaotic to me, and was hoping someone might be able to outline a more Pythonic approach.
Thank you, and sorry for the long post!
Instead of creating functions and assigning them to objects (or whatever minion is, I'm not really sure), you should definitely use classes instead. Then you'll have one class for each of your data sources.
If you want, you can even have these classes inherit from a common base class, but that isn't absolutely necessary.

Categories