How to check many variables in Python if not null? - python

I'm writing a python scraper code for OpenData and I have one question about : how to check if all values aren't filled in site and if it is null change value to null.
My scraper is here.
Currently I'm working on it to optimalize.
My variables now look like:
evcisloval = soup.find_all('td')[3].text.strip()
prinalezival = soup.find_all('td')[5].text.strip()
popisfaplnenia = soup.find_all('td')[7].text.replace('\"', '')
hodnotafaplnenia = soup.find_all('td')[9].text[:-1].replace(",", ".").replace(" ", "")
datumdfa = soup.find_all('td')[11].text
datumzfa = soup.find_all('td')[13].text
formazaplatenia = soup.find_all('td')[15].text
obchmenonazov = soup.find_all('td')[17].text
sidlofirmy = soup.find_all('td')[19].text
pravnaforma = soup.find_all('td')[21].text
sudregistracie = soup.find_all('td')[23].text
ico = soup.find_all('td')[25].text
dic = soup.find_all('td')[27].text
cislouctu = soup.find_all('td')[29].text
And Output :
scraperwiki.sqlite.save(unique_keys=["invoice_id"],
data={ "invoice_id":number,
"invoice_price":hodnotafaplnenia,
"evidence_no":evcisloval,
"paired_with":prinalezival,
"invoice_desc":popisfaplnenia,
"date_received":datumdfa,
"date_payment":datumzfa,
"pay_form":formazaplatenia,
"trade_name":obchmenonazov,
"trade_form":pravnaforma,
"company_location":sidlofirmy,
"court":sudregistracie,
"ico":ico,
"dic":dic,
"accout_no":cislouctu,
"invoice_attachment":urlfa,
"invoice_url":url})
I googled it but without success.

First, write a configuration dict of your variables in the form:
conf = {'evidence_no': (3, str.strip),
'trade_form': (21, None),
...}
i.e. key is the output key, value is a tuple of id from soup.find_all('td') and of an optional function that has to be applied to the result, None otherwise. You don't need those Slavic variable names that may confuse other SO members.
Then iterate over conf and fill the data dict.
Also, run soup.find_all('td') before the loop.
tds = soup.find_all('td')
data = {}
for name, (num, func) in conf.iteritems():
text = tds[num].text
# replace text with None or "NULL" or whatever if needed
...
if func is None:
data[name] = text
else:
data[name] = func(text)
This will remove a lot of duplicated code. Easier to maintain.
Also, I am not sure the strings "NULL" are the best way to write missing data. Doesn't sqlite support Python's real None objects?

Just read your attached link, and it seems what you want is
evcisloval = soup.find_all('td')[3].text.strip() or "NULL"
But be careful. You should only do this with strings. If the part before or is either empty or False or None, or 0, they will all be replaced with "NULL"

Related

how can i fill my key's json automatically

I have a json template that I would like to autofill.
currently i do it manually like this:
for i in range(len(self.dict_trie_csv_infos)):
self.template_json[i]['name'] = self.dict_trie_csv_infos[i]['name']
self.template_json[i]['title'] = self.dict_trie_csv_infos[i]['title']
self.template_json[i]['startDateTime'] = self.dict_trie_csv_infos[i]['startDateTime']
self.template_json[i]['endDateTime'] = self.dict_trie_csv_infos[i]['endDateTime']
self.template_json[i]['address'] = self.dict_trie_csv_infos[i]['address']
self.template_json[i]['locationName'] = self.dict_trie_csv_infos[i]['locationName']
self.template_json[i]['totalTicketsCount'] = self.dict_trie_csv_infos[i]
.....etc..
as you can see they have the same key's name,
i tried to do it with a loop but it didn't worked.
How can i fill it automatically (is it possible)?
Thanks for your answer
If you're matching names you could do something like:
for i in range(len(self.dict_trie_csv_infos)):
for key in list(self.dict_trie_csv_infos.keys()):
self.template_json[i][key] = self.dict_trie_csv_infos[i][key]

Getting wrong result from JSON - Python 3

Im working on a small project of retrieving information about books from the Google Books API using Python 3. For this i make a call to the API, read out the variables and store those in a list. For a search like "linkedin" this works perfectly. However when i enter "Google", it reads the second title from the JSON input. How can this happen?
Please find my code below (Google_Results is the class I use to initialize the variables):
import requests
def Book_Search(search_term):
parms = {"q": search_term, "maxResults": 3}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
print(r.url)
results = r.json()
i = 0
for result in results["items"]:
try:
isbn13 = str(result["volumeInfo"]["industryIdentifiers"][0]["identifier"])
isbn10 = str(result["volumeInfo"]["industryIdentifiers"][1]["identifier"])
title = str(result["volumeInfo"]["title"])
author = str(result["volumeInfo"]["authors"])[2:-2]
publisher = str(result["volumeInfo"]["publisher"])
published_date = str(result["volumeInfo"]["publishedDate"])
description = str(result["volumeInfo"]["description"])
pages = str(result["volumeInfo"]["pageCount"])
genre = str(result["volumeInfo"]["categories"])[2:-2]
language = str(result["volumeInfo"]["language"])
image_link = str(result["volumeInfo"]["imageLinks"]["thumbnail"])
dict = Google_Results(isbn13, isbn10, title, author, publisher, published_date, description, pages, genre,
language, image_link)
gr.append(dict)
print(gr[i].title)
i += 1
except:
pass
return
gr = []
Book_Search("Linkedin")
I am a beginner to Python, so any help would be appreciated!
It does so because there is no publisher entry in volumeInfo of the first entry, thus it raises a KeyError and your except captures it. If you're going to work with fuzzy data you have to account for the fact that it will not always have the expected structure. For simple cases you can rely on dict.get() and its default argument to return a 'valid' default entry if an entry is missing.
Also, there are a few conceptual problems with your function - it relies on a global gr which is bad design, it shadows the built-in dict type and it captures all exceptions guaranteeing that you cannot exit your code even with a SIGINT... I'd suggest you to convert it to something a bit more sane:
def book_search(search_term, max_results=3):
results = [] # a list to store the results
parms = {"q": search_term, "maxResults": max_results}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
try: # just in case the server doesn't return valid JSON
for result in r.json().get("items", []):
if "volumeInfo" not in result: # invalid entry - missing volumeInfo
continue
result_dict = {} # a dictionary to store our discovered fields
result = result["volumeInfo"] # all the data we're interested is in volumeInfo
isbns = result.get("industryIdentifiers", None) # capture ISBNs
if isinstance(isbns, list) and isbns:
for i, t in enumerate(("isbn10", "isbn13")):
if len(isbns) > i and isinstance(isbns[i], dict):
result_dict[t] = isbns[i].get("identifier", None)
result_dict["title"] = result.get("title", None)
authors = result.get("authors", None) # capture authors
if isinstance(authors, list) and len(authors) > 2: # you're slicing from 2
result_dict["author"] = str(authors[2:-2])
result_dict["publisher"] = result.get("publisher", None)
result_dict["published_date"] = result.get("publishedDate", None)
result_dict["description"] = result.get("description", None)
result_dict["pages"] = result.get("pageCount", None)
genres = result.get("authors", None) # capture genres
if isinstance(genres, list) and len(genres) > 2: # since you're slicing from 2
result_dict["genre"] = str(genres[2:-2])
result_dict["language"] = result.get("language", None)
result_dict["image_link"] = result.get("imageLinks", {}).get("thumbnail", None)
# make sure Google_Results accepts keyword arguments like title, author...
# and make them optional as they might not be in the returned result
gr = Google_Results(**result_dict)
results.append(gr) # add it to the results list
except ValueError:
return None # invalid response returned, you may raise an error instead
return results # return the results
Then you can easily retrieve as much info as possible for a term:
gr = book_search("Google")
And it will be far more tolerant of data omissions, provided that your Google_Results type makes most of the entries optional.
Following #Coldspeed's recommendation it became clear that missing information in the JSON file caused the exception to run. Since I only had a "pass" statement there it skipped the entire result. Therefore I will have to adapt the "Try and Except" statements so errors do get handled properly.
Thanks for the help guys!

Using Keys as Variables in Python

There is probably a term for what I'm attempting to do, but it escapes me. I'm using peewee to set some values in a class, and want to iterate through a list of keys and values to generate the command to store the values.
Not all 'collections' contain each of the values within the class, so I want to just include the ones that are contained within my data set. This is how far I've made it:
for value in result['response']['docs']:
for keys in value:
print keys, value[keys] # keys are "identifier, title, language'
#for value in result['response']['docs']:
# collection = Collection(
# identifier = value['identifier'],
# title = value['title'],
# language = value['language'],
# mediatype = value['mediatype'],
# description = value['description'],
# subject = value['subject'],
# collection = value['collection'],
# avg_rating = value['avg_rating'],
# downloads = value['downloads'],
# num_reviews = value['num_reviews'],
# creator = value['creator'],
# format = value['format'],
# licenseurl = value['licenseurl'],
# publisher = value['publisher'],
# uploader = value['uploader'],
# source = value['source'],
# type = value['type'],
# volume = value['volume']
# )
# collection.save()
for value in result['response']['docs']:
Collection(**value).save()
See this question for an explanation on how **kwargs work.
Are you talking about how to find out whether a key is in a dict or not?
>>> somedict = {'firstname': 'Samuel', 'lastname': 'Sample'}
>>> if somedict.get('firstname'):
>>> print somedict['firstname']
Samuel
>>> print somedict.get('address', 'no address given'):
no address given
If there is a different problem you'd like to solve, please clarify your question.

append in a list causing value overwrite

I am facing a peculiar problem. I will describe in brief bellow
Suppose i have this piece of code -
class MyClass:
__postBodies = []
.
.
.
for the_file in os.listdir("/dir/path/to/file"):
file_path = os.path.join(folder, the_file)
params = self.__parseFileAsText(str(file_path)) #reads the file and gets some parsed data back
dictData = {'file':str(file_path), 'body':params}
self.__postBodies.append(dictData)
print self.__postBodies
dictData = None
params = None
Problem is, when i print the params and the dictData everytime for different files it has different values (the right thing), but as soon as the append occurs, and I print __postBodies a strange thing happens. If there are thee files, suppose A,B,C, then
first time __postBodies has the content = [{'body':{A dict with some
data related to file A}, 'file':'path/of/A'}]
second time it becomes = [{'body':{A dict with some data relaed to
file B}, 'file':'path/of/A'}, {'body':{A dict with some data relaed to
file B}, 'file':'path/of/B'}]
AND third time = [{'body':{A dict with some data relaed to file C},
'file':'path/of/A'}, {'body':{A dict with some data relaed to file C},
'file':'path/of/B'}, {'body':{A dict with some data relaed to file C},
'file':'path/of/C'}]
So, you see the 'file' key is working very fine. Just strangely the 'body' key is getting overwritten for all the entries with the one last appended.
Am i making any mistake? is there something i have to? Please point me to a direction.
Sorry if I am not very clear.
EDIT ------------------------
The return from self.__parseFileAsText(str(file_path)) call is a dict that I am inserting as 'body' in the dictData.
EDIT2 ----------------------------
as you asked, this is the code, but i have checked that params = self.__parseFileAsText(str(file_path)) call is returning a diff dict everytime.
def __parseFileAsText(self, fileName):
i = 0
tempParam = StaticConfig.PASTE_PARAMS
tempParam[StaticConfig.KEY_PASTE_PARAM_NAME] = ""
tempParam[StaticConfig.KEY_PASTE_PARAM_PASTEFORMAT] = "text"
tempParam[StaticConfig.KEY_PASTE_PARAM_EXPIREDATE] = "N"
tempParam[StaticConfig.KEY_PASTE_PARAM_PRIVATE] = ""
tempParam[StaticConfig.KEY_PASTE_PARAM_USER] = ""
tempParam[StaticConfig.KEY_PASTE_PARAM_DEVKEY] = ""
tempParam[StaticConfig.KEY_PASTE_FORMAT_PASTECODE] = ""
for line in fileinput.input([fileName]):
temp = str(line)
temp2 = temp.strip()
if i == 0:
postValues = temp2.split("|||")
if int(postValues[(len(postValues) - 1)]) == 0 or int(postValues[(len(postValues) - 1)]) == 2:
tempParam[StaticConfig.KEY_PASTE_PARAM_NAME] = str(postValues[0])
if str(postValues[1]) == '':
tempParam[StaticConfig.KEY_PASTE_PARAM_PASTEFORMAT] = 'text'
else:
tempParam[StaticConfig.KEY_PASTE_PARAM_PASTEFORMAT] = postValues[1]
if str(postValues[2]) != "N":
tempParam[StaticConfig.KEY_PASTE_PARAM_EXPIREDATE] = str(postValues[2])
tempParam[StaticConfig.KEY_PASTE_PARAM_PRIVATE] = str(postValues[3])
tempParam[StaticConfig.KEY_PASTE_PARAM_USER] = StaticConfig.API_USER_KEY
tempParam[StaticConfig.KEY_PASTE_PARAM_DEVKEY] = StaticConfig.API_KEY
else:
tempParam[StaticConfig.KEY_PASTE_PARAM_USER] = StaticConfig.API_USER_KEY
tempParam[StaticConfig.KEY_PASTE_PARAM_DEVKEY] = StaticConfig.API_KEY
i = i+1
else:
if tempParam[StaticConfig.KEY_PASTE_FORMAT_PASTECODE] != "" :
tempParam[StaticConfig.KEY_PASTE_FORMAT_PASTECODE] = str(tempParam[StaticConfig.KEY_PASTE_FORMAT_PASTECODE])+"\n"+temp2
else:
tempParam[StaticConfig.KEY_PASTE_FORMAT_PASTECODE] = temp2
return tempParam
You are likely returning the same dictionary with every call to MyClass.__parseFileAsText(), a couple of common ways this might be happening:
__parseFileAsText() accepts a mutable default argument (the dict that you eventually return)
You modify an attribute of the class or instance and return that instead of creating a new one each time
Making sure that you are creating a new dictionary on each call to __parseFileAsText() should fix this problem.
Edit: Based on your updated question with the code for __parseFileAsText(), your issue is that you are reusing the same dictionary on each call:
tempParam = StaticConfig.PASTE_PARAMS
...
return tempParam
On each call you are modifying StaticConfig.PASTE_PARAMS, and the end result is that all of the body dictionaries in your list are actually references to StaticConfig.PASTE_PARAMS. Depending on what StaticConfig.PASTE_PARAMS is, you should change that top line to one of the following:
# StaticConfig.PASTE_PARAMS is an empty dict
tempParam = {}
# All values in StaticConfig.PASTE_PARAMS are immutable
tempParam = dict(StaticConfig.PASTE_PARAMS)
If any values in StaticConfig.PASTE_PARAMS are mutable, you could use copy.deepcopy but it would be better to populate tempParam with those default values on your own.
What if __postBodies wasn't a class attribute, as it is defined now, but just an instance attribute?

Python - is there an elegant way to avoid dozens try/except blocks while getting data out of a json object?

I'm looking for ways to write functions like get_profile(js) but without all the ugly try/excepts.
Each assignment is in a try/except because occasionally the json field doesn't exist. I'd be happy with an elegant solution which defaulted everything to None even though I'm setting some defaults to [] and such, if doing so would make the overall code much nicer.
def get_profile(js):
""" given a json object, return a dict of a subset of the data.
what are some cleaner/terser ways to implement this?
There will be many other get_foo(js), get_bar(js) functions which
need to do the same general type of thing.
"""
d = {}
try:
d['links'] = js['entry']['gd$feedLink']
except:
d['links'] = []
try:
d['statisitcs'] = js['entry']['yt$statistics']
except:
d['statistics'] = {}
try:
d['published'] = js['entry']['published']['$t']
except:
d['published'] = ''
try:
d['updated'] = js['entry']['updated']['$t']
except:
d['updated'] = ''
try:
d['age'] = js['entry']['yt$age']['$t']
except:
d['age'] = 0
try:
d['name'] = js['entry']['author'][0]['name']['$t']
except:
d['name'] = ''
return d
Replace each of your try catch blocks with chained calls to the dictionary get(key [,default]) method. All calls to get before the last call in the chain should have a default value of {} (empty dictionary) so that the later calls can be called on a valid object, Only the last call in the chain should have the default value for the key that you are trying to look up.
See the python documentation for dictionairies http://docs.python.org/library/stdtypes.html#mapping-types-dict
For example:
d['links'] = js.get('entry', {}).get('gd$feedLink', [])
d['published'] = js.get('entry', {}).get('published',{}).get('$t', '')
Use get(key[, default]) method of dictionaries
Code generate this boilerplate code and save yourself even more trouble.
Try something like...
import time
def get_profile(js):
def cas(prev, el):
if hasattr(prev, "get") and prev:
return prev.get(el, prev)
return prev
def getget(default, *elements):
return reduce(cas, elements[1:], js.get(elements[0], default))
d = {}
d['links'] = getget([], 'entry', 'gd$feedLink')
d['statistics'] = getget({}, 'entry', 'yt$statistics')
d['published'] = getget('', 'entry', 'published', '$t')
d['updated'] = getget('', 'entry', 'updated', '$t')
d['age'] = getget(0, 'entry', 'yt$age', '$t')
d['name'] = getget('', 'entry', 'author', 0, 'name' '$t')
return d
print get_profile({
'entry':{
'gd$feedLink':range(4),
'yt$statistics':{'foo':1, 'bar':2},
'published':{
"$t":time.strftime("%x %X"),
},
'updated':{
"$t":time.strftime("%x %X"),
},
'yt$age':{
"$t":"infinity years",
},
'author':{0:{'name':{'$t':"I am a cow"}}},
}
})
It's kind of a leap of faith for me to assume that you've got a dictionary with a key of 0 instead of a list but... You get the idea.
You need to familiarise yourself with dictionary methods Check here for how to handle what you're asking.
Two possible solutions come to mind, without knowing more about how your data is structured:
if k in js['entry']:
something = js['entry'][k]
(though this solution wouldn't really get rid of your redundancy problem, it is more concise than a ton of try/excepts)
or
js['entry'].get(k, []) # or (k, None) depending on what you want to do
A much shorter version is just something like...
for k,v in js['entry']:
d[k] = v
But again, more would have to be said about your data.

Categories