Python JSON scraping - how can I handle missing values?

Python JSON scraping - how can I handle missing values? - python

I'm pretty new to coding, so I'm learning a lot as I go. This problem got me stumped, and even though I can find several similar questions on here, I can't find one that works or has a recognizable syntax to me.
I'm trying to scrape various user data from a JSON API, og then store those values in a MySQL database I've set up.
The code seems to run fine for the most part, but some users does not have the attributes I'm trying to scrape in the JSON, and thus I'm left with Nonetype errors that I cant seem to foil.
If possible I'd like to just store "0" in the database where the json does not contain the attribute.
In the m/snippet below this works fine for users that has a job, but users without a job returns Nonetype on jobposition and apparently breaks the loop.
response = requests.get("URL")
json_obj = json.loads(response.text)
timer = json_obj['timestamp']
jobposition = json_obj['job']['position']
query = "INSERT INTO users (timer, jobposition) VALUES (%s, %s)"
values = (timer, jobposition)
cursor = db.cursor()
cursor.execute(query, values)
db.commit()
Thanks in advance!

You can use for that the get() method of the dictionary as follow
timer = json_obj.get('timestamp', 0)
0 is the default value and in case there is no 'timestamp' attribute it will return 0.
For job position, you can do
jobposition = json_obj['job'].get('position', 0) if 'job' in json_obj else 0

Try this
try:
jobposition = json_obj['job']['position']
except:
jobposition = 0

You can more clearly declare the data schema using dataclasses:
from dataclasses import dataclass
from validated_dc import ValidatedDC
#dataclass
class Job(ValidatedDC):
name: str
position: int = 0
#dataclass
class Workers(ValidatedDC):
timer: int
job: Job
input_data = {
'timer': 123,
'job': {'name': 'driver'}
}
workers = Workers(**input_data)
assert workers.job.position == 0
https://github.com/EvgeniyBurdin/validated_dc

Related

Problems querying with python to BigQuery (Python String Format)

I am trying to make a query to BigQuery in order to modify all the values of a row (in python). When I use a simple string to query, I have no problems. Nevertheless, when I introduce the string formatting the query does not work. As follows I'm presenting the same query, but diminishing the number of columns that I am modifying.
I already made the connection to BigQuery, by defining the Client, etc (and works properly).
I tried:
"UPDATE `riscos-dev.survey_test.data-test-bdrn` SET informaci_meteorol_gica = {inf}, risc = {ri} WHERE objectid = {obj_id}".format(inf = df.informaci_meteorol_gica[index], ri = df.risc[index], obj_id = df.objectid[index])
To specify the input values in format:
df.informaci_meteorol_gica[index] = 'Neu' , also a string for df.risc[index] and df.objectid[index] = 3
I am obtaining the following error message:
BadRequest: 400 Braced constructors are not supported at [1:77]

Instead of using format method of string, I propose you another approach with the f string formating in Python :
def build_query():
inf = "'test_inf'"
ri = "'test_ri'"
obj_id = "'test_obj_id'"
return f"UPDATE `riscos-dev.survey_test.data-test-bdrn` SET informaci_meteorol_gica = {inf}, risc = {ri} WHERE objectid = {obj_id}"
if __name__ == '__main__':
query = build_query()
print(query)
The result is :
UPDATE `riscos-dev.survey_test.data-test-bdrn` SET informaci_meteorol_gica = 'test_inf', risc = 'test_ri' WHERE objectid = 'test_obj_id'
I mocked the query params in my example with :
inf = "'test_inf'"
ri = "'test_ri'"
obj_id = "'test_obj_id'"

How to add a local variable to pymongo.find if it's not None?

I run a web service with an api function which uses a method I created to interact with MongoDB, using pymongo.
The json data comes with post may or may not include a field: firm. I don't want to create a new method for posts that does not include a firm field.
So I want to use that firm in pymongo.find if it does exists, or I want to just skip it if it doesn't. How can I do this with using one api function and one pymongo method?
API function:
#app.route(f'/{API_PREFIX}/wordcloud', methods=['POST'])
def generate_wc():
request_ = request.get_json()
firm = request_.get("firm").lower()
source = request_["source"]
since = datetime.strptime(request_["since"], "%Y-%m-%d")
until = datetime.strptime(request_["until"], "%Y-%m-%d")
items = mongo.get_tweets(firm, since, until)
...
The pymongo method:
def get_tweets(self, firm: str, since: datetime, until: datetime):
tweets = self.DB.tweets.find(
{
# use firm here if it exists (I mean not None), else just get items by date
'date': {'$gte': since, '$lte': until}
})
...
Here in the second code, comment line in find.
Thanks.

Since it involves two different queries: {date: ...} and {date: ..., firm: ...} depending on the existence of firm in the input, you would have to check if firm is not None in get_tweets and execute the proper query.
For example:
def get_tweets(self, since, until, firm=None):
query = { 'date': { '$gte': since, '$lte': until } }
if firm is not None:
query['firm'] = firm
tweets = self.DB.tweets.find(query)
....
Note that since firm has a default value, it needs to be last in the get_tweets parameter list.

I have an Error with python flask cause of an API result (probably cause of my list) and my Database

I use flask, an api and SQLAlchemy with SQLite.
I begin in python and flask and i have problem with the list.
My application work, now i try a news functions.
I need to know if my json informations are in my db.
The function find_current_project_team() get information in the API.
def find_current_project_team():
headers = {"Authorization" : "bearer "+session['token_info']['access_token']}
user = requests.get("https://my.api.com/users/xxxx/", headers = headers)
user = user.json()
ids = [x['id'] for x in user]
return(ids)
I use ids = [x['id'] for x in user] (is the same that) :
ids = []
for x in user:
ids.append(x['id'])
To get ids information. Ids information are id in the api, and i need it.
I have this result :
[2766233, 2766237, 2766256]
I want to check the values ONE by One in my database.
If the values doesn't exist, i want to add it.
If one or all values exists, I want to check and return "impossible sorry, the ids already exists".
For that I write a new function:
def test():
test = find_current_project_team()
for find_team in test:
find_team_db = User.query.filter_by(
login=session['login'], project_session=test
).first()
I have absolutely no idea to how check values one by one.
If someone can help me, thanks you :)
Actually I have this error :
sqlalchemy.exc.InterfaceError: (InterfaceError) Error binding
parameter 1 - probably unsupported type. 'SELECT user.id AS user_id,
user.login AS user_login, user.project_session AS user_project_session
\nFROM user \nWHERE user.login = ? AND user.project_session = ?\n
LIMIT ? OFFSET ?' ('my_tab_login', [2766233, 2766237, 2766256], 1, 0)

It looks to me like you are passing the list directly into the database query:
def test():
test = find_current_project_team()
for find_team in test:
find_team_db = User.query.filter_by(login=session['login'], project_session=test).first()
Instead, you should pass in the ID only:
def test():
test = find_current_project_team()
for find_team in test:
find_team_db = User.query.filter_by(login=session['login'], project_session=find_team).first()
Asides that, I think you can do better with the naming conventions though:
def test():
project_teams = find_current_project_team()
for project_team in project_teams:
project_team_result = User.query.filter_by(login=session['login'], project_session=project_team).first()

All works thanks
My code :
project_teams = find_current_project_team()
for project_team in project_teams:
project_team_result = User.query.filter_by(project_session=project_team).first()
print(project_team_result)
if project_team_result is not None:
print("not none")
else:
project_team_result = User(login=session['login'], project_session=project_team)
db.session.add(project_team_result)
db.session.commit()

How to update object returned in query

So I'm a flask/sqlalchemy newbie but this seems like it should be a pretty simple. Yet for the life of me I can't get it to work and I can't find any documentation for this anywhere online. I have a somewhat complex query I run that returns me a list of database objects.
items = db.session.query(X, func.count(Y.x_id).label('total')).filter(X.size >= size).outerjoin(Y, X.x_id == Y.x_id).group_by(X.x_id).order_by('total ASC')\
.limit(20).all()
after I get this list of items I want to loop through the list and for each item update some property on it.
for it in items:
it.some_property = 'xyz'
db.session.commit()
However what's happening is that I'm getting an error
it.some_property = 'xyz'
AttributeError: 'result' object has no attribute 'some_property'
I'm not crazy. I'm positive that the property does exist on model X which is subclassed from db.Model. Something about the query is preventing me from accessing the attributes even though I can clearly see they exist in the debugger. Any help would be appreciated.
class X(db.Model):
x_id = db.Column(db.Integer, primary_key=True)
size = db.Column(db.Integer, nullable=False)
oords = db.relationship('Oords', lazy=True, backref=db.backref('x', lazy='joined'))
def __init__(self, capacity):
self.size = size

Given your example your result objects do not have the attribute some_property, just like the exception says. (Neither do model X objects, but I hope that's just an error in the example.)
They have the explicitly labeled total as second column and the model X instance as the first column. If you mean to access a property of the X instance, access that first from the result row, either using index, or the implicit label X:
items = db.session.query(X, func.count(Y.x_id).label('total')).\
filter(X.size >= size).\
outerjoin(Y, X.x_id == Y.x_id).\
group_by(X.x_id).\
order_by('total ASC').\
limit(20).\
all()
# Unpack a result object
for x, total in items:
x.some_property = 'xyz'
# Please commit after *all* the changes.
db.session.commit()
As noted in the other answer you could use bulk operations as well, though your limit(20) will make that a lot more challenging.

You should use the update function.
Like that:
from sqlalchemy import update
stmt = update(users).where(users.c.id==5).\
values(name='user #5')
Or :
session = self.db.get_session()
session.query(Organisation).filter_by(id_organisation = organisation.id_organisation).\
update(
{
"name" : organisation.name,
"type" : organisation.type,
}, synchronize_session = False)
session.commit();
session.close()
The sqlAlchemy doc : http://docs.sqlalchemy.org/en/latest/core/dml.html

Mongoengine, retriving only some of a MapField

For Example.. In Mongodb..
> db.test.findOne({}, {'mapField.FREE':1})
{
"_id" : ObjectId("4fb7b248c450190a2000006a"),
"mapField" : {
"BOXFLUX" : {
"a" : "f",
}
}
}
The 'mapField' field is made of MapField of Mongoengine.
and 'mapField' field has a log of key and data.. but I just retrieved only 'BOXFLUX'..
this query is not working in MongoEngine....
for example..
BoxfluxDocument.objects( ~~ querying ~~ ).only('mapField.BOXFLUX')
AS you can see..
only('mapField.BOXFLUX') or only only('mapField__BOXFLUX') does not work.
it retrieves all 'mapField' data, including 'BOXFLUX' one..
How can I retrieve only a field of MapField???

I see there is a ticket for this: https://github.com/hmarr/mongoengine/issues/508
Works for me heres an example test case:
def test_only_with_mapfields(self):
class BlogPost(Document):
content = StringField()
author = MapField(field=StringField())
BlogPost.drop_collection()
post = BlogPost(content='Had a good coffee today...',
author={'name': "Ross", "age": "20"}).save()
obj = BlogPost.objects.only('author__name',).get()
self.assertEquals(obj.author['name'], "Ross")
self.assertEquals(obj.author.get("age", None), None)

Try this:
query = BlogPost.objects({your: query})
if name:
query = query.only('author__'+name)
else:
query = query.only('author')

I found my fault! I used only twice.
For example:
BlogPost.objects.only('author').only('author__name')
I spent a whole day finding out what is wrong with Mongoengine.
So my wrong conclusion was:
BlogPost.objects()._collection.find_one(~~ filtering query ~~, {'author.'+ name:1})
But as you know it's a just raw data not a mongoengine query.
After this code, I cannot run any mongoengine methods.
In my case, I should have to query depending on some conditions.
so it will be great that 'only' method overwrites 'only' methods written before.. In my humble opinion.
I hope this feature would be integrated with next version. Right now, I have to code duplicate code:
not this code:
query = BlogPost.objects()
query( query~~).only('author')
if name:
query = query.only('author__'+name)
This code:
query = BlogPost.objects()
query( query~~).only('author')
if name:
query = BlogPost.objects().only('author__'+name)
So I think the second one looks dirtier than first one.
of course, the first code shows you all the data
using only('author') not only('author__name')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python JSON scraping - how can I handle missing values? - python

You can use for that the get() method of the dictionary as follow timer = json_obj.get('timestamp', 0) 0 is the default value and in case there is no 'timestamp' attribute it will return 0. For job position, you can do jobposition = json_obj['job'].get('position', 0) if 'job' in json_obj else 0

Try this try: jobposition = json_obj['job']['position'] except: jobposition = 0

Related

Problems querying with python to BigQuery (Python String Format)

How to add a local variable to pymongo.find if it's not None?

I have an Error with python flask cause of an API result (probably cause of my list) and my Database

How to update object returned in query

Mongoengine, retriving only some of a MapField

Categories

Resources