How to create collection with index if not exists in pymongo?

How to create collection with index if not exists in pymongo? - python

I was using automatic collection creation in pymongo:
client = MongoClient(url)
db = client.get_default_database()
my_collection = db['my_collection']
I was thinking it created automatically on last statement and added
index_name = 'my_index'
if index_name not in my_collection.index_information():
my_collection.create_index(..., name=index_name, unique=False)
Unfortunately, this led to error
pymongo.errors.OperationFailure: Collection ... doesn't exist
This made me think, that collection is created on first save. This gives me no place to put index creation code.
So, the question is: how to create collection with index, but only if it doesn't exist?
I have read this answer https://stackoverflow.com/a/9826294/258483 but don't like how it implies to write check of existence twice:
client = MongoClient(url)
db = client.get_default_database()
if 'my_collection' not in db.collection_names():
db.createCollection('my_collection')
my_collection = db['my_collection']
index_name = 'my_index'
if index_name not in my_collection.index_information():
my_collection.create_index(..., name=index_name, unique=False)

As you see, calling index_information on a collection that doesn't exist yet throws OperationFailure.
Just call create_index without checking first:
client = MongoClient(url)
db = client.get_default_database()
my_collection = db['my_collection']
index_name = 'my_index'
my_collection.create_index(..., name=index_name, unique=False)
If the index already exists, the MongoDB server ignores the command. If the index does not exist, MongoDB creates the collection (if necessary) and the index on the collection.

Related

How to use apply_parallel on db calls

I was using apply_parallel function from pandarallel library, the below snippet(Function call) iterates over rows and fetches data from mongo db. While executing the same throws me EOFError and a mongo client warning as given below
Mongo function:
def fetch_final_price(model_name, time, col_name):
collection = database['col_name']
price = collection.find({"$and":[{"Model":model_name},{'time':time}]})
price = price[0]['price']
return price
Function call:
final_df['Price'] = df1.parallel_apply(lambda x :fetch_final_price(x['model_name'],x['purchase_date'],collection_name), axis=1)
MongoClient config:
client = pymongo.MongoClient(host=host,username=username,port=port,password=password,tlsCAFile=sslCAFile,retryWrites=False)
Error:
EOFError: Ran out of input
Mongo client warning:
"MongoClient opened before fork. Create MongoClient only "
How to make db calls in parallel_apply??

First of all, "MongoClient opened before fork" warning also provides a link for the documentation, from which you can know that in multiprocessing (which pandarallel base on) you should create MongoClient inside your function (fetch_final_price), otherwise it likely leads to a deadlock:
def fetch_final_price(model_name, time, col_name):
client = pymongo.MongoClient(
host=host,
username=username,
port=port,
password=password,
tlsCAFile=sslCAFile,
retryWrites=False
)
collection = database['col_name']
price = collection.find({"$and": [{"Model": model_name}, {'time': time}]})
price = price[0]['price']
return price
The second mistake, that leads to the exception in the function and the following EOFError, is that you use the brackets operator to a find result, which is actually an iterator, not a list. Consider using find_one if you need only a first instance (alternatively, you can do next(price) instead of indexing operator, but it's not a good way to do this)
def fetch_final_price(model_name, time, col_name):
client = pymongo.MongoClient(
host=host,
username=username,
port=port,
password=password,
tlsCAFile=sslCAFile,
retryWrites=False
)
collection = database['col_name']
price = collection.find_one({"$and": [{"Model": model_name}, {'time': time}]})
price = price['price']
return price

BigQuery - LoadJobConfig vs QueryJobConfig in update schema

I run into in a situation like below: In fact, I try to update the schema of my partition table (partitioned by Time-unit column). I use this article and this example as my references and the doc says that
schemaUpdateOptions[]
: Schema update options are supported in two cases: when writeDisposition is WRITE_APPEND; when writeDisposition is WRITE_TRUNCATE and the destination table is a partition of a table, specified by partition decorators. For normal tables, WRITE_TRUNCATE will always overwrite the schema.
So what I understand is, with LoadJobConfig().schema_update_options = [bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION]:
For normals tables, the LoadJobConfig().write_disposition
WRITE_APPEND
WRITE_TRUNCATE
Update schema successfully and new data is appended into the table
Update schema successfully BUT the table is overwrited with new data
For partitions tables, the LoadJobConfig().write_disposition
WRITE_APPEND
WRITE_TRUNCATE
NOT ALLOWED - Error message: "invalid: Schema update options should only be specified with WRITE_APPEND disposition, or with WRITE_TRUNCATE disposition on a table partition."
Update schema successfully BUT the table is overwrited with new data
It's always true when I'm using LoadJobConfig() but if I use QueryJobConfig() instead, things have changed.
In fact, it's still true for normals tables but for partitions tables, even when the write_disposition=WRITE_APPEND, the schema is still updated successfully and the new data is appended into the table !!
How do we explain this situation, please ? There is something special about QueryJobConfig()? Or Did I missunderstand something, please?
Many thanks !!

There are some slight differences between each class, I would recommend that you pay attention to the default configurations of each one and you can solve your problem, if any of them that ends up returning erroes may be due to an incorrect inizialization of the configuration and its functionalities.
QueryJobConfig
QueryJobConfig(**kwargs)
Configuration options for query jobs.
All properties in this class are optional. Values which are :data:None -> server defaults. Set properties on the constructed configuration by using the property name as the name of a keyword argument.
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the destination table.
# table_id = "your-project.your_dataset.your_table_name"
# Retrieves the destination table and checks the length of the schema.
table = client.get_table(table_id) # Make an API request.
print("Table {} contains {} columns".format(table_id, len(table.schema)))
# Configures the query to append the results to a destination table,
# allowing field addition.
job_config = bigquery.QueryJobConfig(
destination=table_id,
schema_update_options=[bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION],
write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
)
# Start the query, passing in the extra configuration.
query_job = client.query(
# In this example, the existing table contains only the 'full_name' and
# 'age' columns, while the results of this query will contain an
# additional 'favorite_color' column.
'SELECT "Timmy" as full_name, 85 as age, "Blue" as favorite_color;',
job_config=job_config,
) # Make an API request.
query_job.result() # Wait for the job to complete.
# Checks the updated length of the schema.
table = client.get_table(table_id) # Make an API request.
print("Table {} now contains {} columns".format(table_id, len(table.schema)))
LoadJobConfig
LoadJobConfig(**kwargs)
Configuration options for load jobs.
Set properties on the constructed configuration by using the property name as the name of a keyword argument. Values which are unset or :data:None use the BigQuery REST API default values. See the BigQuery REST API reference documentation for a list of default values.
Required options differ based on the source_format value. For example, the BigQuery API's default value for source_format is "CSV". When loading a CSV file, either schema must be set or autodetect must be set to :data:True.
# from google.cloud import bigquery
# client = bigquery.Client()
# project = client.project
# dataset_ref = bigquery.DatasetReference(project, 'my_dataset')
# filepath = 'path/to/your_file.csv'
# Retrieves the destination table and checks the length of the schema
table_id = "my_table"
table_ref = dataset_ref.table(table_id)
table = client.get_table(table_ref)
print("Table {} contains {} columns.".format(table_id, len(table.schema)))
# Configures the load job to append the data to the destination table,
# allowing field addition
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job_config.schema_update_options = [
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION
]
# In this example, the existing table contains only the 'full_name' column.
# 'REQUIRED' fields cannot be added to an existing schema, so the
# additional column must be 'NULLABLE'.
job_config.schema = [
bigquery.SchemaField("full_name", "STRING", mode="REQUIRED"),
bigquery.SchemaField("age", "INTEGER", mode="NULLABLE"),
]
job_config.source_format = bigquery.SourceFormat.CSV
job_config.skip_leading_rows = 1
with open(filepath, "rb") as source_file:
job = client.load_table_from_file(
source_file,
table_ref,
location="US", # Must match the destination dataset location.
job_config=job_config,
) # API request
job.result() # Waits for table load to complete.
print(
"Loaded {} rows into {}:{}.".format(
job.output_rows, dataset_id, table_ref.table_id
)
)
# Checks the updated length of the schema
table = client.get_table(table)
print("Table {} now contains {} columns.".format(table_id, len(table.schema)))
It should be noted that google.cloud.bigquery.job.SchemaUpdateOption is overloaded in both classes and specifies updates to the destination table schema to allow as a side effect of the query job.

Why is PyMongo literally inserting into collection?

I would like the data to be inserted in mycollection, but it'll literally insert into a collection called 'collection' when I use the collection variable before insert_one.
client = MongoClient()
db = client['mydb']
collection = db['mycollection']
db.collection.insert_one({"id": "hello"})

I didn't realize I had to remove the db part. This worked:
collection = db['mycollection']
collection.insert_one({"id": "hello"})

Return all entities (data) from the most recent put()

I am using AppEngine with the Python runtime environment to host a dashboard for my team. The data for the dashboard is stored in Memcache and/or Cloud Datastore. New data is pulled into the application using the BigQuery API.
class ExampleForStackOverflow(webapp2.RequestHandler):
def get(self):
credentials = GoogleCredentials.get_application_default()
bigquery_service = build('bigquery', 'v2', credentials=credentials)
query = """SELECT field1, field2
FROM
[table_name];"""
try:
timeout = 10000
num_retries = 5
query_request = bigquery_service.jobs()
query_data = {
'query': (query),
'timeoutMs': timeout,
}
query_response = query_request.query(
projectId='project_name',
body=query_data).execute(num_retries=num_retries)
# Insert query response into datastore
for row in query_response['rows']:
parent_key = ndb.Key(MyModel, 'default')
item = MyModel(parent=parent_key)
item.field1 = row['f'][0]['v']
item.field2 = row['f'][1]['v']
item.put()
except HttpError as err:
print('Error: {}'.format(err.content))
raise err
These queries will return an indeterminate number of records. I want the dashboard to display the results of the queries regardless of the number of records so using order() by created and then using fetch() to pull a certain number of records won't help.
Is it possible to write a query to return everything from the last put() operation?
So far I have tried to return all records that have been written within a certain time window (e.g. How to query all entries from past 6 hours ( datetime) in GQL?)
That isn't working for me in a reliable way because every so often the cron job that queries for the new data will fail so I'm left with a blank graph until the cron job runs the following day.
I need a resilient query that will always return data. Thanks in advance.

You could have an additional DateTimeProperty type property in MyModel, let's call it last_put, which will have the auto_now option set to True. So the datetime of the most recent update of such entity would be captured in its last_put property.
In your get() method you'd start with an ancestor query on the MyModel entities, sorted by last_put and fetching only one item - it will be the most recently updated one.
The last_put property value of that MyModel entity will give the datetime of the last put() you're seeking. Which you can then use in your bigquery query, as mentioned in the post you referenced, to get the entities since that datetime.

Dan's answer led me down the right path but I used a variation of what he suggested (mostly because I don't have a good understanding of ancestor queries). I know this isn't the most efficient way to do this but it'll work for now. Thanks, Dan!
My model:
class MyModel(ndb.Model):
field1 = ndb.StringProperty(indexed=True)
field2 = ndb.StringProperty(indexed=True)
created = ndb.DateTimeProperty(default=datetime.datetime.now())
My query:
query = MyModel.query().order(-MyModel.created)
query = query.fetch(1, projection=[MyModel.created])
for a in query:
time_created = a.created
query = MyModel.query()
query = query.filter(MyModel.created == time_created)

how to determine whether a field exists?

I'm connecting to my mongodb using pymongo:
client = MongoClient()
mongo = MongoClient('localhost', 27017)
mongo_db = mongo['test']
mongo_coll = mongo_db['test'] #Tweets database
I have a cursor and am looping through every record:
cursor = mongo_coll.find()
for record in cursor: #for all the tweets in the database
try:
msgurl = record["entities"]["urls"] #look for URLs in the tweets
except:
continue
The reason for the try/except is because if ["entities"]["urls"] does not exist, it errors out.
How can I determine whether ["entities"]["urls"] exists?

Record is a dictionary in which the key "entities" links to another dictionary, so just check to see if "urls" is in that dictionary.
if "urls" in record["entities"]:
If you just want to proceed in any case, you can also use get.
msgurl = record["entities"].get("urls")
This will cause msgurl to equal None if there is no such key.

I'm not familiar with pymongo, but why don't you change your query so it only returns results that contain "urls"? Something like:
mongo_coll.find({"entities.urls": {$exists:1}})
http://docs.mongodb.org/manual/reference/operator/exists/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create collection with index if not exists in pymongo? - python

Related

How to use apply_parallel on db calls

BigQuery - LoadJobConfig vs QueryJobConfig in update schema

Why is PyMongo literally inserting into collection?

Return all entities (data) from the most recent put()

how to determine whether a field exists?

Categories

Resources