mongodb = trigger => python - python

I'm currently building a pipeline that reads data from MongoDB everytime new document gets inserted and send it to external data source after some preprocessing. Preprocessing and sending data to external data source part works well the way I designed.
The problem, however, I can't read data from MongoDB. I'm trying to build a trigger that reads data from MongoDB when certain MongoDB collection gets updated then sends it to python. I'm not considering polling a MongoDB since it's too resource-intensive.
I've found this library mongotriggers(https://github.com/drorasaf/mongotriggers/) and now taking a look at it.
In summary, how can I build a trigger that sends data to python from MongoDB when new document gets inserted to specific collection?
Any comment or feedback would be appreciated.
Thanks in advance.
Best
Gee

In MongoDB v3.6+, you can now use MongoDB Change Streams. Change streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog. Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them.
For example to listen to streams from MongoDB when a new document gets inserted:
try:
with db.collection.watch([{'$match': {'operationType': 'insert'}}]) as stream:
for insert_change in stream:
# Do something
print(insert_change)
except pymongo.errors.PyMongoError:
# The ChangeStream encountered an unrecoverable error or the
# resume attempt failed to recreate the cursor.
logging.error('...')
pymongo.collection.Collection.watch() is available from PyMongo 3.6.0+.

Related

How to be informed that some database information has been changed in Python

I'm working on a code wrote in Python 2.7 that connects to a MariaDB database to read data.
This database receives data from different external resources. My code only read it.
My service read the data once at the beginning and keep everything in memory to avoid I/O.
I would like to know if there is someway to create some 'function callback' in my code to receive some kind of alert of new update/insert, so I can reload my memory data from the database every time that any external resource change or save new data.
I have thought of creating a sql trigger to a new table to insert some "flag" there and put my service to check that new table periodically if the flag is present.
If so, reload the data and delete the flag.
But it sounds like a wrong workaround...
I'm using:
Python 2.7
MariaDB Ver 15.1 Distrib 10.3.24-MariaDB
lib mysql-connector 2.1.6
The better solution for MariaDB is streaming with the CDC API: https://mariadb.com/resources/blog/how-to-stream-change-data-through-mariadb-maxscale-using-cdc-api/
The plan you have now, with using a flag table, means your client has to poll the flag table for presence of the flag. You have to run a query against that table at intervals, and keep doing it 24/7. Depending on how quickly your client needs to be notified of a change, you might need to run this type of polling query very frequently, which puts a burden on the MariaDB server just to respond to the polling queries, even when there is no change to report.
The CDC solution is better because the client can just request to be notified the next time a change occurs, then the client waits. It does not put extra load on the MariaDB server, any more than if you had simply added a replica server.

Data ingestion using kafka from crawlers

I am trying to work with Kafka for data ingestion but being new to this, i am kind of pretty much confused.I have multiple crawlers, who extract data for me from web platform. Now, the issue is i want to ingest that extract data to Hadoop using Kafka without any middle scripts/service file . Is it possible ?
without any middle scripts/service file . Is it possible ?
Unfortunately, no.
You need some service that's writing into Kafka (your scraper). Whether you produce into Kafka HTTP links (then write an intermediate consumer/producer that generates the scraped results), or only produce the final scraped results, that's up to you.
You also need a second service consuming those topic(s) that writes to HDFS. This could be Kafka Connect (via Confluent's HDFS Connector library), or PySpark (code you'd have to write yourself), or other options that include "middle scripts/services".
If you'd like to combine both options, I would suggest taking a look at Apache Nifi or Streamsets, which can perform HTTP lookups, (X)HTML parsing, and Kafka+HDFS connectors, all configured via a centralized GUI. Note: I believe any Python code would have to be rewritten in a JVM language to support major custom parsing logic in this pipeline

BigQuery API for access log - I'm losing data

i have doing access log to a MySQL table, but recently it became too much for MySQL. Then, i decided to save in Google BigQuery. I don't know if it is the better option, but it seems to viable. Anyone has comments about that? Okay...
I started to integrate to Google BigQuery, i made an small application with Flask (a Python framework). I created endpoints to receive data and send to BigQuery. Now my general application sends data to a URL which is pointed to my Flask application, that for your turn, sends to BigQuery. Any observation or suggestion here?
Finally my problem, sometimes i'm losing data. I made an script to test my general application to see the results, i ran the script it for many times and noticed that i lost some data, because sometimes the same data are being saved and sometimes not. Someone has some idea what can be happening? And most important.. How can i prevent to lose data in that case? How my application can be prepared to notice that data wasn't seved to Google BigQuery and then treat it, like to try again?
I am using google-cloud-python library (reference: https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html#tables).
My code:
client = bigquery.Client(project=project_id)
table_ref = client.dataset(dataset_id).table(table_id)
SCHEMA = [SchemaField(**field) for field in schema]
errors = client.create_rows(table_ref, [row], SCHEMA)
That is all
As I expected, you don't handle errors. Make sure you handle and understand how streaming insert works. If you stream 1000 rows, and 56 fail, you get that back, and you need to retry only 56 rows. Also insertId is important.
Streaming Data into BigQuery

Writing a Django backend program that runs indefinitely -- what to keep in mind?

I am trying to write a Django app that queries a remote database for some data, performs some calculations on a portion of this data and stores the results (in the local database using Django models). It also filters another portion and stores the result separately. My front end then queries my Django database for these processed data and displays them to the user.
My questions are:
How do I write an agent program that continuously runs in the backend, downloads data from the remote database, does calculations/ filtering and stores the result in the local Django database ? Particularly, what are the most important things to keep in mind when writing a program that runs indefinitely?
Is using cron for this purpose a good idea ?
The data retrieved from the remote database belong to multiple users and each user's data must be kept/ stored separately in my local database as well. How do I achieve that? using row-level/ class-instance level permissions maybe? Remember that the backend agent does the storage, update and delete. Front end only reads data (through http requests).
And finally, I allow creation of new users. If a new user has valid credentials for the remote database the user should be allowed to use my app. In which case, my backend will download this particular user's data from the remote database, performs calculations/ filtering and presents the results to the user. How can I handle the dynamic creation of objects/ database tables for the new users? and how can I differentiate between users' data when retrieving them ?
Would very much appreciate answers from experienced programmers with knowledge of Django. Thank you.
For
1) The standard get-go solution for timed and background task is Celery which has Django integration. There are others, like Huey https://github.com/coleifer/huey
2) The usual solution is that each row contains user_id column for which this data belongs to. This maps to User model using Django ORM's ForeignKey field. Do your users to need to query the database directly or do they have direct database accounts? If not then this solution should be enough. It sounds like it your front end has 1 database connection and all permission logic is handled by the front end, not the database itself.
3) See 2

Is it possible to do a nonblocking bulk insert into Mongo from Python?

I have to read some data coming across the network, and then insert it into a MongoDB. All of this is done using a Python script. Is there any way to do a nonblocking insert into Mongo from Python so I can continue to process the data coming over the network while the Mongo server handles the insert?
Yes. There are three possibilities:
Do an unacknowledged insert: collection.insert(documents, w=0). The call to insert() returns as soon as the data has been written to the socket's output buffer, and your code does not block awaiting acknowledgment from MongoDB. You'll get no error reporting to your application. More info about write concerns here and here.
Add batches of documents to a Queue, and in a separate thread pop from the queue and insert into MongoDB: collection.insert(q.get()).
Port your application to Tornado and Motor as James Mills suggested; this seems to require the most learning of the three and is probably overkill.
Give Motor a try. It is an Asynchronous Mongo Drivers for the Tornado Web Framework.
Otherwise there are other options:
https://pypi.python.org/pypi/asyncmongo
https://github.com/fiorix/mongo-async-python-driver

Categories