Data ingestion using kafka from crawlers

Data ingestion using kafka from crawlers - python

I am trying to work with Kafka for data ingestion but being new to this, i am kind of pretty much confused.I have multiple crawlers, who extract data for me from web platform. Now, the issue is i want to ingest that extract data to Hadoop using Kafka without any middle scripts/service file . Is it possible ?

without any middle scripts/service file . Is it possible ?
Unfortunately, no.
You need some service that's writing into Kafka (your scraper). Whether you produce into Kafka HTTP links (then write an intermediate consumer/producer that generates the scraped results), or only produce the final scraped results, that's up to you.
You also need a second service consuming those topic(s) that writes to HDFS. This could be Kafka Connect (via Confluent's HDFS Connector library), or PySpark (code you'd have to write yourself), or other options that include "middle scripts/services".
If you'd like to combine both options, I would suggest taking a look at Apache Nifi or Streamsets, which can perform HTTP lookups, (X)HTML parsing, and Kafka+HDFS connectors, all configured via a centralized GUI. Note: I believe any Python code would have to be rewritten in a JVM language to support major custom parsing logic in this pipeline

Related

How to stream logs from elk stack to python

I have a kafka consumer in python to process log data (stacktrace analysis and automatic issue creation) and we are also using elk stack in parallel. Is there any possibility to stream logs to python via elk to get rid of kafka? I have no experience in elk and can't find anything about streaming from it. Seems that I can just query log data once per time but this doesn't seem a perfect decision.

No, you cannot stream data out of Elasticsearch on its own.
If your input is something else, you can use Logstash's various output plugins (or write your own) that can write into a supported Python library
For example: pipe, tcp, websocket/http, exec plugins are all generic enough to be used with any language
However, logstash does not persist events like Kafka does, so if you want something that can handle back pressure and doesn't drop events, you'd keep Kafka around

How to run a python script on Azure for CSV file analysis

I have a python script on my local machine that reads a CSV file and outputs some metrics. The end goal is to create a web interface where the user uploads the CSV file and the metrics are displayed, while all being hosted on Azure.
I want to use a VM on Azure to run this python script.
The script takes the CSV file and outputs metrics which are stored in CosmosDB.
A web interface reads from this DB and displays graphs from the data generated by the script.
Can someone elaborate on the steps I need to follow to achieve this? Detailed steps are not essentially required, but a brief overview with links to relevant learning sources would be helpful.

There's an article that lists the primary options for hosting sites in Azure: https://learn.microsoft.com/en-us/azure/developer/python/quickstarts-app-hosting
As Sadiq mentioned, Functions is probably your best choice as it will probably be less expensive, less maintenance, and can handle both the script and the web interface. Here is a python tutorial for that method: https://learn.microsoft.com/en-us/azure/developer/python/tutorial-vs-code-serverless-python-01
Option 2 would be to run a traditional website on an App Service plan, with background tasks handled either by Functions or a Webjob- they both use the webjobs SDK, so the code is very similar: https://learn.microsoft.com/en-us/learn/paths/deploy-a-website-with-azure-app-service/
VMs are an option if either of those two don't work, but it comes with significantly more administration. This learning path has info on how to do this. The website is built on the MEAN stack, but is applicable to Python as well: https://learn.microsoft.com/en-us/learn/paths/deploy-a-website-with-azure-virtual-machines/

What is the best approach to getting data into S3 for Elasticsearch and RabbitMQ?

In my company we developed a few games for which for some games the events are being sent to either Elasticsearch and others to RabbitMQ. We have a local CLI which grabs the data from both, compiles the messages into compressed (Gzip) JSON files after which another CLI converts them to SQL statements and throws them into a local SQL Server. We want now to scale up but the current setup is painful and nowhere near real-time for analysis.
I've recently built an application in Python which I was planning to publish to a docker container in AWS. The script grabs data from Elasticsearch, compiles into small compressed JSONS and publishes to an S3 bucket. From there the data is ingested into Snowflake for analysis. So far I was able to get the data in quite quickly and looks promising as an alternative.
I was planning to do something similar with RabbitMQ but I wanted to find an even better alternative which would allow this ingestion process to happen seamlessly and help me avoid having to implement within the python code all sorts of exception calls.
I've researched a bit and found there might be a way to link RabbitMQ to Amazon Kinesis Firehose. My question would be: How would I send the stream from RabbitMQ to Kinesis?
For Elasticsearch, what is the best way to achieve this? I've read about the logstash plugin for S3 (https://www.elastic.co/guide/en/logstash/current/plugins-outputs-s3.html) and about logstash plugin for kinesis (https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kinesis.html). Which approach would be ideal for real-time ingestion?

My answer will be very theotic and need to be adapted tested in real world and adapted to your use case.
For a near realtime behaviour, I would use logstash
with elasticsearch input and a short cron. this post can help https://serverfault.com/questions/946237/logstashs-elasticsearch-input-plugin-should-be-used-to-output-to-elasticsearch
S3 output (support gzip)
maybe jdbc output to your DB
RabbitMq output plugin
You can create more scallable archi by output to RabbitMQ and use other pipeline to listen to the queue and execute other tasks.
From logstash ES -> Rabbit MQ
From logstash RabbitMQ -> SQL
From logstash RabbitMQ -> Kinesis
From logstash RabbitMQ -> AWS
etc....

How to set up GCP infrastructure to perform search quickly over massive set of json data?

I have about 100 million json files (10 TB), each with a particular field containing a bunch of text, for which I would like to perform a simple substring search and return the filenames of all the relevant json files. They're all currently stored on Google Cloud Storage. Normally for a smaller number of files I might just spin up a VM with many CPUs and run multiprocessing via Python, but alas this is a bit too much.
I want to avoid spending too much time setting up infrastructure like a Hadoop server, or loading all of that into some MongoDB database. My question is: what would be a quick and dirty way to perform this task? My original thoughts were to set up something on Kubernetes with some parallel processing running Python scripts, but I'm open to suggestions and don't really have a clue how to go about this.

Easier would be to just load the GCS data into Big Query and just run your query from there.
Send your data to AWS S3 and use Amazon Athena.
The Kubernetes option would be set up a cluster in GKE and install Presto in it with a lot of workers, use a hive metastore with GCS and query from there. (Presto doesn't have direct GCS connector yet, afaik) -- This option seems more elaborate.
Hope it helps!

Google AppEngine and Threaded Workers

I am currently trying to develop something using Google AppEngine, I am using Python as my runtime and require some advise on setting up the following.
I am running a webserver that provides JSON data to clients, The data comes from an external service in which I have to pull the data from.
What I need to be able to do is run a background system that will check the memcache to see if there are any required ID's, if there is an ID I need to fetch some data for that ID from the external source and place the data in the memecache.
If there are multiple id's, > 30 I need to be able to pull all 30 request as quickly and efficiently as possible.
I am new to Python Development and AppEngine so any advise you guys could give would be great.
Thanks.

You can use "backends" or "task queues" to run processes in the background. Tasks have a 10-minute run time limit, and backends have no run time limit. There's also a cronjob mechanism which can trigger requests at regular intervals.
You can fetch the data from external servers with the "URLFetch" service.

Note that using memcache as the communication mechanism between front-end and back-end is unreliable -- the contents of memcache may be partially or fully erased at any time (and it does happen from time to time).
Also note that you can't query memcache of you don't know the exact keys ahead of time. It's probably better to use the task queue to queue up requests instead of using memcache, or using the datastore as a storage mechanism.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.