I am trying to build a big aggregated table with googles tools but I am a bit lost on the 'how to do it'.
Here is what I would like to create: I have a big table in bigquery. Its updated daily with about 1.2M events for evert user of the application. I would like to have an auto updating aggregate table(udpated once every day) built upon that with all user data broken by userID. But how do I continiously update the data inside of it?
I read a bit about firebase and bigquery but since they are very new to me I cant figure out if this is possible to do serverlessly?
I know how to do it with a jenkins process that queries the big events table for the last day, gets all userIDs, joins with the data from the existing aggregate values for the userIDs, takes whatever is changed and deletes from the aggregate in order to insert the updated data. (In python)
The issue is I want to do this entirely within the structure of google. Is firebase able to do that? Is bigquery able? How? What tools? Could this be solved using the serverless functions available?
I am more familiar with Redshift.
You can use a pretty new BigQuery feature that to schedule queries.
I use it to create rollup tables. If you need more custom stuff you can use cloud scheduler to call any Google product that can be triggered by an HTTP request such as cloud function, cloud run, or app engine.
Related
There is an external API that has some data that I need to have a copy of inside a dynamoDB table.
The external API gives me an user_id and some user_data. There are many user_ids that need to be updated on a regular basis (every few hours).
I can easily invoke the function to get the data initially into the table and record the first batch (as each user_id is added manually).
But how can I schedule the lambda function, to run again, with the user ID as a parameter, to update the record automatically, 3 hours after it's been added to the table?
I was thinking about some 'scheduler_table' to which to save a timestamp with when the next update should be and for which user_id, and then having a trigger that invokes a lambda function every 10 minutes to scan the table, to see if there are any 'expired' timestamps, but this seems sub-optimal and I was wondering if there are any AWS resources, design decisions or methods I can use, that will allow me to do this in a more straight-forward and easy to manage manner?
If the dynamoDB scheduler_table solution is reasonable, what would be the best way to organize the primary/sort keys and how would I go about querying them? (as I've found out that scanning wouldn't be a great solution)
PS: Solutions are not too hard on my pokets are preferred.
Thanks!
I have a bigquery table about 200 rows, i need to insert,delete and update values in this through a web interface(the table cannot be migrated to any other relational or non-relational database).
The web application will be deployed in google-cloud on app-engine and the user who acts as admin and owner privileges on Bigquery will be able to create and delete records and the other users with view permissions on the dataset in bigquery will be able to view records only.
I am planning to use the scripting language as python,
server(django or flask or any other)-> not sure which one is better
The web application should be displayed as a data-grid like appearance with buttons create,delete or view visiblility according to their roles.
I have not done anything like this in python,bigquery and django. I am already familiar with calling bigquery from python-client but to call in a web interface and in a transactional way, i am totally new.
I am seeing examples only related to django with their inbuilt model and not with big-query.
Can anyone please help me and clarify whether this is possible to implement and how?
I was able to achieve all of "C R U D" on Bigquery with the help of SQLAlchemy, though I had make a lot of concessions like if i use sqlalchemy class i needed to use a false primary key as Bigquery does not use any primary key and for storing sessions i needed to use file based session On Django for updates and create sqlalchemy does not allow without primary key, so i used raw sql part of SqlAlchemy. Thanks to the #mhawke who provided the hint for me to carry out this exericse
No, at most you could achieve the "R" of "CRUD." BigQuery isn't a transactional database, it's for querying vast amounts of data and preparing the results as an immutable view.
It doesn't provide a method to modify the source data directly and even if you did you'd need to run the query again. Also important to note are that queries are asynchronous and require much longer to perform than traditional databases.
The only reasonable solution would be to export the table data to GCS and then import it into a normal database for querying. Alternatively if you can't use another database and since you said there are only 1,000 rows you could perform your CRUD actions directly on that exported CSV.
I am trying to find a way to automatically update a big query table using this link: https://www6.sos.state.oh.us/ords/f?p=VOTERFTP:DOWNLOAD::FILE:NO:2:P2_PRODUCT_NUMBER:1
This link is updated with new data every week and I want to be able to replace the Big Query table with this new data. I have researched that you can export spreadsheets to Big Query, but that is not a streamlined approach.
How would I go about submitting a script that imports the data and having that data be fed to Big Query?
I assume you already have a working script that parses the content of the URL and places the contents in BigQuery. Based on that I would recommend the following workflow:
Upload the script as a Google Cloud Function. If your script isn't written in a compatible language (i.e. Python, Node, Go), you can use Google Cloud Run instead. Set the Cloud Function to be triggered by a Pub/Sub message. In this scenario, the content of your Pub/Sub message doesn't matter.
Set up a Google Cloud Scheduler job to (a) run at 12am every Saturday (or whatever time you wish) and (b) send a dummy message to the Pub/Sub topic that your Cloud Function is subscribed to.
You can try using a HTTP request to the page using a programming language like Python with the Request library, save the data into a Pandas Dataframe or a CSV file, and then using the BigQuery libraries you can push that data into a BigQuery table.
I have a basic personal project website that I am looking to learn some web dev fundamentals with and database (SQL) fundamentals as well (If SQL is even the right technology to use??).
I have the basic skeleton up and running but as I am new to this, I want to make sure I am doing it in the most efficient and "correct" way possible.
Currently the site has a main index (landing) page and from there the user can select one of a few subpages. For the sake of understanding, each of these sub pages represents a different surf break and they each display relevant info about that particular break i.e. wave height, wind, tide.
As I have already been able to successfully scrape this data, my main questions revolve around how would I go about inserting this data into a database for future use (historical graphs, trends)? How would I ensure data is added to this database in a continuous manner (once/day)? How would I use data that was scraped from an earlier time, say at noon, to be displayed/used at 12:05 PM rather than scraping it again?
Any other tips, guidance, or resources you can point me to are much appreciated.
This kind of data is called time series. There are specialized database engines for time series, but with a not-extreme volume of observations - (timestamp, wave heigh, wind, tide, which break it is) tuples - a SQL database will be perfectly fine.
Try to model your data as a table in Postgres or MySQL. Start by making a table and manually inserting some fake data in a GUI client for your database. When it looks right, you have your schema. The corresponding CREATE TABLE statement is your DDL. You should be able to write SELECT queries against your table that yield the data you want to show on your webapp. If these queries are awkward, it's a sign that your schema needs revision. Save your DDL. It's (sort of) part of your source code. I imagine two tables: a listing of surf breaks, and a listing of observations. Each row in the listing of observations would reference the listing of surf breaks. If you're on a Mac, Sequel Pro is a decent tool for playing around with a MySQL database, and playing around is probably the best way to learn to use one.
Next, try to insert data to the table from a Python script. Starting with fake data is fine, but mold your Python script to read from your upstream source (the result of scraping) and insert into the table. What does your scraping code output? Is it a function you can call? A CSV you can read? That'll dictate how this script works.
It'll help if this import script is idempotent: you can run it multiple times and it won't make a mess by inserting duplicate rows. It'll also help if this is incremental: once your dataset grows large, it will be very expensive to recompute the whole thing. Try to deal with importing a specific interval at a time. A command-line tool is fine. You can specify the interval as a command-line argument, or figure out out from the current time.
The general problem here, loading data from one system into another on a regular schedule, is called ETL. You have a very simple case of it, and can use very simple tools, but if you want to read about it, that's what it's called. If instead you could get a continuous stream of observations - say, straight from the sensors - you would have a streaming ingestion problem.
You can use the Linux subsystem cron to make this script run on a schedule. You'll want to know whether it ran successfully - this opens a whole other can of worms about monitoring and alerting. There are various open-source systems that will let you emit metrics from your programs, basically a "hey, this happened" tick, see these metrics plotted on graphs, and ask to be emailed/texted/paged if something is happening too frequently or too infrequently. (These systems are, incidentally, one of the main applications of time-series databases). Don't get bogged down with this upfront, but keep it in mind. Statsd, Grafana, and Prometheus are some names to get you started Googling in this direction. You could also simply have your script send an email on success or failure, but people tend to start ignoring such emails.
You'll have written some functions to interact with your database engine. Extract these in a Python module. This forms the basis of your Data Access Layer. Reuse it in your Flask application. This will be easiest if you keep all this stuff in the same Git repository. You can use your chosen database engine's Python client directly, or you can use an abstraction layer like SQLAlchemy. This decision is controversial and people will have opinions, but just pick one. Whatever database API you pick, please learn what a SQL injection attack is and how to use user-supplied data in queries without opening yourself up to SQL injection. Your database API's documentation should cover the latter.
The / page of your Flask application will be based on a SQL query like SELECT * FROM surf_breaks. Render a link to the break-specific page for each one.
You'll have another page like /breaks/n where n identifies a surf break (an integer that increments as you insert surf break rows is customary). This page will be based on a query like SELECT * FROM observations WHERE surf_break_id = n. In each case, you'll call functions in your Data Access Layer for a list of rows, and then in a template, iterate through those rows and render some HTML. There are various Javascript and Python graphing libraries you can feed this list of rows into and get graphs out of (client side or server side). If you're interested in something like a week-over-week change, you should be able to express that in one SQL query and get that dataset directly from the database engine.
For performance, try not to get in a situation where more than one SQL query happens during a page load. By default, you'll be doing some unnecessary work by going back to the database and recomputing the page every time someone requests it. If this becomes a problem, you can add a reverse proxy cache in front of your Flask app. In your case this is easy, since nothing users do to the app cause its content to change. Simply invalidate the cache when you import new data.
New to Pandas & SQL. Haven't found an answer specific to this config, and not sure if standard SQL wisdom applies when introducing pandas to the mix.
Doing a school project that involves ~300 gb of data in ~6gb .csv chunks.
School advised syncing data via dropbox, but this seemed impractical for a 4-person team.
So, current solution is AWS EC2 & RDS instance (MySQL, I think it'll be, 1 table).
What I wanted to confirm before we start setting it up:
If multiple users are working with (and occasionally modifying) the data, can this arrangement manage conflicts? e.g., if user A uses pandas to construct a dataframe from a query, are the records in that query frozen if user B tries to work with them?
My assumption is that the data in the frame are in memory, and the records in the SQL database are free to be modified by others until the dataframe is written back to the db, but I'm hoping that either I'm wrong or there's a simple solution here (like a random sample query for each user or something).
A pandas DataFrame object does not interact directly with the db. Once you read it in it sits in memory locally. You would have to use a method like DataFrame.to_sql to write your changes back to the MySQL DB. For more information on reading and writing to SQL tables, see the pandas documentation here.