How to fetch data as a .zip using Cx_Oracle? - python

I would like to fetch the data but receiving a .zip with all the data instead of a list of tuples of data. That is, making the specified query, then database server compresses the result data as .zip and then sends this .zip to client.
By doing this I expect to reduce time spent on sending data by a lot, because there are lots of repeated fields.
I know Advanced Data compression exists in Oracle, however I am not able to achieve this using Cx_Oracle.
Any help/ workaround is appreciated.

Advanced Network Compression can be enabled as described here, using sqlnet.ora and/or tnsnames.ora:
https://cx-oracle.readthedocs.io/en/latest/user_guide/initialization.html#optnetfiles
https://www.oracle.com/technetwork/database/enterprise-edition/advancednetworkcompression-2141325.pdf

Related

csv->Oracle DB autoscheduled import

I have basic csv report that is produced by other team on a daily basis, each report has 50k rows, those reports are saved on sharedrive everyday. And I have Oracle DB.
I need to create autoscheduled process (or at least less manual) to import those csv reports to Oracle DB. What solution would you recommend for it?
I did not find such solution in SQL Developer, since it is upload from file and not a query. I was thinking about python cron script, that will autoran on a daily basis and transform csv report to txt with needed SQL syntax (insert into...) and then python will connect to Oracle DB and will ran txt file as SQL command and insert data.
But this looks complicated.
Maybe you know other solution that you would recommend yo use?
Create an external table to allow you to access the content of the CSV as if it were a regular table. This assumes the file name does not change day-to-day.
Create a scheduled job to import the data in that external table and do whatever you want with it.
One common blocking issue that prevents using 'external tables' is that external tables require the data to be on the computer hosting the database. Not everyone has access to those servers. Or sometimes the external transfer of data to that machine + the data load to the DB is slower than doing a direct path load from the remote machine.
SQL*Loader with direct path load may be an option: https://docs.oracle.com/en/database/oracle/oracle-database/19/sutil/oracle-sql-loader.html#GUID-8D037494-07FA-4226-B507-E1B2ED10C144 This will be faster than Python.
If you do want to use Python, then read the cx_Oracle manual Batch Statement Execution and Bulk Loading. There is an example of reading from a CSV file.

Pull data from Tableau Server into Pandas Dataframe

My goal is to join three datasources that are only available to me through Tableau Server (no direct database access). The data is too large to efficiently use Tableau's Data Blending.
One way forward is to pull the data from the three Tableau Server Datasources into a Pandas dataframe, do the necessary manipulations, and save down an Excel File to use as a datasource for a visualization in Tableau.
I have found lots of information on the TabPy module that allows one to convert a Pandas dataframe to a Tableau Data Extract but have not found much re: how to pull data from Tableau server in an automated fashion.
I have also read about tabcmd as a way of automating tasks, but do not have the necessary admin permissions.
Let me know if you need further information.
Tabcmd does not require admin privileges. Anyone with permissions to Server can use it, but it will respect the privileges you do have. You can install tabcmd on computers other than your server without needing extra license keys.
That being said, it's very simple to automate data downloading. Take the URL to your workbook and add ".csv" to the end of it. The .csv goes at the end of the URL, not any query parameters you have.
For example: http://[Tableau Server Location]/views/[Workbook Name]/[View Name].csv
Using URL parameters, you can customize the data filters and how it looks. Just make sure you put the .csv before the ? for any query parameters.
More info for this plus a few others hacks at http://www.vizwiz.com/2014/03/the-greatest-tableau-tip-ever-exporting.html.
You can use pantab to both read and write from Hyper extracts https://pantab.readthedocs.io/en/latest/

Get the data from a REST API and store it in HDFS/HBase

I'm new to Big data. I learned that HDFS is for storing more of structured data and HBase is for storing unstructured data. I'm having a REST API where I need to get the data and load it into the data warehouse (HDFS/HBase). The data is in JSON format. So which one would be better to load the data into? HDFS or HBase? Also can you please direct me to some tutorial to do this. I came across this about Tutorial with Streaming Data. But I'm not sure if this will fit my use case.
It would be of great help if you can guide me to a particular resource/ technology to solve this issue.
There is several questions you have to think about
Do you want to work with batch files or streaming ? It depends on the rate at which your REST API will be requested
For the Storage there is not just HDFS and Hbase, you have a lot of other solutions as Casandra, MongoDB, Neo4j. All depends on the way you want to use it (Random Acces VS Full Scan, Update with versioning VS writing new lines, Concurrency access). For example Hbase is good for random access, Neo4j for graph storage,... If you are receiving JSON files, MongoDB can be a god choice as it stores object as document.
What is the size of your data ?
Here is good article on questions to think about when you start a big data project documentation

What is the best strategy for web application to create temporary .csv files?

My application would create .csv files temporarily for storing some data rows. What is the best strategy to manage this kind of temporary files creation and delete them after user logs out of the app?
I think creating temporary .csv files on the server isn't good idea.
Is there any simple way to manage temporary file creation at client machine (browser)?
These .csv files contains table records -> which would be used as source later for d3.js visualization charts/elements.
Please share your experience on real time applications for this scenario ?
I'm using DJango framework (Python) for doing this.
Why create on-disk files at all? For smaller files, use an in-memory file object like StringIO.
If your CSV file sizes can potentially get large, use a tempfile.SpooledTemporaryFile() object; these dynamically swap out data to a on-disk file if you write enough data to them. Once closed, the file is cleared from disk automatically.
If your preference is to store the data client side, "HTML local storage" could be an option for you. This lets you store string data in key value pairs in the user's browser subject to a same origin policy (so data for one origin (domain) is visible only to that origin). There is also a 5MB limit on data size which must be strings - OK for CSV.
Provided that your visualisation pages/code is served from the same domain, the data in local storage will be accessible to it.

Loading a Lot of Data into Google Bigquery from Python

I've been struggling to load big chunks of data into bigquery for a little while now. In Google's docs, I see the insertAll method, which seems to work fine, but gives me 413 "Entity too large" errors when I try to send anything over about 100k of data in JSON. Per Google's docs, I should be able to send up to 1TB of uncompressed data in JSON. What gives? The example on the previous page has me building the request body manually instead of using insertAll, which is uglier and more error prone. I'm also not sure what format the data should be in in that case.
So, all of that said, what is the clean/proper way of loading lots of data into Bigquery? An example with data would be great. If at all possible, I'd really rather not build the request body myself.
Note that for streaming data to BQ, anything above 10k rows/sec requires talking to a sales rep.
If you'd like to send large chunks directly to BQ, you can send it via POST. If you're using a client library, it should handle making the upload resumable for you. To do this, you'll need to make a call to jobs.insert() instead of tabledata.insertAll(), and provide a description of a load job. To actually push the bytes using the Python client, you can create a MediaFileUpload or MediaInMemoryUpload and pass it as the media_body parameter.
The other option is to stage the data in Google Cloud Storage and load it from there.
The example here uses the resumable upload to upload a CSV file. While the file used is small, it should work for virtually any size upload since it uses a robust media upload protocol. It sounds like you want json, which means you'd need to tweak the code slightly for json (an example for json is in the load_json.py example in the same directory). If you have a stream you want to upload instead of a file, you can use a MediaInMemoryUpload instead of the MediaFileUpload that is used in the example.
BTW ... Craig's answer is correct, I just thought I'd chime in with links to sample code.

Categories