Pull data from Tableau Server into Pandas Dataframe - python

My goal is to join three datasources that are only available to me through Tableau Server (no direct database access). The data is too large to efficiently use Tableau's Data Blending.
One way forward is to pull the data from the three Tableau Server Datasources into a Pandas dataframe, do the necessary manipulations, and save down an Excel File to use as a datasource for a visualization in Tableau.
I have found lots of information on the TabPy module that allows one to convert a Pandas dataframe to a Tableau Data Extract but have not found much re: how to pull data from Tableau server in an automated fashion.
I have also read about tabcmd as a way of automating tasks, but do not have the necessary admin permissions.
Let me know if you need further information.

Tabcmd does not require admin privileges. Anyone with permissions to Server can use it, but it will respect the privileges you do have. You can install tabcmd on computers other than your server without needing extra license keys.
That being said, it's very simple to automate data downloading. Take the URL to your workbook and add ".csv" to the end of it. The .csv goes at the end of the URL, not any query parameters you have.
For example: http://[Tableau Server Location]/views/[Workbook Name]/[View Name].csv
Using URL parameters, you can customize the data filters and how it looks. Just make sure you put the .csv before the ? for any query parameters.
More info for this plus a few others hacks at http://www.vizwiz.com/2014/03/the-greatest-tableau-tip-ever-exporting.html.

You can use pantab to both read and write from Hyper extracts https://pantab.readthedocs.io/en/latest/

Related

csv->Oracle DB autoscheduled import

I have basic csv report that is produced by other team on a daily basis, each report has 50k rows, those reports are saved on sharedrive everyday. And I have Oracle DB.
I need to create autoscheduled process (or at least less manual) to import those csv reports to Oracle DB. What solution would you recommend for it?
I did not find such solution in SQL Developer, since it is upload from file and not a query. I was thinking about python cron script, that will autoran on a daily basis and transform csv report to txt with needed SQL syntax (insert into...) and then python will connect to Oracle DB and will ran txt file as SQL command and insert data.
But this looks complicated.
Maybe you know other solution that you would recommend yo use?
Create an external table to allow you to access the content of the CSV as if it were a regular table. This assumes the file name does not change day-to-day.
Create a scheduled job to import the data in that external table and do whatever you want with it.
One common blocking issue that prevents using 'external tables' is that external tables require the data to be on the computer hosting the database. Not everyone has access to those servers. Or sometimes the external transfer of data to that machine + the data load to the DB is slower than doing a direct path load from the remote machine.
SQL*Loader with direct path load may be an option: https://docs.oracle.com/en/database/oracle/oracle-database/19/sutil/oracle-sql-loader.html#GUID-8D037494-07FA-4226-B507-E1B2ED10C144 This will be faster than Python.
If you do want to use Python, then read the cx_Oracle manual Batch Statement Execution and Bulk Loading. There is an example of reading from a CSV file.

How to fetch data as a .zip using Cx_Oracle?

I would like to fetch the data but receiving a .zip with all the data instead of a list of tuples of data. That is, making the specified query, then database server compresses the result data as .zip and then sends this .zip to client.
By doing this I expect to reduce time spent on sending data by a lot, because there are lots of repeated fields.
I know Advanced Data compression exists in Oracle, however I am not able to achieve this using Cx_Oracle.
Any help/ workaround is appreciated.
Advanced Network Compression can be enabled as described here, using sqlnet.ora and/or tnsnames.ora:
https://cx-oracle.readthedocs.io/en/latest/user_guide/initialization.html#optnetfiles
https://www.oracle.com/technetwork/database/enterprise-edition/advancednetworkcompression-2141325.pdf

Django data storage - SQL or something else?

I am building a Django web app which will essentially serve static data to the users. By static, I mean that admins will be able to upload new datasets but no data entries will be made by users. Effectively, once the data is uploaded, it will be read-only on request by a user.
Given that these are quite large datasets (200k+ rows), I figured that SQL would be the best way to store the data - this avoids reading large datasets into memory (as you'd have to with a pickle or json?). This has the added bonus of using Django models to access the data.
However, I am not sure of the best way to do this, or if there is a better alternative to SQL. I currently have an admin page that allows you to upload .xlsx files which are then parsed and added as model entries row-by-row. It takes FOREVER (30+ minutes for 100K rows). Perhaps I should be creating a whole new db outside of Django and then importing that somehow, but I can't find much documentation on how this could/should be done. Any ideas would be greatly appreciated! Thanks in advance for any wisdom.
You can try to use .csv file format instead of .xlsx. Python has libraries that allow you to easily write to an sql database using .csv format (comma separated value). This answer could be of further assistance. I hope you find what you're looking for and happy coding!

Get the data from a REST API and store it in HDFS/HBase

I'm new to Big data. I learned that HDFS is for storing more of structured data and HBase is for storing unstructured data. I'm having a REST API where I need to get the data and load it into the data warehouse (HDFS/HBase). The data is in JSON format. So which one would be better to load the data into? HDFS or HBase? Also can you please direct me to some tutorial to do this. I came across this about Tutorial with Streaming Data. But I'm not sure if this will fit my use case.
It would be of great help if you can guide me to a particular resource/ technology to solve this issue.
There is several questions you have to think about
Do you want to work with batch files or streaming ? It depends on the rate at which your REST API will be requested
For the Storage there is not just HDFS and Hbase, you have a lot of other solutions as Casandra, MongoDB, Neo4j. All depends on the way you want to use it (Random Acces VS Full Scan, Update with versioning VS writing new lines, Concurrency access). For example Hbase is good for random access, Neo4j for graph storage,... If you are receiving JSON files, MongoDB can be a god choice as it stores object as document.
What is the size of your data ?
Here is good article on questions to think about when you start a big data project documentation

Insert users into Active Directory

I am trying to determine how to best insert users into active directory from a SQL server table.
I figured I could use the LDAP sever to do a insert, but the research iv done would suggest otherwise and that I could only pull data from active directory to SQL server.
Then I thought I could use a python program to query the table and spit out a CSV file to then do a bulk insert but I am not sure if this would modify existing users if data changes.
Any insight would be appreciated
Here's a general idea of the algorithm:
Load user data from SQL Server
Convert it into an LDIF (LDAP Data Interchange Format) file
Import the LDIF file into Active Directory using the LDIFDE command-line tool
Python, or any other programming language, can help you with step 2. Notice that the details of the conversion are very specific to how your data is represented. You'll have to carefully map each data base field into an LDAP attribute, and determine the classes to be used in the LDAP objects.
Will the above modify existing users? yes, of course. You could write the LDIF in such a way that it updates the existing data, or if that's a problem you could verify first if an user exists in the Active Directory and don't add those changes to the LDIF file.
Alternatively
You could use CSVDE for importing data in CSV format, but anyway you'll have to design a mapping strategy for each one of the fields that you want to import into Active Directory.

Categories