Best approach to extract 1000 Gbs of Data from Teradata - python

What should be the best approach to download data from Teradata using python ? I have more than 3000 GBs of data to be migrated from teradata to another code service. Any approach or advise would be great.
So far I have explored -
Using Pandas it is slow and inefficient. I have also used {fn teradata_try_fastexport} but it seems to be no difference.
Using tdload I am not able to partition the file and no support for downloading file with byte,clob,blob data type.
Using dask, not able to connect with teradata and get the data using dask dataframe. I have read the sql using pandas and then use dask to write the data. However, that seems to be of little help in comparison to huge amount of data.
Pyspark - I am exploring but it seems difficult with too much file configuration and dependency issue.
Any advise or suggested approach that I should follow for the efficient and effective data extraction.

Related

Extract webscraped python data to SQLite, excel or xml?

I'm kinda new to Python and webscraping but I'm currently at a point where I need to extract data to a database. Can someone tell me the pros and cons by using sqlite, excel or xml?
I've read that sqlite should be the fastest, so I may go for that database structure, but can someone then tell me what IDE you use to handle sqlite data after I've extracted it from python?
Edit: I hope my post makes sense. I'm currently trying to use a web scraper from here: https://github.com/gingeleski/odds-portal-scraper
Thanks in advance.
For the short term, Excel is a good way to examine your data and prototype analysis and visualizations. It gets old using it for very large datasets, or multiple similar datasets. Basically as soon as you start doing the same thing more than twice or writing VB code you should switch to the pandas/matplotlib solution.
It looks like the scraper you are using already puts the results in an SQLITE database, but if you have your data in a list or dictionary, I'd suggest using pandas to do calculations and matplotlib for visualizations, as that will give you a robust, extensible solution over the long term. It is very easy to read and write data between an SQLITE database and pandas.
A good way of viewing the data in the DB is a must. I'm currently using SQLiteStudio.
When you say IDE, I'm assuming you're looking for a way to view the SQLite data? If so, DBeaver is a free, open source SQL client. You could use this to view the data quite easily.

writing bulk data to big Query

I would like to write the bulk data to BQ using software API.
My restrictions are:
I am going to use the max size of BQ, columns 10,000 and ~35000 rows (this can be bigger)
Schema autodetect is required
If possible, I would like to use some kind of parallelism to write many tables at the same time asynchronously (for that Apache-beam & dataflow might be the solution)
When using Pandas library for BQ, there is a limit on the size of the dataframe that can be written. this requires partitioning of the data
What would be the best way to do so?
Many thanks for any advice / comment,
eilalan
Apache beam would be the right component as it supports huge volume data processing in batch and streaming mode.
I don't think Beam as "Schema auto-detect". But, you can use BigQuery API to fetch the schema if the table already exists.

Python pandas large database using excel

I am comfortable using python / excel / pandas for my dataFrames . I do not know sql or database languages .
I am about to start on a new project that will include around 4,000 different excel files I have. I will call to have the file opened saved as a dataframe for all 4000 files and then do my math on them. This will include many computations such a as sum , linear regression , and other normal stats.
My question is I know how to do this with 5-10 files no problem. Am I going to run into a problem with memory or the programming taking hours to run? The files are around 300-600kB . I don't use any functions in excel only holding data. Would I be better off have 4,000 separate files or 4,000 tabs. Or is this something a computer can handle without a problem? Thanks for looking into have not worked with a lot of data before and would like to know if I am really screwing up before I begin.
You definitely want to use a database. At nearly 2GB of raw data, you won't be able to do too much to it without choking your computer, even reading it in would take a while.
If you feel comfortable with python and pandas, I guarantee you can learn SQL very quickly. The basic syntax can be learned in an hour and you won't regret learning it for future jobs, its a very useful skill.
I'd recommend you install PostgreSQL locally and then use SQLAlchemy to connect to create a database connection (or engine) to it. Then you'll be happy to hear that Pandas actually has df.to_sql and pd.read_sql making it really easy to push and pull data to and from it as you need it. Also SQL can do any basic math you want like summing, counting etc.
Connecting and writing to a SQL database is as easy as:
from sqlalchemy import create_engine
my_db = create_engine('postgresql+psycopg2://username:password#localhost:5432/database_name')
df.to_sql('table_name', my_db, if_exists='append')
I add the last if_exists='append' because you'll want to add all 4000 to one table most likely.

pandas.read_csv vs other csv libraries for loading CSV into a Postgres Database

I am a relatively new user of Python. What is the best way of parsing and processing a CSV and loading it into a local Postgres Database (in Python)?
It was recommended to me to use the CSV library to parse and process the CSV. In particular, the task at hand says:
The data might have errors (some rows may be not be parseable), the
data might be duplicated, the data might be really large.
Is there a reason why I wouldn't be able to just use pandas.read_csv here? Does using the CSV library make parsing and loading it into a local Postgres database easier? In particular, if I just use pandas will I run into problems if rows are unparseable, if the data is really big, or if data is duplicated? (For the last bit, I know that pandas offers some relatively clean solutions for de-dupping).
I feel like pandas.read_csv and pandas.to_sql can do a lot of work for me here, but I'm not sure if using the CSV library offers other advantages.
Just in terms of speed, this post: https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file seems to suggest that pandas.read_csv performs the best?
A quick googling didn't reveal any serious drawbacks in pandas.read_csv regarding its functionality (parsing correctness, supported types etc.). Moreover, since you appear to be using pandas to load the data into the DB, too, reading directly into a DataFrame is a huge boost in both performance and memory (no redundant copies).
There are only memory issues for very large datasets - but these are not library's fault. How to read a 6 GB csv file with pandas has instructions on how to process a large .csv in chunks with pandas.
Regarding "The data might have errors", read_csv has a few facilities like converters, error_bad_lines and skip_blank_lines (specific course of action depends on if and how much corruption you're supposed to be able to recover).
I had a school project just last week that required me to load data from a csv and insert it into a postgres database. So believe me when I tell you this: it's way harder than it has to be unless you use pandas. The issue is sniffing out the data types. Okay, so if your database is all a string datatype, forget what I said, you're golden. But if you have a csv with an assortment of datatypes, either you get to sniff them yourself or you can use pandas which does it efficiently and automatically. Plus pandas has a nifty write to sql method which can be easily adapted to work with postgres via a sql alchemy connection, too.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Big Data with Blaza and Pandas

I want to know if this approach would be an overkill for a project.
I have a 4gb file that obviously my computer cant handle. Would using Blaze to split the file into more manageable file sizes and open with pandas and visualize with Bokeh be an overkill?
I know Pandas has a "chunk" function but the reason i want to split them is because there are specific rows related to specific names that i need to analyze.
is there a different approach you would take that wont crash my laptop and doesnt require setting up Hadoop or any AWS service?
Pandas chunking with pd.read_csv(..., chunksize=...) works well.
Alternatively dask.dataframe mimics the Pandas interface and handles the chunking for you.

Categories