Currently I am using Python to connect to RESTAPI and extracting huge volume of data in csv file. The number of rows are almost 80 million. Now i want to load this huge data into Oracle database table. I tried to load using sql loader and also ODI tool but it was taking hours to load this data.
I want to try with Pyspark as its good for loading large datasets. But as i am new to Pyspark not sure as a first approach will it be performance efficient to load such huge csv into oracle database table ?
As second approach will it be performance efficient if instead of creating csv file just store the data from RESTAPI in memory and load into database table ?
Which approach will be better ?
Below is how my CSV data looks like
Let me show you an example of a control file I use to load a very big file ( 120 Million records each day )
OPTIONS (SKIP=0, ERRORS=500, PARALLEL=TRUE, MULTITHREADING=TRUE, DIRECT=TRUE, SILENT=(ALL))
UNRECOVERABLE
LOAD DATA
CHARACTERSET WE8ISO8859P1
INFILE '/path_to_your_file/name_of_the_file.txt'
BADFILE '/path_to_your_file/name_of_the_file.bad'
DISCARDFILE '/path_to_your_file/name_of_the_file.dsc'
APPEND
INTO TABLE yourtablename
TRAILING NULLCOLS
(
COLUMN1 POSITION(1:4) CHAR
,COLUMN2 POSITION(5:8) CHAR
,COLUMN3 POSITION(9:11) CHAR
,COLUMN4 POSITION(12:18) CHAR
....
....)
Some considerations
It is always faster loading by positions than using delimiters
Use the options of PARALLEL, MULTITHREADING and DIRECT to optimize loading performace.
UNRECOVERABLE is also a good advice if you always have the file in case you need to recover the database, you'd need to load the data again.
Use the appropriate characterset.
The TRAILING NULLCOLS clause tells SQL*Loader to treat any relatively positioned columns that are not present in the record as null columns.
Position means that each row contains data without any delimiter, so you know the position of each field in the table by the length.
AAAAABBBBBBCCCCC19828733UUUU
If your txt or csv file has a field separator, let's say semicolon, then you need to use the FIELDS DELIMITED BY
This is stored in a control file, normally a text file with extension ctl. Then you invoke from command line
sqlldr userid=youuser/pwd#tns_string control=/path_to_control_file/control_file.ctl
Related
I am currently in the process of getting the data from my stakeholder where he has a database from which he is going to extract as a csv file.
From there he is going to upload in shared drive and I am going to pick up the data probably download the data and use that a source locally to import in pandas dataframe.
The approximate size will be 40 million rows, I was wondering if the data can be exported as a single csv file from SQL database and that csv can be used as a source for python dataframe or should it be in chunks as I am not sure what the row limitation of csv file is.
I don't think so ram and processing should be an issue at this time.
Your help is much appreciated. Cheers!
If you can't connect directly to the database, you might need the .db file. I'm not sure a csv will even be able to handle more than a million or so rows.
as I am not sure what the row limitation of csv file is.
There is not such limit inherent for CSV format, if you understood CSV as format defined by RFC4180 which stipulates that CSV file is
file = [header CRLF] record *(CRLF record) [CRLF]
where [...] denote optional part, CRLF denote carriagereturn-linefeed (\r\n) and *(...) denote part repeated zero or more times.
I need to compare two tsv files and save needed data to another file.
Tsv_file stands for imdb data file which contains id-rating-votes separated by tab stop and Tsv_file2 stands for my file which contais id-year separated by tab stop
All id's from tsv_file2 are in tsv_file, I need to save to zapis file data in format id-year-rating, where id from tsv_file2 matches id from tsv_file.
The problem is that code below is working but it is saving only one line into zapis file.
What can i improve to save all records?
for linia in read_tsv2:
for linia2 in read_tsv:
if linia[0] == linia2[0]:
zapis.write(linia[0]+'\t')
zapis.write(linia[1]+'\t')
zapis.write(linia2[1])
It would have made life much simpler if you had provided actual examples of these tsv files instead of describing them. Here is a guide to ask a good question.
As I understand you have
123456 4 15916
123888 1 151687
115945 5 35051
vs
123456 1993
123888 2013
and you want
123456 1993 5
123888 2013 1
There are multiple ways to cut this. I would use the SqLite function of whatever language you choose, load the data into two temp tables, and then make a query to get the data you want. It should be fairly trivial to join the two tables.
Edit: If the SQLite path is taken, there are plenty of good tutorials around. Working with SQLite in Python, How to import a CSV in SQLite (with Python).
I would do the following, in broad terms:
create a :memory: database in SQLite
create tables
(these steps can be omitted, and a database +table can be created by an SQLite editor in advance)
connect to DB
import each TSV file into a table
execute a query on the database
output the result
Python should have libraries for most of these tasks.
The exact same result can be done with other tools.
How can I read a csv file from s3 without few values.
Eg: list [a,b]
Except the values a and b. I need to read all the other values in the csv. I know how to read the whole csv from s3. sqlContext.read.csv(s3_path, header=True) but how do I exclude these 2 values from the file and read the rest of the file.
You don't. A file is a sequential storage medium. A CSV file is a form of text file: it's character-indexed. Therefore, to exclude columns, you have to first read and process the characters to find the column boundaries.
Even if you could magically find those boundaries, you would have to seek past those locations; this would likely cost you more time than simply reading and ignoring the characters, since you would be interrupting the usual, smooth block-transfer instructions that drive most file buffering.
As the comments tell you, simply read the file as is and discard the unwanted data as part of your data cleansing. If you need the file repeatedly, then cleanse it once, and use that version for your program.
If you were wanting to get just a few rows, you could use S3 Select and Glacier Select – Retrieving Subsets of Objects | AWS News Blog. This is a way to run SQL against an S3 object without downloading it.
Alternatively, you could use Amazon Athena to query a CSV file using SQL.
However, it might simply be easier to download the whole file and do the processing locally in your Python app.
I want to append about 700 millions rows and 2 columns to a database. Using the code below:
disk_engine = create_engine('sqlite:///screen-user.db')
chunksize = 1000000
j = 0
index_start = 1
for df in pd.read_csv('C:/Users/xxx/Desktop/jjj.tsv', chunksize=chunksize, header = None, names=['screen','user'],sep='\t', iterator=True, encoding='utf-8'):
df.to_sql('data', disk_engine, if_exists='append')
count = j*chunksize
print(count)
print(j)
It is taking a really long time (I estimate it would take days). Is there a more efficient way to do this? In R, I have have been using the data.table package to load large data sets and it only take 1 minute. Is there a similar package in Python? As a tangential point, I want to also physically store this file on my Desktop. Right now, I am assuming 'data' is being stored as a temporary file. How would I do this?
Also assuming I load the data into a database, I want the queries to execute in a minute or less. Here is some pseudocode of what I want to do using Python + SQL:
#load data(600 million rows * 2 columns) into database
#def count(screen):
#return count of distinct list of users for a given set of screens
Essentially, I am returning the number of screens for a given set of users.Is the data too big for this task? I also want to merge this table with another table. Is there a reason why the fread function in R is much faster?
If your goal is to import data from your TSV file into SQLite, you should try the native import functionality in SQLite itself. Just open the sqlite console program and do something like this:
sqlite> .separator "\t"
sqlite> .import C:/Users/xxx/Desktop/jjj.tsv screen-user
Don't forget to build appropriate indexes before doing any queries.
As #John Zwinck has already said, you should probably use native RDBMS's tools for loading such amount of data.
First of all I think SQLite is not a proper tool/DB for 700 millions rows especially if you want to join/merge this data afterwards.
Depending of what kind of processing you want to do with your data after loading, I would either use free MySQL or if you can afford having a cluster - Apache Spark.SQL and parallelize processing of your data on multiple cluster nodes.
For loading you data into MySQL DB you can and should use native LOAD DATA tool.
Here is a great article showing how to optimize data load process for MySQL (for different: MySQL versions, MySQL options, MySQL storage engines: MyISAM and InnoDB, etc.)
Conclusion: use native DB's tools for loading big amount of CSV/TSV data efficiently instead of pandas, especially if your data doesn't fit into memory and if you want to process (join/merge/filter/etc.) your data after loading.
I am generating load test data in a Python script for Cassandra.
Is it better to insert directly into Cassandra from the script, or to write a CSV file and then load that via Cassandra?
This is for a couple million rows.
For a few million, I'd say just use CSV (assuming rows aren't huge); and see if it works. If not, inserts it is :)
For more heavy duty stuff, you might want to create sstables and use sstable loader.