I am currently in the process of getting the data from my stakeholder where he has a database from which he is going to extract as a csv file.
From there he is going to upload in shared drive and I am going to pick up the data probably download the data and use that a source locally to import in pandas dataframe.
The approximate size will be 40 million rows, I was wondering if the data can be exported as a single csv file from SQL database and that csv can be used as a source for python dataframe or should it be in chunks as I am not sure what the row limitation of csv file is.
I don't think so ram and processing should be an issue at this time.
Your help is much appreciated. Cheers!
If you can't connect directly to the database, you might need the .db file. I'm not sure a csv will even be able to handle more than a million or so rows.
as I am not sure what the row limitation of csv file is.
There is not such limit inherent for CSV format, if you understood CSV as format defined by RFC4180 which stipulates that CSV file is
file = [header CRLF] record *(CRLF record) [CRLF]
where [...] denote optional part, CRLF denote carriagereturn-linefeed (\r\n) and *(...) denote part repeated zero or more times.
Related
Currently I am using Python to connect to RESTAPI and extracting huge volume of data in csv file. The number of rows are almost 80 million. Now i want to load this huge data into Oracle database table. I tried to load using sql loader and also ODI tool but it was taking hours to load this data.
I want to try with Pyspark as its good for loading large datasets. But as i am new to Pyspark not sure as a first approach will it be performance efficient to load such huge csv into oracle database table ?
As second approach will it be performance efficient if instead of creating csv file just store the data from RESTAPI in memory and load into database table ?
Which approach will be better ?
Below is how my CSV data looks like
Let me show you an example of a control file I use to load a very big file ( 120 Million records each day )
OPTIONS (SKIP=0, ERRORS=500, PARALLEL=TRUE, MULTITHREADING=TRUE, DIRECT=TRUE, SILENT=(ALL))
UNRECOVERABLE
LOAD DATA
CHARACTERSET WE8ISO8859P1
INFILE '/path_to_your_file/name_of_the_file.txt'
BADFILE '/path_to_your_file/name_of_the_file.bad'
DISCARDFILE '/path_to_your_file/name_of_the_file.dsc'
APPEND
INTO TABLE yourtablename
TRAILING NULLCOLS
(
COLUMN1 POSITION(1:4) CHAR
,COLUMN2 POSITION(5:8) CHAR
,COLUMN3 POSITION(9:11) CHAR
,COLUMN4 POSITION(12:18) CHAR
....
....)
Some considerations
It is always faster loading by positions than using delimiters
Use the options of PARALLEL, MULTITHREADING and DIRECT to optimize loading performace.
UNRECOVERABLE is also a good advice if you always have the file in case you need to recover the database, you'd need to load the data again.
Use the appropriate characterset.
The TRAILING NULLCOLS clause tells SQL*Loader to treat any relatively positioned columns that are not present in the record as null columns.
Position means that each row contains data without any delimiter, so you know the position of each field in the table by the length.
AAAAABBBBBBCCCCC19828733UUUU
If your txt or csv file has a field separator, let's say semicolon, then you need to use the FIELDS DELIMITED BY
This is stored in a control file, normally a text file with extension ctl. Then you invoke from command line
sqlldr userid=youuser/pwd#tns_string control=/path_to_control_file/control_file.ctl
How can I read a csv file from s3 without few values.
Eg: list [a,b]
Except the values a and b. I need to read all the other values in the csv. I know how to read the whole csv from s3. sqlContext.read.csv(s3_path, header=True) but how do I exclude these 2 values from the file and read the rest of the file.
You don't. A file is a sequential storage medium. A CSV file is a form of text file: it's character-indexed. Therefore, to exclude columns, you have to first read and process the characters to find the column boundaries.
Even if you could magically find those boundaries, you would have to seek past those locations; this would likely cost you more time than simply reading and ignoring the characters, since you would be interrupting the usual, smooth block-transfer instructions that drive most file buffering.
As the comments tell you, simply read the file as is and discard the unwanted data as part of your data cleansing. If you need the file repeatedly, then cleanse it once, and use that version for your program.
If you were wanting to get just a few rows, you could use S3 Select and Glacier Select – Retrieving Subsets of Objects | AWS News Blog. This is a way to run SQL against an S3 object without downloading it.
Alternatively, you could use Amazon Athena to query a CSV file using SQL.
However, it might simply be easier to download the whole file and do the processing locally in your Python app.
I had a hard time last week getting data out of Spark, in the end I had to simply go with
df.toPandas().to_csv('mycsv.csv')
out of this answer.
I had tested the more native
df.write.csv('mycsv.csv')
for Spark 2.0+ but as per the comment underneath, it drops a set of csv files instead of one which need to be concatenated, whatever that means in this context. It also dropped an empty file into the directory called something like 'success'. The directory name was /mycsv/ but the csv itself had an unintelligible name out of a long string of characters.
This was the first I had heard of such a thing. Well, Excel has multiple tabs which must somehow be reflected in an .xls file, and NumPy arrays can be multidimensional, but I thought a csv file was just a header, values separated into columns by commas in rows.
Another answer suggested:
query.repartition(1).write.csv("cc_out.csv", sep='|')
So this drops just one file and the blank 'success' file, still the file does not have the name you want, the directory does.
Does anyone know why Spark is doing this, why will it not simply output a csv, how does it name the csv, what is that success file supposed to contain, and if concatenating csv files means here joining them vertically, head to tail.
There are a few reasons why Spark outputs multiple CSVs:
- Spark runs on a distributed cluster. For large datasets, all the data may not be able to fit on a single machine, but it can fit across a cluster of machines. To write one CSV, all the data would presumably have to be on one machine and written by one machine, which one machine may not be able to do.
- Spark is designed for speed. If data lives on 5 partitions across 5 executors, it makes sense to write 5 CSVs in parallel rather than move all data to a single executor and have one executor write the entire dataset.
If you need one CSV, my presumption is that your dataset is not super large. My recommendation is to download all the CSV files into a directory, and run cat *.csv > output.csv in the relevant directory. This will join your CSV files head-to-tail. You may need to do more work to strip headers from each part file if you're writing with headers.
Does anyone know why Spark is doing this, why will it not simply output a csv,
Because it is designed for distributed computing where each chunk of data (a.k.a. partition) is written independently of others.
how does it name the csv
Name depends on the partition number.
what is that success file supposed to contain
Nothing. It just indicates success.
This basically happens because Spark dumps file based on the number of partitions between which the data is divided. So, each partition would simply dump it's own file seperately. You can use the coalesce option to save them to a single file. Check this link for more info.
However, this method has a disadvantage that it needs to collect all the data in the Master Node, hence the Master Node should contain enough memory. A workaround for this can seen in this answer.
This link also sheds some more information about this behavior of Spark:
Spark is like Hadoop - uses Hadoop, in fact - for performing actions like outputting data to HDFS. You'll know what I mean the first time you try to save "all-the-data.csv" and are surprised to find a directory named all-the-data.csv/ containing a 0 byte _SUCCESS file and then several part-0000n files for each partition that took part in the job.
What is the fastest way to import data into MSSQL using Python with these requirements:
Large file (too big for memory, no row based inserts).
Minimal logging.
CSV file.
CSV file column may contain single and double quotes.
CSV file Column may contain line breaks.
Thanks.
Call BCP in windows or freebcp for Linux from python code .
I have a very large (> 2 million rows) csv file that is being generated and viewed in an internal web service. The problem is that when users of this system want to export this csv to run custom queries, they open these files in excel. Excel is formatting the numbers the best it can, but there are some requests to have the data in xlsx format with filters and whatnot.
The question boils down to: Using python2.7, how can I read a large csv file (>2 million rows) into excel (or multiple excel files) and control the formatting? (dates, numbers, autofilters, etc)
I am open to python and internal excel solutions.
Without more information about the data types in the csv, or your exact issue with EXCEL properly handling those data types, it's hard to give you an exact answer.
However, recommending looking at this module (https://xlsxwriter.readthedocs.org/) which can be used in Python to create xlsx files. I haven't used it, but it seems to have more features than you need.
Especially if you need to split between multiple files, or workbooks. And it looks like you can pre-create the filters and have total control over the formating