Read a csv file from s3 excluding some values - python

How can I read a csv file from s3 without few values.
Eg: list [a,b]
Except the values a and b. I need to read all the other values in the csv. I know how to read the whole csv from s3. sqlContext.read.csv(s3_path, header=True) but how do I exclude these 2 values from the file and read the rest of the file.

You don't. A file is a sequential storage medium. A CSV file is a form of text file: it's character-indexed. Therefore, to exclude columns, you have to first read and process the characters to find the column boundaries.
Even if you could magically find those boundaries, you would have to seek past those locations; this would likely cost you more time than simply reading and ignoring the characters, since you would be interrupting the usual, smooth block-transfer instructions that drive most file buffering.
As the comments tell you, simply read the file as is and discard the unwanted data as part of your data cleansing. If you need the file repeatedly, then cleanse it once, and use that version for your program.

If you were wanting to get just a few rows, you could use S3 Select and Glacier Select – Retrieving Subsets of Objects | AWS News Blog. This is a way to run SQL against an S3 object without downloading it.
Alternatively, you could use Amazon Athena to query a CSV file using SQL.
However, it might simply be easier to download the whole file and do the processing locally in your Python app.

Related

Read columns from Parquet in GCS without reading the entire file?

Reading a parquet file from disc I can choose to read only a few columns (I assume it scans the header/footer, then decides). Is it possible to do this remotely (such as via Google Cloud Storage?)
We have 100 MB parquet files with about 400 columns and we have a use-case where we want to read 3 of them, and show them to the user. The user can choose which columns.
Currently we download the entire file, and then filter it but this takes time.
Long term we will be putting it into Google BigQuery and the problem will be solved
More specifically we use Python with either pandas or PyArrow and ideally would like to use those (either with a GCS backend or manually getting the specific data we need via a wrapper). This runs in Cloud Run so we would prefer to not use Fuse, although that is certainly possible.
I intend to use Python and pandas/pyarrow as the backend for this, running in Cloud Run (hence why data size matter, because 100MB download to disk actually means 100MB downloaded to RAM)
We use pyarrow.parquet.read_parquet with to_pandas() or pandas.read_parquet.
pandas.read_parquet function has columns argument to read a subset of columns.

How to read and process large csv objects from S3 using Python boto3?

I'm downloading csv files and processing the content by using Python 3.8.
I faced a memory error when downloading a large file, so, I need to download a certain amount of rows (let's say 10k rows), process and then read the next 10k rows until the entire csv is processed. So far, I read the entire csv and I decode it by converting it into a dictionary that preserves the headers and the values of each row:
data = s3.get_object(Bucket=config.BUCKET_NAME, Key=source_file)
contents = data['Body'].read().decode("utf-8")
csv_reader = csv.DictReader(contents.splitlines(True))
I've been reading documentation and download_fileobj can read an object in chunks and uses a callback method to process it, but the object is divided in bytes, and I need to divide it in rows to not split a row in the middle.
I prefer to not download the entire file into disk because I don't have a lot of space and that will require to delete the file after processing, so I prefer some way to do it directly in RAM, by using a library, method etc.
Ideas?

Limit of csv file importing in python

I am currently in the process of getting the data from my stakeholder where he has a database from which he is going to extract as a csv file.
From there he is going to upload in shared drive and I am going to pick up the data probably download the data and use that a source locally to import in pandas dataframe.
The approximate size will be 40 million rows, I was wondering if the data can be exported as a single csv file from SQL database and that csv can be used as a source for python dataframe or should it be in chunks as I am not sure what the row limitation of csv file is.
I don't think so ram and processing should be an issue at this time.
Your help is much appreciated. Cheers!
If you can't connect directly to the database, you might need the .db file. I'm not sure a csv will even be able to handle more than a million or so rows.
as I am not sure what the row limitation of csv file is.
There is not such limit inherent for CSV format, if you understood CSV as format defined by RFC4180 which stipulates that CSV file is
file = [header CRLF] record *(CRLF record) [CRLF]
where [...] denote optional part, CRLF denote carriagereturn-linefeed (\r\n) and *(...) denote part repeated zero or more times.

different pipelines based on files in compressed file

I have a compressed file in a google cloud storage bucket. This file contains a big csv file and a small xml based metadata file. I would like to extract both files and determine the metadata and process the csv file. I am using the Python SDK, and the pipeline will run on Google Dataflow at some point.
The current solution is to use Google Cloud Functions to extract both files and start the pipeline with the parameters parsed from the xml file.
I would like to eliminate the Google Cloud Function and process the compressed file in Apache Beam itself. The pipeline should process the XML file and then process the csv file.
However, I am stuck at extracting the two files into separate collections. I would like to understand if my solution is flawed, or if not, an example on how to deal with different files in a single compressed file.
In my understanding, this is not achievable through any existing text IO in beam.
The problem of your design is that, you are enforcing a dependency of file reading order (metadata xml must be read before processing CSV file and a logic to understand the CSV. Both are not supported in any concrete text IO.
If you do want to have this flexibility, I would suggest that you take a look at vcfio. You might want to write your own reader that inherits from filebasedsource.FileBasedSource too. There is some similarity in the implementation of vcfio to your case, in that there is always a header that explains how to interpret the CSV part in a VCF-formatted file.
Actually if you can somehow rewrite your xml metdata and add it as a header to the csv file, you probably can use vcfio instead.

Why does Spark output a set of csv's instead or just one?

I had a hard time last week getting data out of Spark, in the end I had to simply go with
df.toPandas().to_csv('mycsv.csv')
out of this answer.
I had tested the more native
df.write.csv('mycsv.csv')
for Spark 2.0+ but as per the comment underneath, it drops a set of csv files instead of one which need to be concatenated, whatever that means in this context. It also dropped an empty file into the directory called something like 'success'. The directory name was /mycsv/ but the csv itself had an unintelligible name out of a long string of characters.
This was the first I had heard of such a thing. Well, Excel has multiple tabs which must somehow be reflected in an .xls file, and NumPy arrays can be multidimensional, but I thought a csv file was just a header, values separated into columns by commas in rows.
Another answer suggested:
query.repartition(1).write.csv("cc_out.csv", sep='|')
So this drops just one file and the blank 'success' file, still the file does not have the name you want, the directory does.
Does anyone know why Spark is doing this, why will it not simply output a csv, how does it name the csv, what is that success file supposed to contain, and if concatenating csv files means here joining them vertically, head to tail.
There are a few reasons why Spark outputs multiple CSVs:
- Spark runs on a distributed cluster. For large datasets, all the data may not be able to fit on a single machine, but it can fit across a cluster of machines. To write one CSV, all the data would presumably have to be on one machine and written by one machine, which one machine may not be able to do.
- Spark is designed for speed. If data lives on 5 partitions across 5 executors, it makes sense to write 5 CSVs in parallel rather than move all data to a single executor and have one executor write the entire dataset.
If you need one CSV, my presumption is that your dataset is not super large. My recommendation is to download all the CSV files into a directory, and run cat *.csv > output.csv in the relevant directory. This will join your CSV files head-to-tail. You may need to do more work to strip headers from each part file if you're writing with headers.
Does anyone know why Spark is doing this, why will it not simply output a csv,
Because it is designed for distributed computing where each chunk of data (a.k.a. partition) is written independently of others.
how does it name the csv
Name depends on the partition number.
what is that success file supposed to contain
Nothing. It just indicates success.
This basically happens because Spark dumps file based on the number of partitions between which the data is divided. So, each partition would simply dump it's own file seperately. You can use the coalesce option to save them to a single file. Check this link for more info.
However, this method has a disadvantage that it needs to collect all the data in the Master Node, hence the Master Node should contain enough memory. A workaround for this can seen in this answer.
This link also sheds some more information about this behavior of Spark:
Spark is like Hadoop - uses Hadoop, in fact - for performing actions like outputting data to HDFS. You'll know what I mean the first time you try to save "all-the-data.csv" and are surprised to find a directory named all-the-data.csv/ containing a 0 byte _SUCCESS file and then several part-0000n files for each partition that took part in the job.

Categories