Read columns from Parquet in GCS without reading the entire file? - python

Reading a parquet file from disc I can choose to read only a few columns (I assume it scans the header/footer, then decides). Is it possible to do this remotely (such as via Google Cloud Storage?)
We have 100 MB parquet files with about 400 columns and we have a use-case where we want to read 3 of them, and show them to the user. The user can choose which columns.
Currently we download the entire file, and then filter it but this takes time.
Long term we will be putting it into Google BigQuery and the problem will be solved
More specifically we use Python with either pandas or PyArrow and ideally would like to use those (either with a GCS backend or manually getting the specific data we need via a wrapper). This runs in Cloud Run so we would prefer to not use Fuse, although that is certainly possible.
I intend to use Python and pandas/pyarrow as the backend for this, running in Cloud Run (hence why data size matter, because 100MB download to disk actually means 100MB downloaded to RAM)
We use pyarrow.parquet.read_parquet with to_pandas() or pandas.read_parquet.

pandas.read_parquet function has columns argument to read a subset of columns.

Related

How to export a huge table from BigQuery into a Google cloud bucket as one file

I am trying to export a huge table (2,000,000,000 rows, roughly 600GB in size) from BigQuery into a google bucket as a single file. All tools suggested in Google's Documentation are limited in export size and will create multiple files.
Is there a pythonic way to do it without needing to hold the entire table in the memory?
While perhaps there are other ways to make it as a script, the recommended solution is to merge the files using Google Storage compose action.
What you have to do is:
export in CSV format
this produces many files
run the compose action batched from 32 files until the final one, the big file is merged
All this can be combined in a cloud Workflow, there is a tutorial here.

Read a csv file from s3 excluding some values

How can I read a csv file from s3 without few values.
Eg: list [a,b]
Except the values a and b. I need to read all the other values in the csv. I know how to read the whole csv from s3. sqlContext.read.csv(s3_path, header=True) but how do I exclude these 2 values from the file and read the rest of the file.
You don't. A file is a sequential storage medium. A CSV file is a form of text file: it's character-indexed. Therefore, to exclude columns, you have to first read and process the characters to find the column boundaries.
Even if you could magically find those boundaries, you would have to seek past those locations; this would likely cost you more time than simply reading and ignoring the characters, since you would be interrupting the usual, smooth block-transfer instructions that drive most file buffering.
As the comments tell you, simply read the file as is and discard the unwanted data as part of your data cleansing. If you need the file repeatedly, then cleanse it once, and use that version for your program.
If you were wanting to get just a few rows, you could use S3 Select and Glacier Select – Retrieving Subsets of Objects | AWS News Blog. This is a way to run SQL against an S3 object without downloading it.
Alternatively, you could use Amazon Athena to query a CSV file using SQL.
However, it might simply be easier to download the whole file and do the processing locally in your Python app.

Get a massive csv file from GCS to BQ

I have a very large CSV file (let's say 1TB) that I need to get from GCS onto BQ. While BQ does have a CSV-loader, the CSV files that I have are pretty non-standard and don't end up loading properly to BQ without formatting it.
Normally I would download the csv file onto a server to 'process it' and save it either directly to BQ or to an avro file that can be ingested easily by BQ. However, the file(s) are quite large and it's quite possible (and probably) that I wouldn't have the storage/memory to do the batch processing without writing a lot of code to optimize/stream it.
Is this a good use case for using Cloud Dataflow? Are there any tutorials are ways to go about getting a file of format "X" from GCS into BQ? Any tutorial pointers or example scripts to do so would be great.
I'd personally use Dataflow (not Dataprep) and write a simple pipeline to read the file in parallel, clean/transform it, and finally write it to BigQuery. It's pretty straightforward. Here's an example of one in my GitHub repo. Although it's in Java, you could easily port it to Python. Note: it uses the "templates" feature in Dataflow, but this can be changed with one line of code.
If Dataflow is off the table, another option could be to use a weird/unused delimiter and read the entire row into BigQuery. Then use SQL/Regex/UDFs to clean/transform/parse it. See here (suggestion from Felipe). We've done this lots of times in the past, and because you're in BigQuery it scales really well.
I would consider using Cloud Dataprep.
Dataprep can import data from GCS, clean / modify the data and export to BigQuery. One of the features that I like is that everything can be done visually / interactively so that I can see how the data transforms.
Start with a subset of your data to see what transformations are required and to give yourself some practice before loading and processing a TB of data.
You can always transfer from a storage bucket directly into a BQ table:
bq --location=US load --[no]replace --source_format=CSV dataset.table gs://bucket/file.csv [schema]
Here, [schema] can be an inline schema of your csv file (like id:int,name:string,..) or a path to a JSON schema file (available locally).
As per BQ documentation, they try to parallelize large CSV loads into tables. Of course, there is an upper-bound involved: maximum size of an uncompressed (csv) file to be loaded from GCS to BQ should be <= 5TB, which is way above your requirements. I think you should be good with this.

Pyspark data distribution

I have 1000 csv files that are to be processed in parallel using map function available in spark. I have two desktops connected in a cluster and I'm using the pyspark shell for computation. I am passing the name of csv files into the map function and the function accesses the files based on name. However, I need to copy files to the slave for the process to function properly. This means there has to be a copy of all the csv files on the other system. Kindly suggest an alternative storage while avoiding data transfer latency.
I also tried storing these files into a 3-d array and generating an RDD by using parallelize command. But that gives out of memory error.
you can use spark-csv to load the files
https://github.com/databricks/spark-csv
Then you can use dataframe concept to pre-process the files.
Since its 1000 csv files and if there is some link among them , use spark-sql to run operation on them , and then extract your output for final computation.
If that doesn't work , you can try to load the same in HBase or Hive and then use spark to compute , I checked with 100 gb of csv contents in my single node cluster.
It may help

How can I reduce the access time on large Excel files?

I would like to process a large data set of a mechanical testing device with Python. The software of this device only allows to export the data as an Excel file. Therefore, I use the xlrd package which works fine for small *.xlsx files.
The problem I have is, that when I want to open a common data set (3-5 MB) by
xlrd.open_workbook(path_wb)
the access time is about 30s to 60s. Is there any more effecitve and faster way to access Excel files?
You could access the file as a database via PyPyODBC instead, which may (or may not) be faster - you'd have to try it out and compare the results.
This method should work for both .xls and .xlsx files. Unfortunately, it comes with a couple of caveats:
As far as I am aware, this will only work on Windows machines, since you're relying on the Microsoft Jet database driver.
The Microsoft Jet database driver can be rather buggy, especially with dates.
It's not possible to create or modify Excel files (a note in the PyPyODBC exceltests.py file says: I have not been able to successfully create or modify Excel files.). Your question seems to indicate that you're only interested in reading files, though, so hopefully this will not be a problem.
I just figured out that it wasn't actually the problem with the access time but I created an object in the same step. Now, by creating the object separately everything works fast and nice.

Categories