I have a .parquet file, and would like to use Python to quickly and efficiently query that file by a column.
For example, I might have a column name in that .parquet file and want to get back the first (or all of) the rows with a chosen name.
How can I query a parquet file like this in the Polars API, or possibly FastParquet (whichever is faster)?
I thought pl.scan_parquet might be helpful but realised it didn't seem so, or I just didn't understand it. Preferably, though it is not essential, we would not have to read the entire file into memory first, to reduce memory and CPU usage.
I thank you for your help.
Related
This is the CSV file to be parsed in our program. This is one of many CSV files
ID|TITLE|COMPANY|DATE|REV|VIEW_TIME
id1|title1|company 1|2014-04-01|4.00|1:30
id1|title3|company 2|2014-04-03|6.00|2:05
id2|title4|company 1|2014-04-02|8.00|2:45
id3|title2|company 1|2014-04-02|4.00|1:05
The catch as given in the assignment is as follows
Your first task is to parse and import the file into a simple
datastore. You may use any file format that you want to implement to
store the data. For this assignment, you must write your own datastore
instead of reusing an existing one such as a SQL database. Also, you
must assume that after importing many files the entire datastore will
be too large to fit in memory. Records in the datastore should be
unique by ID, TITLE and DATE. Subsequent imports with the same
logical record should overwrite the earlier records.
Since the data structure can not hold all the data in memory, then I have to look into more permanent storage solution. Writing to a file seems more suitable than any other solution, but herein lies the catch. If I have to overwrite the contents on the basis of ID, TITLE, and DATE, then I would have to load the entire content in memory, before overwriting it, which is not possible, according to the precondition.
WHAT I AM LOOKING FOR
What approach should I take? I am not looking for a code sample, but I am hoping that some has an idea on which data structure or file structure to use. Anything suggestions like Use a stack or a list or a file structure is appreciated.
I'm manipulating several files via nfs, due to security concerns. The situation is very painful to process something due to slow file I/O. Followings are descriptions of the issue.
I use pandas in Python to do simple processing on data. So I use read_csv() and to_csv() frequently.
Currently, writing of a 10GB csv file requires nearly 30 mins whereas reading consumes 2 mins.
I have enough CPU cores (> 20 cores) and memory (50G~100G).
It is hard to ask more bandwidth.
I need to access data in column-oriented manner, frequently. For example, there would be 100M records with 20 columns (most of them are numeric data). For the data, I frequently read all of 100M records only for 3~4 columns' value.
I've tried with HDF5, but it constructs a larger file and consumes similar time to write. And it does not provide column-oriented I/O. So I've discarded this option.
I cannot store them locally. It would violate many security criteria. Actually I'm working on virtual machine and file system is mounted via nfs.
I repeatedly read several columns. For several columns, no. The task is something like data analysis.
Which approaches can I consider?
In several cases, I use sqlite3 to manipulate data in simple way and exports results into csv files. Can I accelerate I/O tasks by using sqlite3 in Python? If it provide column-wise operation, it would be a good solution, I reckon.
two options: pandas hdf5 or dask.
you can review hdf5 format with format='table'.
HDFStore supports another PyTables format on disk, the table format.
Conceptually a table is shaped very much like a DataFrame, with rows
and columns. A table may be appended to in the same or other sessions.
In addition, delete and query type operations are supported. This
format is specified by format='table' or format='t' to append or put
or to_hdf.
you can use dask read_csv. it read data only when execute()
For purely improve IO performance, i think hdf with compress format is best.
I'm new to Azure and Python and was creating a notebook in databricks to output the results of a piece of sql. The code below produces the expected output, but with a default filename that's about 100 characters long. Id like to be able to give the output a sensible name and add a date/time to create uniqueness, something like testfile20191001142340.csv. I've serched high and low and can't find anything that helps, hoping somebody in the community can point me in the right direction
%python
try:
dfsql = spark.sql("select * from dbsmets1mig02_technical_build.tbl_Temp_Output_CS_Firmware_Final order by record1") #Replace with your SQL
except:
print("Exception occurred")
if dfsql.count() == 0:
print("No data rows")
else:
dfsql.coalesce(1).write.format("com.databricks.spark.csv").option("header","false").option("delimiter","|").mode("overwrite").option("quote","\u0000").save(
"/mnt/publisheddatasmets1mig/metering/smets1mig/cs/system_data_build/firmware/outbound/")
The issue with naming a single file is that it pretty much goes against the philosophy of spark. To enable quick processing, Spark has to be able to parallelise writes. For parquet files or other outputs that naturally support parallelizm it's not a problem. In case of .csv files we are used to working with single files and thus a lot of confusion.
Long story short, if you did not use .coalesce(1) Spark would write your data to multiple .csv files in one folder. Since there is only one partition, there will be only one file - but with a generated name. So you have here two options:
rename/move the file afterwards using databricks utils or regular python libraries
.collect the result and save it using other libraries (default would be csv package)
The obvious question you may have is why is it so hard to do something so simple as saving to a single file - and the answer is, because it's a problem for Spark. The issue with your approach to saving a single partition is that if you have more data than can fit in your driver / executor memory, repartitioning to 1 partition or collecting the data to executor is going to simply fail and explode with an exception.
For safely saving to single .csv file you can use toLocalIterator method which loads only one partition to memory at time and within its iterator save your results to a single file using csv package.
I had a hard time last week getting data out of Spark, in the end I had to simply go with
df.toPandas().to_csv('mycsv.csv')
out of this answer.
I had tested the more native
df.write.csv('mycsv.csv')
for Spark 2.0+ but as per the comment underneath, it drops a set of csv files instead of one which need to be concatenated, whatever that means in this context. It also dropped an empty file into the directory called something like 'success'. The directory name was /mycsv/ but the csv itself had an unintelligible name out of a long string of characters.
This was the first I had heard of such a thing. Well, Excel has multiple tabs which must somehow be reflected in an .xls file, and NumPy arrays can be multidimensional, but I thought a csv file was just a header, values separated into columns by commas in rows.
Another answer suggested:
query.repartition(1).write.csv("cc_out.csv", sep='|')
So this drops just one file and the blank 'success' file, still the file does not have the name you want, the directory does.
Does anyone know why Spark is doing this, why will it not simply output a csv, how does it name the csv, what is that success file supposed to contain, and if concatenating csv files means here joining them vertically, head to tail.
There are a few reasons why Spark outputs multiple CSVs:
- Spark runs on a distributed cluster. For large datasets, all the data may not be able to fit on a single machine, but it can fit across a cluster of machines. To write one CSV, all the data would presumably have to be on one machine and written by one machine, which one machine may not be able to do.
- Spark is designed for speed. If data lives on 5 partitions across 5 executors, it makes sense to write 5 CSVs in parallel rather than move all data to a single executor and have one executor write the entire dataset.
If you need one CSV, my presumption is that your dataset is not super large. My recommendation is to download all the CSV files into a directory, and run cat *.csv > output.csv in the relevant directory. This will join your CSV files head-to-tail. You may need to do more work to strip headers from each part file if you're writing with headers.
Does anyone know why Spark is doing this, why will it not simply output a csv,
Because it is designed for distributed computing where each chunk of data (a.k.a. partition) is written independently of others.
how does it name the csv
Name depends on the partition number.
what is that success file supposed to contain
Nothing. It just indicates success.
This basically happens because Spark dumps file based on the number of partitions between which the data is divided. So, each partition would simply dump it's own file seperately. You can use the coalesce option to save them to a single file. Check this link for more info.
However, this method has a disadvantage that it needs to collect all the data in the Master Node, hence the Master Node should contain enough memory. A workaround for this can seen in this answer.
This link also sheds some more information about this behavior of Spark:
Spark is like Hadoop - uses Hadoop, in fact - for performing actions like outputting data to HDFS. You'll know what I mean the first time you try to save "all-the-data.csv" and are surprised to find a directory named all-the-data.csv/ containing a 0 byte _SUCCESS file and then several part-0000n files for each partition that took part in the job.
I am a relatively new user of Python. What is the best way of parsing and processing a CSV and loading it into a local Postgres Database (in Python)?
It was recommended to me to use the CSV library to parse and process the CSV. In particular, the task at hand says:
The data might have errors (some rows may be not be parseable), the
data might be duplicated, the data might be really large.
Is there a reason why I wouldn't be able to just use pandas.read_csv here? Does using the CSV library make parsing and loading it into a local Postgres database easier? In particular, if I just use pandas will I run into problems if rows are unparseable, if the data is really big, or if data is duplicated? (For the last bit, I know that pandas offers some relatively clean solutions for de-dupping).
I feel like pandas.read_csv and pandas.to_sql can do a lot of work for me here, but I'm not sure if using the CSV library offers other advantages.
Just in terms of speed, this post: https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file seems to suggest that pandas.read_csv performs the best?
A quick googling didn't reveal any serious drawbacks in pandas.read_csv regarding its functionality (parsing correctness, supported types etc.). Moreover, since you appear to be using pandas to load the data into the DB, too, reading directly into a DataFrame is a huge boost in both performance and memory (no redundant copies).
There are only memory issues for very large datasets - but these are not library's fault. How to read a 6 GB csv file with pandas has instructions on how to process a large .csv in chunks with pandas.
Regarding "The data might have errors", read_csv has a few facilities like converters, error_bad_lines and skip_blank_lines (specific course of action depends on if and how much corruption you're supposed to be able to recover).
I had a school project just last week that required me to load data from a csv and insert it into a postgres database. So believe me when I tell you this: it's way harder than it has to be unless you use pandas. The issue is sniffing out the data types. Okay, so if your database is all a string datatype, forget what I said, you're golden. But if you have a csv with an assortment of datatypes, either you get to sniff them yourself or you can use pandas which does it efficiently and automatically. Plus pandas has a nifty write to sql method which can be easily adapted to work with postgres via a sql alchemy connection, too.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html