I am reading a parquet file and transforming it into dataframe.
from fastparquet import ParquetFile
pf = ParquetFile('file.parquet')
df = pf.to_pandas()
Is there a way to read a parquet file from a variable (that previously read and now hold parquet data)?
Thanks.
In Pandas there is method to deal with parquet. Here is reference to the docs. Something like that:
import pandas as pd
pd.read_parquet('file.parquet')
should work. Also please read this post for engine selection.
You can read a file from a variable also using pandas.read_parquet using the following code. I tested this with the pyarrow backend but this should also work for the fastparquet backend.
import pandas as pd
import io
with open("file.parquet", "rb") as f:
data = f.read()
buf = io.BytesIO(data)
df = pd.read_parquet(buf)
Related
I am looking to read in a .csv.gz file that is in the same directory as my python script using the gzip and pandas module only.
So far I have,
import gzip
import pandas as pd
data = gzip.open(test_data.csv.gz, mode='rb')
How do I proceed in converting / reading this file in as a dataframe without using the csv module as seen in similarly answered questions?
You can use pandas.read_csv directly:
import pandas as pd
df = pd.read_csv('test_data.csv.gz', compression='gzip')
If you must use gzip:
with gzip.open('test_data.csv.gz', mode='rb') as csv:
df = pd.read_csv(csv)
I have .ndjson file that has 20GB that I want to open with Python. File is to big so I found a way to split it into 50 peaces with one online tool. This is the tool: https://pinetools.com/split-files
Now I get one file, that has extension .ndjson.000 (and I do not know what is that)
I'm trying to open it as json or as a csv file, to read it in pandas but it does not work.
Do you have any idea how to solve this?
import json
import pandas as pd
First approach:
df = pd.read_json('dump.ndjson.000', lines=True)
Error: ValueError: Unmatched ''"' when when decoding 'string'
Second approach:
with open('dump.ndjson.000', 'r') as f:
my_data = f.read()
print(my_data)
Error: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 104925061 (char 104925060)
I think the problem is that I have some emojis in my file, so I do not know how to encode them?
ndjson is now supported out of the box with argument lines=True
import pandas as pd
df = pd.read_json('/path/to/records.ndjson', lines=True)
df.to_json('/path/to/export.ndjson', lines=True)
I think the pandas.read_json cannot handle ndjson correctly.
According to this issue you can do sth. like this to read it.
import ujson as json
import pandas as pd
records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)
P.S: All credits for this code go to KristianHolsheimer from the Github Issue
The ndjson (newline delimited) json is a json-lines format, that is, each line is a json. It is ideal for a dataset lacking rigid structure ('non-sql') where the file size is large enough to warrant multiple files.
You can use pandas:
import pandas as pd
data = pd.read_json('dump.ndjson.000', lines=True)
In case your json strings do not contain newlines, you can alternatively use:
import json
with open("dump.ndjson.000") as f:
data = [json.loads(l) for l in f.readlines()]
Here is a data I am interested in.
http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip
It consists of 3 files:
I want to download zip with pandas and create DataFrame from 1 file called Production_Crops_E_All_Data.csv
import pandas as pd
url="http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip"
df=pd.read_csv(url)
Pandas can download files, it can work with zips and of course it can work with csv files. But how can I work with 1 specific file in archive with many files?
Now I get error
ValueError: ('Multiple files found in compressed zip file %s)
This post doesn't answer my question bcause I have multiple files in 1 zip
Read a zipped file as a pandas DataFrame
From this link
try this
from zipfile import ZipFile
import io
from urllib.request import urlopen
import pandas as pd
r = urlopen("http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip").read()
file = ZipFile(io.BytesIO(r))
data_df = pd.read_csv(file.open("Production_Crops_E_All_Data.csv"), encoding='latin1')
data_df_noflags = pd.read_csv(file.open("Production_Crops_E_All_Data_NOFLAG.csv"), encoding='latin1')
data_df_flags = pd.read_csv(file.open("Production_Crops_E_Flags.csv"), encoding='latin1')
Hope this helps!
EDIT: updated for python3 StringIO to io.StringIO
EDIT: updated the import of urllib, changed usage of StringIO to BytesIO. Also your CSV files are not utf-8 encoding, I tried latin1 and that worked.
You could use python's datatable, which is a reimplementation of Rdatatable in python.
Read in data :
from datatable import fread
#The exact file to be extracted is known, simply append it to the zip name:
url = "Production_Crops_E_All_Data.zip/Production_Crops_E_All_Data.csv"
df = fread(url)
#convert to pandas
df.to_pandas()
You can equally work within datatable; do note however, that it is not as feature-rich as Pandas; but it is a powerful and very fast tool.
Update: You can use the zipfile module as well :
from zipfile import ZipFile
from io import BytesIO
with ZipFile(url) as myzip:
with myzip.open("Production_Crops_E_All_Data.csv") as myfile:
data = myfile.read()
#read data into pandas
#had to toy a bit with the encoding,
#thankfully it is a known issue on SO
#https://stackoverflow.com/a/51843284/7175713
df = pd.read_csv(BytesIO(data), encoding="iso-8859-1", low_memory=False)
I have a parquet file and I want to read first n rows from the file into a pandas data frame.
What I tried:
df = pd.read_parquet(path= 'filepath', nrows = 10)
It did not work and gave me error:
TypeError: read_table() got an unexpected keyword argument 'nrows'
I did try the skiprows argument as well but that also gave me same error.
Alternatively, I can read the complete parquet file and filter the first n rows, but that will require more computations which I want to avoid.
Is there any way to achieve it?
The accepted answer is out of date. It is now possible to read only the first few lines of a parquet file into pandas, though it is a bit messy and backend dependent.
To read using PyArrow as the backend, follow below:
from pyarrow.parquet import ParquetFile
import pyarrow as pa
pf = ParquetFile('file_name.pq')
first_ten_rows = next(pf.iter_batches(batch_size = 10))
df = pa.Table.from_batches([first_ten_rows]).to_pandas()
Change the line batch_size = 10 to match however many rows you want to read in.
After exploring around and getting in touch with the pandas dev team, the end point is pandas does not support argument nrows or skiprows while reading the parquet file.
The reason being that pandas use pyarrow or fastparquet parquet engines to process parquet file and pyarrow has no support for reading file partially or reading file by skipping rows (not sure about fastparquet). Below is the link of issue on pandas github for discussion.
https://github.com/pandas-dev/pandas/issues/24511
As an alternative you can use S3 Select functionality from AWS SDK for pandas as proposed by Abdel Jaidi in this answer.
pip install awswrangler
import awswrangler as wr
df = wr.s3.select_query(
sql="SELECT * FROM s3object s limit 5",
path="s3://filepath",
input_serialization="Parquet",
input_serialization_params={},
use_threads=True,
)
Parquet file is column oriented storage, designed for that... So it's normal to load all the file to access just one line.
Using pyarrow dataset scanner:
import pyarrow as pa
n = 10
src_path = "/parquet/path"
df = pa.dataset.dataset(src_path).scanner().head(n).to_pandas()
The most straighforward option for me seems to use dask library as
import dask.dataframe as dd
df = dd.read_parquet(path= 'filepath').head(10)
I'm trying to deploy a training script on Google Cloud ML. Of course, I've uploaded my datasets (CSV files) in a bucket on GCS.
I used to import my data with read_csv from pandas, but it doesn't seem to work with a GCS path.
How should I proceed (I would like to keep using pandas) ?
import pandas as pd
data = pd.read_csv("gs://bucket/folder/file.csv")
output :
ERROR 2018-02-01 18:43:34 +0100 master-replica-0 IOError: File gs://bucket/folder/file.csv does not exist
You will require to use file_io from tensorflow.python.lib.io to do that as demonstrated below:
from tensorflow.python.lib.io import file_io
from pandas.compat import StringIO
from pandas import read_csv
# read csv file from google cloud storage
def read_data(gcs_path):
file_stream = file_io.FileIO(gcs_path, mode='r')
csv_data = read_csv(StringIO(file_stream.read()))
return csv_data
Now call the above function
gcs_path = 'gs://bucket/folder/file.csv' # change path according to your bucket, folder and path
df = read_data(gcs_path)
# print(df.head()) # displays top 5 rows including headers as default
Pandas does not have native GCS support. There are two alternatives:
1. copy the file to the VM using gsutil cli
2. use the TensorFlow file_io library to open the file, and pass the file object to pd.read_csv(). Please refer to the detailed answer here.
You could also use Dask to extract and then load the data into, let's say, a Jupyter Notebook running on GCP.
Make sure you have Dask is installed.
conda install dask #conda
pip install dask[complete] #pip
import dask.dataframe as dd #Import
dataframe = dd.read_csv('gs://bucket/datafile.csv') #Read CSV data
dataframe2 = dd.read_csv('gs://bucket/path/*.csv') #Read parquet data
This is all you need to load the data.
You can filter and manipulate data with Pandas syntax now.
dataframe['z'] = dataframe.x + dataframe.y
dataframe_pd = dataframe.compute()