ignore columns not present in parquet with pyarrow in pandas - python

I am trying to read a parquet with pyarrow==1.0.1 as engine.
Given :
columns = ['a','b','c']
pd.read_parquet(x, columns=columns, engine="pyarrow")
if file x does not contain c, it will give out :
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset._scanner()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.from_dataset()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset._populate_builder()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Field named 'c' not found or not unique in the schema.
There is no argument to ignore warning and just read columns that are missing as nan.
The error handling is also pretty bad.
pyarrow.lib.ArrowInvalid("Field named 'c' not found or not unique in the schema.")
It is pretty hard to get the filed name that was missing, so that it can be used to remove the columns that is passed in next try.
Is there a method to this?

You can read the metadata from your parquet file to figure out which columns are available.
Bear in mind though that pandas won't be able to guess the type of the missing column (c in the example below), which may cause issues when you concatenate tables later.
import pandas as pd
import pyarrow.parquet as pq
all_columns = ['a', 'b', 'c']
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'z']})
file_name = '/tmp/my_df.pq'
df.to_parquet(file_name)
parquet_file = pq.ParquetFile(file_name)
columns_in_file = [c for c in all_columns if c in parquet_file.schema.names]
df = (
parquet_file
.read(columns=columns_in_file)
.to_pandas()
.reindex(columns=all_columns)
)

Related

Get only the name of a DataFrame - Python - Pandas

I'm actually working on a ETL project with crappy data I'm trying to get right.
For this, I'm trying to create a function that would take the names of my DFs and export them to CSV files that would be easy for me to deal with in Power BI.
I've started with a function that will take my DFs and clean the dates:
df_liste = []
def facture(x) :
x = pd.DataFrame(x)
for s in x.columns.values :
if s.__contains__("Fact") :
x.rename(columns= {s : 'periode_facture'}, inplace = True)
x['periode_facture'] = x[['periode_facture']].apply(lambda x : pd.to_datetime(x, format = '%Y%m'))
If I don't set 'x' as a DataFrame, it doesn't work but that's not my problem.
As you can see, I have set a list variable which I would like to increment with the names of the DFs, and the names only. Unfortunately, after a lot of tries, I haven't succeeded yet so... There it is, my first question on Stack ever!
Just in case, this is the first version of the function I would like to have:
def export(x) :
for df in x :
df.to_csv(f'{df}.csv', encoding='utf-8')
You'd have to set the name of your dataframe first using df.name (probably, when you are creating them / reading data into them)
Then you can access the name like a normal attribute
import pandas as pd
df = pd.DataFrame( data=[1, 2, 3])
df.name = 'my df'
and can use
df.to_csv(f'{df.name}.csv', encoding='utf-8')

How to filter different partition in Dask read_parquet function

I've encountered a problem with loading dask dataframes from parquet file.
Basically I stored my parquet file into categories: aircraft name(AIRCRAFT=name_aircraft), progressive number(a number which identify each mission of the aircraft: PROGRESSIVE=number), year, months and day.
When I try to read the parquet file into a dask dataframe I succeed in filtering the year window and progressive windows but fail in select only some aircrafts.
Here it is reported the function that I use to read the parquet file
ddf = dd.read_parquet(path, engine="pyarrow", index=False, filters=filters)
Where path is correctly the path to the file .parquet and filters is a list of tuple with the elements that I want to filter for example:
filters = [('PROGRESSIVE', '>=', 0), ('PROGRESSIVE', '<=', 999), ('year', '>', 1999), ('year', '<', 2021), ('AIRCRAFT', '=', 'Aircraft-5')]
Now with this kind of filters everything is ok but if I want to select multiple aircraft or, for example, different progressive numbers which are not in the same range window (let's say 753, 800 and 883 only) I cannot load the dataframe properly.
For example if I set
filters = [('PROGRESSIVE', '>=', 0), ('PROGRESSIVE', '<=', 999), ('year', '>', 1999), ('year', '<', 2021), ('AIRCRAFT', '=', 'Aircraft-4') ('AIRCRAFT', '=','Aircraft-5')]
Then the dataframe loaded is empty in fact: len(ddf_filtered_demo.index) is 0 while selecting only one aircraft is not empty and is correct.
The problem is I can select a range of values (< or >) but cannot select only some elements.
What is the correct way of loading a dask dataframe from a parquet file selecting only some partition not belonging to a unique range of values?
The fastparquet interface supports in, so you could do
filters = [('PROGRESSIVE', 'in', [753, 80, 883]), ... )
I don't know whether arrow supports this syntax, you can try.
The specific non-working example for you sounds like a bug, and you should report it. Ideally, you can recreate it with a minimal example dataset that you create in code.
The following is a simple example which works with current fastparquet main branch (which is what I happen to have installed locally).
# make data
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': list("abcd")})
df.to_parquet("out.parq", partition_on='a', engine="fastparquet")
# read
pd.read_parquet("out.parq", filters=[('a', 'in', [1, 2, 4])], engine='fastparquet')
Since the OP's question used pyarrow: note that the Dask documentation on the filters kwarg of dd.read_parquet says:
"List of filters to apply, like [[('col1', '==', 0), ...], ...]. Using this argument will NOT result in row-wise filtering of the final partitions unless engine="pyarrow-dataset" is also specified. For other engines, filtering is only performed at the partition level, i.e., to prevent the loading of some row-groups and/or files."
In other words, when using a non-pyarrow engine, filters cannot pull out specific rows satisfying the filter condition unless the parquet file was created with the right partition key (partition_on=a in the case of the answers above).
So while this code works for both fastparquet and pyarrow engines:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(
{"letter": ["a", "b", "c", "a", "a", "d"], "number": [1, 2, 3, 4, 5, 6]}
)
ddf = dd.from_pandas(df, npartitions=3)
ddf.to_parquet("tmp/partition_on.parquet", engine="fastparquet", partition_on="letter")
# write dummy df to parquet using partition_on kwarg
ddf_1 = dd.read_parquet(
"tmp/partition/1", engine="fastparquet", filters=[("letter", "==", "a")]
)
This code works only for pyarrow:
# write dummy df to parquet without partition_on kwarg
ddf.to_parquet("tmp/without_partition_on.parquet", engine="pyarrow")
ddf_2 = dd.read_parquet(
"tmp/non_partitioned_on.parquet", engine="pyarrow", filters=[("letter", "==", "a"), ("number", ">=", 2)]
)
The example above is a MCVE version of the OP's code. That suggests to me that either something is up with your data quality or you should report a bug.
Modifying the example from mdurant to be specific to dask would be:
import dask.dataframe as dd
import pandas as pd
# make data
df = dd.from_pandas(pd.DataFrame({'a': [1, 2, 3, 4], 'b': list("abcd")}), npartitions=1)
df.to_parquet("out.parq", partition_on='a', engine="fastparquet")
# read
dd.read_parquet("out.parq", filters=[('a', 'in', (1, 2, 4))], engine='fastparquet').compute()
It looks like dask performs a flattening of the filters (link) and so you'll get a type error if you have a list in your filter (which happens when the _flatten_filter() function attempts to build a set). But, if you provide the values as a tuple rather than a list the filter works properly.

DataFrame.lookup requires unique index and columns with a recent version of Pandas

I am working with python3.7, and I am facing an issue with a recent version of pandas.
Here is my code.
import pandas as pd
import numpy as np
data = {'col_1':[9087.6000, 9135.8000, np.nan, 9102.1000],
'col_2':[0.1648, 0.1649, '', 5.3379],
'col_nan':[np.nan, np.nan, np.nan, np.nan],
'col_name':['col_nan', 'col_1', 'col_2', 'col_nan']
}
df = pd.DataFrame(data, index=[101, 102, 102, 104])
col_lookup = 'results'
col_result = 'col_name'
df[col_lookup] = df.lookup(df.index, df[col_result])
The code works fine with pandas version 1.0.3, but
when I try with version 1.1.1 the following error occurs:
"ValueError: DataFrame.lookup requires unique index and columns"
The dataframe indeed includes a duplication of the index "102".
For different reasons, I have to work with version 1.1.1 of pandas. Is there a solution with the "lookup" command to support index duplication with this version of pandas?
Thanks in advance for your help.
Put a unique index in place then restore the old index...
import pandas as pd
import numpy as np
data = {'col_1':[9087.6000, 9135.8000, np.nan, 9102.1000],
'col_2':[0.1648, 0.1649, '', 5.3379],
'col_nan':[np.nan, np.nan, np.nan, np.nan],
'col_name':['col_nan', 'col_1', 'col_2', 'col_nan']
}
df = pd.DataFrame(data, index=[101, 102, 102, 104])
col_lookup = 'results'
col_result = 'col_name'
df.reset_index(inplace=True)
df[col_lookup] = df.lookup(df.index, df[col_result])
df = df.set_index(df["index"]).drop(columns="index")
Non-unique index was a bug: Github link
"look up" method in pandas 1.1.1 does not allows you to pass non-unique index as input argument.
following code has been added at the beginning of "lookup" method in "frame.py" which for me is in(line 3836):
C:\Users\Sajad\AppData\Local\Programs\Python\Python38\Lib\site-packages\pandas\core\frame.py
if not (self.index.is_unique and self.columns.is_unique):
# GH#33041
raise ValueError("DataFrame.lookup requires unique index and columns")
However if this error handler didn't exist, the following procedure in this method would end up in a for loop. substituting the last line with this built-in for loop gives you the same result as previous pandas versions.
result = np.empty(len(df.index), dtype="O")
for i, (r, c) in enumerate(zip(df.index, df[col_result])):
result[i] = df._get_value(r, c)
df[col_lookup] = result

Save and load correctly pandas dataframe in csv while preserving freq of datetimeindex

I was trying to save a DataFrame and load it. If I print the resulting df, I see they are (almost) identical. The freq attribute of the datetimeindex is not preserved though.
My code looks like this
import datetime
import os
import numpy as np
import pandas as pd
def test_load_pandas_dataframe():
idx = pd.date_range(start=datetime.datetime.now(),
end=(datetime.datetime.now()
+ datetime.timedelta(hours=3)),
freq='10min')
a = pd.DataFrame(np.arange(2*len(idx)).reshape((len(idx), 2)), index=idx,
columns=['first', 2])
a.to_csv('test_df')
b = load_pandas_dataframe('test_df')
os.remove('test_df')
assert np.all(b == a)
def load_pandas_dataframe(filename):
'''Correcty loads dataframe but freq is not maintained'''
df = pd.read_csv(filename, index_col=0,
parse_dates=True)
return df
if __name__ == '__main__':
test_load_pandas_dataframe()
And I get the following error:
ValueError: Can only compare identically-labeled DataFrame objects
It is not a big issue for my program, but it is still annoying.
Thanks!
The issue here is that the dataframe you save has columns
Index(['first', 2], dtype='object')
but the dataframe you load has columns
Index(['first', '2'], dtype='object').
In other words, the columns of your original dataframe had the integer 2, but upon saving it with to_csv and loading it back with read_csv, it is parsed as the string '2'.
The easiest fix that passes your assertion is to change line 13 to:
columns=['first', '2'])
To complemente #jfaccioni answer, freq attribute is not preserved, there are two options here
Fast a simple, use pickle which will preserver everything:
a.to_pickle('test_df')
b = pd.read_pickle('test_df')
a.equals(b) # True
Or you can use the inferred_freq attribute from a DatetimeIndex:
a.to_csv('test_df')
b.read_csv('test_df')
b.index.freq = b.index.inferred_freq
print(b.index.freq) #<10 * Minutes>

Collecting data in loop using Pythons Pandas Dataframe

I am trying to extract data from a csv file using python's pandas module. The experiment data has 6 columns (lets say a,b,c,d,e,f) and i have a list of model directories. Not every model has all 6 'species' (columns) so i need to split the data specifically for each model. Here is my code:
def read_experimental_data(self,experiment_path):
[path,fle]=os.path.split(experiment_path)
os.chdir(path)
data_df=pandas.read_csv(experiment_path)
# print data_df
experiment_species=data_df.keys() #(a,b,c,d,e,f)
# print experiment_species
for i in self.all_models_dirs: #iterate through a list of model directories.
[path,fle]=os.path.split(i)
model_specific_data=pandas.DataFrame()
species_dct=self.get_model_species(i+'.xml') #gives all the species (culuns) in this particular model
# print species_dct
#gives me only species that are included in model dir i
for l in species_dct.keys():
for m in experiment_species:
if l == m:
#how do i collate these pandas series into a single dataframe?
print data_df[m]
The above code gives me the correct data but i'm having trouble collecting it in a usable format. I've tried to merge and concatenate them but no joy. Does any body know how to do this?
Thanks
You can create a new DataFrame from data_df by passing it a list of columns you want,
import pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7,8,9]})
df_filtered = df[['a', 'c']]
or an example using some of your variable names,
import pandas as pd
data_df = pd.DataFrame({'a': [1,2], 'b': [3,4], 'c': [5,6],
'd': [7,8], 'e': [9,10], 'f': [11,12]})
experiment_species = data_df.keys()
species_dct = ['b', 'd', 'e', 'x', 'y', 'z']
good_columns = list(set(experiment_species).intersection(species_dct))
df_filtered = data_df[good_columns]

Categories