Extract partition list from pyarrow.dataset.ParquetFileFragment object - python

I have a pyarrow.dataset.ParquetFileFragment object like this:
<pyarrow.dataset.ParquetFileFragment path=pq-test/Location=US-California/Industry=HT-SoftWare/dce9900c46f94ec3a8dca094cf62bd34-0.parquet partition=[Industry=HT-SoftWare, Location=US-California]>
I could get the path using .path but .partition method does not give the partition list. Is there anyway to grab it?

There is an PR open that would expose ds.get_partition_keys publicly: https://github.com/apache/arrow/pull/33862/files and that would help you get a nice dict from partition_expression attribute of a ds.ParquetFileFragment.
Note that you have to add partitioning parameter when you read the dataset, to get a valid expression:
>>> import pyarrow as pa
>>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
... 'n_legs': [2, 2, 4, 4, 5, 100],
... 'animal': ["Flamingo", "Parrot", "Dog", "Horse",
... "Brittle stars", "Centipede"]})
>>> import pyarrow.dataset as ds
>>> ds.write_dataset(table, "dataset_name_fragments", format="parquet",
... partitioning=["year"], partitioning_flavor="hive")
>>> dataset = ds.dataset('dataset_name_fragments/', format="parquet", partitioning="hive")
>>> fragments = dataset.get_fragments()
>>> fragment = next(fragments)
>>> fragment.partition_expression
<pyarrow.compute.Expression (year == 2019)>
It would be also great to have an attribute that would get you the partition list also and will be added to the mentioned PR.

Related

How to handle a list which contains elements having leading zeros?

I have a data which is coming from another source in the form of a nested list which looks as below:
data = [
["store1", 50, 02132020],
["store2", 20, 02112020],
["store3", 25, 02172020]
]
Here, 50 is the price.
And, 02152022 is the date.
When, I print the data, I get below error:
leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers
How to deal with list whose elements may contain "leading zeros"?
My suggestion is that you may cast the third element in the list into string, then using datetime to cast into datetime object since it's a date.
I don't know the exact source of data. Firstly, It cannot be a vanilla python list object since it cannot pass interpreter check. Secondly, It also maked nonsense as a JSON object. Something like 02132020 are not allowed to appear in JSON.
import datetime
data = [
["store1", 50, "02132020"],
["store2", 20, "02112020"],
["store3", 25, "02172020"]
]
for i in data:
print({
"store_name" : i[0],
"price": i[1],
"date":(datetime.datetime.strptime(i[2], '%m%d%Y')).date()
})
Output:
{'store_name': 'store1', 'price': 50, 'date': datetime.date(2020, 2, 13)}
{'store_name': 'store2', 'price': 20, 'date': datetime.date(2020, 2, 11)}
{'store_name': 'store3', 'price': 25, 'date': datetime.date(2020, 2, 17)}

Change string format into date time format or separate them via Python?

I have a column of data which is like 20181,20182,20183,20184. I am wondering how to change them to the 2018Q1, 2018Q2 format or separate them into two columns one for years and the other for quarters.
Since years are in the foreseeable future always 4 characters long, you can simply split at the 4th character like this:
s = '20181'
year, quarter = s[:4], s[4:]
# You can now use year and quarter separately or merge them back in the given format:
print(year + 'Q' + quarter)
You can try this:
Adding character
Code Syntax
def subby(list):
lista = []
for each in list:
string = str(each)
part1, part2 = string[:4], string[-1:]
lista.append((part1+'Q'+part2))
return lista
print(subby([20181,20182,20183,20184]))
Output
['2018Q1', '2018Q2', '2018Q3', '2018Q4']
[Program finished]
converting to dictionary
Code Syntax
def dicty(list):
dict = {"year":[],
"quarter": []
}
for each in list:
string = str(each)
dict['year'].append(int(string[:4]))
dict['quarter'].append(int(string[-1:]))
return dict
print(dicty([20181,20182,20183,20184]))
Output
{'year': [2018, 2018, 2018, 2018], 'quarter': [1, 2, 3, 4]}
[Program finished]

Efficiently find "duplicates" between two lists with dict-elements comparing just a subset of the dict-fields

What makes my question different...
...from similar ones here on StackOverflow. In the MWE below, the dict has 4 fields; but only 2 of them are relevant for comparing.
The problem...
...is easy, and I always provide a solution in my MWE below, but I would like to know if there is a more efficient, pythonic way to handle that -- efficient in the meaning of CPU time.
Two list() with dict() elements need to be check for "duplicates." A duplicate is defined as if the fields title and date are equal. The content of the other fields (flag and misc) do not matter. To add another factor: the dict elements can have different number of fields, but title and date are always present.
My solution
I simply translate my human words into python words (dupli and data are the two lists):
for d in dupli[:]:
for e in data:
# compare by 'title' and 'date'
if e['title'] == d['title'] and e['date'] == d['date']:
print('Duplicate found: {}'.format(d))
# remove duplicates
dupli.remove(d)
My full MWE
This MWE generates the list data with 100 random elements. The second list dupli has 10 elements where 3 of them have "duplicates" (by title and date BUT NOT misc or flag) in the first list.
#!/usr/bin/env python3
import random
import string
# helper function
def random_string(n):
return ''.join([random.choice(string.ascii_letters) for i in range(n)])
# Create persistent data
def create_data(n):
container = []
for idx in range(n):
# create an element with title, date and some misc data
element = {
'title': random_string(random.randrange(35)),
'date': [2020, 5, random.randint(1, 30)],
'flag': (random.random() < 0.5), # bool
'misc': random_string(random.randrange(5)),
}
container.append(element)
return container
# persistent data
data = create_data(100)
# new data elements with possible duplicates
dupli = create_data(10)
# generate 3 duplicates
for idx in [2, 5, 8]:
dupli[idx]['title'] = data[idx*10]['title']
dupli[idx]['date'] = data[idx*10]['date']
# MY APPROACH to check for duplicates
for d in dupli[:]:
for e in data:
# compare by 'title' and 'date'
if e['title'] == d['title'] and e['date'] == d['date']:
print('Duplicate found: {}'.format(d))
# remove duplicates
dupli.remove(d)
# add the rest to persistent data
data.extend(dupli)
Background information
This is only an MWE. The real world data is more complex. That is why efficiency is important for me. The following numbers depending on my users of course, but just to give you and idea:
A dict element has 4 to 12 fields.
A list have 200 to 20.000 elements.
There are 30 to 300 lists in my application.
On the other side for each list there is a list with new elements: 10 to 40 elements.
I hope I understood your question. You can first prepare a set() with tuples (<title>, <date>) and then just check if element from data list is in this set - this will be O(n).
For example:
elems = set((d['title'], tuple(d['date'])) for d in dupli)
for e in data:
if (e['title'], tuple(e['date'])) in elems:
print('Duplicate found: {}'.format(e))
Prints:
Duplicate found: {'title': 'ENhIdksxeNKqCgbbg', 'date': [2020, 5, 5], 'flag': True, 'misc': ''}
Duplicate found: {'title': 'MRyXAmfJjSNjrXYTNpPRQFP', 'date': [2020, 5, 3], 'flag': True, 'misc': 'oTmr'}
Duplicate found: {'title': 'IyeSazUnquqTwYXGnTjHelFGr', 'date': [2020, 5, 12], 'flag': True, 'misc': 'CdhG'}
EDIT:
In set((d['title'], tuple(d['date'])) for d in dupli) I must first create tuple from d['date'], because original type is list (unhashable) and I cannot add list to set.
Better will be storing d['date'] as tuple originally (to skip this converting step and speed-up things):
'date': (2020, 5, random.randint(1, 30)), # <-- tuple, not list

Make a nested JSON from online zipped CSV

I'm fairly new to Python and I need to make a nested JSON out of an online zipped CSV file using standard libraries only and specifically in python 2.7. I've figured out the accessing and unzipping the file but am having some trouble with the parsing. Basically, I need to make a JSON output that contains three high-level elements for each primary key:
The primary key (which is made up of columns 0,2,3&4)
A dictionary that is a time series of the observed values for that PK (ie: date: observed value)
A dictionary of metadata (The product, flowtype, units,and ideally a nested time series of the quality for each observed point.
from StringIO import StringIO
from urllib import urlopen
from zipfile
import ZipFile from datetime
import datetime import itertools as it
import csv
import sys
url = urlopen("https://www.jodidata.org/_resources/files/downloads/gas-data/jodi_gas_csv_beta.zip")
myzip = ZipFile(StringIO(url.read()))
with myzip.open('jodi_gas_beta.csv','r' ) as myCSV:
#Read the data
reader=csv.DictReader(myCSV)
#Sort the data by PK + Time for timeseries
reader=sorted(reader,key=lambda row: row['REF_AREA'],row['ENERGY_PRODUCT'],row['FLOW_BREAKDOWN'],row['UNIT_MEASURE'],row['TIME_PERIOD']))
#initialize dictionaries for output
myData=[]
keys=[]
groups=[]
#limiting to first 200 rows for testing ONLY
for k, g in it.groupby(list(it.islice(reader,200)),key=lambda row: row['REF_AREA'],row['ENERGY_PRODUCT'],row['FLOW_BREAKDOWN'],row['UNIT_MEASURE'])):
keys.append(k)
groups.append(list(g))
myData.append({'MyPK': ''.join(k), #captures the PKs
'TimeSeries' : dict((zip(e['TIME_PERIOD'],e['OBS_VALUE']))) for e in g], #Not working properly, want a time series dictionary here
#TODO: Dictionary of metadata here (with nested time series, if possible)})
#TODO: Output as a JSON string
So, the Result should look something like this:
{
"myPK": "AENATGASEXPLNGM3",
"TimeSeries":[
["2015-01", 756],
["2015-02", 572],
["2015-03", 654]
],
"Metadata":{
"Country":"AE",
"Product":"NATGAS",
"Flow":"EXPLNG",
"Unit":"M3",
"Quality:[
["2015-01", 3],
["2015-02", 3],
["2015-03", 3]
]
}
}
Although you don't appear to have put much effort into solving the problem yourself, here's something I think does what you want. It makes use of the operator.itemgetter() function to simplify retrieving a series of different items from the various containers (such as lists and dicts).
I also modified the code to more closely follow the PEP 8 - Style Guide for Python Code.
import datetime
import csv
from operator import itemgetter
import itertools as it
import json
from StringIO import StringIO
import sys
from urllib import urlopen
from zipfile import ZipFile
# Utility.
def typed_itemgetter(items, callables):
""" Like operator.itemgetter() but also applies corresponding callable to
each retrieved value if it's not None. Creates and returns a function.
"""
return lambda row: [f(value) if f else value
for value, f in zip(itemgetter(*items)(row), callables)]
url = urlopen("https://www.jodidata.org/_resources/files/downloads/gas-data/jodi_gas_csv_beta.zip")
myzip = ZipFile(StringIO(url.read()))
with myzip.open('jodi_gas_beta.csv', 'r' ) as myCSV:
reader = csv.DictReader(myCSV)
primary_key = itemgetter('REF_AREA', 'ENERGY_PRODUCT', 'FLOW_BREAKDOWN', 'UNIT_MEASURE',
'TIME_PERIOD')
reader = sorted(reader, key=primary_key)
# Limit to first 200 rows for TESTING.
reader = [row for row in it.islice(reader, 200)]
# Group the data by designated keys (aka "primary key").
keys, groups = [], []
keyfunc = itemgetter('REF_AREA', 'ENERGY_PRODUCT', 'FLOW_BREAKDOWN', 'UNIT_MEASURE')
for k, g in it.groupby(reader, key=keyfunc):
keys.append(k)
groups.append(list(g))
# Create corresponding JSON-like Python data-structure.
myData = []
for i, group in enumerate(groups):
result = {'myPK': ''.join(keys[i]),
'TimeSeries': [
typed_itemgetter(('TIME_PERIOD', 'OBS_VALUE'),
(None, lambda x: int(float(x))))(row)
for row in group]
}
metadata = dict(zip(("Country", "Product", "Flow", "Unit"), keys[i]))
metadata['Quality'] = [typed_itemgetter(
('TIME_PERIOD', 'ASSESSMENT_CODE'), (None, int))(row)
for row in group]
result['Metadata'] = metadata
myData.append(result)
# Display the data to be turned into JSON.
from pprint import pprint
print('myData:')
pprint(myData)
# To create JSON format output, use something like:
import json
with open('myData.json', 'w') as fp:
json.dump(myData, fp, indent=2)
Beginning portion of the output printed:
myData:
[{'Metadata': {'Country': 'AE',
'Flow': 'EXPLNG',
'Product': 'NATGAS',
'Quality': [['2015-01', 3],
['2015-02', 3],
['2015-03', 3],
['2015-04', 3],
['2015-05', 3],
['2015-06', 3],
['2015-07', 3],
['2015-08', 3],
['2015-09', 3],
['2015-10', 3],
['2015-11', 3],
['2015-12', 3],
['2016-01', 3],
['2016-02', 3],
['2016-04', 3],
['2016-05', 3]],
'Unit': 'M3'},
'TimeSeries': [['2015-01', 756],
['2015-02', 572],
['2015-03', 654],
['2015-04', 431],
['2015-05', 681],
['2015-06', 683],
['2015-07', 751],
['2015-08', 716],
['2015-09', 830],
['2015-10', 580],
['2015-11', 659],
['2015-12', 659],
['2016-01', 742],
['2016-02', 746],
['2016-04', 0],
['2016-05', 0]],
'myPK': 'AENATGASEXPLNGM3'},
{'Metadata': {'Country': 'AE',
'Flow': 'EXPPIP',
'Product': 'NATGAS',
'Quality': [['2015-01', 3],
['2015-02', 3],
['2015-03', 3],
['2015-04', 3],
['2015-05', 3],
['2015-06', 3],
['2015-07', 3],
['2015-08', 3],
['2015-09', 3],
['2015-10', 3],
['2015-11', 3],
['2015-12', 3],
['2016-01', 3],
['2016-02', 3],
['2016-03', 3],
['2016-04', 3],
# etc, etc...
]

Converting a set to a list with Pandas grouopby agg function causes 'ValueError: Function does not reduce'

Sometimes, it seems that the more I use Python (and Pandas), the less I understand. So I apologise if I'm just not seeing the wood for the trees here but I've been going round in circles and just can't see what I'm doing wrong.
Basically, I have an example script (that I'd like to implement on a much larger dataframe) but I can't get it to work to my satisfaction.
The dataframe consists of columns of various datatypes. I'd like to group the dataframe on 2 columns and then produce a new dataframe that contains lists of all the unique values for each variable in each group. (Ultimately, I'd like to concatenate the list items into a single string – but that's a different question.)
The initial script I used was:
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
The output from this script is as expected: a series of lines containing the sets of values and a dataframe containing the returned sets:
{'09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34'}
{'01/06/2015 11:09', '12/05/2015 14:19', '27/05/2015 22:31', '19/06/2015 05:37'}
{'15/04/2015 07:12', '19/05/2015 19:22', '06/05/2015 11:12', '04/06/2015 12:57', '15/06/2015 03:23', '12/04/2015 01:00'}
{'02/04/2015 02:34', '10/05/2015 08:52'}
{2, 3, 6}
{18, 11, 13, 14}
{4, 5, 9, 12, 15, 17}
{1, 10}
date \
gender age
female old set([09/04/2015 23:03, 21/04/2015 12:59, 06/04...
young set([01/06/2015 11:09, 12/05/2015 14:19, 27/05...
male old set([15/04/2015 07:12, 19/05/2015 19:22, 06/05...
young set([02/04/2015 02:34, 10/05/2015 08:52])
id
gender age
female old set([2, 3, 6])
young set([18, 11, 13, 14])
male old set([4, 5, 9, 12, 15, 17])
young set([1, 10])
The problem occurs when I try to convert the sets to lists. Bizarrely, it produces 2 duplicated rows containing identical lists but then fails with a 'ValueError: Function does not reduce' error.
def tempFuncAgg(tempVar):
tempList = list(set(tempVar.dropna())) # This is the only difference
print(tempList)
return tempList
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
tempGroupby = tempDF.groupby(['gender','age'])
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
But now the output is:
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Function does not reduce
Any help to troubleshoot this problem would be appreciated and I apologise in advance if it's something obvious that I'm just not seeing.
EDIT
Incidentally, converting the set to a tuple rather than a list works with no problem.
Lists can sometimes have weird problems in pandas. You can either :
Use tuples (as you've already noticed)
If you really need lists, just do it in a second operation like this :
dfAgg.applymap(lambda x: list(x))
full example :
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
# Transform in list
dfAgg.applymap(lambda x: list(x))
print(dfAgg)
There's many such bizzare behaviours in pandas, it is generally better to go on with a workaround (like this), than to find a perfect solution

Categories