Apache Beam Find Top N elements Python SDK - python

My requirement is to read from a BQ table, do some processing, select the Top N rows on the "score" column and write it to the BQ Table and also publish the rows as a PubSub message.
I made a sample below to create a PCollection and select top 5 rows from this PCollection based on the value of "score".
import apache_beam as beam
with beam.Pipeline() as p:
elements = (
p
| beam.Create([{"name": "Bob", "score": 5},
{"name":"Adam", "score": 5},
{"name":"John", "score": 2},
{"name":"Jose", "score": 1},
{"name":"Tim", "score": 1},
{"name":"Bryan", "score": 1},
{"name":"Kim", "score":1}])
| "Filter Top 5 scores" >> beam.combiners.Top.Of(5, key=lambda t: t['score'])
| "Print" >> beam.Map(print)
)
# Write to BQ
# Publish to PubSub
This returns a list instead of a PCollection. Hence I'm unable to write it back to BQ table nor publish to PubSub in this format.
What is the best way to select the top N elements but keep them as a PCollection?
In my real use case I might have around 2 million rows and I need to select 800k records from it based on a column.
Also in one of the Apache Beam summit videos I remember hearing that the Top function will keep the results in one worker node instead of keeping it across. So thats why i assume it is a list. What would be the maximum number of the rows before which it could break?

To solve your issue, you can add a FlatMap transformation after the Top operator :
def test_pipeline(self):
with TestPipeline() as p:
elements = (
p
| beam.Create([{"name": "Bob", "score": 5},
{"name": "Adam", "score": 5},
{"name": "John", "score": 2},
{"name": "Jose", "score": 1},
{"name": "Tim", "score": 1},
{"name": "Bryan", "score": 1},
{"name": "Kim", "score": 1}])
| "Filter Top 5 scores" >> beam.combiners.Top.Of(5, key=lambda t: t['score'])
| "Flatten" >> beam.FlatMap(lambda e: e)
| "Print" >> beam.Map(print)
)
In this case the result is a PCollection of Dict instead of a PCollection of List[Dict].
You have to be carreful with high volume because the beam.combiners.Top.Of operation is done only in one worker.
You can also optimize and make your Dataflow job more powerful with Dataflow Prime if needed (Right Fitting, Vertical Autoscaling inside a Worker...).

Related

How to normalize complex JSON structure with 4 levels of nested arrays?

As I'm fairly new to python I've tried various ways based on answers found here but can't manage to normalize my json file.
As I checked in Postman it has 4 levels of nested arrays. For suppliers I want to expand all the levels of data.The problem I have is with the score_card subtree for which I want to pull out all risk_groups , then name,risk_score and all riskstogether with name,risk_score and indicators with the deepest level containing again name and risk_score.
and I'm not sure whether it is possible in one go based on ambiguous names between levels.
Below I'm sharing a reproducible example:
data = {"suppliers": [
{
"id": 1260391,
"name": "2712270 MANITOBA LTD",
"duns": "rm-071-7291",
"erp_number": "15189067;15189067",
"material_group_ids": [
176069
],
"business_unit_ids": [
13728
],
"responsible_user_ids": [
37761,
37760,
37759,
37758,
37757,
37756,
36520,
37587,
36494,
22060,
37742,
36446,
36289
],
"address": {
"address1": "BOX 550 NE 16-20-26",
"address2": None,
"zip_code": "R0J 1W0",
"city": "RUSSELL",
"state": None,
"country_code": "ca",
"latitude": 50.7731176,
"longitude": -101.2862461
},
"score_card": {
"risk_score": 26.13,
"risk_groups": [
{
"name": "Viability",
"risk_score": 43.33,
"risks": [
{
"name": "Financial stability supplier",
"risk_score": None,
"indicators": [
{
"name": "Acquisitions",
"risk_score": None
}
]
}
]
},
]
}
}
]
}
And here is how it should look:
expected = [[1260391,'2712270 MANITOBA LTD','rm-071-7291',176069,13728,[37761,37760,
37759,37758,37757,37756,36520,37587,36494,22060,37742,36446,36289],
'BOX 550 NE 16-20-26','None','R0J 1W0','RUSSELL','None','ca',50.7731176,
-101.2862461,26.13,'Viability',43.33,'Financial stability supplier','None',
'Acquisitions','None']]
df = pd.DataFrame(expected,columns=['id','name','duns','material_groups_ids',
'business_unit_ids','responsible_user_ids',
'address.address1','address.address2','address.zip_code',
'address.city','address.state','address.country_code','address.latitude',
'address.longitude','risk_score','risk_groups.name',
'risk_groups.risk_score','risk_groups.risks.name',
'risk_groups.risks.risk_score','risk_groups.risks.indicators.name',
'risk_groups.risks.indicators.risk_score'])
What I tried since last 3 days is to use json_normalize:
example = pd.json_normalize(data['suppliers'],
record_path=['score_card','risk_groups','risks',
'indicators'],meta = ['id','name','duns','erp_number','material_group_ids',
'business_unit_ids','responsible_user_ids','address',['score_card','risk_score']],record_prefix='_')
but when I specify the fields for record_path which seems to be a vital param here it always return this prompt:
TypeError: {'name': 'Acquisitions', 'risk_score': None} has non list value Acquisitions for path name. Must be list or null.
What I want to get a a table with columns as presented below:
So it seems that python doesn't know how to treat most granular level which contains null values.
I've tried approach from here: Use json_normalize to normalize json with nested arrays
but unsuccessfully.
I would like to get the same view which I was able to 'unpack' in Power BI but this way of having the data is not satisfying me.
All risk indicators are the components of risks which are part of risk groups as You can see on attached pictures.
Going to the most granular level it goes this way: Risk groups -> Risks -> Indicators so I would like to unpack it all.
Any help more than desperately needed. Kindly thank You for any support here.

Get unique of a column ordered by date in 800million rows

Input : Multiple csv with the same columns (800 million rows) [Time Stamp, User ID, Col1, Col2, Col3]
Memory available : 60GB of RAM and 24 core CPU
Input Output example
Problem : I want to group by User ID, sort by TimeStamp and take a unique of Col1 but dropping duplicates while retaining the order based on the TimeStamp.
Solutions Tried :
Tried using joblib to load csv in parallel and use pandas to sort and write to csv (Get an error at the sorting step)
Used dask (New to Dask); \
LocalCluster(dashboard_address=f':{port}', n_workers=4, threads_per_worker=4, memory_limit='7GB') ## Cannot use the full 60 gigs as there are others on the server
ddf = read_csv("/path/*.csv")
ddf = ddf.set_index("Time Stamp")
ddf.to_csv("/outdir/")
Questions :
Assuming dask will use disk to sort and write the multipart output, will it preserve the order after I read the output using read_csv?
How do I achieve the 2 part of the problem in dask. In pandas, I'd just apply and gather results in a new dataframe?
def getUnique(user_group): ## assuming the rows for each user are sorted by timestamp
res = list()
for val in user_group["Col1"]:
if val not in res:
res.append(val)
return res
Please direct me if there is a better alternative to dask.
So, I think I would approach this with two passes. In the first pass, I would look to run though all the csv files and build a data structure to hold the keys of user_id and col1 and the "best" timestamp. In this case, "best" will be the lowest.
Note: the use of dictionaries here only serves to clarify what we are attempting to do and if performance or memory was an issue, I would first look to reimplement without them where possible.
so, starting with CSV data like:
[
{"user_id": 1, "col1": "a", "timestamp": 1},
{"user_id": 1, "col1": "a", "timestamp": 2},
{"user_id": 1, "col1": "b", "timestamp": 4},
{"user_id": 1, "col1": "c", "timestamp": 3},
]
After processing all the csv files I hope to have an interim representation of:
{
1: {'a': 1, 'b': 4, 'c': 3}
}
Note that a representation like this could be created in parallel for each CSV and then re-distilled into a final interim representation via a pass 1.5 if you wanted to do that.
Now we can create a final representation based on the keys of this nested structure sorted by the inner most value. Giving us:
[
{'user_id': 1, 'col1': ['a', 'c', 'b']}
]
Here is how I might first approach this task before tweaking things for performance.
import csv
all_csv_files = [
"some.csv",
"bunch.csv",
"of.csv",
"files.csv",
]
data = {}
for csv_file in all_csv_files:
#with open(csv_file, "r") as file_in:
# rows = csv.DictReader(file_in)
## ----------------------------
## demo data
## ----------------------------
rows = [
{"user_id": 1, "col1": "a", "timestamp": 1},
{"user_id": 1, "col1": "a", "timestamp": 2},
{"user_id": 1, "col1": "b", "timestamp": 4},
{"user_id": 1, "col1": "c", "timestamp": 3},
]
## ----------------------------
## ----------------------------
## First pass to determine the "best" timestamp
## for a user_id/col1
## ----------------------------
for row in rows:
user_id = row['user_id']
col1 = row['col1']
ts_new = row['timestamp']
ts_old = (
data
.setdefault(user_id, {})
.setdefault(col1, ts_new)
)
if ts_new < ts_old:
data[user_id][col1] = ts_new
## ----------------------------
print(data)
## ----------------------------
## second pass to set order of col1 for a given user_id
## ----------------------------
data_out = [
{
"user_id": outer_key,
"col1": [
inner_kvp[0]
for inner_kvp
in sorted(outer_value.items(), key=lambda v: v[1])
]
}
for outer_key, outer_value
in data.items()
]
## ----------------------------
print(data_out)

Most Efficiently Iterating Large List of Dictionaries in Python

I have seen some answers for similar questions but I am not sure that they were the best way to fix my problem.
I have a very large table (100,000+ rows of 20+ columns) being handled as a list of dictionaries. I need to do a partial deduplication of this list using a comparison. I have simplified an example of what I am doing now below.
table = [
{ "serial": "111", "time": 1000, "name": jon },
{ "serial": "222", "time": 0900, "name": sal },
{ "serial": "333", "time": 1100, "name": tim },
{ "serial": "444", "time": 1300, "name": ron },
{ "serial": "111", "time": 1300, "name": pam }
]
for row in table:
for row2 in table:
if row != row2:
if row['serial'] == row2['serial']:
if row['time'] > row2['time']:
action
This method does work (obviously simplified and just wrote "action" for that part) but the question I have is whether there is a more efficient method to get to the "row" that I want without having to double iterate the entire table. I don't have a way to necessarily predict where in the list matching rows would be located, but they would be listed under the same "serial" in this case.
I'm relatively new to Python and efficiency is the goal here. As of now with the amount of rows that are being iterated it is taking a long time to complete and I'm sure there is a more efficient way to do this, I'm just not sure where to start.
Thanks for any help!
You can sort the table with serial as the primary key and time as the secondary key, in reverse order (so that the latter of the duplicate items take precedence), then iterate through the sorted list and take action only on the first dict of every distinct serial:
from operator import itemgetter
table = [
{ "serial": "111", "time": "1000", "name": "jon" },
{ "serial": "222", "time": "0900", "name": "sal" },
{ "serial": "333", "time": "1100", "name": "tim" },
{ "serial": "444", "time": "1300", "name": "ron" },
{ "serial": "111", "time": "1300", "name": "pam" }
]
last_serial = ''
for d in sorted(table, key=itemgetter('serial', 'time'), reverse=True):
if d['serial'] != last_serial:
action(d)
last_serial = d['serial']
A list of dictionaries is always going to be fairly slow for this much data. Instead, look into whether Pandas is suitable for your use case - it is already optimised for this kind of work.
It may not be the most efficient, but one thing you can do is get a list of the serial numbers, then sort them. Let's call that list serialNumbersList. The serial numbers that only appear once, we know that they cannot possibly be duplicates, so we remove them from serialNumbersList. Then, you can use that list to reduce the amount of rows to process. Again, I am sure there are better solutions, but this is a good starting point.
#GiraffeMan91 Just to clarify what I mean (typed directly here, do not copy-paste):
serials = collections.defaultdict(list)
for d in table:
serials[d.pop('serial')].append(d)
def process_serial(entry):
serial, values = entry
# remove duplicates, take action based on time
# return serial, processed values
results = dict(
multiprocess.Pool(10).imap(process_serial, serials.iteritems())
)

Looping json data dynamically in Python

I have a two json file in a server. The first json file is a dataframe in json format where it is having 21 columns.
The second json object is a collection of different filters to be applied on the first json(data file) & I want to calculate the reduction in amount column dynamically after applying each filter.
Both the jsons are in server. Sample of this is like below,
[{
"criteria_no.": 1,
"expression": "!=",
"attributes": "Industry_name",
"value": "Clasentrix"
},{
"criteria_no.": 2,
"expression": "=",
"attributes": "currency",
"value": ["EUR","GBP","INR"]
},{
"criteria_no.": 3,
"expression": ">",
"attributes": "Industry_Rating",
"value": "A3"
},{
"criteria_no.": 4,
"expression": "<",
"attributes": "Due_date",
"value": "01/01/2025"
}
]
When coded in python, it is like below,
import urllib2, json
url = urllib2.urlopen('http://.../server/criteria_sample.json')
obj = json.load(url)
print obj
[{u'attributes': u'Industry_name', u'expression': u'!=', u'value': u'Clasentrix', u'criteria_no.': 1}, {u'attributes': u'currency', u'expression': u'=', u'value': [u'EUR', u'GBP', u'INR'], u'criteria_no.': 2}, {u'attributes': u'Industry_Rating', u'expression': u'>', u'value': u'A3', u'criteria_no.': 3}, {u'attributes': u'Due_date', u'expression': u'<', u'value': u'01/01/2025', u'criteria_no.': 4}]
Now, in the sample json, we can see "attributes", which are nothing but the columns present in the first data file. I mentioned it has 21 columns, "Industry_name","currency","Industry_Rating","Due_date" are four of them. "Loan_amount" is another column present there in the data file along with all them.
Now as this criteria list is only a sample, we are having n number of such criterion or filters. I want this filters to be applied dynamically on the data file & I would like to calculate the reduction in loan amount. Let's consider the first filter, it is saying "Industry_name" column should not have "Clasentrix". So from the data file I want to filter "Industry_name", which will not have 'Clasentrix' entry. Now let's say for 11 observations we had 'Clasentrix' out of 61 observations from the data file. Then we will take sum of entire loan amount(61 rows) & then subtract the sum of loan amount for 11 rows which consist 'Clasentrix' from the total loan amount. This number will be considered as reduction for after applying the first filter.
Now for each of n criterion I want to calculate the reduction dynamically in python. So inside the loop the filter json file will create filter considering attribute, expression & value. Just like for the first filter it is "Industry_name != 'Clasentrix'". This should get reflected for each set of rows for the json object like for the second criterion(filter) it should be "currency=['EUR','GBP','INR']" & so on. I also want to calculate the reduction accordingly.
I am struggling to create the python code for the above mentioned exercise. My post is too long, apologies for that. But please provide assistance that how can I calculate the reduction dynamically for each n criterion.
Thanks in advance!!
UPDATE for first data file, find some sample rows;
[{
"industry_id.": 1234,
"loan_id": 1113456,
"Industry_name": "Clasentrix",
"currency": "EUR",
"Industry_Rating": "Ba3",
"Due_date": "20/02/2020",
"loan_amount": 563332790,
"currency_rate": 0.67,
"country": "USA"
},{
"industry_id.": 6543,
"loan_id": 1125678,
"Industry_name": "Wolver",
"currency": "GBP",
"Industry_Rating": "Aa3",
"Due_date": "23/05/2020",
"loan_amount": 33459087,
"currency_rate": 0.8,
"country": "UK"
},{
"industry_id.": 1469,
"loan_id": "8876548",
"Industry_name": "GroupOn",
"currency": "EUR",
"Industry_Rating": "Aa1",
"Due_date": "16/09/2021",
"loan_amount": 66543278,
"currency_rate": 0.67,
"country": "UK"
},{
"industry_id.": 1657,
"loan_id": "6654321",
"Industry_name": "Clasentrix",
"currency": "EUR",
"Industry_Rating": "Ba3",
"Due_date": "15/07/2020",
"loan_amount": 5439908765,
"currency_rate": 0.53,
"country": "USA"
}
]
You can use Pandas to turn the json data into a dataframe and turn the criteria into query strings. Some processing is needed to turn the criteria json into a valid query. In the code below dates are still treated as strings - you may need to explicitly set date queries to convert the string into a date first.
import pandas as pd
import json
# ...
criteria = json.load(url)
df = pd.DataFrame(json.load(data_url)) # data_url is the handle of the data file
print("Loan total without filters is {}".format(df["loan_amount"].sum()))
for c in criteria:
if c["expression"] == "=":
c["expression"] = "=="
# If the value is a string we need to surround it in quotation marks
# Note this can break if any values contain "
if isinstance(c["value"], basestring):
query = '{attributes} {expression} "{value}"'.format(**c)
else:
query = '{attributes} {expression} {value}'.format(**c)
loan_total = df.query(query)["loan_amount"].sum()
print "With criterion {}, {}, loan total is {}".format(c["criteria_no."], query, loan_total)
Alternatively you can turn each criterion into an indexing vector like this:
def criterion_filter(s, expression, value):
if type(value) is list:
if expression == "=":
return s.isin(value)
elif expression == "!=":
return ~s.isin(value)
else:
if expression == "=":
return s == value
elif expression == "!=":
return s != value
elif expression == "<":
return s < value
elif expression == ">":
return s > value
for c in criteria:
filt = criterion_filter(df[c["attributes"]], c["expression"], c["value"])
loan_total = df[filt]["loan_amount"].sum()
print "With criterion {}, loan total is {}".format(c["criteria_no."], loan_total)
EDIT: To calculate the cumulative reduction in loan total, you can combine the indexing vectors using the & operator.
loans = [df["loan_amount"].sum()]
print("Loan total without filters is {}".format(loans[0]))
filt = True
for c in criteria:
filt &= criterion_filter(df[c["attributes"]], c["expression"], c["value"])
loans.append(df[filt]["loan_amount"].sum())
print "Adding criterion {} reduces the total by {}".format(c["criteria_no."],
loans[-2] - loans[-1])
print "The cumulative reduction is {}".format(loans[0] - loans[-1])

Map Reduce on timestamps

Mongodb Database:
{"thread": "abc", "message": "hjhjh", "Date": (2010,4,5,0,0,0)}
{"thread": "abc", "message": "hjhjh", "Date": (2009,3,5,0,0,0)}
{"thread": "efg", "message": "hjhjh", "Date": (2010,3,7,0,0,0)}
{"thread": "efg", "message": "hjhjh", "Date": (2011,4,5,0,0,0)}
How can I Map-Reduce or aggregate on the above data to generate an output as:
{"thread": "abc", "messages_per_month": 5}
{"thread": "efg", "messages_per_month": 4}
I am trying to write some code but its difficult to average on unsorted dates. Are there any built in functions averaging on time to do this?
You could do min/max on the dates like described here for numbers and just add a counter for the messages. The result would be a list of threads with the number of messages and min/max date. From the dates you can calculate the number of months which you can then use to average on the number of messages.
Edit: updated link with link to archived version

Categories