Mongodb Database:
{"thread": "abc", "message": "hjhjh", "Date": (2010,4,5,0,0,0)}
{"thread": "abc", "message": "hjhjh", "Date": (2009,3,5,0,0,0)}
{"thread": "efg", "message": "hjhjh", "Date": (2010,3,7,0,0,0)}
{"thread": "efg", "message": "hjhjh", "Date": (2011,4,5,0,0,0)}
How can I Map-Reduce or aggregate on the above data to generate an output as:
{"thread": "abc", "messages_per_month": 5}
{"thread": "efg", "messages_per_month": 4}
I am trying to write some code but its difficult to average on unsorted dates. Are there any built in functions averaging on time to do this?
You could do min/max on the dates like described here for numbers and just add a counter for the messages. The result would be a list of threads with the number of messages and min/max date. From the dates you can calculate the number of months which you can then use to average on the number of messages.
Edit: updated link with link to archived version
Related
As I'm fairly new to python I've tried various ways based on answers found here but can't manage to normalize my json file.
As I checked in Postman it has 4 levels of nested arrays. For suppliers I want to expand all the levels of data.The problem I have is with the score_card subtree for which I want to pull out all risk_groups , then name,risk_score and all riskstogether with name,risk_score and indicators with the deepest level containing again name and risk_score.
and I'm not sure whether it is possible in one go based on ambiguous names between levels.
Below I'm sharing a reproducible example:
data = {"suppliers": [
{
"id": 1260391,
"name": "2712270 MANITOBA LTD",
"duns": "rm-071-7291",
"erp_number": "15189067;15189067",
"material_group_ids": [
176069
],
"business_unit_ids": [
13728
],
"responsible_user_ids": [
37761,
37760,
37759,
37758,
37757,
37756,
36520,
37587,
36494,
22060,
37742,
36446,
36289
],
"address": {
"address1": "BOX 550 NE 16-20-26",
"address2": None,
"zip_code": "R0J 1W0",
"city": "RUSSELL",
"state": None,
"country_code": "ca",
"latitude": 50.7731176,
"longitude": -101.2862461
},
"score_card": {
"risk_score": 26.13,
"risk_groups": [
{
"name": "Viability",
"risk_score": 43.33,
"risks": [
{
"name": "Financial stability supplier",
"risk_score": None,
"indicators": [
{
"name": "Acquisitions",
"risk_score": None
}
]
}
]
},
]
}
}
]
}
And here is how it should look:
expected = [[1260391,'2712270 MANITOBA LTD','rm-071-7291',176069,13728,[37761,37760,
37759,37758,37757,37756,36520,37587,36494,22060,37742,36446,36289],
'BOX 550 NE 16-20-26','None','R0J 1W0','RUSSELL','None','ca',50.7731176,
-101.2862461,26.13,'Viability',43.33,'Financial stability supplier','None',
'Acquisitions','None']]
df = pd.DataFrame(expected,columns=['id','name','duns','material_groups_ids',
'business_unit_ids','responsible_user_ids',
'address.address1','address.address2','address.zip_code',
'address.city','address.state','address.country_code','address.latitude',
'address.longitude','risk_score','risk_groups.name',
'risk_groups.risk_score','risk_groups.risks.name',
'risk_groups.risks.risk_score','risk_groups.risks.indicators.name',
'risk_groups.risks.indicators.risk_score'])
What I tried since last 3 days is to use json_normalize:
example = pd.json_normalize(data['suppliers'],
record_path=['score_card','risk_groups','risks',
'indicators'],meta = ['id','name','duns','erp_number','material_group_ids',
'business_unit_ids','responsible_user_ids','address',['score_card','risk_score']],record_prefix='_')
but when I specify the fields for record_path which seems to be a vital param here it always return this prompt:
TypeError: {'name': 'Acquisitions', 'risk_score': None} has non list value Acquisitions for path name. Must be list or null.
What I want to get a a table with columns as presented below:
So it seems that python doesn't know how to treat most granular level which contains null values.
I've tried approach from here: Use json_normalize to normalize json with nested arrays
but unsuccessfully.
I would like to get the same view which I was able to 'unpack' in Power BI but this way of having the data is not satisfying me.
All risk indicators are the components of risks which are part of risk groups as You can see on attached pictures.
Going to the most granular level it goes this way: Risk groups -> Risks -> Indicators so I would like to unpack it all.
Any help more than desperately needed. Kindly thank You for any support here.
My problem is as follows:
I have a txt-file that holds nothing but a dictionary with one single key. The value to that one, single key is a huge list containing dictionaries as list entries. First key:value pair for comparison:
"data": [{"type": "utl", "id": "53150", "attributes": {"timestamp": "T13:00:00Z", "count": 0.0}}, [...etc.]
I tried the following method to convert the value of the single-keyed dictionary into a list by calling the .values method and then using list():
list_variable = list(dict_variable.values())
But it seems that this just converts the value into a list with just one index, for when I try to call index 0 the file crashes (list is too big) and if I try to call index 1 I get a KeyError stating that the index is out of range. (My current idea is to frist convert it into a list and THEN into a DataFrame)
I'm a bloody beginner and have no idea what else I could try. What am I missing?
Thanks a lot in advance! fpr your helpful comments!
looks like a json to me. try using pandas.json_normalize
d = {"data": [{"type": "utl", "id": "53150", "attributes": {"timestamp": "T13:00:00Z", "count": 0.0}}]}
pd.json_normalize(d['data'])
type id attributes.timestamp attributes.count
0 utl 53150 T13:00:00Z 0.0
Does the below codes help you?
test.txt
"data": [{"type": "utl", "id": "53150", "attributes": {"timestamp": "T13:00:00Z", "count": 0.0}}, {"type": "utl2", "id": "53151", "attributes": {"timestamp": "T12:00:00Z", "count": 1.0}}]
from re import findall
from pandas.io.json import json_normalize
with open("test.txt") as f:
print(json_normalize(eval(findall("{.+}", f.read())[0])))
Output:
type id attributes.timestamp attributes.count
0 utl 53150 T13:00:00Z 0.0
1 utl2 53151 T12:00:00Z 1.0
I have one json payload which is used for one service request. After processing that payload(JSON) will be stored in S3 and through Athena we can download those data in CSV format. Now in the actual scenario, there are more than 100 fields. I want to verify their value through some automated script instead of manual.
say my sample payload is similar to the following:
{
"BOOK": {
"serialno": "123",
"author": "xyz",
"yearofpublish": "2015",
"price": "16"
}, "Author": [
{
"isbn": "xxxxx", "title": "first", "publisher": "xyz", "year": "2020"
}, {
"isbn": "yyyy", "title": "second", "publisher": "zmy", "year": "2019"
}
]
}
the sample csv will be like following:
Can anyone please help me how exactly I can do it on Python? Maybe the library or dictionary?
it looks like you just want to flatten out the JSON structure. It'll be easiest to loop over the "Author" list. Since the CSV has renamed the columns you'll need some way to represent that mapping. Based only on example, this works:
import json
fin=open(some_json_file, 'r')
j=json.load(fin)
result=[]
for author in j['Author']:
val = {'book_serialno': j['BOOK']['serialno'],
'book_author': j['BOOK']['author'],
'book_yearofpublish': j['BOOK']['yearofpublish'],
'book_price': j['BOOK']['price'],
'author_isbn': author['isbn'],
'author_title': author['title'],
'author_publisher': author['publisher'],
'author_year': author['year']}
result.append(val)
This is using a dictionary to show the mapping of data points to the new column names. You might be able to get away with using a list as well. Depends how you want to use it later on. To write to a CSV:
import csv
fout=open(some_csv_file, 'w')
writer=csv.writer(fout)
writer.writerow(result[0].keys())
writer.writerows(r.values() for r in result)
This writes the column names in the first row, then the data. If you don't want the column names, just leave out the writerow(...) line.
I have seen some answers for similar questions but I am not sure that they were the best way to fix my problem.
I have a very large table (100,000+ rows of 20+ columns) being handled as a list of dictionaries. I need to do a partial deduplication of this list using a comparison. I have simplified an example of what I am doing now below.
table = [
{ "serial": "111", "time": 1000, "name": jon },
{ "serial": "222", "time": 0900, "name": sal },
{ "serial": "333", "time": 1100, "name": tim },
{ "serial": "444", "time": 1300, "name": ron },
{ "serial": "111", "time": 1300, "name": pam }
]
for row in table:
for row2 in table:
if row != row2:
if row['serial'] == row2['serial']:
if row['time'] > row2['time']:
action
This method does work (obviously simplified and just wrote "action" for that part) but the question I have is whether there is a more efficient method to get to the "row" that I want without having to double iterate the entire table. I don't have a way to necessarily predict where in the list matching rows would be located, but they would be listed under the same "serial" in this case.
I'm relatively new to Python and efficiency is the goal here. As of now with the amount of rows that are being iterated it is taking a long time to complete and I'm sure there is a more efficient way to do this, I'm just not sure where to start.
Thanks for any help!
You can sort the table with serial as the primary key and time as the secondary key, in reverse order (so that the latter of the duplicate items take precedence), then iterate through the sorted list and take action only on the first dict of every distinct serial:
from operator import itemgetter
table = [
{ "serial": "111", "time": "1000", "name": "jon" },
{ "serial": "222", "time": "0900", "name": "sal" },
{ "serial": "333", "time": "1100", "name": "tim" },
{ "serial": "444", "time": "1300", "name": "ron" },
{ "serial": "111", "time": "1300", "name": "pam" }
]
last_serial = ''
for d in sorted(table, key=itemgetter('serial', 'time'), reverse=True):
if d['serial'] != last_serial:
action(d)
last_serial = d['serial']
A list of dictionaries is always going to be fairly slow for this much data. Instead, look into whether Pandas is suitable for your use case - it is already optimised for this kind of work.
It may not be the most efficient, but one thing you can do is get a list of the serial numbers, then sort them. Let's call that list serialNumbersList. The serial numbers that only appear once, we know that they cannot possibly be duplicates, so we remove them from serialNumbersList. Then, you can use that list to reduce the amount of rows to process. Again, I am sure there are better solutions, but this is a good starting point.
#GiraffeMan91 Just to clarify what I mean (typed directly here, do not copy-paste):
serials = collections.defaultdict(list)
for d in table:
serials[d.pop('serial')].append(d)
def process_serial(entry):
serial, values = entry
# remove duplicates, take action based on time
# return serial, processed values
results = dict(
multiprocess.Pool(10).imap(process_serial, serials.iteritems())
)
I have a two json file in a server. The first json file is a dataframe in json format where it is having 21 columns.
The second json object is a collection of different filters to be applied on the first json(data file) & I want to calculate the reduction in amount column dynamically after applying each filter.
Both the jsons are in server. Sample of this is like below,
[{
"criteria_no.": 1,
"expression": "!=",
"attributes": "Industry_name",
"value": "Clasentrix"
},{
"criteria_no.": 2,
"expression": "=",
"attributes": "currency",
"value": ["EUR","GBP","INR"]
},{
"criteria_no.": 3,
"expression": ">",
"attributes": "Industry_Rating",
"value": "A3"
},{
"criteria_no.": 4,
"expression": "<",
"attributes": "Due_date",
"value": "01/01/2025"
}
]
When coded in python, it is like below,
import urllib2, json
url = urllib2.urlopen('http://.../server/criteria_sample.json')
obj = json.load(url)
print obj
[{u'attributes': u'Industry_name', u'expression': u'!=', u'value': u'Clasentrix', u'criteria_no.': 1}, {u'attributes': u'currency', u'expression': u'=', u'value': [u'EUR', u'GBP', u'INR'], u'criteria_no.': 2}, {u'attributes': u'Industry_Rating', u'expression': u'>', u'value': u'A3', u'criteria_no.': 3}, {u'attributes': u'Due_date', u'expression': u'<', u'value': u'01/01/2025', u'criteria_no.': 4}]
Now, in the sample json, we can see "attributes", which are nothing but the columns present in the first data file. I mentioned it has 21 columns, "Industry_name","currency","Industry_Rating","Due_date" are four of them. "Loan_amount" is another column present there in the data file along with all them.
Now as this criteria list is only a sample, we are having n number of such criterion or filters. I want this filters to be applied dynamically on the data file & I would like to calculate the reduction in loan amount. Let's consider the first filter, it is saying "Industry_name" column should not have "Clasentrix". So from the data file I want to filter "Industry_name", which will not have 'Clasentrix' entry. Now let's say for 11 observations we had 'Clasentrix' out of 61 observations from the data file. Then we will take sum of entire loan amount(61 rows) & then subtract the sum of loan amount for 11 rows which consist 'Clasentrix' from the total loan amount. This number will be considered as reduction for after applying the first filter.
Now for each of n criterion I want to calculate the reduction dynamically in python. So inside the loop the filter json file will create filter considering attribute, expression & value. Just like for the first filter it is "Industry_name != 'Clasentrix'". This should get reflected for each set of rows for the json object like for the second criterion(filter) it should be "currency=['EUR','GBP','INR']" & so on. I also want to calculate the reduction accordingly.
I am struggling to create the python code for the above mentioned exercise. My post is too long, apologies for that. But please provide assistance that how can I calculate the reduction dynamically for each n criterion.
Thanks in advance!!
UPDATE for first data file, find some sample rows;
[{
"industry_id.": 1234,
"loan_id": 1113456,
"Industry_name": "Clasentrix",
"currency": "EUR",
"Industry_Rating": "Ba3",
"Due_date": "20/02/2020",
"loan_amount": 563332790,
"currency_rate": 0.67,
"country": "USA"
},{
"industry_id.": 6543,
"loan_id": 1125678,
"Industry_name": "Wolver",
"currency": "GBP",
"Industry_Rating": "Aa3",
"Due_date": "23/05/2020",
"loan_amount": 33459087,
"currency_rate": 0.8,
"country": "UK"
},{
"industry_id.": 1469,
"loan_id": "8876548",
"Industry_name": "GroupOn",
"currency": "EUR",
"Industry_Rating": "Aa1",
"Due_date": "16/09/2021",
"loan_amount": 66543278,
"currency_rate": 0.67,
"country": "UK"
},{
"industry_id.": 1657,
"loan_id": "6654321",
"Industry_name": "Clasentrix",
"currency": "EUR",
"Industry_Rating": "Ba3",
"Due_date": "15/07/2020",
"loan_amount": 5439908765,
"currency_rate": 0.53,
"country": "USA"
}
]
You can use Pandas to turn the json data into a dataframe and turn the criteria into query strings. Some processing is needed to turn the criteria json into a valid query. In the code below dates are still treated as strings - you may need to explicitly set date queries to convert the string into a date first.
import pandas as pd
import json
# ...
criteria = json.load(url)
df = pd.DataFrame(json.load(data_url)) # data_url is the handle of the data file
print("Loan total without filters is {}".format(df["loan_amount"].sum()))
for c in criteria:
if c["expression"] == "=":
c["expression"] = "=="
# If the value is a string we need to surround it in quotation marks
# Note this can break if any values contain "
if isinstance(c["value"], basestring):
query = '{attributes} {expression} "{value}"'.format(**c)
else:
query = '{attributes} {expression} {value}'.format(**c)
loan_total = df.query(query)["loan_amount"].sum()
print "With criterion {}, {}, loan total is {}".format(c["criteria_no."], query, loan_total)
Alternatively you can turn each criterion into an indexing vector like this:
def criterion_filter(s, expression, value):
if type(value) is list:
if expression == "=":
return s.isin(value)
elif expression == "!=":
return ~s.isin(value)
else:
if expression == "=":
return s == value
elif expression == "!=":
return s != value
elif expression == "<":
return s < value
elif expression == ">":
return s > value
for c in criteria:
filt = criterion_filter(df[c["attributes"]], c["expression"], c["value"])
loan_total = df[filt]["loan_amount"].sum()
print "With criterion {}, loan total is {}".format(c["criteria_no."], loan_total)
EDIT: To calculate the cumulative reduction in loan total, you can combine the indexing vectors using the & operator.
loans = [df["loan_amount"].sum()]
print("Loan total without filters is {}".format(loans[0]))
filt = True
for c in criteria:
filt &= criterion_filter(df[c["attributes"]], c["expression"], c["value"])
loans.append(df[filt]["loan_amount"].sum())
print "Adding criterion {} reduces the total by {}".format(c["criteria_no."],
loans[-2] - loans[-1])
print "The cumulative reduction is {}".format(loans[0] - loans[-1])