I have a two json file in a server. The first json file is a dataframe in json format where it is having 21 columns.
The second json object is a collection of different filters to be applied on the first json(data file) & I want to calculate the reduction in amount column dynamically after applying each filter.
Both the jsons are in server. Sample of this is like below,
[{
"criteria_no.": 1,
"expression": "!=",
"attributes": "Industry_name",
"value": "Clasentrix"
},{
"criteria_no.": 2,
"expression": "=",
"attributes": "currency",
"value": ["EUR","GBP","INR"]
},{
"criteria_no.": 3,
"expression": ">",
"attributes": "Industry_Rating",
"value": "A3"
},{
"criteria_no.": 4,
"expression": "<",
"attributes": "Due_date",
"value": "01/01/2025"
}
]
When coded in python, it is like below,
import urllib2, json
url = urllib2.urlopen('http://.../server/criteria_sample.json')
obj = json.load(url)
print obj
[{u'attributes': u'Industry_name', u'expression': u'!=', u'value': u'Clasentrix', u'criteria_no.': 1}, {u'attributes': u'currency', u'expression': u'=', u'value': [u'EUR', u'GBP', u'INR'], u'criteria_no.': 2}, {u'attributes': u'Industry_Rating', u'expression': u'>', u'value': u'A3', u'criteria_no.': 3}, {u'attributes': u'Due_date', u'expression': u'<', u'value': u'01/01/2025', u'criteria_no.': 4}]
Now, in the sample json, we can see "attributes", which are nothing but the columns present in the first data file. I mentioned it has 21 columns, "Industry_name","currency","Industry_Rating","Due_date" are four of them. "Loan_amount" is another column present there in the data file along with all them.
Now as this criteria list is only a sample, we are having n number of such criterion or filters. I want this filters to be applied dynamically on the data file & I would like to calculate the reduction in loan amount. Let's consider the first filter, it is saying "Industry_name" column should not have "Clasentrix". So from the data file I want to filter "Industry_name", which will not have 'Clasentrix' entry. Now let's say for 11 observations we had 'Clasentrix' out of 61 observations from the data file. Then we will take sum of entire loan amount(61 rows) & then subtract the sum of loan amount for 11 rows which consist 'Clasentrix' from the total loan amount. This number will be considered as reduction for after applying the first filter.
Now for each of n criterion I want to calculate the reduction dynamically in python. So inside the loop the filter json file will create filter considering attribute, expression & value. Just like for the first filter it is "Industry_name != 'Clasentrix'". This should get reflected for each set of rows for the json object like for the second criterion(filter) it should be "currency=['EUR','GBP','INR']" & so on. I also want to calculate the reduction accordingly.
I am struggling to create the python code for the above mentioned exercise. My post is too long, apologies for that. But please provide assistance that how can I calculate the reduction dynamically for each n criterion.
Thanks in advance!!
UPDATE for first data file, find some sample rows;
[{
"industry_id.": 1234,
"loan_id": 1113456,
"Industry_name": "Clasentrix",
"currency": "EUR",
"Industry_Rating": "Ba3",
"Due_date": "20/02/2020",
"loan_amount": 563332790,
"currency_rate": 0.67,
"country": "USA"
},{
"industry_id.": 6543,
"loan_id": 1125678,
"Industry_name": "Wolver",
"currency": "GBP",
"Industry_Rating": "Aa3",
"Due_date": "23/05/2020",
"loan_amount": 33459087,
"currency_rate": 0.8,
"country": "UK"
},{
"industry_id.": 1469,
"loan_id": "8876548",
"Industry_name": "GroupOn",
"currency": "EUR",
"Industry_Rating": "Aa1",
"Due_date": "16/09/2021",
"loan_amount": 66543278,
"currency_rate": 0.67,
"country": "UK"
},{
"industry_id.": 1657,
"loan_id": "6654321",
"Industry_name": "Clasentrix",
"currency": "EUR",
"Industry_Rating": "Ba3",
"Due_date": "15/07/2020",
"loan_amount": 5439908765,
"currency_rate": 0.53,
"country": "USA"
}
]
You can use Pandas to turn the json data into a dataframe and turn the criteria into query strings. Some processing is needed to turn the criteria json into a valid query. In the code below dates are still treated as strings - you may need to explicitly set date queries to convert the string into a date first.
import pandas as pd
import json
# ...
criteria = json.load(url)
df = pd.DataFrame(json.load(data_url)) # data_url is the handle of the data file
print("Loan total without filters is {}".format(df["loan_amount"].sum()))
for c in criteria:
if c["expression"] == "=":
c["expression"] = "=="
# If the value is a string we need to surround it in quotation marks
# Note this can break if any values contain "
if isinstance(c["value"], basestring):
query = '{attributes} {expression} "{value}"'.format(**c)
else:
query = '{attributes} {expression} {value}'.format(**c)
loan_total = df.query(query)["loan_amount"].sum()
print "With criterion {}, {}, loan total is {}".format(c["criteria_no."], query, loan_total)
Alternatively you can turn each criterion into an indexing vector like this:
def criterion_filter(s, expression, value):
if type(value) is list:
if expression == "=":
return s.isin(value)
elif expression == "!=":
return ~s.isin(value)
else:
if expression == "=":
return s == value
elif expression == "!=":
return s != value
elif expression == "<":
return s < value
elif expression == ">":
return s > value
for c in criteria:
filt = criterion_filter(df[c["attributes"]], c["expression"], c["value"])
loan_total = df[filt]["loan_amount"].sum()
print "With criterion {}, loan total is {}".format(c["criteria_no."], loan_total)
EDIT: To calculate the cumulative reduction in loan total, you can combine the indexing vectors using the & operator.
loans = [df["loan_amount"].sum()]
print("Loan total without filters is {}".format(loans[0]))
filt = True
for c in criteria:
filt &= criterion_filter(df[c["attributes"]], c["expression"], c["value"])
loans.append(df[filt]["loan_amount"].sum())
print "Adding criterion {} reduces the total by {}".format(c["criteria_no."],
loans[-2] - loans[-1])
print "The cumulative reduction is {}".format(loans[0] - loans[-1])
Related
I need to extract only the content of the operations of the following json:
{"entries":[{"description":"Text transform on 101 cells in column Column 2: value.toLowercase()","operation":{"op":"core/text-transform","engineConfig":{"facets":[],"mode":"row-based"},"columnName":"Column 2","expression":"value.toLowercase()","onError":"keep-original","repeat":false,"repeatCount":10,"description":"Text transform on cells in column Column 2 using expression value.toLowercase()"}},{"description":"Text transform on 101 cells in column Column 6: value.toDate()","operation":{"op":"core/text-transform","engineConfig":{"facets":[],"mode":"row-based"},"columnName":"Column 6","expression":"value.toDate()","onError":"keep-original","repeat":false,"repeatCount":10,"description":"Text transform on cells in column Column 6 using expression value.toDate()"}}]}
It should look like this:
[{"op": "core/text-transform", "engineConfig": {"facets": [], "mode": "row-based"}, "columnName": "Column 2", "expression": "value.toLowercase()", "onError": "keep-original", "repeat": "false", "repeatCount": 10, "description": "Text transform on cells in column Column 2 using expression value.toLowercase()"}, {"op": "core/text-transform", "engineConfig": {"facets": [], "mode": "row-based"}, "columnName": "Column 6", "expression": "value.toDate()", "onError": "keep-original", "repeat": "false", "repeatCount": 10, "description": "Text transform on cells in column Column 6 using expression value.toDate()"}]
I tried to use this code:
import json
operations = [{"description":"Text transform on 101 cells in column Column 2: value.toLowercase()","operation":{"op":"core/text-transform","engineConfig":{"facets":[],"mode":"row-based"},"columnName":"Column 2","expression":"value.toLowercase()","onError":"keep-original","repeat":"false","repeatCount":10,"description":"Text transform on cells in column Column 2 using expression value.toLowercase()"}},{"description":"Text transform on 101 cells in column Column 6: value.toDate()","operation":{"op":"core/text-transform","engineConfig":{"facets":[],"mode":"row-based"},"columnName":"Column 6","expression":"value.toDate()","onError":"keep-original","repeat":"false","repeatCount":10,"description":"Text transform on cells in column Column 6 using expression value.toDate()"}}]
new_operations = []
for operation in operations:
new_operations.append(operation["operation"])
x = json.dumps(new_operations)
print(x)
However, I must manually place quotes in the words like "fake" and also remove the first "entries" part for it to work. Does anyone know how to do it automatically?
IIUC you can do it like this. Read it as json data, extract the parts you want and dump it back to json.
with open('your-json-data.json') as j:
data = json.load(j)
new_data = []
for dic in data['entries']:
for key,value in dic.items():
if key == 'operation':
dic = {k:v for k,v in value.items()}
new_data.append(dic)
x = json.dumps(new_data)
print(x)
Output:
[
{
"op":"core/text-transform",
"engineConfig":{
"facets":[
],
"mode":"row-based"
},
"columnName":"Column 2",
"expression":"value.toLowercase()",
"onError":"keep-original",
"repeat":false,
"repeatCount":10,
"description":"Text transform on cells in column Column 2 using expression value.toLowercase()"
},
{
"op":"core/text-transform",
"engineConfig":{
"facets":[
],
"mode":"row-based"
},
"columnName":"Column 6",
"expression":"value.toDate()",
"onError":"keep-original",
"repeat":false,
"repeatCount":10,
"description":"Text transform on cells in column Column 6 using expression value.toDate()"
}
]
So I am using the JSON package in python to extract data from generated JSON which would essentially fetched data from a firebase database which was then generated as a JSON file.
Within the given data set I want to extract all of the data corresponding to bills in each entry within the JSON file. For that I created a separate dictionary to add all of the elements corresponding to bills in the dataset.
When converted to CSV, the dataset looks like this:
csv for one entry
So I have the following code to do above operation. But as I create a new dictionary, there are certain entries which have null values designated as [] (see the csv file). I assigned list to store all those bills which would have the data in the bills column (essentially avoiding all the null entries). But as a I create a new list the required output is only getting stored in the first index of the new list or array. Please see the code below.
My code is as below:
filedata = open('requireddataset.json','r') data = json.load(filedata)
listoffields = [] # To produce it into a list with fields for dic
for dic in data:
try:
listoffields.append(dic['bills']) # only non-essential bill categories.
except KeyError:
pass
#print (listoffields[3]) # This would return the first payment entry within
# the JSON Array of objects.
for val in listoffields:
if val!=[]:
x = val[0] # only val[0] would contain data
#print (x)
myarray = np.array(val)
print(myarray[0]) # All of the data stored in only one index, any way to change this?
This is the output : output
This is how the original JSON file looks like : requireddataset.json
Essentially my question is the list listoffields would contain all the fields in it(from the JSON file), and bills in one of the fields. And within the column bills each entry again contains id, value, role and many other entries. Is there any way to extract only values from this and produce sum .
In the JSON file this is how it looks like for one entry :
[{"goal_savings": 0.0, "social_id": "", "score": 0, "country": "BR", "photo": "http://graph.facebook", "id": "", "plates": 3, "rcu": null, "name": "", "email": ".", "provider": "facebook", "phone": "", "savings": [], "privacyPolicyAccepted": true, "currentRole": "RoleType.PERSONAL", "empty_lives_date": null, "userId": "", "authentication_token": "-------", "onboard_status": "ONBOARDING_WIZARD", "fcmToken": ----------", "level": 1, "dni": "", "social_token": "", "lives": 10, "bills": [{"date": "2020-12-10", "role": "RoleType.PERSONAL", "name": "Supermercado", "category": "feeding", "periodicity": "PeriodicityType.NONE", "value": 100.0"}], "payments": [], "goals": [], "goalTransactions": [], "incomes": [], "achievements": [{"created_at":", "name": ""}]}]
I have a sizable json file and i need to get the index of a certain value inside it. Here's what my json file looks like:
data.json
[{...many more elements here...
},
{
"name": "SQUARED SOS",
"unified": "1F198",
"non_qualified": null,
"docomo": null,
"au": "E4E8",
"softbank": null,
"google": "FEB4F",
"image": "1f198.png",
"sheet_x": 0,
"sheet_y": 28,
"short_name": "sos",
"short_names": [
"sos"
],
"text": null,
"texts": null,
"category": "Symbols",
"sort_order": 167,
"added_in": "0.6",
"has_img_apple": true,
"has_img_google": true,
"has_img_twitter": true,
"has_img_facebook": true
},
{...many more elements here...
}]
How can i get the index of the value "FEB4F" whose key is "google", for example?
My only idea was this but it doesn't work:
print(data.index('FEB4F'))
Your basic data structure is a list, so there's no way to avoid looping over it.
Loop through all the items, keeping track of the current position. If the current item has the desired key/value, print the current position.
position = 0
for item in data:
if item.get('google') == 'FEB4F':
print('position is:', position)
break
position += 1
Assuming your data can fit in an table, I recommend using pandas for that. Here is the summary:
Read de data using pandas.read_json
Identify witch column to filter
Filter using pandas.DataFrame.loc
IE:
import pandas as pd
data = pd.read_json("path_to_json.json")
print(data)
#lets assume you want to filter using the 'unified' column
filtered = data.loc[data['unified'] == 'something']
print(filtered)
Of course the steps would be different depending on the JSON structure
I have seen some answers for similar questions but I am not sure that they were the best way to fix my problem.
I have a very large table (100,000+ rows of 20+ columns) being handled as a list of dictionaries. I need to do a partial deduplication of this list using a comparison. I have simplified an example of what I am doing now below.
table = [
{ "serial": "111", "time": 1000, "name": jon },
{ "serial": "222", "time": 0900, "name": sal },
{ "serial": "333", "time": 1100, "name": tim },
{ "serial": "444", "time": 1300, "name": ron },
{ "serial": "111", "time": 1300, "name": pam }
]
for row in table:
for row2 in table:
if row != row2:
if row['serial'] == row2['serial']:
if row['time'] > row2['time']:
action
This method does work (obviously simplified and just wrote "action" for that part) but the question I have is whether there is a more efficient method to get to the "row" that I want without having to double iterate the entire table. I don't have a way to necessarily predict where in the list matching rows would be located, but they would be listed under the same "serial" in this case.
I'm relatively new to Python and efficiency is the goal here. As of now with the amount of rows that are being iterated it is taking a long time to complete and I'm sure there is a more efficient way to do this, I'm just not sure where to start.
Thanks for any help!
You can sort the table with serial as the primary key and time as the secondary key, in reverse order (so that the latter of the duplicate items take precedence), then iterate through the sorted list and take action only on the first dict of every distinct serial:
from operator import itemgetter
table = [
{ "serial": "111", "time": "1000", "name": "jon" },
{ "serial": "222", "time": "0900", "name": "sal" },
{ "serial": "333", "time": "1100", "name": "tim" },
{ "serial": "444", "time": "1300", "name": "ron" },
{ "serial": "111", "time": "1300", "name": "pam" }
]
last_serial = ''
for d in sorted(table, key=itemgetter('serial', 'time'), reverse=True):
if d['serial'] != last_serial:
action(d)
last_serial = d['serial']
A list of dictionaries is always going to be fairly slow for this much data. Instead, look into whether Pandas is suitable for your use case - it is already optimised for this kind of work.
It may not be the most efficient, but one thing you can do is get a list of the serial numbers, then sort them. Let's call that list serialNumbersList. The serial numbers that only appear once, we know that they cannot possibly be duplicates, so we remove them from serialNumbersList. Then, you can use that list to reduce the amount of rows to process. Again, I am sure there are better solutions, but this is a good starting point.
#GiraffeMan91 Just to clarify what I mean (typed directly here, do not copy-paste):
serials = collections.defaultdict(list)
for d in table:
serials[d.pop('serial')].append(d)
def process_serial(entry):
serial, values = entry
# remove duplicates, take action based on time
# return serial, processed values
results = dict(
multiprocess.Pool(10).imap(process_serial, serials.iteritems())
)
I'm trying to format a column of numbers in Google Sheets using the API (Sheets API v.4 and Python 3.6.1, specifically). A portion of my non-functional code is below. I know it's executing, as the background color of the column gets set, but the numbers still show as text, not numbers.
Put another way, I'm trying to get the equivalent of clicking on a column header (A, B, C, or whatever) then choosing the Format -> Number -> Number menu item in the GUI.
def sheets_batch_update(SHEET_ID,data):
print ( ("Sheets: Batch update"))
service.spreadsheets().batchUpdate(spreadsheetId=SHEET_ID,body=data).execute() #,valueInputOption='RAW'
data={
"requests": [
{
"repeatCell": {
"range": {
"sheetId": all_sheets['Users'],
"startColumnIndex": 19,
"endColumnIndex": 20
},
"cell": {
"userEnteredFormat": {
"numberFormat": {
"type": "NUMBER",
"pattern": "#,##0",
},
"backgroundColor": {
"red": 0.0,
"green": 0.4,
"blue": 0.4
},
}
},
"fields": "userEnteredFormat(numberFormat,backgroundColor)"
}
},
]
}
sheets_batch_update(SHEET_ID, data)
The problem is likely that your data is currently stored as strings and therefore not affected by the number format.
"userEnteredValue": {
"stringValue": "1000"
},
"formattedValue": "1000",
"userEnteredFormat": {
"numberFormat": {
"type": "NUMBER",
"pattern": "#,##0"
}
},
When you set a number format via the UI (Format > Number > ...) it's actually doing two things at once:
Setting the number format.
Converting string values to number values, if possible.
Your API call is only doing #1, so any cells that are currently set with a string value will remain a string value and will therefore be unaffected by the number format. One solution would be to go through the affected values and move the stringValue to a numberValue if the cell contains a number.
To flesh out the answer from Eric Koleda a bit more, I ended up solving this two ways, depending on how I was getting the data for the Sheet:
First, if I was appending cells to the sheet, I used a function:
def set_cell_type(cell_contents):
current_cell_contents=str(cell_contents).replace(',', '')
float_cell=re.compile("^\d+\.\d+$")
int_cell=re.compile("^\d+$")
if int_cell.search(current_cell_contents):
data = {"userEnteredValue": {"numberValue": int(current_cell_contents)}}
elif float_cell.search(current_cell_contents):
data = {"userEnteredValue": {"numberValue": float(current_cell_contents)}}
else:
data = {"userEnteredValue": {"stringValue": str(cell_contents)}}
return data
To format the cells properly. Here's the call that actually did the appending:
rows = [{"values": [set_cell_type(cell) for cell in row]} for row in daily_data_output]
data = { "requests": [ { "appendCells": { "sheetId": all_sheets['Daily record'], "rows": rows, "fields": "*", } } ], }
sheets_batch_update(SHEET_ID,data)
Second, if I was replacing a whole sheet, I did:
#convert the ints to ints and floats to floats
float_cell=re.compile("^\d+\.\d+$")
int_cell=re.compile("^\d+$")
row_list=error_message.split("\t")
i=0
while i < len(row_list):
current_cell=row_list[i].replace(',', '') #remove the commas from any numbers
if int_cell.search(current_cell):
row_list[i]=int(current_cell)
elif float_cell.search(current_cell):
row_list[i]=float(current_cell)
i+=1
error_output.append(row_list)
then the following to actually save error_output to the sheet:
data = {'values': [row for row in error_output]}
sheets_update(SHEET_ID,data,'Errors!A1')
those two techniques, coupled with the formatting calls I had already figured out in my initial question, did the trick.