How to Compute Sum of Dictionary Field in Spark with Python?

How to Compute Sum of Dictionary Field in Spark with Python? - python

My data is saved in an Spark RDD and it is structured as such:
survivors.take(3)
Out[45]:
[{'Age': '38',
'Cabin': 'C85',
'Embarked': 'C',
'Fare': '71.2833',
'Name': 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
'Parch': '0',
'PassengerId': '2',
'Pclass': '1',
'Sex': 'female',
'SibSp': '1',
'Survived': '1',
'Ticket': 'PC 17599'},
{'Age': '26',
'Cabin': '',
'Embarked': 'S',
'Fare': '7.925',
'Name': 'Heikkinen, Miss. Laina',
'Parch': '0',
'PassengerId': '3',
'Pclass': '3',
'Sex': 'female',
'SibSp': '0',
'Survived': '1',
'Ticket': 'STON/O2. 3101282'},
{'Age': '35',
'Cabin': 'C123',
'Embarked': 'S',
'Fare': '53.1',
'Name': 'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
'Parch': '0',
'PassengerId': '4',
'Pclass': '1',
'Sex': 'female',
'SibSp': '1',
'Survived': '1',
'Ticket': '113803'}]
I would like to calculate the sum of the "Age" column for the dictionary above, using reduce. I am trying to do it as such:
survivors.reduce(lambda row, acc: acc + float(row['Age']))
However, I am not having any luck. I am no python expert, so perhaps this is a python problem.

I would use a sum map instead.
sum(list(map(lambda row: row['Age'],survivors.take(3))))

You've got the arguments of the reduce the wrong way around, you need to have the accumulator first, also you need to make the accumulator into a dict.
survivors.reduce(lambda acc, row: {'Age': float(acc['Age']) + float(row['Age'])})

Related

Nested dictionary parsing error JSON- TypeError: string indices must be integers

Image of Code
Im trying to pull the key values pair for the dictionary associated to the "awayBattingTotals". However, im encountering the below error that i do not know how to fix.
Snippet of the JSON response is below
{
'namefield': '9 Lopez, N SS',
'ab': '3',
'r': '0',
'h': '1',
'doubles': '0',
'triples': '0',
'hr': '0',
'rbi': '0',
'sb': '0',
'bb': '0',
'k': '0',
'lob': '2',
'avg': '.248',
'ops': '.599',
'personId': 670032,
'battingOrder': '900',
'substitution': False,
'note': '',
'name': 'Lopez, N',
'position': 'SS',
'obp': '.305',
'slg': '.294'
}],
'awayBattingTotals': {
'namefield': 'Totals',
'ab': '33',
'r': '2',
'h': '7',
'hr': '1',
'rbi': '2',
'bb': '0',
'k': '8',
'lob': '13',
'avg': '',
'ops': '',
'obp': '',
'slg': '',
'name': 'Totals',
'position': '',
'note': '',
'substitution': False,
'battingOrder': '',
'personId': 0
},
'homeBattingTotals': {
'namefield': 'Totals',
'ab': '34',
'r': '4',
'h': '9',
'hr': '2',
'rbi': '4',
'bb': '1',
'k': '7',
'lob': '13',
'avg': '',
'ops': '',
'obp': '',
'slg': '',
'name': 'Totals',
'position': '',
'note': '',
'substitution': False,
'battingOrder': '',
'personId': 0
},
The below is obtained via
statsapi.boxscore_data(662647)
summary = statsapi.boxscore(662647)
From the above im trying to run
summary["awayBattingTotals"]["Totals"]
to pull the below values:
`awayBattingTotals': {'namefield': 'Totals', 'ab': '33', 'r': '2', 'h': '7', 'hr': '1', 'rbi': '2', 'bb': '0', 'k': '8', 'lob': '13',`
but i keep getting the below error:
TypeError: string indices must be integers`

As Barmar mentioned, it seemed like the data wasn't behaving as json...
Switching the single to double quotes in the json-like text of the response allows me to reach into it with json.loads() like so:
mysecond = '''{"awayBattingTotals": {"namefield": "Totals", "ab": "33", "r": "2"}}'''
myload = json.loads(mysecond)
print myload
Result:
{u'awayBattingTotals': {u'r': u'2', u'ab': u'33', u'namefield': u'Totals'}}
This failed in the same way you described when I cut and pasted the json response you included in your question:
import json
myjson = """{'awayBattingTotals': { 'namefield': 'Totals',
'ab': '33',
'r': '2'}}"""
print json.loads(myjson)
result:
TypeError: string indices must be integers, not str

How to transform index values into columns using Pandas?

I have a dictionary like this:
my_dict = {'RuleSet': {'0': {'RuleSetID': '0',
'RuleSetName': 'Allgemein',
'Rules': [{'RulesID': '10',
'RuleName': 'Gemeinde Seiten',
'GroupHits': '2',
'KeyWordGroups': ['100', '101', '102']}]},
'1': {'RuleSetID': '1',
'RuleSetName': 'Portale Berlin',
'Rules': [{'RulesID': '11',
'RuleName': 'Portale Berlin',
'GroupHits': '4',
'KeyWordGroups': ['100', '101', '102', '107']}]},
'6': {'RuleSetID': '6',
'RuleSetName': 'Zwangsvollstr. Berlin',
'Rules': [{'RulesID': '23',
'RuleName': 'Zwangsvollstr. Berlin',
'GroupHits': '1',
'KeyWordGroups': ['100', '101']}]}}}
When using this code snippet it can be transformed into a dataframe:
rules_pd = pd.DataFrame(my_dict['RuleSet'])
rules_pd
The result is:
I would like to make it look like this:
Does anyone know how to tackle this challenge?

Doing from_dict with index
out = pd.DataFrame.from_dict(my_dict['RuleSet'],'index')
Out[692]:
RuleSetID ... Rules
0 0 ... [{'RulesID': '10', 'RuleName': 'Gemeinde Seite...
1 1 ... [{'RulesID': '11', 'RuleName': 'Portale Berlin...
6 6 ... [{'RulesID': '23', 'RuleName': 'Zwangsvollstr....
[3 rows x 3 columns]
#out.columns
#Out[693]: Index(['RuleSetID', 'RuleSetName', 'Rules'], dtype='object')

You could try use Transpose()
rules_pd = pd.DataFrame(my_dict['RuleSet']).transpose()
print(rules_pd)

Finding missing value in JSON using python

I am facing this problem, I want to separate the dataset that has completed and not complete.
So, I want to put flag like 'complete' in the JSON. Example as in output.
This is the data that i have
data=[{'id': 'abc001',
'demo':{'gender':'1',
'job':'6',
'area':'3',
'study':'3'},
'ex_data':{'fam':'small',
'scholar':'2'}},
{'id': 'abc002',
'demo':{'gender':'1',
'edu':'6',
'qual':'3',
'living':'3'},
'ex_data':{'fam':'',
'scholar':''}},
{'id': 'abc003',
'demo':{'gender':'1',
'edu':'6',
'area':'3',
'sal':'3'}
'ex_data':{'fam':'big',
'scholar':NaN}}]
Output
How can I put the flag and also detect NaN and NULL in JSON?
Output=[{'id': 'abc001',
'completed':'yes',
'demo':{'gender':'1',
'job':'6',
'area':'3',
'study':'3'},
'ex_data':{'fam':'small',
'scholar':'2'}},
{'id': 'abc002',
'completed':'no',
'demo':{'gender':'1',
'edu':'6',
'qual':'3',
'living':'3'},
'ex_data':{'fam':'',
'scholar':''}},
{'id': 'abc003',
'completed':'no',
'demo':{'gender':'1',
'edu':'6',
'area':'3',
'sal':'3'}
'ex_data':{'fam':'big',
'scholar':NaN}}]

Something like this should work for you:
data = [
{
'id': 'abc001',
'demo': {
'gender': '1',
'job': '6',
'area': '3',
'study': '3'},
'ex_data': {'fam': 'small',
'scholar': '2'}
},
{
'id': 'abc002',
'demo': {
'gender': '1',
'edu': '6',
'qual': '3',
'living': '3'},
'ex_data': {'fam': '',
'scholar': ''}},
{
'id': 'abc003',
'demo': {
'gender': '1',
'edu': '6',
'area': '3',
'sal': '3'},
'ex_data': {'fam': 'big',
'scholar': None}
}
]
def browse_dict(dico):
empty_values = 0
for key in dico:
if dico[key] is None or dico[key] == "":
empty_values += 1
if isinstance(dico[key], dict):
for k in dico[key]:
if dico[key][k] is None or dico[key][k] == "":
empty_values += 1
if empty_values == 0:
dico["completed"] = "yes"
else:
dico["completed"] = "no"
for d in data:
browse_dict(d)
print(d)
Output :
{'id': 'abc001', 'demo': {'gender': '1', 'job': '6', 'area': '3', 'study': '3'}, 'ex_data': {'fam': 'small', 'scholar': '2'}, 'completed': 'yes'}
{'id': 'abc002', 'demo': {'gender': '1', 'edu': '6', 'qual': '3', 'living': '3'}, 'ex_data': {'fam': '', 'scholar': ''}, 'completed': 'no'}
{'id': 'abc003', 'demo': {'gender': '1', 'edu': '6', 'area': '3', 'sal': '3'}, 'ex_data': {'fam': 'big', 'scholar': None}, 'completed': 'no'}
Note that I changed NaN to None, because here you are most likely showing a python dictionary, not a JSON file since you are using data =
In a dictionary, the NaN value would be changed for None.
If you have to convert your JSON to a dictionary, refer to the JSON module documentation.
Also please check your dictionary syntax. You missed several commas to separate data.

You should try
The Input is
data = [{'demo': {'gender': '1', 'job': '6', 'study': '3', 'area': '3'}, 'id': 'abc001', 'ex_data': {'scholar': '2', 'fam': 'small'}}, {'demo': {'living': '3', 'gender': '1', 'qual': '3', 'edu': '6'}, 'id': 'abc002', 'ex_data': {'scholar': '', 'fam': ''}}, {'demo': {'gender': '1', 'area': '3', 'sal': '3', 'edu': '6'}, 'id': 'abc003', 'ex_data': {'scholar': None, 'fam': 'big'}}]
Also, Nan will not work in Python. So, instead of Nan we have used None.
for item in data:
item["completed"] = 'yes'
for key in item.keys():
if isinstance(item[key],dict):
for inner_key in item[key].keys():
if (not item[key][inner_key]):
item["completed"] = "no"
break
else:
if (not item[key]):
item["completed"] = "no"
break
The Output will be
data = [{'demo': {'gender': '1', 'job': '6', 'study': '3', 'area': '3'}, 'completed': 'yes', 'id': 'abc001', 'ex_data': {'scholar': '2', 'fam': 'small'}}, {'demo': {'living': '3', 'edu': '6', 'qual': '3', 'gender': '1'}, 'completed': 'no', 'id': 'abc002', 'ex_data': {'scholar': '', 'fam': ''}}, {'demo': {'edu': '6', 'gender': '1', 'sal': '3', 'area': '3'}, 'completed': 'no', 'id': 'abc003', 'ex_data': {'scholar': None, 'fam': 'big'}}]

How to remove duplicate elements of, list of dictionaries in python

I have a list of campuses:
campus = [{'id': '1', 'dlin': '1'}, {'id': '2', 'dlin': '1'},{'id': '3', 'dlin': '1'},{'id': '4', 'dlin': '2'},{'id': '5', 'dlin': '2'},{'id': '6', 'dlin': '1'}, ]
each campus belongs to a school with a unique dlin. I want to have a list in which I have some other lists, each having a few dictionaries.
I run the below code:
schools = []
for i in campus:
ls = []
for j in campus:
if i['dlin'] == j['dlin']:
ls.append(j)
# campus_copy.remove(j)
schools.append(ls)
[print(item) for item in schools]
the result is:
[{'id': '1', 'dlin': '1'}, {'id': '2', 'dlin': '1'}, {'id': '3', 'dlin': '1'}, {'id': '6', 'dlin': '1'}]
[{'id': '1', 'dlin': '1'}, {'id': '2', 'dlin': '1'}, {'id': '3', 'dlin': '1'}, {'id': '6', 'dlin': '1'}]
[{'id': '1', 'dlin': '1'}, {'id': '2', 'dlin': '1'}, {'id': '3', 'dlin': '1'}, {'id': '6', 'dlin': '1'}]
[{'id': '4', 'dlin': '2'}, {'id': '5', 'dlin': '2'}]
[{'id': '4', 'dlin': '2'}, {'id': '5', 'dlin': '2'}]
[{'id': '1', 'dlin': '1'}, {'id': '2', 'dlin': '1'}, {'id': '3', 'dlin': '1'}, {'id': '6', 'dlin': '1'}]
I have to either remove the duplicate members from schools or modify the code such that I do not get duplicates.
When I try to remove duplicates from schools, I see that dic item is not hashable so I can not do it.
To solutions are available that are somewhat similar to my problem.
Remove duplicates from list of dictionaries within list of dictionaries
Remove duplicate dict in list in Python
However, I cannot figure out what to do?
does anybody know how to solve the problem?
what I expect to get is:
[{'id': '1', 'dlin': '1'}, {'id': '2', 'dlin': '1'}, {'id': '3', 'dlin': '1'}, {'id': '6', 'dlin': '1'}]
[{'id': '4', 'dlin': '2'}, {'id': '5', 'dlin': '2'}]

One possible solution is storing the dlin as key in dictionary (and dictionaries cannot have multiple equal keys) rather than removing duplicates explicitly afterwards:
campus = [{'id': '1', 'dlin': '1'}, {'id': '2', 'dlin': '1'},{'id': '3', 'dlin': '1'},{'id': '4', 'dlin': '2'},{'id': '5', 'dlin': '2'},{'id': '6', 'dlin': '1'}, ]
schools = {}
for c in campus:
schools.setdefault(c['dlin'], []).append(c)
for s in schools.values():
print(s)
Prints:
[{'id': '1', 'dlin': '1'}, {'id': '2', 'dlin': '1'}, {'id': '3', 'dlin': '1'}, {'id': '6', 'dlin': '1'}]
[{'id': '4', 'dlin': '2'}, {'id': '5', 'dlin': '2'}]

Based on the answer of Andrej, I solved another part of the question I had and I wanted just to share it here:
My question:
I am now involved in another issue related to the previous one:
I have this list of dictionaries, each informaton of a campus. multiple campuses might belong to a school. I have to distinguish and cluster them based on the similarity of their names.
campus = [
{'id': '1', 'name': 'seneca - york'},
{'id': '2', 'name': 'seneca college - north gate campus'},
{'id': '3', 'name': 'humber college - toronto campus'},
{'id': '4', 'name': 'humber college'},
{'id': '5', 'name': 'humber collge - waterloo campus'},
{'id': '6', 'name': 'university of waterloo toronto campus'},
]
my expected result can be reached by this small and neat code:
schools = {}
for c in campus:
schools.setdefault(c['name'][:4], []).append(c)
print(schools)

How can I loop through a Python list and perform math calculations on elements of the list?

I am attempting to create a contract bridge match point scoring system. In the list below the 1st, 3rd, etc. numbers are the pair numbers (players) and the 2nd, 4th etc. numbers are the scores achieved by each pair. So pair 2 scored 430, pair 3 scored 420 and so on.
I want to loop through the list and score as follows:
for each pair score that pair 2 beats they receive 2 points, for each they tie 1 point and where they don't beat they get 0 points. The loop then continues and compares each pair's score in the same way. In the example below, pair 2 gets 7 points (beating 3 other pairs and a tie with 1), pair 7 gets 0 points, pair 6 gets 12 points beating every other pair.
My list (generated from an elasticsearch json object) is:
['2', '430', '3', '420', '4', '460', '5', '400', '7', '0', '1', '430', '6', '480']
The python code I have tried (after multiple variations) is:
nsp_mp = 0
ewp_mp = 0
ns_list = []
for row in arr["hits"]["hits"]:
nsp = row["_source"]["nsp"]
nsscore = row["_source"]["nsscore"]
ns_list.append(nsp)
ns_list.append(nsscore)
print(ns_list)
x = ns_list[1]
for i in range(6): #number of competing pairs
if x > ns_list[1::2][i]:
nsp_mp = nsp_mp + 2
elif x == ns_list[1::2][i]:
nsp_mp = nsp_mp
else:
nsp_mp = nsp_mp + 1
print(nsp_mp)
which produces:
['2', '430', '3', '420', '4', '460', '5', '400', '7', '0', '1', '430', '6', '480']
7
which as per calculation above is correct. But when I try to execute a loop it does not return the correct results.
Maybe the approach is wrong. What is the correct way to do this?
The elasticsearch json object is:
arr = {'took': 0, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 7, 'max_score': 1.0, 'hits': [{'_index': 'match', '_type': 'score', '_id': 'L_L122cBjpp4O0gQG0qd', '_score': 1.0, '_source': {'tournament_id': 1, 'board_number': '1', 'nsp': '2', 'ewp': '9', 'contract': '3NT', 'by': 'S', 'tricks': '10', 'nsscore': '430', 'ewscore': '0', 'timestamp': '2018-12-23T16:45:32.896151'}}, {'_index': 'match', '_type': 'score', '_id': 'MPL122cBjpp4O0gQHEog', '_score': 1.0, '_source': {'tournament_id': 1, 'board_number': '1', 'nsp': '3', 'ewp': '10', 'contract': '4S', 'by': 'N', 'tricks': '10', 'nsscore': '420', 'ewscore': '0', 'timestamp': '2018-12-23T16:45:33.027631'}}, {'_index': 'match', '_type': 'score', '_id': 'MfL122cBjpp4O0gQHEqk', '_score': 1.0, '_source': {'tournament_id': 1, 'board_number': '1', 'nsp': '4', 'ewp': '11', 'contract': '3NT', 'by': 'N', 'tricks': '11', 'nsscore': '460', 'ewscore': '0', 'timestamp': '2018-12-23T16:45:33.158060'}}, {'_index': 'match', '_type': 'score', '_id': 'MvL122cBjpp4O0gQHUoj', '_score': 1.0, '_source': {'tournament_id': 1, 'board_number': '1', 'nsp': '5', 'ewp': '12', 'contract': '3NT', 'by': 'S', 'tricks': '10', 'nsscore': '400', 'ewscore': '0', 'timestamp': '2018-12-23T16:45:33.285460'}}, {'_index': 'match', '_type': 'score', '_id': 'NPL122cBjpp4O0gQHkof', '_score': 1.0, '_source': {'tournament_id': 1, 'board_number': '1', 'nsp': '7', 'ewp': '14', 'contract': '3NT', 'by': 'S', 'tricks': '8', 'nsscore': '0', 'ewscore': '50', 'timestamp': '2018-12-23T16:45:33.538710'}}, {'_index': 'match', '_type': 'score', '_id': 'LvL122cBjpp4O0gQGkqt', '_score': 1.0, '_source': {'tournament_id': 1, 'board_number': '1', 'nsp': '1', 'ewp': '8', 'contract': '3NT', 'by': 'N', 'tricks': '10', 'nsscore': '430', 'ewscore': '0', 'timestamp': '2018-12-23T16:45:32.405998'}}, {'_index': 'match', '_type': 'score', '_id': 'M_L122cBjpp4O0gQHUqg', '_score': 1.0, '_source': {'tournament_id': 1, 'board_number': '1', 'nsp': '6', 'ewp': '13', 'contract': '4S', 'by': 'S', 'tricks': '11', 'nsscore': '480', 'ewscore': '0', 'timestamp': '2018-12-23T16:45:33.411104'}}]}}

List appears to be a poor data structure for this, I think you are making everything worse by flattening your elasticsearch object.
Note there are a few minor mistakes in listings below - to make sure
I'm not solving someone's homework for free. I also realize this is
not the most efficient way of doing so.
Try with dicts:
1) convert elasticsearch json you have to a dict with a better structure:
scores = {}
for row in arr["hits"]["hits"]:
nsp = row["_source"]["nsp"]
nsscore = row["_source"]["nsscore"]
scores[nsp] = nsscore
This will give you something like this:
{'1': '430',
'2': '430',
'3': '420',
'4': '460',
'5': '400',
'6': '480',
'7': '0'}
2) write a function to calculate pair score:
def calculate_score(pair, scores):
score = 0
for p in scores:
if p == pair:
continue
if scores[p] < scores[pair]:
score += 2 # win
elif scores[p] == scores[pair]:
score += 1
return score
This should give you something like this:
In [13]: calculate_score('1', scores)
Out[13]: 7
In [14]: calculate_score('7', scores)
Out[14]: 0
3) loop over all pairs, calculating scores. I'll leave this as exercise.

The main problem with your code is, that the loop is one short, you have 7 entries. Then you should convert the numbers to int, so that the comparison is correct. In your code, you get for ties 0 points.
Instead of having a list, with flattend pairs, you should use tuple pairs.
ns_list = []
for row in arr["hits"]["hits"]:
nsp = int(row["_source"]["nsp"])
nsscore = int(row["_source"]["nsscore"])
ns_list.append((nsp, nsscore))
print(ns_list)
x = ns_list[0][1]
nsp_mp = 0
for nsp, nsscore in ns_list:
if x > nsscore:
nsp_mp += 2
elif x == nsscore:
nsp_mp += 1
print(nsp_mp)

So we can do it like so:
import itertools
d = [(i['_source']['nsp'], i['_source']['nsscore']) for i in arr['hits']['hits']]
d
[('2', '430'),
('3', '420'),
('4', '460'),
('5', '400'),
('7', '0'),
('1', '430'),
('6', '480')]
c = itertools.combinations(d, 2)
counts = {}
for tup in c:
p1, p2 = tup
if not counts.get(p1[0]):
counts[p1[0]] = 0
if int(p1[1]) > int(p2[1]):
counts[p1[0]] += 1
counts
{'2': 3, '3': 2, '4': 3, '5': 1, '7': 0, '1': 0}

I first convert the list of your score to a dictionary object using itertools, then iterating through each key, and for each key, compare the values available in the list
and add accordingly the score you provided and since in this approach you will always add the value 1 because you will always compare it with itself so at end i decrease 1 from the final score there may be a better approach for this
ls = ['2', '430', '3', '420', '4', '460', '5', '400', '7', '0', '1', '430', '6', '480']
d = dict(itertools.zip_longest(*[iter(ls)] * 2, fillvalue=""))
values= d.values()
for item in d.keys():
score=0
for i in values:
if d[item]>i:
score+=2
elif d[item]==i:
score+=1
else:
pass
print(item,":",score-1)
Output:
2 : 7
3 : 4
4 : 10
5 : 2
7 : 0
1 : 7
6 : 12

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Compute Sum of Dictionary Field in Spark with Python? - python

I would use a sum map instead. sum(list(map(lambda row: row['Age'],survivors.take(3))))

You've got the arguments of the reduce the wrong way around, you need to have the accumulator first, also you need to make the accumulator into a dict. survivors.reduce(lambda acc, row: {'Age': float(acc['Age']) + float(row['Age'])})

Related

Nested dictionary parsing error JSON- TypeError: string indices must be integers

How to transform index values into columns using Pandas?

Finding missing value in JSON using python

How to remove duplicate elements of, list of dictionaries in python

How can I loop through a Python list and perform math calculations on elements of the list?

Categories

Resources