Python pandas dataframe groupby columns error no apply member - python

I am using pandas to get the average price and total quantity of a dataset.
The code works, but I get the following error message:
Instance of 'DataFrameGroupBy' has no 'apply' member for the line
summary_Report = df.groupby('name').apply(f1)
import pandas as pd
dataset = [
{'name': 'A', 'Quantity': 37, 'Price': 10},
{'name': 'B', 'Quantity': 20, 'Price': 10.5},
{'name': 'A', 'Quantity': 17, 'Price': 9},
{'name': 'D', 'Quantity': 19, 'Price': 5},
{'name': 'E', 'Quantity': 30, 'Price': 6}
]
def f1(x):
d={}
d['Total_Quantity']=x['Quantity'].sum()
d['Average_Price']= ((x['Quantity'] * x['Price']).sum())/x['Quantity'].sum()
return pd.Series(d, index=['Total_Quantity', 'Average_Price'])
df = pd.DataFrame(dataset)
summary_Report = df.groupby('name').apply(f1)
print(summary_Report)
Do you know how to solve this error message?
I am using Visual Studio Code. Already using the latest pandas version. See screenshot below for the error message. Using panda version 1.1.1.

The code is running absolutely fine for me. But there can be a few potential issues -
You have used groupby as a variable name and overwritten the pandas method. Please restart the kernel and run again and it should not throw this error again.
You are working with an experimental version or outdated version of pandas. Please use pip install --upgrade pandas to update your pandas to a stable version.
import pandas as pd
dataset = [
{'name': 'A', 'Quantity': 37, 'Price': 10},
{'name': 'B', 'Quantity': 20, 'Price': 10.5},
{'name': 'A', 'Quantity': 17, 'Price': 9},
{'name': 'D', 'Quantity': 19, 'Price': 5},
{'name': 'E', 'Quantity': 30, 'Price': 6}
]
def f1(x):
d={}
d['Total_Quantity']=x['Quantity'].sum()
d['Average_Price']= ((x['Quantity'] * x['Price']).sum())/x['Quantity'].sum()
return pd.Series(d, index=['Total_Quantity', 'Average_Price'])
df = pd.DataFrame(dataset)
summary_Report = df.groupby('name').apply(f1)
print(summary_Report)
Total_Quantity Average_Price
name
A 54.0 9.685185
B 20.0 10.500000
D 19.0 5.000000
E 30.0 6.000000

Related

In List of dicts, find dicts with most values [duplicate]

This question already has answers here:
Top-k on a list of dict in python
(3 answers)
Closed 2 years ago.
I have a list of python dicts like this:
[{'name': 'A', 'score': 12},
{'name': 'B', 'score': 20},
{'name': 'C', 'score': 11},
{'name': 'D', 'score': 20},
{'name': 'E', 'score': 9}]
How do I select first three dicts with highest score values? [D, B, A]
Sort using the score as a key, then take the top 3 elements:
>>> sorted([{'name': 'A', 'score': 12},
... {'name': 'B', 'score': 20},
... {'name': 'C', 'score': 11},
... {'name': 'D', 'score': 20},
... {'name': 'E', 'score': 9}], key=lambda d: d['score'])[-3:]
[{'name': 'A', 'score': 12}, {'name': 'B', 'score': 20}, {'name': 'D', 'score': 20}]

Correlation in Apache Spark and groupBy with Python

I'm new in Python and Apache Spark, and try to understand, how function "pyspark.sql.functions.corr (val1, val2)" works.
I have big dataframe with auto brand, age and price. I want to get correlation between age and price for each auto brand.
I have 2 solutions:
//get all brands
get_all_maker = data.groupBy("brand").agg(F.count("*").alias("counts")).collect()
for row in get_all_maker:
print(row["brand"],": ",data.filter(data["brand"]==row["brand"]).corr("age","price"))
This solution is slow, because I use "corr" a lot of times.
So I tried to do it with one aggregation:
get_all_maker_corr = data.groupBy("brand").agg(
F.count("*").alias("counts"),
F.corr("age","price").alias("correlation")).collect()
for row in get_all_maker_corr:
print(row["brand"],": ",row["correlation"])
If I try to compare results, they are different. But why?
I tried with simple examples. Here I generate simple data frame:
d = [
{'name': 'a', 'age': 1, 'price': 2},
{'name': 'a', 'age': 2, 'price': 4},
{'name': 'b', 'age': 1, 'price': 1},
{'name': 'b', 'age': 2, 'price': 2}
]
b = spark.createDataFrame(d)
Let's test two methods:
#first version
get_all_maker = b.groupBy("name").agg(F.count("*").alias("counts")).collect()
print("Correlation (1st)")
for row in get_all_maker:
print(row["name"],"(",row["counts"],"):",b.filter(b["name"] == row["name"]).corr("age","price"))
#second version
get_all_maker_corr = b.groupBy("name").agg(
F.count("*").alias("counts"),
F.corr("age","price").alias("correlation")).collect()
print("Correlation (2nd)")
for row in get_all_maker_corr:
print(row["name"],"(",row["counts"],"):",row["correlation"])
Both of them bring me the same answer:
Correlation (1st)
b ( 2 ): 1.0
a ( 2 ): 1.0
Let's add another entry to data frame with None-value:
d = [
{'name': 'a', 'age': 1, 'price': 2},
{'name': 'a', 'age': 2, 'price': 4},
{'name': 'a', 'age': 3, 'price': None},
{'name': 'b', 'age': 1, 'price': 1},
{'name': 'b', 'age': 2, 'price': 2}
]
b = spark.createDataFrame(d)
In first version you will get these results:
Correlation (1st)
b ( 2 ): 1.0
a ( 3 ): -0.5
and the second version bring you other results:
Correlation (2nd)
b ( 2 ): 1.0
a ( 3 ): 1.0
I think, that dataframe.filter with corr-function set None-value to 0-value.
And dataframe.groupBy with F.corr-function in agg-function will ignore None-value.
So, two these methods are not equal. I don't know, if this is a bug or a feature of the Spark system, but just in case you want to count correlation value, the data should be used only without None-value.

Python find duplicated dicts in list and separate them with counting

I have a dicts in a list and some dicts are identical. I want to find duplicated ones and want to add to new list or dictionary with how many duplicate they have.
import itertools
myListCombined = list()
for a, b in itertools.combinations(myList, 2):
is_equal = set(a.items()) - set(b.items())
if len(is_equal) == 0:
a.update(count=2)
myListCombined.append(a)
else:
a.update(count=1)
b.update(count=1)
myListCombined.append(a)
myListCombined.append(b)
myListCombined = [i for n, i enumerate(myListCombine) if i not in myListCombine[n + 1:]]
This code is kinda working, but it's just for 2 duplicated dicts in list. a.update(count=2) won't work in this situations.
I'm also deleting duplicated dicts after separete them in last line, but i'm not sure if it's going to work well.
Input:
[{'name': 'Mary', 'age': 25, 'salary': 1000},
{'name': 'John', 'age': 25, 'salary': 2000},
{'name': 'George', 'age': 30, 'salary': 2500},
{'name': 'John', 'age': 25, 'salary': 2000},
{'name': 'John', 'age': 25, 'salary': 2000}]
Desired Output:
[{'name': 'Mary', 'age': 25, 'salary': 1000, 'count':1},
{'name': 'John', 'age': 25, 'salary': 2000, 'count': 3},
{'name': 'George', 'age': 30, 'salary': 2500, 'count' 1}]
You could try the following, which first converts each dictionary to a frozenset of key,value tuples (so that they are hashable as required by collections.Counter).
import collections
a = [{'a':1}, {'a':1},{'b':2}]
print(collections.Counter(map(lambda x: frozenset(x.items()),a)))
Edit to reflect your desired input/output:
from copy import deepcopy
def count_duplicate_dicts(list_of_dicts):
cpy = deepcopy(list_of_dicts)
for d in list_of_dicts:
d['count'] = cpy.count(d)
return list_of_dicts
x = [{'a':1},{'a':1}, {'c':3}]
print(count_duplicate_dicts(x))
If your dict data is well structured and the contents of the dict are simple data types, e.g., numbers and string, and you have following data analysis processing, I would suggest you use pandas, which provide rich functions. Here is a sample code for your case:
In [32]: data = [{'name': 'Mary', 'age': 25, 'salary': 1000},
...: {'name': 'John', 'age': 25, 'salary': 2000},
...: {'name': 'George', 'age': 30, 'salary': 2500},
...: {'name': 'John', 'age': 25, 'salary': 2000},
...: {'name': 'John', 'age': 25, 'salary': 2000}]
...:
...: df = pd.DataFrame(data)
...: df['counts'] = 1
...: df = df.groupby(df.columns.tolist()[:-1]).sum().reset_index(drop=False)
...:
In [33]: df
Out[33]:
age name salary counts
0 25 John 2000 3
1 25 Mary 1000 1
2 30 George 2500 1
In [34]: df.to_dict(orient='records')
Out[34]:
[{'age': 25, 'counts': 3, 'name': 'John', 'salary': 2000},
{'age': 25, 'counts': 1, 'name': 'Mary', 'salary': 1000},
{'age': 30, 'counts': 1, 'name': 'George', 'salary': 2500}]
The logical are:
(1) First build the DataFrame from your data
(2) The groupby function can do aggregate function on each group.
(3) To output back to dict, you can call pd.to_dict
Pandas is a big package, which costs some time to learn it, but it worths to know pandas. It is so powerful that can make your data analysis quite faster and elegant.
Thanks.
You can try this:
import collections
d = [{'name': 'Mary', 'age': 25, 'salary': 1000},
{'name': 'John', 'age': 25, 'salary': 2000},
{'name': 'George', 'age': 30, 'salary': 2500},
{'name': 'John', 'age': 25, 'salary': 2000},
{'name': 'John', 'age': 25, 'salary': 2000}]
count = dict(collections.Counter([i["name"] for i in d]))
a = list(set(map(tuple, [i.items() for i in d])))
final_dict = [dict(list(i)+[("count", count[dict(i)["name"]])]) for i in a]
Output:
[{'salary': 2000, 'count': 3, 'age': 25, 'name': 'John'}, {'salary': 2500, 'count': 1, 'age': 30, 'name': 'George'}, {'salary': 1000, 'count': 1, 'age': 25, 'name': 'Mary'}]
You can take the count values using collections.Counter and then rebuild the dicts after adding the count value from the Counter to each frozenset:
from collections import Counter
l = [dict(d | {('count', c)}) for d, c in Counter(frozenset(d.items())
for d in myList).items()]
print(l)
# [{'salary': 1000, 'name': 'Mary', 'age': 25, 'count': 1},
# {'name': 'John', 'salary': 2000, 'age': 25, 'count': 3},
# {'salary': 2500, 'name': 'George', 'age': 30, 'count': 1}]

How to combine values in python list of dictionaries

I have a list of dictionaries that look like this:
l = [{'name': 'john', 'amount': 50}, {'name': 'al', 'amount': 20}, {'name': 'john', 'amount': 80}]
is there any way to combine/merge the matching name values dictionaries and sum the amount also?
You can use a collections.Counter() object to map names to amounts, summing them as you go along:
from collections import Counter
summed = Counter()
for d in l:
summed[d['name']] += d['amount']
result = [{'name': name, 'amount': amount} for name, amount in summed.most_common()]
The result is then also sorted by amount (highest first):
>>> from collections import Counter
>>> l = [{'name': 'john', 'amount': 50}, {'name': 'al', 'amount': 20}, {'name': 'john', 'amount': 80}]
>>> summed = Counter()
>>> for d in l:
... summed[d['name']] += d['amount']
...
>>> summed
Counter({'john': 130, 'al': 20})
>>> [{'name': name, 'amount': amount} for name, amount in summed.most_common()]
[{'amount': 130, 'name': 'john'}, {'amount': 20, 'name': 'al'}]

sort a list of dicts by x then by y

I want to sort this info(name, points, and time):
list = [
{'name':'JOHN', 'points' : 30, 'time' : '0:02:2'},
{'name':'KARL','points':50,'time': '0:03:00'}
]
so, what I want is the list sorted first by points made, then by time played (in my example, matt go first because of his less time. any help?
I'm trying with this:
import operator
list.sort(key=operator.itemgetter('points', 'time'))
but got a TypeError: list indices must be integers, not str.
Your example works for me. I would advise you not to use list as a variable name, since it is a builtin type.
You could try doing something like this also:
list.sort(key=lambda item: (item['points'], item['time']))
edit:
example list:
>>> a = [
... {'name':'JOHN', 'points' : 30, 'time' : '0:02:20'},
... {'name':'LEO', 'points' : 30, 'time': '0:04:20'},
... {'name':'KARL','points':50,'time': '0:03:00'},
... {'name':'MARK','points':50,'time': '0:02:00'},
... ]
descending 'points':
using sort() for inplace sorting:
>>> a.sort(key=lambda x: (-x['points'],x['time']))
>>> pprint.pprint(a)
[{'name': 'MARK', 'points': 50, 'time': '0:02:00'},
{'name': 'KARL', 'points': 50, 'time': '0:03:00'},
{'name': 'JOHN', 'points': 30, 'time': '0:02:20'},
{'name': 'LEO', 'points': 30, 'time': '0:04:20'}]
>>>
using sorted to return a sorted list:
>>> pprint.pprint(sorted(a, key=lambda x: (-x['points'],x['time'])))
[{'name': 'MARK', 'points': 50, 'time': '0:02:00'},
{'name': 'KARL', 'points': 50, 'time': '0:03:00'},
{'name': 'JOHN', 'points': 30, 'time': '0:02:20'},
{'name': 'LEO', 'points': 30, 'time': '0:04:20'}]
>>>
ascending 'points':
>>> a.sort(key=lambda x: (x['points'],x['time']))
>>> import pprint
>>> pprint.pprint(a)
[{'name': 'JOHN', 'points': 30, 'time': '0:02:20'},
{'name': 'LEO', 'points': 30, 'time': '0:04:20'},
{'name': 'MARK', 'points': 50, 'time': '0:02:00'},
{'name': 'KARL', 'points': 50, 'time': '0:03:00'}]
>>>
itemgetter will throw this error up to Python2.4
If you are stuck on 2.4, you will need to use the lambda
my_list.sort(key=lambda x: (x['points'], x['time']))
It would be preferable to upgrade to a newer Python if possible

Categories