Correlation in Apache Spark and groupBy with Python - python

I'm new in Python and Apache Spark, and try to understand, how function "pyspark.sql.functions.corr (val1, val2)" works.
I have big dataframe with auto brand, age and price. I want to get correlation between age and price for each auto brand.
I have 2 solutions:
//get all brands
get_all_maker = data.groupBy("brand").agg(F.count("*").alias("counts")).collect()
for row in get_all_maker:
print(row["brand"],": ",data.filter(data["brand"]==row["brand"]).corr("age","price"))
This solution is slow, because I use "corr" a lot of times.
So I tried to do it with one aggregation:
get_all_maker_corr = data.groupBy("brand").agg(
F.count("*").alias("counts"),
F.corr("age","price").alias("correlation")).collect()
for row in get_all_maker_corr:
print(row["brand"],": ",row["correlation"])
If I try to compare results, they are different. But why?

I tried with simple examples. Here I generate simple data frame:
d = [
{'name': 'a', 'age': 1, 'price': 2},
{'name': 'a', 'age': 2, 'price': 4},
{'name': 'b', 'age': 1, 'price': 1},
{'name': 'b', 'age': 2, 'price': 2}
]
b = spark.createDataFrame(d)
Let's test two methods:
#first version
get_all_maker = b.groupBy("name").agg(F.count("*").alias("counts")).collect()
print("Correlation (1st)")
for row in get_all_maker:
print(row["name"],"(",row["counts"],"):",b.filter(b["name"] == row["name"]).corr("age","price"))
#second version
get_all_maker_corr = b.groupBy("name").agg(
F.count("*").alias("counts"),
F.corr("age","price").alias("correlation")).collect()
print("Correlation (2nd)")
for row in get_all_maker_corr:
print(row["name"],"(",row["counts"],"):",row["correlation"])
Both of them bring me the same answer:
Correlation (1st)
b ( 2 ): 1.0
a ( 2 ): 1.0
Let's add another entry to data frame with None-value:
d = [
{'name': 'a', 'age': 1, 'price': 2},
{'name': 'a', 'age': 2, 'price': 4},
{'name': 'a', 'age': 3, 'price': None},
{'name': 'b', 'age': 1, 'price': 1},
{'name': 'b', 'age': 2, 'price': 2}
]
b = spark.createDataFrame(d)
In first version you will get these results:
Correlation (1st)
b ( 2 ): 1.0
a ( 3 ): -0.5
and the second version bring you other results:
Correlation (2nd)
b ( 2 ): 1.0
a ( 3 ): 1.0
I think, that dataframe.filter with corr-function set None-value to 0-value.
And dataframe.groupBy with F.corr-function in agg-function will ignore None-value.
So, two these methods are not equal. I don't know, if this is a bug or a feature of the Spark system, but just in case you want to count correlation value, the data should be used only without None-value.

Related

Python pandas dataframe groupby columns error no apply member

I am using pandas to get the average price and total quantity of a dataset.
The code works, but I get the following error message:
Instance of 'DataFrameGroupBy' has no 'apply' member for the line
summary_Report = df.groupby('name').apply(f1)
import pandas as pd
dataset = [
{'name': 'A', 'Quantity': 37, 'Price': 10},
{'name': 'B', 'Quantity': 20, 'Price': 10.5},
{'name': 'A', 'Quantity': 17, 'Price': 9},
{'name': 'D', 'Quantity': 19, 'Price': 5},
{'name': 'E', 'Quantity': 30, 'Price': 6}
]
def f1(x):
d={}
d['Total_Quantity']=x['Quantity'].sum()
d['Average_Price']= ((x['Quantity'] * x['Price']).sum())/x['Quantity'].sum()
return pd.Series(d, index=['Total_Quantity', 'Average_Price'])
df = pd.DataFrame(dataset)
summary_Report = df.groupby('name').apply(f1)
print(summary_Report)
Do you know how to solve this error message?
I am using Visual Studio Code. Already using the latest pandas version. See screenshot below for the error message. Using panda version 1.1.1.
The code is running absolutely fine for me. But there can be a few potential issues -
You have used groupby as a variable name and overwritten the pandas method. Please restart the kernel and run again and it should not throw this error again.
You are working with an experimental version or outdated version of pandas. Please use pip install --upgrade pandas to update your pandas to a stable version.
import pandas as pd
dataset = [
{'name': 'A', 'Quantity': 37, 'Price': 10},
{'name': 'B', 'Quantity': 20, 'Price': 10.5},
{'name': 'A', 'Quantity': 17, 'Price': 9},
{'name': 'D', 'Quantity': 19, 'Price': 5},
{'name': 'E', 'Quantity': 30, 'Price': 6}
]
def f1(x):
d={}
d['Total_Quantity']=x['Quantity'].sum()
d['Average_Price']= ((x['Quantity'] * x['Price']).sum())/x['Quantity'].sum()
return pd.Series(d, index=['Total_Quantity', 'Average_Price'])
df = pd.DataFrame(dataset)
summary_Report = df.groupby('name').apply(f1)
print(summary_Report)
Total_Quantity Average_Price
name
A 54.0 9.685185
B 20.0 10.500000
D 19.0 5.000000
E 30.0 6.000000

find cells with specific value and replace its value

Using pandas I have created a csv file containing 2 columns and saved my data into these columns. something like this:
fist second
{'value': 2} {'name': 'f'}
{'value': 2} {'name': 'h'}
{"value": {"data": {"n": 2, "m":"f"}}} {'name': 'h'}
...
Is there any way to look for all the rows whose the first column contains "data" and if any, only keep its value in this cell? I mean is it possible to change my third row from:
{"value": {"data": {"n": 2, "m":"f"}}} {'name': 'h'}
to something like this:
{"data": {"n": 2, "m":"f"}} {'name': 'h'}
and delete or replace the value of all other cells that does not contain data to something like -?
So my csv file will look like this:
fist second
- {'name': 'f'}
- {'name': 'h'}
{"data": {"n": 2, "m":"f"}} {'name': 'h'}
...
Here is my code:
import json
import pandas as pd
result = []
for line in open('file.json', 'r'):
result.append(json.loads(line))
df = pd.DataFrame(result)
print(df)
df.to_csv('document.csv')
f = pd.read_csv("document.csv")
keep_col = ['first', 'second']
new_f = f[keep_col]
new_f.to_csv("newFile.csv", index=False)
here is my short example:
df = pd.DataFrame({
'first' : [{'value': 2}, {'value': 2}, {"value": {"data": {"n": 2, "m":"f"}}}]
,'secound' : [{'name': 'f'}, {'name': 'h'},{'name': 'h'}]
})
a = pd.DataFrame(df["first"].tolist())
a[~a["value"].str.contains("data",regex=False).fillna(False)] = "-"
df["first"] = a.value
first step is to remove the 'value' field. after this the value field
if the new field contains the word "data" the field is set to true; all other fields are False, Numberic Fields have the value NaN this is Replaced with False. And the whole gets negated and replaced with "-"
last step is to overwrite the column in the original Data Frame.
Something like this might work.
first=[{'value': 2} , {'value': 2} , {"value": {"data": {"n": 2, "m":"f"}}}, {"data": {"n": 2, "m":"f"}}]
second=[{'name': 'f'}, {'name': 'h'}, {'name': 'h'}, {'name': 'h'}]
df = pd.DataFrame({'first': first,
'second': second})
f = lambda x: x.get('value', x) if isinstance(x, dict) else np.nan
df['first'] = df['first'].apply(f)
df['first'][~df["first"].str.contains("data",regex=False).fillna(False)] = "-"
print(df)
first second
0 - {'name': 'f'}
1 - {'name': 'h'}
2 {'data': {'n': 2, 'm': 'f'}} {'name': 'h'}
3 {'data': {'n': 2, 'm': 'f'}} {'name': 'h'}

iterable from pandas dataframe

I need to create an iterable of the form (id, {feature name: features weight}) for using a python package.
my data are store in a pandas dataframe, here an example:
data = pd.DataFrame({"id":[1,2,3],
"gender":[1,0,1],
"age":[25,23,40]})
for the {feature name: features weight}) part, I know I can use this:
fe = data.to_dict(orient='records')
Out[28]:
[{'age': 25, 'gender': 1, 'id': 1},
{'age': 23, 'gender': 0, 'id': 2},
{'age': 40, 'gender': 1, 'id': 3}]
I know I can also iterate over the datframe for get the id, like this:
(row[1] for row in data.itertuples())
But I can get this two together to get one iterable (generator object )
I tried :
((row[1] for row in data.itertuples()),fe[i] for i in range(len(data)))
but the syntax is wrong.
Do you guys know how to do so ?
pd.DataFrame.itertuples returns named tuples. You can iterate and convert each row to a dictionary via the purpose-built method _asdict. You can wrap this in a generator function to create a lazy reader:
data = pd.DataFrame({"id":[1,2,3],
"gender":[1,0,1],
"age":[25,23,40]})
def gen_rows(df):
for row in df.itertuples(index=False):
yield row._asdict()
G = gen_rows(data)
print(next(G)) # OrderedDict([('age', 25), ('gender', 1), ('id', 1)])
print(next(G)) # OrderedDict([('age', 23), ('gender', 0), ('id', 2)])
print(next(G)) # OrderedDict([('age', 40), ('gender', 1), ('id', 3)])
Note that the result will be OrderedDict objects. As a subclass of dict, for most purposes this should be sufficient.
I think need first set_index by column id and then to_dict with orient='index':
fe = data.set_index('id', drop=False).to_dict(orient='index')
print (fe)
{1: {'id': 1, 'gender': 1, 'age': 25},
2: {'id': 2, 'gender': 0, 'age': 23},
3: {'id': 3, 'gender': 1, 'age': 40}}

List of dictionaries - stack one value of dictionary

I have trouble in adding one value of dictionary when conditions met, For example I have this list of dictionaries:
[{'plu': 1, 'price': 150, 'quantity': 2, 'stock': 5},
{'plu': 2, 'price': 150, 'quantity': 7, 'stock': 10},
{'plu': 1, 'price': 150, 'quantity': 6, 'stock': 5},
{'plu': 1, 'price': 200, 'quantity': 4, 'stock': 5},
{'plu': 2, 'price': 150, 'quantity': 3, 'stock': 10}
]
Then output should look like this:
[{'plu': 1, 'price': 150, 'quantity': 8, 'stock': 5},
{'plu': 1, 'price': 200, 'quantity': 4, 'stock': 5},
{'plu': 2, 'price': 150, 'quantity': 10, 'stock': 10}
]
Quantity should be added only if plu and price are the same, it should ignore key:values other than that (ex. stock). What is the most efficient way to do that?
#edit
I tried:
import itertools as it
keyfunc = lambda x: x['plu']
groups = it.groupby(sorted(new_data, key=keyfunc), keyfunc)
x = [{'plu': k, 'quantity': sum(x['quantity'] for x in g)} for k, g in groups]
But it works only on plu and then I get only quantity value when making html table in django, other are empty
You need to sort/groupby the combined key, not just one key. Easiest/most efficient way to do this is with operator.itemgetter. To preserve an arbitrary stock value, you'll need to use the group twice, so you'll need to convert it to a sequence:
from operator import itemgetter
keyfunc = itemgetter('plu', 'price')
# Unpack key and listify g so it can be reused
groups = ((plu, price, list(g))
for (plu, price), g in it.groupby(sorted(new_data, key=keyfunc), keyfunc))
x = [{'plu': plu, 'price': price, 'stock': g[0]['stock'],
'quantity': sum(x['quantity'] for x in g)}
for plu, price, g in groups]
Alternatively, if stock is guaranteed to be the same for each unique plu/price pair, you can include it in the key to simplify matters, so you don't need to listify the groups:
keyfunc = itemgetter('plu', 'price', 'stock')
groups = it.groupby(sorted(new_data, key=keyfunc), keyfunc)
x = [{'plu': plu, 'price': price, 'stock': stock,
'quantity': sum(x['quantity'] for x in g)
for (plu, price, stock), g in groups]
Optionally, you could create getquantity = itemgetter('quantity') at top level (like the keyfunc) and change sum(x['quantity'] for x in g) to sum(map(getquantity, g)) which pushes work to the C layer in CPython, and can be faster if your groups are large.
The other approach is to avoid sorting entirely using collections.Counter (or collections.defaultdict(int), though Counter makes the intent more clear here):
from collections import Counter
grouped = Counter()
for plu, price, stock, quantity in map(itemgetter('plu', 'price', 'stock', 'quantity'), new_data):
grouped[plu, price, stock] += quantity
then convert back to your preferred form with:
x = [{'plu': plu, 'price': price, 'stock': stock, 'quantity': quantity}
for (plu, price, stock), quantity in grouped.items()]
This should be faster for large inputs, since it replaces O(n log n) sorting work with O(n) dict operations (which are roughly O(1) cost).
Using pandas will make this a trivial problem:
import pandas as pd
data = [{'plu': 1, 'price': 150, 'quantity': 2, 'stock': 5},
{'plu': 2, 'price': 150, 'quantity': 7, 'stock': 10},
{'plu': 1, 'price': 150, 'quantity': 6, 'stock': 5},
{'plu': 1, 'price': 200, 'quantity': 4, 'stock': 5},
{'plu': 2, 'price': 150, 'quantity': 3, 'stock': 10}]
df = pd.DataFrame.from_records(data)
# df
#
# plu price quantity stock
# 0 1 150 2 5
# 1 2 150 7 10
# 2 1 150 6 5
# 3 1 200 4 5
# 4 2 150 3 10
new_df = df.groupby(['plu','price','stock'], as_index=False).sum()
new_df = new_df[['plu','price','quantity','stock']] # Optional: reorder the columns
# new_df
#
# plu price quantity stock
# 0 1 150 8 5
# 1 1 200 4 5
# 2 2 150 10 10
And finally, if you want to, port it back to dict (though I would argue pandas give you a lot more functionality to handle the data elements):
new_data = df2.to_dict(orient='records')
# new_data
#
# [{'plu': 1, 'price': 150, 'quantity': 8, 'stock': 5},
# {'plu': 1, 'price': 200, 'quantity': 4, 'stock': 5},
# {'plu': 2, 'price': 150, 'quantity': 10, 'stock': 10}]

Python dict group and sum multiple values [duplicate]

This question already has answers here:
Group by multiple keys and summarize/average values of a list of dictionaries
(8 answers)
Closed 5 years ago.
I have a set of data in the list of dict format like below:
data = [
{'name': 'A', 'tea':5, 'coffee':6},
{'name': 'A', 'tea':2, 'coffee':3},
{'name': 'B', 'tea':7, 'coffee':1},
{'name': 'B', 'tea':9, 'coffee':4},
]
I'm trying to group by 'name' and sum the 'tea' separately and 'coffee' separately
The final grouped data must be in the this format:
grouped_data = [
{'name': 'A', 'tea':7, 'coffee':9},
{'name': 'B', 'tea':16, 'coffee':5},
]
I tried some steps:
from collections import Counter
c = Counter()
for v in data:
c[v['name']] += v['tea']
my_data = [{'name': name, 'tea':tea} for name, tea in c.items()]
for e in my_data:
print e
The above step returned the following output:
{'name': 'A', 'tea':7,}
{'name': 'B', 'tea':16}
Only I can sum the key 'tea', I'm not able to get the sum for the key 'coffee', can you guys please help to solve this solution to get the grouped_data format
Using pandas:
df = pd.DataFrame(data)
df
coffee name tea
0 6 A 5
1 3 A 2
2 1 B 7
3 4 B 9
g = df.groupby('name', as_index=False).sum()
g
name coffee tea
0 A 9 7
1 B 5 16
And, the final step, df.to_dict:
d = g.to_dict('r')
d
[{'coffee': 9, 'name': 'A', 'tea': 7}, {'coffee': 5, 'name': 'B', 'tea': 16}]
You can try this:
data = [
{'name': 'A', 'tea':5, 'coffee':6},
{'name': 'A', 'tea':2, 'coffee':3},
{'name': 'B', 'tea':7, 'coffee':1},
{'name': 'B', 'tea':9, 'coffee':4},
]
import itertools
final_data = [(a, list(b)) for a, b in itertools.groupby([i.items() for i in data], key=lambda x:dict(x)["name"])]
new_final_data = [{i[0][0]:sum(c[-1] for c in i if isinstance(c[-1], int)) if i[0][0] != "name" else i[0][-1] for i in zip(*b)} for a, b in final_data]
Output:
[{'tea': 7, 'coffee': 9, 'name': 'A'}, {'tea': 16, 'coffee': 5, 'name': 'B'}
Using pandas, this is pretty easy to do:
import pandas as pd
data = [
{'name': 'A', 'tea':5, 'coffee':6},
{'name': 'A', 'tea':2, 'coffee':3},
{'name': 'B', 'tea':7, 'coffee':1},
{'name': 'B', 'tea':9, 'coffee':4},
]
df = pd.DataFrame(data)
df.groupby(['name']).sum()
coffee tea
name
A 9 7
B 5 16
Here's one way to get it into your dict format:
grouped_data = []
for idx in gb.index:
d = {'name': idx}
d = {**d, **{col: gb.loc[idx, col] for col in gb}}
grouped_data.append(d)
grouped_data
Out[15]: [{'coffee': 9, 'name': 'A', 'tea': 7}, {'coffee': 5, 'name': 'B', 'tea': 16}]
But COLDSPEED got the native pandas solution with the as_index=False config...
Click here to see snap shot
import pandas as pd
df = pd.DataFrame(data)
df2=df.groupby('name').sum()
df2.to_dict('r')
Here is a method I created, you can input the key you want to group by:
def group_sum(key,list_of_dicts):
d = {}
for dct in list_of_dicts:
if dct[key] not in d:
d[dct[key]] = {}
for k,v in dct.items():
if k != key:
if k not in d[dct[key]]:
d[dct[key]][k] = v
else:
d[dct[key]][k] += v
final_list = []
for k,v in d.items():
temp_d = {key: k}
for k2,v2 in v.items():
temp_d[k2] = v2
final_list.append(temp_d)
return final_list
data = [
{'name': 'A', 'tea':5, 'coffee':6},
{'name': 'A', 'tea':2, 'coffee':3},
{'name': 'B', 'tea':7, 'coffee':1},
{'name': 'B', 'tea':9, 'coffee':4},
]
grouped_data = group_sum("name",data)
print (grouped_data)
result:
[{'coffee': 5, 'name': 'B', 'tea': 16}, {'coffee': 9, 'name': 'A', 'tea': 7}]
I guess this would be slower when summing thousands of dicts compared to pandas, maybe not, I don't know. It also doesn't seem to maintain order unless you use ordereddict or python 3.6

Categories