Related
I am trying to learn how to use pyspark.pandas and I am coming across an issue that I don't know how to solve. I have a df of about 700k rows and 7 columns. Here is a sample of my data:
import pyspark.pandas as ps
import pandas as pd
data = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF'],
'Year': [2016, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2019],
'Price': [500, 0,450,750,0,0,890,19,120,3],
'Quantity': [1200,0,330,500,190,70,120,300,50,80],
'Value': [600000,0,148500,350000,0,29100,106800,74300,5500,20750]}
df = ps.DataFrame(data)
Even when I run the simplest of operations like df.head(), I get the following warning and I'm not sure how to fix it:
WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
I know how to work around this with pyspark dataframes, but I'm not sure how to fix it using the Pandas API for Pyspark to define a partition for window operation.
Does anyone have any suggestions?
For Koalas, the repartition seems to only take in a number of partitions here: https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.repartition.html
I think the goal here is to run Pandas functions on a Spark DataFrame. One option you can use is Fugue. Fugue can take a Python function and apply it on Spark per partition. Example code below.
from typing import List, Dict, Any
import pandas as pd
df = pd.DataFrame({"date":["2021-01-01", "2021-01-02", "2021-01-03"] * 3,
"id": (["A"]*3 + ["B"]*3 + ["C"]*3),
"value": [3, 4, 2, 1, 2, 5, 3, 2, 3]})
def count(df: pd.DataFrame) -> pd.DataFrame:
# this assumes the data is already partitioned
id = df.iloc[0]["id"]
count = df.shape[0]
return pd.DataFrame({"id": [id], "count": [count]})
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(df)
from fugue import transform
# Pandas
pdf = transform(df.copy(),
count,
schema="id:str, count:int",
partition={"by": "id"})
print(pdf.head())
# Spark
transform(sdf,
count,
schema="id:str, count:int",
partition={"by": "id"},
engine=spark).show()
You just need to annotate your function with input and output types and then you can use it with the Fugue transform function. Schema is a requirement for Spark so you need to pass it. If you supply spark as the engine, then the execution will happen on Spark. Otherwise, it will run on Pandas by default.
Single-level DataFrame:
data1 = {'Sr.No.': Sr_no,
'CompanyNames': Company_Names,
'YourChoice1': Your_Choice,
'YourChoice2': Your_Choice}
df1 = pd.DataFrame(data1, columns = pd.Index(['Sr.No.', 'CompanyNames','YourChoice1','YourChoice2'], name='key'))
Output of single-level dataframe in csv file:
3-level dataframe:
form = {'I1': {'F1': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']},
'F2': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']},
'F3': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']}
},
'I2': {'F1': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']},
'F2': {'PD': ['1','2','3','4','5','6','7','8','9'],
'CD': ['1','2','3','4','5','6','7','8','9']}
}
}
headers,values,data = CSV_trial.DATA(form)
cols = pd.MultiIndex.from_tuples(headers, names=['ind','field','data'])
df2 = pd.DataFrame(data, columns=cols)
Output of 3-level dataframe in csv file:
I want to merge these dataframe as df1 on left and df2 on right...
Desired Output:
Can anyone help me with this???
An easy way is to transform the single-layer df into a 3-level, then concat two df's of the same structure.
Importing necessary packages:
import pandas as pd
import numpy as np
Creating a native 3-level index. You can read it from a csv, xml, etc.
native_lvl_3_index_tup = [('A','foo1', 1), ('A','foo2', 3),
('B','foo1', 1), ('B','foo2', 3),
('C','foo1', 1), ('C','foo2', 3)]
variables = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
native_lvl_3_index = pd.MultiIndex.from_tuples(native_lvl_3_index_tup)
Function, converting native single-level index to a 3-level:
def single_to_3_lvl(single_index_list,val_lvl_0,val_lvl_1):
multiindex_tuple = [(val_lvl_0,val_lvl_1,i) for i in single_index_list]
return pd.MultiIndex.from_tuples(multiindex_tuple)
Use this function to get an artificial 3-level index:
single_index = [1,2,3,4,5,6]
artificial_multiindex = single_to_3_lvl(single_index,'A','B')
Creating dataframes, transposing to move multiindex to columns (as in the question):
df1 = pd.DataFrame(variables,artificial_multiindex).T
df2 = pd.DataFrame(variables,native_lvl_3_index).T
I used the same variables in the dataframes. You can manipulate the concatenation by setting join='outer' or 'inner' in the pd.concat()
result = pd.concat([df1,df2],axis = 1)
Variable result contains the concatenated dataframes. If You have a single-level indexed dataframe, You can reindex it:
single_level_df = pd.DataFrame(single_index,variables)
reindexed = single_level_df.reindex(artificial_multiindex).T
Again, I do transposing (.T) to work with columns. It can be setup differently when creating dataframes.
Hope my answer helped.
I used some code from the link: https://jakevdp.github.io/PythonDataScienceHandbook/03.05-hierarchical-indexing.html
I am having a list as
a=[{'name': 'xyz','inv_name':'asd','quant':300,'amt':20000, 'current':30000},{'name': 'xyz','inv_name':'asd','quant':200,'amt':2000,'current':3000}]
This list i have fetched using itertools groupby.
I want to form a list after adding up the quant, amt and current filed for same name and inv_name and create a list something like : [{'name':'xyz','inv_name':'asd','quant':500,'amt':22000,'current':33000}
Any suggestions on how to achieve this?
If you are happy using a 3rd party library, pandas accepts a list of dictionaries:
import pandas as pd
a=[{'name': 'xyz','inv_name':'asd','quant':300,'amt':20000, 'current':30000},
{'name': 'xyz','inv_name':'asd','quant':200,'amt':2000,'current':3000}]
df = pd.DataFrame(a)
res = df.groupby(['name', 'inv_name'], as_index=False).sum().to_dict(orient='records')
# [{'amt': 22000,
# 'current': 33000,
# 'inv_name': 'asd',
# 'name': 'xyz',
# 'quant': 500}]
I have a pandas df containing 'features' for stocks, which looks like this:
I am now trying to create a dictionary with unique sector as key, and a python list of tickers for that unique sector as values, so I end up having something that looks like this:
{'consumer_discretionary': ['AAP',
'AMZN',
'AN',
'AZO',
'BBBY',
'BBY',
'BWA',
'KMX',
'CCL',
'CBS',
'CHTR',
'CMG',
etc.
I could iterate over the pandas df rows to create the dictionary, but I prefer a more pythonic solution. Thus far, this code is a partial solution:
df.set_index('sector')['ticker'].to_dict()
Any feedback is appreciated.
UPDATE:
The solution by #wrwrwr
df.set_index('ticker').groupby('sector').groups
partially works, but it returns a pandas series as a the value, instead of a python list. Any ideas about how to transform the pandas series into a python list in the same line and w/o having to iterate the dictionary?
Wouldn't f.set_index('ticker').groupby('sector').groups be what you want?
For example:
f = DataFrame({
'ticker': ('t1', 't2', 't3'),
'sector': ('sa', 'sb', 'sb'),
'name': ('n1', 'n2', 'n3')})
groups = f.set_index('ticker').groupby('sector').groups
# {'sa': Index(['t1']), 'sb': Index(['t2', 't3'])}
To ensure that they have the type you want:
{k: list(v) for k, v in f.set_index('ticker').groupby('sector').groups.items()}
or:
f.set_index('ticker').groupby('sector').apply(lambda g: list(g.index)).to_dict()
Sometimes, it seems that the more I use Python (and Pandas), the less I understand. So I apologise if I'm just not seeing the wood for the trees here but I've been going round in circles and just can't see what I'm doing wrong.
Basically, I have an example script (that I'd like to implement on a much larger dataframe) but I can't get it to work to my satisfaction.
The dataframe consists of columns of various datatypes. I'd like to group the dataframe on 2 columns and then produce a new dataframe that contains lists of all the unique values for each variable in each group. (Ultimately, I'd like to concatenate the list items into a single string – but that's a different question.)
The initial script I used was:
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
The output from this script is as expected: a series of lines containing the sets of values and a dataframe containing the returned sets:
{'09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34'}
{'01/06/2015 11:09', '12/05/2015 14:19', '27/05/2015 22:31', '19/06/2015 05:37'}
{'15/04/2015 07:12', '19/05/2015 19:22', '06/05/2015 11:12', '04/06/2015 12:57', '15/06/2015 03:23', '12/04/2015 01:00'}
{'02/04/2015 02:34', '10/05/2015 08:52'}
{2, 3, 6}
{18, 11, 13, 14}
{4, 5, 9, 12, 15, 17}
{1, 10}
date \
gender age
female old set([09/04/2015 23:03, 21/04/2015 12:59, 06/04...
young set([01/06/2015 11:09, 12/05/2015 14:19, 27/05...
male old set([15/04/2015 07:12, 19/05/2015 19:22, 06/05...
young set([02/04/2015 02:34, 10/05/2015 08:52])
id
gender age
female old set([2, 3, 6])
young set([18, 11, 13, 14])
male old set([4, 5, 9, 12, 15, 17])
young set([1, 10])
The problem occurs when I try to convert the sets to lists. Bizarrely, it produces 2 duplicated rows containing identical lists but then fails with a 'ValueError: Function does not reduce' error.
def tempFuncAgg(tempVar):
tempList = list(set(tempVar.dropna())) # This is the only difference
print(tempList)
return tempList
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
tempGroupby = tempDF.groupby(['gender','age'])
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
But now the output is:
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Function does not reduce
Any help to troubleshoot this problem would be appreciated and I apologise in advance if it's something obvious that I'm just not seeing.
EDIT
Incidentally, converting the set to a tuple rather than a list works with no problem.
Lists can sometimes have weird problems in pandas. You can either :
Use tuples (as you've already noticed)
If you really need lists, just do it in a second operation like this :
dfAgg.applymap(lambda x: list(x))
full example :
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
# Transform in list
dfAgg.applymap(lambda x: list(x))
print(dfAgg)
There's many such bizzare behaviours in pandas, it is generally better to go on with a workaround (like this), than to find a perfect solution