I am iterating over a groupby column in a pandas dataframe in Python 3.6 with the help of a for loop. The problem with this is that it becomes slow if I have a lot of data. This is my code:
import pandas as pd
dataDict = {}
for metric, df_metric in frontendFrame.groupby('METRIC'): # Creates frames for each metric
dataDict[metric] = df_metric.to_dict('records') # Converts dataframe to dictionary
frontendFrame is a dataframe containing two columns: VALUE and METRIC. My end goal is basically creating a dictionary where there is a key for each metric containing all data connected to it. I now this should be possible to do with lambda or map but I can't get it working with multiple arguments. frontendFrame.groupby('METRIC').apply(lambda x: print(x))
How can I solve this and make my script faster?
If you do not need any calculation involved after groupby , do not groupby data , you can using .loc to get what you need
s=frontendFrame.METRIC.unique()
frontendFrame.loc[frontendFrame.METRIC==s[0],]
Related
In a complex chained method using pandas, one of the steps is grouping data by a column and then calculate some metrics. This is a simplified example of the procedure i want to achieve. I have many more assignments in the workflow but is failing miserabily at first.
import pandas as pd
import numpy as np
data = pd.DataFrame({'Group':['A','A','A','B','B','B'],'first':[1,12,4,5,4,3],'last':[5,3,4,5,2,7,]})
data.groupby('Group').assign(average_ratio=lambda x: np.mean(x['first']/x['last']))
>>>> AttributeError: 'DataFrameGroupBy' object has no attribute 'assign'
I know i could use apply this way:
data.groupby('Group').apply(lambda x: np.mean(x['first']/x['last']))
Group
A 1.733333
B 1.142857
dtype: float64
or much better, renaming the column in the same step:
data.groupby('Group').apply(lambda x: pd.Series({'average_ratio':np.mean(x['first']/x['last'])}))
average_ratio
Group
A 1.733333
B 1.142857
Is there any way of using .assign to obtain the same?
To answer last question, for your needs no you cannot. The method, DataFrame.assign simply adds new columns or replace existing columns but return the same index DataFrame with new/adjusted columns.
You are attempted a grouped aggregation that reduces the rows to group level and thereby changing the index and DataFrame granularity from unit level to aggregated grouped level. Therefore you need to run your groupby operations without assign.
To encapsulate multiple assigned aggregated columns that aligns to chained process, use a defined method and then apply it accordingly:
def aggfunc(row):
row['first_mean'] = np.mean(row['first'])
row['last_mean'] = np.mean(row['last'])
row['average_ratio'] = np.mean(row['first'].div(row['last']))
return row
agg_data = data.groupby('Group').apply(aggfunc)
Venturing into python using some pandas. I'm trying to some very simple transformations on a Dataframe. Have read thru the docs and am not quite sussing out how this works. I want a simple subtraction on cells in a row. Want to return a new Dataframe with the computed column. Like so:
def geMeWhatIwant (data)
#data is a Dataframe organized like index : col 1 : col2
return log(data col1 - data col2)
It feels like this can be done under the hood without iterating over the Dataframe in my function. If I need to iterate then so be it. Could itterrows thru and do computations and append to return Dataframe. I'm simply looking for the most efficient and elegant way to do this in python - pandas
Thanks
This should work:
import numpy as np
np.log(df[col1] - df[col2])
I am trying to improve a pandas iteration with a purely vectorization approach. I am a little new to vectoriztion and am having trouble getting it to work.
Within one dataframe field, I am finding all the unique string-based records of addresses. I need to seach the dataframe for each unique address idividually and assign a single unique identifier to the returned records. In this way, I can have 1 UID for each each address regardless of multiple occurances in the dataframe.
I have developed an approach that utilizes vectorition with the pandas .apply method.
def addr_id(x):
global df
df['Unq_ID'][df['address'] == x] = uuid.uuid4()
pd.DataFrame(df['address'].unique(), columns=["column1"]).apply(lambda x: addr_id(x["column1"]), axis=1)
However, I am trying to do away with the .apply method completely. This is where I am stuck.
df['Unq_ID'][df['address'] == (pd.DataFrame(df['address'].unique(), columns=["column1"]))["column1"]] = uuid.uuid4()
I keep getting a ValueError: Can only compare identically-labeled Series objects
You want to get rid of the Pandas apply due to performance reasons, right?
May I suggest a different approach to your problem?
You can construct a dict with the unique values of column1 as keys and the uuids as values and then map them to the DataFrame:
uuid_dict = {key: uuid.uuid4() for key in df['column1'].unique()}
df['address'] = df.column1.map(uuid_dict)
This would be very fast because it avoids looping in Python (which Pandas apply does under the hood).
I have just started using databricks/pyspark. Im using python/spark 2.1. I have uploaded data to a table. This table is a single column full of strings. I wish to apply a mapping function to each element in the column. I load the table into a dataframe:
df = spark.table("mynewtable")
The only way I could see was others saying was to convert it to RDD to apply the mapping function and then back to dataframe to show the data. But this throws up job aborted stage failure:
df2 = df.select("_c0").rdd.flatMap(lambda x: x.append("anything")).toDF()
All i want to do is just apply any sort of map function to my data in the table.
For example append something to each string in the column, or perform a split on a char, and then put that back into a dataframe so i can .show() or display it.
You cannot:
Use flatMap because it will flatten the Row
You cannot use append because:
tuple or Row have no append method
append (if present on collection) is executed for side effects and returns None
I would use withColumn:
df.withColumn("foo", lit("anything"))
but map should work as well:
df.select("_c0").rdd.flatMap(lambda x: x + ("anything", )).toDF()
Edit (given the comment):
You probably want an udf
from pyspark.sql.functions import udf
def iplookup(s):
return ... # Some lookup logic
iplookup_udf = udf(iplookup)
df.withColumn("foo", iplookup_udf("c0"))
Default return type is StringType, so if you want something else you should adjust it.
I have a data set with columns Dist, Class, and Count.
I want to group that data set by dist and divide the count column of each group by the sum of the counts for that group (normalize it to one).
The following MWE demonstrates my approach thus far. But I wonder: is there a more compact/pandaific way of writing this?
import pandas as pd
import numpy as np
a = np.random.randint(0,4,(10,3))
s = pd.DataFrame(a,columns=['Dist','Class','Count'])
def manipcolumn(x):
csum = x['Count'].sum()
x['Count'] = x['Count'].apply(lambda x: x/csum)
return x
s.groupby('Dist').apply(manipcolumn)
One alternative way to get the normalised 'Count' column could be to use groupby and transform to get the sums for each group and then divide the returned Series by the 'Count' column. You can reassign this Series back to your DataFrame:
s['Count'] = s['Count'] / s.groupby('Dist')['Count'].transform(np.sum)
This avoids the need for a bespoke Python function and the use of apply. Testing it for the small example DataFrame in your question showed that it was around 8 times faster.