I have been running the code for like 45 mins now and is still going. Can someone please suggest to me how I can make it faster?
df4 is a panda data frame. df4.head() looks like this
df4 = pd.DataFrame({
'hashtag':np.random.randn(3000000),
'sentiment_score':np.random.choice( [0,1], 3000000),
'user_id':np.random.choice( ['11','12','13'], 3000000),
})
What I am aiming to have is a new column called rating.
len(df4.index) is 3,037,321.
ratings = []
for index in df4.index:
rowUserID = df4['user_id'][index]
rowTrackID = df4['track_id'][index]
rowSentimentScore = df4['sentiment_score'][index]
condition = ((df4['user_id'] == rowUserID) & (df4['sentiment_score'] == rowSentimentScore))
allRows = df4[condition]
totalSongListendForContext = len(allRows.index)
rows = df4[(condition & (df4['track_id'] == rowTrackID))]
songListendForContext = len(rows.index)
rating = songListendForContext/totalSongListendForContext
ratings.append(rating)
Globally, you'll need groupby. you can either:
use two groupby with transform to get the size of what you called condition and the size of the condition & (df4['track_id'] == rowTrackID), divide the second by the first:
df4['ratings'] = (df4.groupby(['user_id', 'sentiment_score','track_id'])['track_id'].transform('size')
/ df4.groupby(['user_id', 'sentiment_score'])['track_id'].transform('size'))
Or use groupby with value_counts with the parameter normalize=True and merge the result with df4:
df4 = df4.merge(df4.groupby(['user_id', 'sentiment_score'])['track_id']
.value_counts(normalize=True)
.rename('ratings').reset_index(),
how='left')
in both case, you will get the same result as your list ratings (that I assume you wanted to be a column). I would say the second option is faster but it depends on the number of groups you have in your real case.
Related
I have a Pandas dataframe with ~50,000 rows and I want to randomly select a proportion of rows from that dataframe based on a number of conditions. Specifically, I have a column called 'type of use' and, for each field in that column, I want to select a different proportion of rows.
For instance:
df[df['type of use'] == 'housing'].sample(frac=0.2)
This code returns 20% of all the rows which have 'housing' as their 'type of use'. The problem is I do not know how to do this for the remaining fields in a way that is 'idiomatic'. I also do not know how I could take the result from this sampling to form a new dataframe.
You can make a unique list for all the values in the column by list(df['type of use'].unique()) and iterate like below:
for i in list(df['type of use'].unique()):
print(df[df['type of use'] == i].sample(frac=0.2))
or
i = 0
while i < len(list(df['type of use'].unique())):
df1 = df[(df['type of use']==list(df['type of use'].unique())[i])].sample(frac=0.2)
print(df1.head())
i = i + 1
For storing you can create a dictionary:
dfs = ['df' + str(x) for x in list(df2['type of use'].unique())]
dicdf = dict()
i = 0
while i < len(dfs):
dicdf[dfs[i]] = df[(df['type of use']==list(df2['type of use'].unique())[i])].sample(frac=0.2)
i = i + 1
print(dicdf)
This will print a dictionary of the dataframes.
You can print what you like to see for example for housing sample : print (dicdf['dfhousing'])
Sorry this is coming in 2+ years late, but I think you can do this without iterating, based on help I received to a similar question here. Applying it to your data:
import pandas as pd
import math
percentage_to_flag = 0.2 #I'm assuming you want the same %age for all 'types of use'?
#First, create a new 'helper' dataframe:
random_state = 41 # Change to get different random values.
df_sample = df.groupby("type of use").apply(lambda x: x.sample(n=(math.ceil(percentage_to_flag * len(x))),random_state=random_state))
df_sample = df_sample.reset_index(level=0, drop=True) #may need this to simplify multi-index dataframe
# Now, mark the random sample in a new column in the original dataframe:
df["marked"] = False
df.loc[df_sample.index, "marked"] = True
I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)
I have 3 dataframes (df1, df2, df3) which are identically structured (# and labels of rows/columns), but populated with different values.
I want to populate df3 based on values in the associated column/rows in df1 and df2. I'm doing this with a FOR loop and a custom function:
for x in range(len(df3.columns)):
df3.iloc[:, x] = customFunction(x)
I want to populate df3 using this custom IF/ELSE function:
def customFunction(y):
if df1.iloc[:,y] <> 1 and df2.iloc[:,y] = 0:
return "NEW"
elif df2.iloc[:,y] = 2:
return "OLD"
else:
return "NEITHER"
I understand why I get an error message when i run this, but i can't figure out how to apply this function to a series. I could do it row by row with more complex code but i'm hoping there's a more efficient solution? I fear my approach is flawed.
v1 = df1.values
v2 = df2.values
df3.loc[:] = np.where(
(v1 != 1) & (v2 == 0), 'NEW',
np.where(v2 == 2, 'OLD', 'NEITHER'))
Yeah, try to avoid loops in pandas, its inefficient and built to be used with the underlying numpy vectorization.
You want to use the apply function.
Something like:
df3['new_col'] = df3.apply(lambda x: customFunction(x))
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
I am quite new to pandas and I have a pandas dataframe of about 500,000 rows filled with numbers. I am using python 2.x and am currently defining and calling the method shown below on it. It sets a predicted value to be equal to the corresponding value in series 'B', if two adjacent values in series 'A' are the same. However, it is running extremely slowly, about 5 rows are outputted per second and I want to find a way accomplish the same result more quickly.
def myModel(df):
A_series = df['A']
B_series = df['B']
seriesLength = A_series.size
# Make a new empty column in the dataframe to hold the predicted values
df['predicted_series'] = np.nan
# Make a new empty column to store whether or not
# prediction matches predicted matches B
df['wrong_prediction'] = np.nan
prev_B = B_series[0]
for x in range(1, seriesLength):
prev_A = A_series[x-1]
prev_B = B_series[x-1]
#set the predicted value to equal B if A has two equal values in a row
if A_series[x] == prev_A:
if df['predicted_series'][x] > 0:
df['predicted_series'][x] = df[predicted_series'][x-1]
else:
df['predicted_series'][x] = B_series[x-1]
Is there a way to vectorize this or to just make it run faster? Under the current circumstances, it is projected to take many hours. Should it really be taking this long? It doesn't seem like 500,000 rows should be giving my program that much problem.
Something like this should work as you described:
df['predicted_series'] = np.where(A_series.shift() == A_series, B_series, df['predicted_series'])
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
This will get rid of the for loop and set predicted_series to the value of B when A is equal to previous A.
edit:
per your comment, change your initialization of predicted_series to be all NAN and then front fill the values:
df['predicted_series'] = np.nan
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
df.predicted_series = df.predicted_series.fillna(method='ffill')
For fastest speed modifying ayhans answer a bit will perform best:
df['predicted_series'] = np.where(df.A.shift() == df.A, df.B, df['predicted_series'].shift())
That will give you your forward filled values and run faster than my original recommendation
Solution
df.loc[df.A == df.A.shift()] = df.B.shift()
I am attempting to perform multiple operations on a large dataframe (~3 million rows).
Using a small test-set representative of my data, I've come up with a solution. However the script runs extremely slowly when using the large dataset as input.
Here is the main loop of the application:
def run():
df = pd.DataFrame(columns=['CARD_NO','CUSTOMER_ID','MODIFIED_DATE','STATUS','LOYALTY_CARD_ENROLLED'])
foo = input.groupby('CARD_NO', as_index=False, sort=False)
for name, group in foo:
if len(group) == 1:
df = df.append(group)
else:
dates = group['MODIFIED_DATE'].values
if all_same(dates):
df = df.append(group[group.STATUS == '1'])
else:
df = df.append(group[group.MODIFIED_DATE == most_recent(dates)])
path = ''
df.to_csv(path, sep=',', index=False)
The logic is as follows:
For each CARD_NO
- if there is only 1 CARD_NO, add row to new dataframe
- if there are > 1 of the same CARD_NO, check MODIFIED_DATE,
- if MODIFIED_DATEs are different, take the row with most recent date
- if all MODIFIED_DATES are equal, take whichever row has STATUS = 1
The slow-down occurs at each iteration around,
input.groupby('CARD_NO', as_index=False, sort=False)
I am currently trying to parallelize the loop by splitting the groups returned by the above statement, but I'm not sure if this is the correct approach...
Am I overlooking a core functionality of Pandas?
Is there a better, more Pandas-esque way of solving this problem?
Any help is greatly appreciated.
Thank you.
Two general tips:
For looping over a groupby object, you can try apply. For example,
grouped = input.groupby('CARD_NO', as_index=False, sort=False))
grouped.apply(example_function)
Here, example_function is called for each group in your groupby object. You could write example_function to append to a data structure yourself, or if it has a return value, pandas will try to concatenate the return values into one dataframe.
Appending rows to dataframes is slow. You might be better off building some other data structure with each iteration of the loop, and then building your dataframe at the end. For example, you could make a list of dicts.
data = []
grouped = input.groupby('CARD_NO', as_index=False, sort=False)
def example_function(row,data_list):
row_dict = {}
row_dict['length'] = len(row)
row_dict['has_property_x'] = pandas.notnull(row['property_x'])
data_list.append(row_dict)
grouped.apply(example_function, data_list=data)
pandas.DataFrame(data)
I wrote a significant improvement in running time. The return statements appear to change the dataframe in place, significantly improving running time (~30 minutes for 3 million rows) and avoiding the need for secondary data structures.
def foo(df_of_grouped_data):
group_length = len(df_of_grouped_data)
if group_length == 1:
return df_of_grouped_data
else:
dates = df_of_grouped_data['MODIFIED_DATE'].values
if all_same(dates):
return df_of_grouped_data[df_of_grouped_data.STATUS == '1']
else:
return df_of_grouped_data[df_of_grouped_data.MODIFIED_DATE == most_recent(dates)]
result = card_groups.apply(foo)