I have two columns hours and minutes stored separately in columns: a and b
I want to calculate the sum of both in terms of minutes(DURATION)
To convert the hour into minutes I have used the following:
train['duration']=train.a.apply(lambda x:x*60)
Now I want to add the minutes to the newly created duration column.
So that my final value is duration=(a*60)+b
I am unable to perform this operation using lambda and for loop takes forever to execute In pandas.
You can do it using lambda as follows.
import pandas as pd
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
df["sum"] = df.apply(lambda row: row.x + row.y, axis=1)
print(df)
It will give following output:
x y sum
0 1 4 5
1 2 5 7
2 3 6 9
Hope it helps. You can do it using List Comprehension as that will be fast also or as #Karl Olufsen suggested.
Pandas is a very powerful module with neat features. If you want to simply add values of 1 column to values in another respectively you can simply do + operation on the columns.
Here is my example:
import pandas as pd
df = pd.DataFrame({"col1":[1,2,3,4], "col2":[10,20,30,40]})
df["sum of 2 columns"] = df["col1"] + df["col2"]
print(df)
And this is the output:
col1 col2 sum of 2 columns
0 1 10 11
1 2 20 22
2 3 30 33
3 4 40 44
Best would be to use vectorized code which would be faster than apply
df['duration_in_min'] = df['hour'] * 60 + df['min']
The time complexity of various operations from best(fastest) to worst(slowest) is:
Cython procedures < Vectorized code < apply < itertuples < iterrows
Related
fruits.xlsx
I'm trying to apply a function that:
takes the value of each cell in a column divided by the mean of its respective column.
Then create a column called ['Score'] that has the sum of each cell value in a row computed from step 1.
My code so far:
import pandas as pd
df = pd.DataFrame(pd.read_excel('fruits.xlsx'))
def func(column):
out = df[column].values / df[column].mean()
return out
Im really unsure of how to execute this with pandas properly.
Try this one will calculate exactly what you need in one single line:
df['Score'] = df.apply(lambda x: sum([x[i]/df[i].mean() for i in df.columns]),axis=1)
You can do it like this
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['a', 'b', 'c'])
df['score'] = df.div(df.mean()).sum(axis=1)
Output
a b c score
0 1 2 3 1.15
1 4 5 6 3.00
2 7 8 9 4.85
you can make the output as a column in the dataframe
df["Score"] = df[<col_name>] / df[<col_name>].mean()
and you can use
df["Score"] = df[<col_name>].values / df[<col_name>].mean()
I tested both and both gave me the same output in the dataframe
I'm trying to get the correlation between a single column and the rest of the numerical columns of the dataframe, but I'm stuck.
I'm trying with this:
corr = IM['imdb_score'].corr(IM)
But I get the error
operands could not be broadcast together with shapes
which I assume is because I'm trying to find a correlation between a vector (my imdb_score column) with the dataframe of several columns.
How can this be fixed?
The most efficient method it to use corrwith.
Example:
df.corrwith(df['A'])
Setup of example data:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(5, 5)), columns=list('ABCDE'))
# A B C D E
# 0 7 2 0 0 0
# 1 4 4 1 7 2
# 2 6 2 0 6 6
# 3 9 8 0 2 1
# 4 6 0 9 7 7
output:
A 1.000000
B 0.526317
C -0.209734
D -0.720400
E -0.326986
dtype: float64
I think you can you just use .corr which returns all correlations between all columns and then select just the column you are interested in.
So, something like
IM.corr()['imbd_score']
should work.
Rather than calculating all correlations and keeping the ones of interest, it can be computationally more efficient to compute the subset of interesting correlations:
import pandas as pd
df = pd.DataFrame()
df['a'] = range(10)
df['b'] = range(10)
df['c'] = range(10)
pd.DataFrame([[c, df['a'].corr(df[c])] for c in df.columns if c!='a'], columns=['var', 'corr'])
Say I have one dataframe
import pandas as pd
input_df = pd.DataFrame(dict(a=[1, 2], b=[2, 3]))
Also I have a function f that maps each row to another dataframe. Here's an example of such a function. Note that in general the function could take any form so I'm not looking for answers that use agg to reimplement the f below.
def f(row):
return pd.DataFrame(dict(x=[row['a'] * row['b'], row['a'] + row['b']],
y=[row['a']**2, row['b']**2]))
I want to create one dataframe that is the concatenation of the function applied to each of the first dataframe's rows. What is the idiomatic way to do this?
output_df = pd.concat([f(row) for _, row in input_df.iterrows()])
I thought I should be able to use apply or similar for this purpose but nothing seemed to work.
x y
0 2 1
1 3 4
0 6 4
1 5 9
You can use DataFrame.agg to calucalate sum and prod and numpy.ndarray.reshape, df.pow(2)/np.sqaure for calculating sqaure.
out = pd.DataFrame({'x': df.agg(['prod', 'sum'],axis=1).to_numpy().reshape(-1),
'y': np.square(df).to_numpy().reshape(-1)})
out
x y
0 2 1
1 3 4
2 6 4
3 5 9
Yoy should avoid iterating rows (How to iterate over rows in a DataFrame in Pandas).
Instead try:
df = df.assign(product=df.a*df.b, sum=df.sum(axis=1),
asq=df.a**2, bsq=df.b**2)
Then:
df = [[[p, s], [asq, bsq]] for p, s, asq, bsq in df.to_numpy()]
I have a situation where I am creating a pivot table in PANDAS where it makes more sense to calculate the fields separately and just use .pivot_table() for the pivot step. However, I am running into some difficultly trying to calculate the denominator for my percentages. Essentially, due to the data format I appear to need to do something like "groupby transform unique sum" on the second line below (which is where I am stuck):
df['numerator'] = df.groupby(['category1','category2'])['customer_id'].transform('nunique')
df['denominator'] = df.groupby(['category2'])['numerator'].nunique().transform('sum')
df['percentage'] = (df['numerator'] / df['denominator'])
df_pivot = df.pivot_table(index='category1',
columns=['category2'],
values=['numerator','percentage']) \
swaplevel(0,1,axis=1)
df_pivot.loc['total', :] = df_pivot.sum().values
My apologies for not being able to provide any dummy data, but I would appreciate any tips if I have hopefully provided enough detail to reason about.
I believe need lambda function with unique and sum:
df = pd.DataFrame({'numerator':[3,1,1,9,2,2],
'category2':list('aaabbb')})
#print (df)
df['denominator']=df.groupby(['category2'])['numerator'].transform(lambda x: x.unique().sum())
Alternative solution with sets and sums:
df['denominator']=df.groupby(['category2'])['numerator'].transform(lambda x: sum(set(x)))
print (df)
category2 numerator denominator
0 a 3 4
1 a 1 4
2 a 1 4
3 b 9 11
4 b 2 11
5 b 2 11
I am trying to transform DataFrame, such that some of the rows will be replicated a given number of times. For example:
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
class count
0 A 1
1 B 0
2 C 2
should be transformed to:
class
0 A
1 C
2 C
This is the reverse of aggregation with count function. Is there an easy way to achieve it in pandas (without using for loops or list comprehensions)?
One possibility might be to allow DataFrame.applymap function return multiple rows (akin apply method of GroupBy). However, I do not think it is possible in pandas now.
You could use groupby:
def f(group):
row = group.irow(0)
return DataFrame({'class': [row['class']] * row['count']})
df.groupby('class', group_keys=False).apply(f)
so you get
In [25]: df.groupby('class', group_keys=False).apply(f)
Out[25]:
class
0 A
0 C
1 C
You can fix the index of the result however you like
I know this is an old question, but I was having trouble getting Wes' answer to work for multiple columns in the dataframe so I made his code a bit more generic. Thought I'd share in case anyone else stumbles on this question with the same problem.
You just basically specify what column has the counts in it in and you get an expanded dataframe in return.
import pandas as pd
df = pd.DataFrame({'class 1': ['A','B','C','A'],
'class 2': [ 1, 2, 3, 1],
'count': [ 3, 3, 3, 1]})
print df,"\n"
def f(group, *args):
row = group.irow(0)
Dict = {}
row_dict = row.to_dict()
for item in row_dict: Dict[item] = [row[item]] * row[args[0]]
return pd.DataFrame(Dict)
def ExpandRows(df,WeightsColumnName):
df_expand = df.groupby(df.columns.tolist(), group_keys=False).apply(f,WeightsColumnName).reset_index(drop=True)
return df_expand
df_expanded = ExpandRows(df,'count')
print df_expanded
Returns:
class 1 class 2 count
0 A 1 3
1 B 2 3
2 C 3 3
3 A 1 1
class 1 class 2 count
0 A 1 1
1 A 1 3
2 A 1 3
3 A 1 3
4 B 2 3
5 B 2 3
6 B 2 3
7 C 3 3
8 C 3 3
9 C 3 3
With regards to speed, my base df is 10 columns by ~6k rows and when expanded is ~100,000 rows takes ~7 seconds. I'm not sure in this case if grouping is necessary or wise since it's taking all the columns to group form, but hey whatever only 7 seconds.
There is even a simpler and significantly more efficient solution.
I had to make similar modification for a table of about 3.5M rows, and the previous suggested solutions were extremely slow.
A better way is to use numpy's repeat procedure for generating a new index in which each row index is repeated multiple times according to its given count, and use iloc to select rows of the original table according to this index:
import pandas as pd
import numpy as np
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count': [1, 0, 2]})
spread_ixs = np.repeat(range(len(df)), df['count'])
spread_ixs
array([0, 2, 2])
df.iloc[spread_ixs, :].drop(columns='count').reset_index(drop=True)
class
0 A
1 C
2 C
This question is very old and the answers do not reflect pandas modern capabilities. You can use iterrows to loop over every row and then use the DataFrame constructor to create new DataFrames with the correct number of rows. Finally, use pd.concat to concatenate all the rows together.
pd.concat([pd.DataFrame(data=[row], index=range(row['count']))
for _, row in df.iterrows()], ignore_index=True)
class count
0 A 1
1 C 2
2 C 2
This has the benefit of working with any size DataFrame.