Multiply two columns by following a pattern using pandas - python

I would like to multiply two columns of a df by following a specific pattern without using a loop. I have this df :
num m d
0 8 5
1 2 3
2 17 8
The idea is to multply for each row in 'm' every row in 'd' except the one with the same 'num'. The resulting df would be :
num1 num2 mult
0 1 8x3 = 24
0 2 8x8 = 64
1 2 2x8 = 16
Is there a way to do that ?
Thank for your help

You can try:
df = df.set_index('num')
((df[['m']].rename(columns={'m':'d'}) # df[['d']].T)
.rename_axis('num2', axis=1)
.stack().reset_index(name='mult')
)
Or use broadcasting:
(pd.DataFrame(df['m'].values * df['d'].values[:,None],
index=df['num'],
columns=df['num'].rename('num2'))
.stack().reset_index(name='mult')
)
num num2 mult
0 0 0 40
1 0 1 24
2 0 2 64
3 1 0 10
4 1 1 6
5 1 2 16
6 2 0 85
7 2 1 51
8 2 2 136

You could use the following:
product = df1['m'][df2['num1']].values*df1['d'][df2['num2']].values
df2['mult'] = pd.Series(product, index=df2.index)

I'd recommend first creating a frame with all possible permutations of the 2 columns, then filtering out the rows which don't correspond to the required pattern.
Something like this
df = df.set_index('num')
((df[['m']].rename(columns={'m':'d'}) # df[['d']].T)
.rename_axis('num2', axis=1)
.stack().reset_index(name='mult')
)
df[df['num']!=df['num2']]

Related

Count number of consecutive rows that are greater than current row value but less than the value from other column

Say I have the following sample dataframe (there are about 25k rows in the real dataframe)
df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
A B
0 0 9
1 3 8
2 2 3
3 9 5
4 1 5
5 0 5
6 4 5
7 7 8
8 3 0
9 2 4
For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.
So my expected output is :
A B next count previous count
0 9 2 0
3 8 0 0
2 3 0 1
9 5 0 0
1 5 0 0
0 5 2 1
4 5 1 0
7 8 0 0
3 0 0 2
2 4 0 0
Explanation :
First row is calculated as : since 3 and 2 are greater than 0 but less than corresponding B value 8 and 3
Second row is calculated as : since next value 2 is not greater than 3
Third row is calculated as : since 9 is greater than 2 but not greater than its corresponding B value
Similarly, previous count is calculated
Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply approach. I was looking for a more pandaic approach.
My Solution
Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.
This is what I have tried.
This function gets the number of previous/next rows that satisfy the condition.
def get_prev_next_count(row):
next_nrow = df.loc[row['index']+1:,['A', 'B']]
prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
if (next_nrow.size == 0):
return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
if (prev_nrow.size == 0):
return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())
Generating output :
df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")
Output :
This gives us the expected output
df
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
I made some optimizations:
You don't need to reset_index() you can access the index with .name
If you only pass df[['A']] instead of the whole frame, that may help.
prev_nrow.empty is the same as (prev_nrow.size == 0)
Applied different logic to get the desired value via first_false, this speeds things up significantly.
def first_false(val1, val2, A):
i = 0
for x, y in zip(val1, val2):
if A < x < y:
i += 1
else:
break
return i
def get_prev_next_count(row):
A = row['A']
next_nrow = df.loc[row.name+1:,['A', 'B']]
prev_nrow = df2.loc[row.name-1:,['A', 'B']]
if next_nrow.empty:
return 0, first_false(prev_nrow.A, prev_nrow.B, A)
if prev_nrow.empty:
return first_false(next_nrow.A, next_nrow.B, A), 0
return (first_false(next_nrow.A, next_nrow.B, A),
first_false(prev_nrow.A, prev_nrow.B, A))
df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)
Output:
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
Timing
Expanding the data:
df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)
Original Method:
Gave up at 15 minutes.
New Method:
1m 20sec
Throw pandarallel at it:
from pandarallel import pandarallel
pandarallel.initialize()
df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')
26sec

pandas - take last N rows from one subgroup

Let's suppose we have a dataframe that be generated using this code:
import pandas as pd
d = {'p1': np.random.rand(32),
'a1': np.random.rand(32),
'phase': [0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3, 0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3],
'file_number': [1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 2,2,2,2, 2,2,2,2, 2,2,2,2, 2,2,2,2]
}
df = pd.DataFrame(d)
For each file number i want to take only last N rows of phase number 3. So that the result for N==2 looks like this:
Currently I'm doing it in this way:
def phase_3_last_n_observations(df, n):
result = []
for fn in df['file_number'].unique():
file_df = df[df['file_number']==fn]
for phase in [0,1,2,3]:
phase_df = file_df[file_df['phase']==phase]
if phase == 3:
phase_df = phase_df[-n:]
result.append(phase_df)
df = pd.concat(result, axis=0)
return df
phase_3_last_n_observations(df, 2)
However, it is very slow and I have terabytes of data, so I need to worry about performance. Does anyone have any idea how to speed my solution up? Thanks!
Filter the rows where phase is 3 then groupby and use tail to select the last two rows per file_number, finally append to get the result
m = df['phase'].eq(3)
df[~m].append(df[m].groupby('file_number').tail(2)).sort_index()
p1 a1 phase file_number
0 0.223906 0.164288 0 1
1 0.214081 0.748598 0 1
2 0.567702 0.226143 0 1
3 0.695458 0.567288 0 1
4 0.760710 0.127880 1 1
5 0.592913 0.397473 1 1
6 0.721191 0.572320 1 1
7 0.047981 0.153484 1 1
8 0.598202 0.203754 2 1
9 0.296797 0.614071 2 1
10 0.961616 0.105837 2 1
11 0.237614 0.640263 2 1
14 0.500415 0.220355 3 1
15 0.968630 0.351404 3 1
16 0.065283 0.595144 0 2
17 0.308802 0.164214 0 2
18 0.668811 0.826478 0 2
19 0.888497 0.186267 0 2
20 0.199129 0.241900 1 2
21 0.345185 0.220940 1 2
22 0.389895 0.761068 1 2
23 0.343100 0.582458 1 2
24 0.182792 0.245551 2 2
25 0.503181 0.894517 2 2
26 0.144294 0.351350 2 2
27 0.157116 0.847499 2 2
30 0.194274 0.143037 3 2
31 0.542183 0.060485 3 2
I use idea from deleted answer - get indices by previous rows for rows matching 3 by GroupBy.cumcount and remove them by DataFrame.drop:
def phase_3_last_n_observations(df, N):
df1 = df[df['phase'].eq(3)]
idx = df1[df1.groupby('file_number').cumcount(ascending=False).ge(N)].index
return df.drop(idx)
#index is reseted for default, because used for remove rows
df = phase_3_last_n_observations(df.reset_index(drop=True), 2)
As an alternative solution to what already exists: You can calculate the last elements for all phase groups and afterwards just use .loc to get the needed group result. I have written the code for N==2, if you want for N==3, then use [-1, -2, -3]
result = df.groupby(['phase']).nth([-1, -2])
PHASE = 3
result.loc[PHASE]

Is there a way to avoid while loops using pandas in order to speed up my code?

I'm writing a code to merge several dataframe together using pandas .
Here is my first table :
Index Values Intensity
1 11 98
2 12 855
3 13 500
4 24 140
and here is the second one:
Index Values Intensity
1 21 1000
2 11 2000
3 24 0.55
4 25 500
With these two df, I concanate and drop_duplicates the Values columns which give me the following df :
Index Values Intensity_df1 Intensity_df2
1 11 0 0
2 12 0 0
3 13 0 0
4 24 0 0
5 21 0 0
6 25 0 0
I would like to recover the intensity of each values in each Dataframes, for this purpose, I'm iterating through each line of each df which is very inefficient. Here is the following code I use:
m = 0
while m < len(num_df):
n = 0
while n < len(df3):
temp_intens_abs = df[m]['Intensity'][df3['Values'][n] == df[m]['Values']]
if temp_intens_abs.empty:
merged.at[n,"Intensity_df%s" %df[m]] = 0
else:
merged.at[n,"Intensity_df%s" %df[m]] = pandas.to_numeric(temp_intens_abs, errors='coerce')
n = n + 1
m = m + 1
The resulting df3 looks like this at the end:
Index Values Intensity_df1 Intensity_df2
1 11 98 2000
2 12 855 0
3 13 500 0
4 24 140 0.55
5 21 0 1000
6 25 0 500
My question is : Is there a way to directly recover "present" values in a df by comparing directly two columns using pandas? I've tried several solutions using numpy but without success.. Thanks in advance for your help.
You can try joining these dataframes: df3 = df1.merge(df2, on="Values")

Pandas: return the occurrences of the most frequent value for each group (possibly without apply)

Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1

Python pandas, multindex, slicing

I have got a pd.DataFrame
Time Value
a 1 1 1
2 2 5
3 5 7
b 1 1 5
2 2 9
3 10 11
I want to multiply the column Value with the column Time - Time(t-1) and write the result to a column Product, starting with row b, but separately for each top level index.
For example Product('1','b') should be (Time('1','b') - Time('1','a')) * Value('1','b'). To do this, i would need a "shifted" version of column Time "starting" at row b so that i could do df["Product"] = (df["Time"].shifted - df["Time"]) * df["Value"]. The result should look like this:
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
This should do it:
>>> time_shifted = df['Time'].groupby(level=0).apply(lambda x: x.shift())
>>> df['Product'] = ((df.Time - time_shifted)*df.Value).fillna(0)
>>> df
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
Hey this should do what you need it to. Comment if I missed anything.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Time':[1,2,5,1,2,10],'Value':[1,5,7,5,9,11]},
index = [['a','a','a','b','b','b'],[1,2,3,1,2,3]])
def product(x):
x['Product'] = (x['Time']-x.shift()['Time'])*x['Value']
return x
df = df.groupby(level =0).apply(product)
df['Product'] = df['Product'].replace(np.nan, 0)
print df

Categories