I have a data frame of roughly 6 million rows, which I need to repeatedly analyse for simulations. The following is a very simple representation of the data.
For rows where action=1,
I am tying to devise an efficient way to do this
For index,row in df.iterrows():
`Result = the first next row where (price2 is >= row.price1 +4) and index > row.index`
or if that doesn't exist
return index+100 (i.e the activity times out).
import pandas as pd
df = pd.DataFrame({'Action(y/n)' : [0,1,0,0,1,0,1,0,0,0], 'Price1' : [1,8,3,1,7,3,8,2,3,1], 'Price2' : [2,1,1,5,3,1,2,11,12,1]})
print(df)
Action(y/n) Price1 Price2
0 0 1 2
1 1 8 1
2 0 3 1
3 0 1 5
4 1 7 3
5 0 3 1
6 1 8 2
7 0 2 11
8 0 3 12
9 0 1 1
Resulting in something like this:
Action(y/n) Price1 Price2 ExitRow(IndexOfRowWhereCriteriaMet)
0 0 14 2 9
1 1 8 1 8
2 0 3 1 102
3 0 1 5 103
4 1 7 3 7
5 0 3 1 105
6 1 8 2 8
7 0 2 11 107
8 0 3 12 108
9 0 1 1 109
I have tried a few methods,which are all really slow.
This best one maps it, but really not fast enough.
df['ExitRow'] = list(map(ATestFcn, df.index,df.price1))
def ATestFcn(dfIx, dfPrice1):
ExitRow = df[((df.price2>(price1+4))&(df.index >dfIx)& (df.index<=dfIx+TimeOut))].index.min()
if pd.isnull(ExitRow):
return dfIx+ TimeOut
else:
return ExitRow
I also tested this with a loop, it was about 25% slower - but it was ideas-wise
essentially the same.
I'm thinking there must be a smarter or faster way to do this, a mask could have been useful except you can't fill down with this data as price2 for one row might be thousands of rows after the price2 for another row, and I can't find a way to turn a merge into a cross apply like one might in TSQL.
To find the index of the first row which meet your criterion, you could use
cur_row_idx = 100 # we want the row after 100
next_row_idx = (df[cur_row_idx:].Price2 >= df[cur_row_idx:].Price1 + 4).argmax()
Then, you want to set a cutoff, say, the max value you getting is the TimeOut - so it could be
next_row_idx = np.min(((df[cur_row_idx:].Price2 >= df[cur_row_idx:].Price1 + 4).argmax() , dfIx + TimeOut))
I did not check the performance on the large datasets, but hope it helps.
If you wish, you also can wrap it into a function:
def ATestFcn(dfIx, df, TimeOut):
return np.min(((df[dfIx:].Price2 >= df[dfIx:].Price1 + 4).argmax() , dfIx + TimeOut))
Edit: Just tested it, it is quite fast, see the results below:
df = pd.DataFrame()
Price1 = np.random.randint(100, size=10e6)
Price2 = np.random.randint(100, size=10e6)
df["Price1"] = Price1
df["Price2"] = Price2
timeit ATestFcn(np.random.randint(1e6), df, 100)
Out[62]: 1 loops, best of 3: 289 ms per loop
Related
Say I have the following sample dataframe (there are about 25k rows in the real dataframe)
df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
A B
0 0 9
1 3 8
2 2 3
3 9 5
4 1 5
5 0 5
6 4 5
7 7 8
8 3 0
9 2 4
For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.
So my expected output is :
A B next count previous count
0 9 2 0
3 8 0 0
2 3 0 1
9 5 0 0
1 5 0 0
0 5 2 1
4 5 1 0
7 8 0 0
3 0 0 2
2 4 0 0
Explanation :
First row is calculated as : since 3 and 2 are greater than 0 but less than corresponding B value 8 and 3
Second row is calculated as : since next value 2 is not greater than 3
Third row is calculated as : since 9 is greater than 2 but not greater than its corresponding B value
Similarly, previous count is calculated
Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply approach. I was looking for a more pandaic approach.
My Solution
Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.
This is what I have tried.
This function gets the number of previous/next rows that satisfy the condition.
def get_prev_next_count(row):
next_nrow = df.loc[row['index']+1:,['A', 'B']]
prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
if (next_nrow.size == 0):
return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
if (prev_nrow.size == 0):
return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())
Generating output :
df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")
Output :
This gives us the expected output
df
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
I made some optimizations:
You don't need to reset_index() you can access the index with .name
If you only pass df[['A']] instead of the whole frame, that may help.
prev_nrow.empty is the same as (prev_nrow.size == 0)
Applied different logic to get the desired value via first_false, this speeds things up significantly.
def first_false(val1, val2, A):
i = 0
for x, y in zip(val1, val2):
if A < x < y:
i += 1
else:
break
return i
def get_prev_next_count(row):
A = row['A']
next_nrow = df.loc[row.name+1:,['A', 'B']]
prev_nrow = df2.loc[row.name-1:,['A', 'B']]
if next_nrow.empty:
return 0, first_false(prev_nrow.A, prev_nrow.B, A)
if prev_nrow.empty:
return first_false(next_nrow.A, next_nrow.B, A), 0
return (first_false(next_nrow.A, next_nrow.B, A),
first_false(prev_nrow.A, prev_nrow.B, A))
df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)
Output:
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
Timing
Expanding the data:
df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)
Original Method:
Gave up at 15 minutes.
New Method:
1m 20sec
Throw pandarallel at it:
from pandarallel import pandarallel
pandarallel.initialize()
df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')
26sec
I'm writing a code to merge several dataframe together using pandas .
Here is my first table :
Index Values Intensity
1 11 98
2 12 855
3 13 500
4 24 140
and here is the second one:
Index Values Intensity
1 21 1000
2 11 2000
3 24 0.55
4 25 500
With these two df, I concanate and drop_duplicates the Values columns which give me the following df :
Index Values Intensity_df1 Intensity_df2
1 11 0 0
2 12 0 0
3 13 0 0
4 24 0 0
5 21 0 0
6 25 0 0
I would like to recover the intensity of each values in each Dataframes, for this purpose, I'm iterating through each line of each df which is very inefficient. Here is the following code I use:
m = 0
while m < len(num_df):
n = 0
while n < len(df3):
temp_intens_abs = df[m]['Intensity'][df3['Values'][n] == df[m]['Values']]
if temp_intens_abs.empty:
merged.at[n,"Intensity_df%s" %df[m]] = 0
else:
merged.at[n,"Intensity_df%s" %df[m]] = pandas.to_numeric(temp_intens_abs, errors='coerce')
n = n + 1
m = m + 1
The resulting df3 looks like this at the end:
Index Values Intensity_df1 Intensity_df2
1 11 98 2000
2 12 855 0
3 13 500 0
4 24 140 0.55
5 21 0 1000
6 25 0 500
My question is : Is there a way to directly recover "present" values in a df by comparing directly two columns using pandas? I've tried several solutions using numpy but without success.. Thanks in advance for your help.
You can try joining these dataframes: df3 = df1.merge(df2, on="Values")
Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1
I am working through a pandas dataframe with three relevant columns and 2.7 million rows. The structure is:
key VisitLink dx_filter time
0 1 ddcde14 1 100
1 2 abcde11 1 140
2 3 absdf12 1 50
3 4 ddcde14 0 125
4 5 ddcde14 1 140
data = [[1,'ddcde14',1,100],[2,'abcde11',1,140],[3,'absdf12',1,50],[4,'ddcde14',0,125],[5,'ddcde14',1,140]]
df_example = pd.DataFrame(data,columns = ['key','VisitLink','dx_filter','time'])
I need 3 things to be true:
- VisitLink: matches between the two rows
- dx_filter: is 1 for the first event
- Time: the second event happens within 30 days of the first event
Example: Key 1 will generate Key 4 as a matching record, as it meets all qualifications, but Key 4 will not generate Key 5 because its dx_filter = 0.
I ran a trial where I predicted my method would take 120+ hours to complete and am wondering if there is a way to shorten this to <10 hours or if that is not possible.
def add_readmit_id(df):
df['readmit_id'] = np.nan
def set_id(row):
if row['dx_filter'] ==0:
return np.nan
else:
relevant_df = df.loc[df['VisitLink']==row['VisitLink']]
timeframe_df = relevant_df.loc[(relevant_df['time']>row['time'])&(relevant_df['time']<=row['time']+30)]
next_timeframe = timeframe_df['time'].min()
id_row = timeframe_df.loc[timeframe_df['time']==next_timeframe]
if not id_row.empty:
return id_row.iloc[0]['key']
else:
return np.nan
df['readmit_id'] = df.apply(set_id,axis=1)
return df
df_example = add_readmit_id(df_example)
See above for the code I used to run it #minimum reproducible.
Here's my approach with groupby:
groups = df.groupby('VisitLink')
s = groups['time'].diff(-1).le(30) & df['dx_filter']
df['shifted'] = np.where(s, groups['key'].shift(-1), np.nan)
Output:
key VisitLink dx_filter time shifted
0 1 ddcde14 1 100 4.0
1 2 abcde11 1 140 NaN
2 3 absdf12 1 50 NaN
3 4 ddcde14 0 125 NaN
4 5 ddcde14 1 140 NaN
I'm using Pandas to come up with new column that will search through the entire column with values [1-100] and will count the values where it's less than the current row.
See [df] example below:
[A][NewCol]
1 0
3 2
2 1
5 4
8 5
3 2
Essentially, for each row I need to look at the entire Column A, and count how many values are less than the current row. So for Value 5, there are 4 values that are less (<) than 5 (1,2,3,3).
What would be the easiest way of doing this?
Thanks!
One way to do it like this, use rank with method='min':
df['NewCol'] = (df['A'].rank(method='min') - 1).astype(int)
Output:
A NewCol
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
I am using numpy broadcast
s=df.A.values
(s[:,None]>s).sum(1)
Out[649]: array([0, 2, 1, 4, 5, 2])
#df['NewCol']=(s[:,None]>s).sum(1)
timing
df=pd.concat([df]*1000)
%%timeit
s=df.A.values
(s[:,None]>s).sum(1)
10 loops, best of 3: 83.7 ms per loop
%timeit (df['A'].rank(method='min') - 1).astype(int)
1000 loops, best of 3: 479 µs per loop
Try this code
A = [Your numbers]
less_than = []
for element in A:
counter = 0
for number in A:
if number < element:
counter += 1
less_than.append(counter)
You can do it this way:
import pandas as pd
df = pd.DataFrame({'A': [1,3,2,5,8,3]})
df['NewCol'] = 0
for idx, row in df.iterrows():
df.loc[idx, 'NewCol'] = (df.loc[:, 'A'] < row.A).sum()
print(df)
A NewCol
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
Another way is sort and reset index:
m=df.A.sort_values().reset_index(drop=True).reset_index()
m.columns=['new','A']
print(m)
new A
0 0 1
1 1 2
2 2 3
3 3 3
4 4 5
5 5 8
You didn't specify if speed or memory usage was important (or if you had a very large dataset). The "easiest" way to do it is straightfoward: calculate how many are less then i for each entry in the column and collect those into a new column:
df=pd.DataFrame({'A': [1,3,2,5,8,3]})
col=df['A']
df['new_col']=[ sum(col<i) for i in col ]
print(df)
Result:
A new_col
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
There might be more efficient ways to do this on large datasets, such as sorting your column first.