Apply arithmetic calculations on specific rows of a large dataframe - python

Suppose that we have a data-frame (df) with a high number of rows (1600000X4). Also, we have a list of lists such as this one:
inx = [[1,2],[4,5], [8,9,10], [15,16]]
We need to calculate average of first and third columns of this data-frame and median of second and fourth columns for every list in inx. For example, for the first list of inx, we should do this for first and second rows and replace all these rows with a new row which contains the output of these calculations. What is the fastest way to do this?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
a b c d
0 1 2 3 3
1 4 5 6 1
2 7 8 9 3
3 1 1 1 1
The output for just the first list inside of inx ([1,2]) will be something like this:
a b c d
0 1 2 3 3
1 5.5 6.5 7.5 2
3 1 1 1 1
As you can see, we don't change first row (0), because it's not in the main list. After that, we're going to do the same for [4,5]. We don't change anything in row 3 because it's not in the list too. inx is a large list of lists (more than 100000 elements).

EDIT: NEW APPROACH AVOIDING LOOPS
Here below you find an approach relying on pandas and avoiding loops.
After generating some fake data with the same size of yours, I basically create list of indexes from your inx list of rows; i.e., with your inx being:
[[2,3], [5,6,7], [10,11], ...]
the created list is:
[[1,1], [2,2,2], [3,3],...]
After that, this list is flattened and added to the original dataframe to mark various groups of rows to operate on.
After proper calculations, the resulting dataframe is joined back with original rows which don't need calculations (in my example above, rows: [0, 1, 4, 8, 9, ...]).
You find more comments in the code.
At the end of the answer I leave also my previous approach for the records.
On my box, the old algo involving a loop take more than 18 minutes... unbearable!
Using pandas only, it takes less than half second!! Pandas is great!
import pandas as pd
import numpy as np
import random
# Prepare some fake data to test
data = np.random.randint(0, 9, size=(160000, 4))
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
inxl = random.sample(range(1, 160000), 140000)
inxl.sort()
inx=[]
while len(inxl) > 3:
i = random.randint(2,3)
l = inxl[0:i]
inx.append(l)
inxl = inxl[i:]
inx.append(inxl)
# flatten inx (used below)
flat_inx = [item for sublist in inx for item in sublist]
# for each element (list) in inx create equivalent list (same length)
# of increasing ints. They'll be used to group corresponding rows
gr=[len(sublist) for sublist in inx]
t = list(zip(gr, range(1, len(inx)+1)))
group_list = [a*[b] for (a,b) in t]
# the groups are flatten either
flat_group_list = [item for sublist in group_list for item in sublist]
# create a new dataframe to mark rows to group retaining
# original index for each row
df_groups = pd.DataFrame({'groups': flat_group_list}, index=flat_inx)
# and join the group dataframe to the original df
df['groups'] = df_groups
# rows not belonging to a group are marked with 0
df['groups']=df['groups'].fillna(0)
# save rows not belonging to a group for later
df_untouched = df[df['groups'] == 0]
df_untouched = df_untouched.drop('groups', axis=1)
# new dataframe containg only rows belonging to a group
df_to_operate = df[df['groups']>0]
df_to_operate = df_to_operate.assign(ind=df_to_operate.index)
# at last, we group the rows according to original inx
df_grouped = df_to_operate.groupby('groups')
# calculate mean and median
# for each group we retain the index of first row of group
df_operated =df_grouped.agg({'a' : 'mean',
'b' : 'median',
'c' : 'mean',
'd' : 'median',
'ind': 'first'})
# set correct index on dataframe
df_operated=df_operated.set_index('ind')
# finally, join the previous dataframe with saved
# dataframe of rows which don't need calcullations
df_final = df_operated.combine_first(df_untouched)
OLD ALGO, TOO SLOW FOR SO MUCH DATA
This algo involving a loop, though giving a correct result, takes to long for such a big amount of data:
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
inx = [[1,2]]
for l in inx:
means=df.iloc[l][['a', 'c']].mean()
medians=df.iloc[l][['b', 'd']].median()
df.iloc[l[0]]=pd.DataFrame([means, medians]).fillna(method='bfill').iloc[0]
df.drop(index=l[1:], inplace=True)

Related

How to create multiple rows of a data frame based on some original values

I am a Python newbie and have a question.
As a simple example, I have three variables:
a = 3
b = 10
c = 1
I'd like to create a data frame with three columns ('a', 'b', and 'c') with:
each column +/- a certain constant from the original value AND also >0 and <=10.
If the constant is 1 then:
the possible values of 'a' will be 2, 3, 4
the possible values of 'b' will be 9, 10
the possible values of 'c' will be 1, 2
The final data frame will consist of all possible combination of a, b and c.
Do you know any Python code to do so?
Here is a code to start.
import pandas as pd
data = [[3 , 10, 1]]
df1 = pd.DataFrame(data, columns=['a', 'b', 'c'])
You may use itertools.product for this.
Create 3 separate lists with the necessary accepted data. This can be done by calling a method which will return you the list of possible values.
def list_of_values(n):
if 1 < n < 9:
return [n - 1, n, n + 1]
elif n == 1:
return [1, 2]
elif n == 10:
return [9, 10]
return []
So you will have the following:
a = [2, 3, 4]
b = [9, 10]
c = [1,2]
Next, do the following:
from itertools import product
l = product(a,b,c)
data = list(l)
pd.DataFrame(data, columns =['a', 'b', 'c'])

How to order columns of old/new values such that the ith old value = the (i-1)th new value

Edit: title suggestions welcome. This probably has a name, but I don't know what it is and could not find something similar.
Edit2: I've rewritten the problem to try and explain it more clearly. In both versions, I think I've met the site standards by putting forth an explanation, reproducible example, and my own solution... if you could suggest improvements before downvoting, that would be appreciated.
I have user entered data from a system containing these three columns:
date: timestamps in %Y-%m-%d %H:%M:%S format; however %S=00 for all cases
old: the old value of this observation
new: the new value of this observation
If the user entered data within the same minute, then sorting by the timestamp alone is insufficient. We end up with a "chunk" of entries that may or may not be in the correct order. To illustrate, I've replaced dates with integers and present a correct and jumbled case:
How do we know the data is in the correct order? When each row's value for old equals the previous row's value for new (ignoring the first/last row where we have nothing to compare to). Put another way: old_i = new_(i-1). This creates the matching diagonal colors on the left, which are jumbled on the right.
Other comments:
there may be multiple solutions, as two rows may have the same values for old and new and thus are interchanbeable
if an ambiguous chunk occurs by itself (imagine the data is only the rows where date=1 above), any solution will suffice
if the ambiguous chunk occurs with a unique date before and/or after, these serve as additional constraints and must be considered to achieve the solution
consider the case with back to back ambiguous chunks as bonus; I plan to ignore these and am not sure they even exist in the data
My data set is much larger, so my end solution will involve using pandas.groupby() to feed a function chunks like the above. The right side would be passed to the function, and I need the left side returned (or some index/order to get me to the left side).
Here's a reproducible example, using the same data as above, but adding a group column and another chunk so you can see my groupby() solution.
Setup and input jumbled data:
import pandas as pd
import itertools
df = pd.DataFrame({'group': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b'],
'date': [0, 1, 1, 1, 1, 2, 3, 4, 4],
'old': [1, 8, 2, 2, 5, 5, 4, 10, 7],
'new': [2, 5, 5, 8, 2, 4, 7, 1, 10]})
print(df)
### jumbled: the `new` value of a row is not the same as the next row's `old` value
# group date old new
# 0 a 0 1 2
# 1 a 1 8 5
# 2 a 1 2 5
# 3 a 1 2 8
# 4 a 1 5 2
# 5 a 2 5 4
# 6 b 3 4 7
# 7 b 4 10 1
# 8 b 4 7 10
I wrote a kludgy solution that begs for a more elegant approach. See my gist here for the code behind the order_rows function I call below. The output is correct:
df1 = df.copy()
df1 = df1.groupby(['group'], as_index=False, sort=False).apply(order_rows).reset_index(drop=True)
print(df1)
### correct: the `old` value in each row equals the `new` value of the previous row
# group date old new
# 0 a 0 1 2
# 1 a 1 2 5
# 2 a 1 5 2
# 3 a 1 2 8
# 4 a 1 8 5
# 5 a 2 5 4
# 6 b 3 4 7
# 7 b 4 7 10
# 8 b 4 10 1
Update based on networkx suggestion
Note that bullet #2 above suggests that these ambiguous chunks can occur without a prior reference row. In that case, feeding the starting point as df.iloc[0] is not safe. In addition, I found that when seeding the graph with an incorrect starting point, it appears to only output the nodes it could successfully order. Note that 5 rows were passed, but only 4 values were returned.
Example:
import networkx as nx
import numpy as np
df = pd.DataFrame({'group': ['a', 'a', 'a', 'a', 'a'],
'date': [1, 1, 1, 1, 1],
'old': [8, 1, 2, 2, 5],
'new': [5, 2, 5, 8, 2]})
g = nx.from_pandas_edgelist(df[['old', 'new']],
source='old',
target='new',
create_using=nx.DiGraph)
ordered = np.asarray(list(nx.algorithms.traversal.edge_dfs(g, df.old[0])))
ordered
# array([[8, 5],
# [5, 2],
# [2, 5],
# [2, 8]])
This is a graph problem. You can use networkx to create your graph, and then use numpy for manipulation. A simple traversal algorithm, like depth-first search, will visit all your edges starting from a source.
The source is simply your first node (i.e. df.old[0])
To your example:
import networkx as nx
g = nx.from_pandas_edgelist(df[['old', 'new']],
source='old',
target='new',
create_using=nx.DiGraph)
ordered = np.asarray(list(nx.algorithms.traversal.edge_dfs(g, df.old[0])))
>>>ordered
array([[ 1, 2],
[ 2, 5],
[ 5, 2],
[ 2, 8],
[ 8, 5],
[ 5, 4],
[ 4, 7],
[ 7, 10],
[10, 1]])
You may just assign back to your data frame: df[['old', 'new']] = ordered. You might have to change some small details, e.g. if your groups are not inter-connected. But, if the starting point is a sorted df on group and date and the dependencies on old_i = new_(i-1) are respected inter-groups, then you should be fine to just assign back the ordered array.
I still believe, however, that you should investigate your timestamps. I believe this is a simpler problem that can be solved by just sorting the timestamps. Make sure you are not losing precision on your timestamps when reading/writing to files.

Merge (or concat) two dataframes by index with duplicates index

I have two dataframe A and B with common indexes for A and B. These common indexes can appear several times (duplicate) for A and B.
I want to merge A and B in these 3 ways :
Case 0: If index i of A appear one time (i1) and index i for B
appear one times (i1), I want my merged by index dataframe to add
the rows A(i1), B(i1)
Case 1 : If index i of A appear one time (i1) and index i for B
appear two times in this order : (i1 and i2), I want my merged by
index dataframe to add the rows A(i1), B(i1) and A(i1), B(i2).
Case 2: If index i of A appear two time in this order : (i1, i2) and
index i for B appear two times in this order : (i1 and i2), I want
my merged by index dataframe to add the rows A(i1), B(i1) and A(i2),
B(i2).
These 3 cases are all of the possible case that could appear on my data.
When using pandas.merge, case 0 and case 1 works. But for case 2, the returned dataframe will add rows A(i1), B(i1) and A(i1), B(i2) and A(i2), B(i1) and A(i2), B(i2) instead of A(i1), B(i1) and A(i2), B(i2).
I could use pandas.merge method and then delete the undesired merged rows but is there a ways to combine these 3 cases at the same time ?
A = pd.DataFrame([[1, 2], [4, 2], [5,5], [5,5], [1,1]], index=['a', 'a', 'b', 'c', 'c'])
B = pd.DataFrame([[1, 5], [4, 8], [7,7], [5,5]], index=['b', 'c', 'a', 'a'])
pd.merge(A,B, left_index=True, right_index=True, how='inner')
For example, in the dataframe above, I want exactly it without the second and third index 'a'.
Basically, your 3 cases can be summarized into 2 cases:
Index i occur the same times (1 or 2 times) in A and B, merge according to the order.
Index i occur 2 times in A and 1 time in B, merge using B content for all rows.
Prep code:
def add_secondary_index(df):
df.index.name = 'Old'
df['Order'] = df.groupby(df.index).cumcount()
df.set_index('Order', append=True, inplace=True)
return df
import pandas as pd
A = pd.DataFrame([[1, 2], [4, 2], [5,5], [5,5], [1,1]], index=['a', 'a', 'b', 'c', 'c'])
B = pd.DataFrame([[1, 5], [4, 8], [7,7], [5,5]], index=['b', 'c', 'a', 'a'])
index_times = A.groupby(A.index).count() == B.groupby(B.index).count()
For case 1 is easy to solve, you just need to add the secondary index:
same_times_index = index_times[index_times[0].values].index
A_same = A.loc[same_times_index].copy()
B_same = B.loc[same_times_index].copy()
add_secondary_index(A_same)
add_secondary_index(B_same)
result_merge_same = pd.merge(A_same,B_same,left_index=True,right_index=True)
For case 2, you need to seprately consider:
not_same_times_index = index_times[~index_times.index.isin(same_times_index)].index
A_notsame = A.loc[not_same_times_index].copy()
B_notsame = B.loc[not_same_times_index].copy()
result_merge_notsame = pd.merge(A_notsame,B_notsame,left_index=True,right_index=True)
You could consider whether to add secondary index for result_merge_notsame, or drop it for result_merge_same.

obtaining indices of n max absolute values in dataframe row

suppose i create a Pandas DataFrame as below
import pandas as pd
import numpy as np
np.random.seed(0)
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
as an example, this can generate the below:
for each row, i am looking for a way to readily obtain the indices corresponding to the largest n (say 3) values in absolute value terms. for example, for the first row, i would expect [0,3,4]. we can assume that the results don't need to be ordered.
i tried searching for solutions similar to idxmax and argmax, but it seems these do not readily handle multiple values
You can use np.argsort(axis=1)
Given dataset:
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
0 1 2 3 4
0 17.640523 4.001572 9.787380 22.408932 18.675580
1 -9.772779 9.500884 -1.513572 -1.032189 4.105985
2 1.440436 14.542735 7.610377 1.216750 4.438632
3 3.336743 14.940791 -2.051583 3.130677 -8.540957
4 -25.529898 6.536186 8.644362 -7.421650 22.697546
df.abs().values.argsort(1)[:, -3:][:, ::-1]
array([[3, 4, 0],
[0, 1, 4],
[1, 2, 4],
[1, 4, 0],
[0, 4, 2]])
Try this ( this is not the optimal code ) :
idx_nmax = {}
n = 3
for index, row in df.iterrows():
idx_nmax[index] = list(row.nlargest(n).index)
at the end of that you will have a dictionary with:
as Key the index of the row
and as Values ​​the index of the 'n' highest value of this row

What's the most efficient way to get a variable length of rows w.r.t each group of a dataframe

To illustrate my question clearly, for a dummy dataframe like this:
df = pd.DataFrame({'X' : ['B', 'B', 'A', 'A', 'A'], 'Y' : [1, 2, 3, 4, 5]})
How can I get top 1 row of group A and top 2 rows of group B, and get rid of the rest rows of each group? By the way, the real dataset is big with hundreds of thousands of rows and thousands of groups.
And the output looks like this:
pd.DataFrame({'X' : ['B', 'B', 'A'], 'Y' : [1, 2, 3]})
My main gripe is .groupby().head() only gives me a fixed length of rows within each group, and I want have a different number of rows of different groups.
One way to do this is create a dictionary contains the number of rows each group should keep, and in the groupby.apply, use the g.name as the key to look up the value in the dictionary, with the head method you can keep different rows for each group:
rows_per_group = {"A": 1, "B": 2}
df.groupby("X", group_keys=False).apply(lambda g: g.head(rows_per_group[g.name]))
# X Y
#2 A 3
#0 B 1
#1 B 2

Categories