Group dataframe according to start and stop columns - python

I'd like to cut/group a pandas Dataframe according to a start and a stop column, but only in the case of start->stop.
I would like the range of indexes from 'start' non-zero value to 'stop' non-zero value. But only if the 'start' non-zero value is followed next by a 'stop' non-zero value. Running through the indices from top to bottom
I attached some code creating a simplified version of the problem and a corresponding image.
col1 = np.zeros(10)
col2 = np.zeros(10)
col1[[0, 1, 5, 8]] = 1
col2[[3, 6, 7, 9]] = 1
df = pd.DataFrame({'start': col1, 'stop': col2})
The desired output would group the indexes somewhat like:
[(1,2,3), (5,6), (8,9)]
Additional info in case this would simplify things:
Merging the columns would be fine.
My original data frame has a pd.TimedeltaIndex.
Visual Clarification of the desired result:

First we need to look the intervals of start and stop and find out which are “valid” interval ends:
>>> ends = df.index.to_series().where(df['stop'].ne(0))
>>> starts = df.index.to_series().where(df['start'].ne(0))
>>> ends
0 NaN
1 NaN
2 NaN
3 3.0
4 NaN
5 NaN
6 6.0
7 7.0
8 NaN
9 9.0
dtype: float64
>>> starts
0 0.0
1 1.0
2 NaN
3 NaN
4 NaN
5 5.0
6 NaN
7 NaN
8 8.0
9 NaN
dtype: float64
Now we can try to get for each valid start the next valid end:
>>> next_end = ends.bfill().rename('end')
>>> valid_starts = starts.dropna().rename('start')
>>> candidates = valid_starts.to_frame().join(next_end, how='left')
>>> candidates
start end
0 0.0 3.0
1 1.0 3.0
5 5.0 6.0
8 8.0 9.0
Here we see that there is an issue with the interval starting at 0: another interval starts later (at 1) so [0, 3] is not valid and we should only keep [1, 3]. This could be done with groupby + max for example:
>>> intervals = candidates.groupby('end')['start'].max().reset_index().astype(int)
>>> intervals
end start
0 3 1
1 6 5
2 9 8
Finally generating the list of indexes from the endpoints is easy:
>>> intervals.agg(lambda s: list(range(s['start'], s['end'] + 1)), axis='columns')
0 [1, 2, 3]
1 [5, 6]
2 [8, 9]
dtype: object

Related

Apply different mathematical function in table in Python

I have two columns - Column A and Column B and it has some values like below:-
Now, I want to apply normal arithmetic function for each row and add result in next column. But Different arithmetic operator should be apply on each row. Like
A+B for first row
A-B for second row
A*B for third row
A/B for fourth row
and so on till nth record of the row with same repetitive mathematical function.
Can someone please help me with this code in Python.
python-3.x
pandas
We can use:
row.name to access the index when using apply on a row
can use a dictionary to map indexes to a operations
Code
import operator as _operator
# Data
d = {"A":[5, 6, 7, 8, 9, 10, 11],
"B": [1, 2, 3, 4, 5, 6, 7]}
df = pd.DataFrame(d)
print(df)
# Mapping from index to mathematical operation
operator_map = {
0: _operator.add,
1: _operator.sub,
2: _operator.mul,
3: _operator.truediv,
}
# use row.name % 4 to have operators have a cycle of 4
df['new'] = df.apply(lambda row: operator_map[row.name % 4](*row), axis = 1)
Output
Initial df
A B
0 5 1
1 6 2
2 7 3
3 8 4
4 9 5
5 10 6
6 11 7
New df
A B new
0 5 1 6.0
1 6 2 4.0
2 7 3 21.0
3 8 4 2.0
4 9 5 14.0
5 10 6 4.0
6 11 7 77.0
IIUC, you can try DataFrame.apply on rows with operator
import operator
operators = [operator.add, operator.sub, operator.mul, operator.truediv]
df['C'] = df.apply(lambda row: operators[row.name](*row), axis=1)
print(df)
A B C
0 5 1 6.0
1 6 2 4.0
2 7 3 21.0
3 8 4 2.0

Pandas: dynamically add rows before & after groups

I have a Pandas data frame:
pd.DataFrame(data={"Value": [5, 0, 0, 8, 0, 3, 0, 2, 7, 0], "Group": [1, 1, 1, 1, 2, 2, 2, 2, 2, 2]})
Value Group
0 5.0 1
1 0.0 1
2 0.0 1
3 8.0 1
4 0.0 2
5 3.0 2
6 0.0 2
7 2.0 2
8 7.0 2
9 0.0 2
Also I calculated the cumulative sum with respect to two rows for each group:
{"2-cumsum Group 1": array([5., 8.]), "2-cumsum Group 2": array([7., 5.])}
E.g. array([5., 8.]) because array([5.0, 0.0]) (rows 0 and 1) + array([0.0 + 8.0]) (rows 2 and 3) = array([5., 8.]).
What I now need is to append exactly two rows at the beginning of df, in-between each group and at the end of df so that I get the following data frame (gaps are for illustration purposes):
Value Group
0 10.0 0 # Initialize with 10.0
1 10.0 0 # Initialize with 10.0
2 5.0 1
3 0.0 1
4 0.0 1
5 8.0 1
6 5.0 0 # 5.0 ("2-cumsum Group 1"[0])
7 2.0 0 # 8.0 ("2-cumsum Group 1"[1])
8 0.0 2
9 3.0 2
10 0.0 2
11 2.0 2
12 7.0 2
13 0.0 2
14 7.0 0 # 5.0 ("2-cumsum Group 2"[0])
15 5.0 0 # 8.0 ("2-cumsum Group 2"[1])
Please consider that the original data frame is much larger, has more than just two columns and I need to dynamically append rows with varying entries. E.g. the rows to append should have an additional column with "10.0" entries. Also, calculating the cumulative sum with respect to some integer (in this case 2) is variable (could be 8).
There are so many occasions where I need to generate rows based on other rows in data frames but I didn't find any effective solutions other than using for-loops and some temporary cache lists that save values from previous iterations.
I would appreciate some help.
Thank you in advance and kind regards.
My original code applied to the exemplary data, in case anybody needs it. It's very confusing and ineffective so only consider it if you really need to:
import pandas as pd
import numpy as np
# Some stuff
df = pd.DataFrame(data={"Group1": ["a", "a", "b", "b", "b"],
"Group2": [1, 2, 1, 2, 3],
"Group3": [1, 9, 2, 1, 1],
"Value": [5, 8, 3, 2, 7]})
length = 2
max_value = 20
g = df['Group1'].unique()
h = df["Group2"].unique()
i = range(1,df['Group3'].max()+1)
df2 = df.set_index(['Group1','Group2','Group3']).reindex(pd.MultiIndex.from_product([g,h,i])).assign(cc = lambda x: (x.groupby(level=0).cumcount())//length).rename_axis(['Group1','Group2','Group3'],axis=0)
df2 = df2.loc[~df2['Value'].isna().groupby([pd.Grouper(level=0),df2['cc']]).transform('all')].reset_index().fillna(0).drop('cc',axis=1)
values = df2["Value"].copy().to_numpy()
values = np.array_split(values, len(values)/length)
stock = df2["Group1"].copy().to_numpy()
stock = np.array_split(stock, len(stock)/length)
# Generate the "Group" column and generate the "2-cumsum" arrays
# stored in the "volumes" variable
k = 0
a_groups = []
values_cache = []
volumes = []
for e, i in enumerate(values):
if any(stock[e] == stock[e-1]):
if np.any(i + values_cache >= max_value):
k += 1
volumes.append(values_cache)
values_cache = i
else:
values_cache += i
a_groups.extend([k] * length)
else:
k += 1
if e:
volumes.append(values_cache)
values_cache = i
a_groups.extend([k] * length)
volumes.append(values_cache)
df2["Group"] = a_groups
print(df2[["Value", "Group"]])
print("\n")
print(f"2-cumsums: {volumes}")
"""Output
Value Group
0 5.0 1
1 0.0 1
2 0.0 1
3 8.0 1
4 0.0 2
5 3.0 2
6 0.0 2
7 2.0 2
8 7.0 2
9 0.0 2
2-cumsums: [array([5., 8.]), array([7., 5.])]"""

Find duplicate values in two arrays, Python

I have two arrays (A and B) with about 50 000 values in each. Every value represents an ID. I want to create a pandas dataframe with three columns, col1: values from array A, col2: values from array B, col3: a string with the labels "unique" or "duplicate". In each array the ID:s are unique.
The arrays is of different length. So I can't do something like this to get started.
a = np.array([1, 2, 3, 4, 5])
a = np.array([5, 6, 7, 8, 9, 10])
pd.DataFrame({'a':a, 'a':b})
I was then thinking to create a different pandas dataframe, also with three columns. One for ID, another for which array the ID comes from (a or b). And thereafter group on ID and count occurrences. if >=2 then we have a duplicate.
But I couldn’t figure out how get to numpy arrays after one another in the same column (like rbind in R) and at the same time create the other column based on which array the value come from.
Most likely there are far better solutions that those I have suggested above. Any ideas?
For finding duplicate elements in two arrays, use numpy.intersect1d:
In [458]: a = np.array([1, 2, 3, 4, 5])
In [459]: b = np.array([5, 6, 7, 8, 9, 10])
In [462]: np.intersect1d(a,b)
Out[462]: array([5])
Convert the array into series and then concat them to create dataframe
a = np.array([1, 2, 3, 4, 5,])
b = np.array([5, 6, 7, 8, 9, 10])
s1 = pd.Series(a, name = 'a')
s2 = pd.Series(b, name = 'b')
pd.concat([s1, s2], axis = 1)
a b
0 1.0 5
1 2.0 6
2 3.0 7
3 4.0 8
4 5.0 9
5 NaN 10
Try with merge + indicator
out = pd.DataFrame({'a':a}).merge(pd.DataFrame({'b':b}), left_on='a',right_on='b',indicator=True,how='outer')
Out[210]:
a b _merge
0 1.0 NaN left_only
1 2.0 NaN left_only
2 3.0 NaN left_only
3 4.0 NaN left_only
4 5.0 5.0 both
5 NaN 6.0 right_only
6 NaN 7.0 right_only
7 NaN 8.0 right_only
8 NaN 9.0 right_only
9 NaN 10.0 right_only

Set Pandas Dataframe Value with another list of Value

I have a data frame with column A, df['A']
df is something like
index A
1 nan
2 nan
3 nan
4 nan
5 nan
I have a list of True/False value which is a mask of data frame, where True means the value should be replaced.
mask = [False, True, False, True, True]
I have another list of value which I want to use to replace the df['A'] with index from 2 -
value = [1, 3, 2]
The result I want is -
index A
1 nan
2 1
3 nan
4 3
5 2
I try to use df['A'][mask] = value
But it's not working.
Anyone can help? Thank you!
Use DataFrame.loc for working with slice of DataFrame, not copy:
df.loc[mask, 'A'] = value
print (df)
A
1 NaN
2 1.0
3 NaN
4 3.0
5 2.0

Filling Pandas columns with lists of unequal lengths

I am having trouble filling Pandas dataframes with values from lists of unequal lengths.
nx_lists_into_df is a list of numpy arrays.
I get the following error:
ValueError: Length of values does not match length of index
The code is below:
# Column headers
df_cols = ["f1","f2"]
# Create one dataframe fror each sheet
df1 = pd.DataFrame(columns=df_cols)
df2 = pd.DataFrame(columns=df_cols)
# Create list of dataframes to iterate through
df_list = [df1, df2]
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Loop through each sheet (i.e. each round of k folds)
for df, test_index_list in zip_longest(df_list, nx_lists_into_df):
counter = -1
# Loop through each column in that sheet (i.e. each fold)
for col in df_cols:
print(col)
counter += 1
# Add 1 to each index value to start indexing at 1
df[col] = test_index_list[counter] + 1
Thank you for your help.
Edit: This is how the result should hopefully look:-
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN
We'll leverage pd.Series to attach an appropriate index and will allow us to use the pd.DataFrame constructor without complaining of unequal lengths.
df1, df2 = (
pd.DataFrame(dict(zip(df_cols, map(pd.Series, d))))
for d in nx_lists_into_df
)
print(df1)
f1 f2
0 0 2.0
1 1 5.0
2 3 6.0
3 4 8.0
4 7 NaN
print(df2)
f1 f2
0 0 3.0
1 1 4.0
2 2 5.0
3 6 8.0
4 7 NaN
Setup
from numpy import array
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Column headers
df_cols = ["f1","f2"]
You could predefine the size of your DataFrames (by setting the index range to the length of the longest column you want to add [or any size bigger than the longest column]) like so:
df1 = pd.DataFrame(columns=df_cols, index=range(5))
df2 = pd.DataFrame(columns=df_cols, index=range(5))
print(df1)
f1 f2
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
(df2 is the same)
The DataFrame will be filled with NaNs automatically.
Then you use .loc to access each entry separately like so:
for x in range(len(nx_lists_into_df)):
for col_idx, y in enumerate(nx_lists_into_df[x]):
df_list[x].loc[range(len(y)), df_cols[col_idx]] = y
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN
The first loop iterates over the first dimension of your array (or the number of DataFrames you want to create).
The second loop iterates over the column values for the DataFrame, where y are the values for the current column and df_cols[col_idx] is the respective column (f1 or f2).
Since the row & col indices are the same size as y, you don't get the length mismatch.
Also check out the enumerate(iterable, start=0) function to get around those "counter" variables.
Hope this helps.
If I understand correctly, this is possible via pd.concat.
But see #pir's solution for an extendable version.
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
df1 = pd.concat([pd.DataFrame({'A': nx_lists_into_df[0][0]}),
pd.DataFrame({'B': nx_lists_into_df[0][1]})],
axis=1)
# A B
# 0 0 2.0
# 1 1 5.0
# 2 3 6.0
# 3 4 8.0
# 4 7 NaN
df2 = pd.concat([pd.DataFrame({'C': nx_lists_into_df[1][0]}),
pd.DataFrame({'D': nx_lists_into_df[1][1]})],
axis=1)
# C D
# 0 0 3.0
# 1 1 4.0
# 2 2 5.0
# 3 6 8.0
# 4 7 NaN

Categories