Row-wise DataFrame segmentation - python

Given the following dataframe:
df = pd.DataFrame(data={'item': [1, 2, 3, 4], 'start':[0.0, 2.0, 8.0, 6.0],
'end': [2.0, 6.0, 8.0, 14.0]})
How do I quickly expand the above dataframe row-wise by segmenting the interval 'start' - 'end' into multiples of 2?
For the above example, the resulting dataframe should be
Out=
item start end
1 0.0 2.0
2 2.0 4.0
2 4.0 6.0
3 8.0 8.0
4 6.0 8.0
4 8.0 10.0
4 10.0 12.0
4 12.0 14.0
Performance is of utmost importance for me, as I have millions of lines to check.
I had already filtered the entire dataframe using boolean indexing for those rows that do not need segmenting. That is a great speed-up However, on the remainder of the rows I applied a 'for loop' and made dataframes of the correct length that I kept appending. Unfortunately, the performance is not sufficient for millions of rows.
Looking forward to expert solutions!

You can write a function that returns a DataFrame of the expanded start and end times. In this example, I groupby item as I'm not sure you can return a DataFrame from apply without it being grouped first.
def convert(row):
start = row.start.values[0]
end = row.end.values[0]
if start == end:
return pd.DataFrame([[start, end]], columns=['start', 'end'])
else:
return pd.DataFrame({'start': np.arange(start, end, 2),
'end':np.arange(start + 2, end + 2, 2)},
columns=['start', 'end'])
df1=df.groupby('item').apply(convert)
df1.index = df1.index.droplevel(1)
df1.reset_index()
item start end
0 1 0.0 2.0
1 2 2.0 4.0
2 2 4.0 6.0
3 3 8.0 8.0
4 4 6.0 8.0
5 4 8.0 10.0
6 4 10.0 12.0
7 4 12.0 14.0

Start from the original dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'item': [1, 2, 3, 4], 'start':[0.0, 2.0, 8.0, 6.0],
'end': [2.0, 6.0, 10.0, 14.0]})
Then, run the following code:
lengths = pd.Series([1, 2, 1, 4]) # For the example, I just created this array,
# but obviously I would use the mod function to
# determine the number of segments to create
# Row below elongates the dataframe according to the array 'lengths'
df = df.reindex(np.repeat(df.index.values, lengths), method='ffill')
df['start'] += pd.Series(df.groupby(level=0).cumcount()*2.0)
df['end'] = df['start'] + 2.0
print df
Note that the initial dataframe contained an error. Item '3' required a 'start=8.0' and 'end=10.0'.
I believe this method is extremely quick due to the use of pandas Cython functions. Of course, still open to other possibilities.

Related

How to execute a function on a group of rows in pandas dataframe and concatenate the results row wise on axis 0?

I have one DataFrame like below:
cell_id col1 col2
en_1 2.0 3.0
en_2 8.0 9.0
.
.
en_2 9.0 8.0
en_1 9.0 8.0
.
.
en_n 4.0 6.7
I want to send this DataFrame per cell_id once a the time to some function like below and concatenate the results on row wise(axis 0)
def func(df):
do_some_process
return df
result1 = func(df[df.cell_id.eq('en_1')])
result2 = func(df[df.cell_id.eq('en_2')])
.
.
result_n = func(df[df.cell_id.eq('en_n')])
result = pd.concat([result1, result2,.....,result_n], axis=0)
You can simply use df.apply() as follows:
def func(x):
#perform your operation on the pd.Series
return x
df.apply(func, axis=1)
If you need to some values depends on each row, simply you can use apply function and create a new column like this.
df['new_col'] = df.apply(func, axis=1)
Then if you want these values as row wise, you can assign reshaped version of that column to a variable.
t = df['new_col'].values.reshape(1, -1)
t will be a row version of that column if you need something like this.
Example
data = {'cell_id': {0: 'en_1', 1: 'en_2', 2: 'en_2', 3: 'en_1', 4: 'en_3'},
'col1': {0: 2.0, 1: 8.0, 2: 9.0, 3: 9.0, 4: 4.0},
'col2': {0: 3.0, 1: 9.0, 2: 8.0, 3: 8.0, 4: 6.7}}
df = pd.DataFrame(data)
df
cell_id col1 col2
0 en_1 2.0 3.0
1 en_2 8.0 9.0
2 en_2 9.0 8.0
3 en_1 9.0 8.0
4 en_3 4.0 6.7
Code
you can divide df by cell_id value
g = df.groupby('cell_id')
[g.get_group(i) for i in g.groups]
result:
[ cell_id col1 col2
0 en_1 2.0 3.0
3 en_1 9.0 8.0,
cell_id col1 col2
1 en_2 8.0 9.0
2 en_2 9.0 8.0,
cell_id col1 col2
4 en_3 4.0 6.7]
get list of dataframes
then you can apply your func and concat
pd.concat([func(g.get_group(i)) for i in g.groups])

Filling pandas DataFrame with values from another DataFrame of different shape

I have a dataframe df2 containing four columns: A, B, C, D. I want to fill this dataframe with the values from another data frame temp= (1, 2, 6.5, 8, 3, 4, 6.6, 7.8, 5, 6, 5, 4).
What I want to obtain is given in
Any idea on how to do this?
If length of values modulo 4 is equal 0 seelct first row to Series by DataFrame.iloc, convert to numpy array and reshape by -1 for by default counts number of rows and 4 for number of columns:
print (len(df.iloc[0]) % 4)
0
df2 = pd.DataFrame(df.iloc[0].to_numpy().reshape(-1, 4), columns=list('ABCD'))
print (df2)
A B C D
0 1.0 2.0 6.5 8.0
1 3.0 4.0 6.6 7.8
2 5.0 6.0 5.0 4.0

Replace missing data based on certain conditions

Let's say I have data:
a b
0 1.0 NaN
1 6.0 1
2 3.0 NaN
3 1.0 NaN
I would like to iterate over this data to see,
if Data[i] == NaN **and** column['a'] == 1.0 then replace NAN with 4 instead of replace by 4 in any NaN you see. How shall I go about it? I tried every for if function and it didn't work. I also did
for i in df.itertuples():
but the problem is df.itertuples() doesn't have a replace functionality and the other methods I've seen were to do it one by one.
End Result looking for:
a b
0 1.0 4
1 6.0 1
2 3.0 NaN
3 1.0 4
def func(x):
if x['a'] == 1 and pd.isna(x['b']):
x['b'] = 4
return x
df = pd.DataFrame.from_dict({'a': [1.0, 6.0, 3.0, 1.0], 'b': [np.nan, 1, np.nan, np.nan]})
df.apply(func, axis=1)
Instead of iterrows(), apply() may be a better option.
You can create a mask and then fill in the intended NaNs using that mask:
df = pd.DataFrame({'a': [1,6,3,1], 'b': [np.nan, 1, np.nan, np.nan]})
mask = df[['a', 'b']].apply(lambda x: (x[0] == 1) and (pd.isna(x[1])), axis=1)
df['b'] = df['b'].mask(mask, df['b'].fillna(4))
print(df)
a b
0 1 4.0
1 6 1.0
2 3 NaN
3 1 4.0
df2 = df[df['a']==1.0].fillna(4.0)
df2.combine_first(df)
Can this help you?
Like you said, you can achieve this by combining 2 conditions: a==1 and b==Nan.
To combine two conditions in python you can use &.
In your example:
import pandas as pd
import numpy as np
# Create sample data
d = {'a': [1, 6, 3, 1], 'b': [np.nan, 1, np.nan, np.nan]}
df = pd.DataFrame(data=d)
# Convert to numeric
df = df.apply(pd.to_numeric, errors='coerce')
print(df)
# Replace Nans
df[ (df['a'] == 1 ) & np.isnan(df['b']) ] = 4
print(df)
Should do the trick.

Python Pandas Next Sequence Number

I have a dataframe
df = pd.DataFrame([1,5,8, np.nan,np.nan], columns = ["UserID"])
I want to fill np.nan with next sequence numbers from starting with highest value + 1
expected result of df.UserID
[1, 5, 8, 9, 10]
Use Series.isna with Series.cumsum for counter and add original data with forward filling missing values:
df['UserID'] = df['UserID'].isna().cumsum().add(df['UserID'].ffill(), fill_value=0)
print (df)
UserID
0 1.0
1 5.0
2 8.0
3 9.0
4 10.0

Faster method to multiply column lookup values with vectorization

I have two Dataframes, one contains values and is the working dataset (postsolutionDF), while the other is simply for reference as a lookup table (factorimportpcntDF). The goal is to add a column to postsolutionDF that contains the product of the lookup values from each row of postsolutionDF (new column name = num_predict). That product is then multiplied by 2700. For example, on first row, the working values are 0.5, 2, -6. The equivalent lookup values for these are 0.1182, 0.2098, and 0.8455. The product of those is 0.0209, which when multiplied by 2700 is 56.61 as shown in output.
The code below works for this simplified example, but it is very slow in the real solution (1.6MM rows x 15 numbered columns). I'm sure there is a better way to do this by removing the 'for k in range' loop but am struggling with how since already using apply on rows. I've found many tangential solutions but nothing that has worked for my situation yet. Thanks for any help.
import pandas as pd
import numpy as np
postsolutionDF = pd.DataFrame({'SCRN' : (['2019-01-22-0000001', '2019-01-22-0000002', '2019-01-22-0000003']), '1' : 0.5,
'2' : 2, '3' : ([-6, 1.0, 8.0])})
postsolutionDF = postsolutionDF[['SCRN', '1', '2', '3']]
print('printing initial postsolutionDF..')
print(postsolutionDF)
factorimportpcntDF = pd.DataFrame({'F1_Val' : [0.5, 1, 1.5, 2], 'F1_Pcnt' : [0.1182, 0.2938, 0.4371, 0.5433], 'F2_Val'
: [2, 3, np.nan, np.nan], 'F2_Pcnt' : [0.2098, 0.7585, np.nan, np.nan], 'F3_Val' : [-6, 1, 8, np.nan], 'F3_Pcnt' :
[0.8455, 0.1753, 0.072, np.nan]})
print('printing factorimportpcntDF..')
print(factorimportpcntDF)
def zero_filter(row): # row is series
inner_value = 1
for k in range(1, 4): # number of columns in postsolutionDF with numeric headers, dynamic in actual code
inner_value *= factorimportpcntDF.loc[factorimportpcntDF['F'+str(k)+'_Val']==row[0+k], 'F'+str(k)+'_Pcnt'].values[0]
inner_value *= 2700
return inner_value
postsolutionDF['num_predict'] = postsolutionDF.apply(zero_filter, axis=1)
print('printing new postsolutionDF..')
print(postsolutionDF)
Print Output:
C:\ProgramData\Anaconda3\python.exe C:/Users/Eric/.PyCharmCE2017.3/config/scratches/scratch_5.py
printing initial postsolutionDF..
SCRN 1 2 3
0 2019-01-22-0000001 0.5 2 -6.0
1 2019-01-22-0000002 0.5 2 1.0
2 2019-01-22-0000003 0.5 2 8.0
printing factorimportpcntDF..
F1_Pcnt F1_Val F2_Pcnt F2_Val F3_Pcnt F3_Val
0 0.1182 0.5 0.2098 2.0 0.8455 -6.0
1 0.2938 1.0 0.7585 3.0 0.1753 1.0
2 0.4371 1.5 NaN NaN 0.0720 8.0
3 0.5433 2.0 NaN NaN NaN NaN
printing new postsolutionDF..
SCRN 1 2 3 num_predict
0 2019-01-22-0000001 0.5 2 -6.0 56.610936
1 2019-01-22-0000002 0.5 2 1.0 11.737312
2 2019-01-22-0000003 0.5 2 8.0 4.820801
Process finished with exit code 0
I'm not sure how to do this in native pandas, but if you go back to numpy, it is pretty easy.
The numpy.interp function is designed to interpolate between values in the lookup table, but if the input values exactly match the values in the lookup table (like yours do), it becomes just a simple lookup instead of an interpolation.
postsolutionDF['1new'] = np.interp(postsolutionDF['1'].values, factorimportpcntDF['F1_Val'], factorimportpcntDF['F1_Pcnt'])
postsolutionDF['2new'] = np.interp(postsolutionDF['2'].values, factorimportpcntDF['F2_Val'], factorimportpcntDF['F2_Pcnt'])
postsolutionDF['3new'] = np.interp(postsolutionDF['3'].values, factorimportpcntDF['F3_Val'], factorimportpcntDF['F3_Pcnt'])
postsolutionDF['num_predict'] = postsolutionDF['1new'] * postsolutionDF['2new'] * postsolutionDF['3new'] * 2700
postsolutionDF.drop(columns=['1new', '2new', '3new'], inplace=True)
Gives the output:
In [167]: postsolutionDF
Out[167]:
SCRN 1 2 3 num_predict
0 2019-01-22-0000001 0.5 2 -6.0 56.610936
1 2019-01-22-0000002 0.5 2 1.0 11.737312
2 2019-01-22-0000003 0.5 2 8.0 4.820801
I had to pad out the factorimportpcntDF so all the columns had 4 values, otherwise looking up the highest value for a column wouldn't work. You can just use the same value multiple times, or split it into 3 lookup tables if you prefer, then the columns could be different lengths.
factorimportpcntDF = pd.DataFrame({'F1_Val' : [0.5, 1, 1.5, 2], 'F1_Pcnt' : [0.1182, 0.2938, 0.4371, 0.5433],
'F2_Val' : [2, 3, 3, 3], 'F2_Pcnt' : [0.2098, 0.7585, 0.7585, 0.7585],
'F3_Val' : [-6, 1, 8, 8], 'F3_Pcnt' : [0.8455, 0.1753, 0.072, 0.072]})
Note that the documentation specifies that your F1_val etc columns need to be in increasing order (yours are here, just an FYI). Otherwise interp will run, but won't necessarily give good results.

Categories