Calculation within Pandas dataframe group

Calculation within Pandas dataframe group - python

I've Pandas Dataframe as shown below. What I'm trying to do is, partition (or groupby) by BlockID, LineID, WordID, and then within each group use current WordStartX - previous (WordStartX + WordWidth) to derive another column, e.g., WordDistance to indicate the distance between this word and previous word.
This post Row operations within a group of a pandas dataframe is very helpful but in my case multiple columns involved (WordStartX and WordWidth).
*BlockID LineID WordID WordStartX WordWidth WordDistance
0 0 0 0 275 150 0
1 0 0 1 431 96 431-(275+150)=6
2 0 0 2 642 90 642-(431+96)=115
3 0 0 3 746 104 746-(642+90)=14
4 1 0 0 273 69 ...
5 1 0 1 352 151 ...
6 1 0 2 510 92
7 1 0 3 647 90
8 1 0 4 752 105**

The diff() and shift() functions are usually helpful for calculation referring to previous or next rows:
df['WordDistance'] = (df.groupby(['BlockID', 'LineID'])
.apply(lambda g: g['WordStartX'].diff() - g['WordWidth'].shift()).fillna(0).values)

Related

Failed to count occurrence of values in dataFrame based on few columns with groupby

I have pandas dataframe:
id colA colB colC
194 1 0 1
194 1 1 0
194 2 1 3
195 1 1 2
195 0 1 0
197 1 1 2
i would to calculate occurrence of each value group by id. in my case, expected result is:
id countOfValue0 countOfValue1 countOfValue2 countOfValue3
194 2 3 1 1
195 1 2 1 0
197 0 1 1 0
if value appeared in same row - distinct value by row (this is why i have for id=194, value1 = 3)
i thought to separate the data to 3 data frames using group by id-colA, id-colB, id-colC
something like = df.groupby('id', 'colaA') but i can't find an proper way to calculate those dataframe values based on id. probably there is more efficient way for doing this

Try:
res=df.set_index("id", append=True).stack()\
.reset_index(level=0).reset_index(level=1,drop=True)\
.drop_duplicates().assign(_dummy=1)\
.rename(columns={0: "countOfValue"})\
.pivot_table(index="id", columns="countOfValue", values="_dummy", aggfunc="sum")\
.fillna(0).astype(int)
res=res.add_prefix("countOfValue")
del res.columns.name
Outputs:
countOfValue0 ... countOfValue3
id ...
194 2 ... 1
195 1 ... 0
197 0 ... 0

Python pandas drop_duplicates with rolling window

I have a pandas data frame (about 500,000 rows) with a datetime index and 3 columns (a, b, c):
a b c
2016-03-30 09:59:36.619 0 55 0
2016-03-30 09:59:41.979 0 20 0
2016-03-30 09:59:41.986 0 1 0
2016-03-30 09:59:45.853 0 1 3
2016-03-30 09:59:51.265 0 20 9
2016-03-30 10:00:03.273 0 55 26
2016-03-30 10:00:05.658 0 55 28
2016-03-30 10:00:17.416 0 156 0
2016-03-30 10:00:17.928 0 122 1073
2016-03-30 10:00:21.933 0 122 0
2016-03-30 10:00:31.937 0 122 10
2016-03-30 10:00:40.941 0 122 0
2016-03-30 10:00:51.147 10 2 0
2016-03-30 10:01:27.060 0 156 0
I want to search within a 10 minute rolling window and remove duplicate items from one of the columns (column b), to get something like this:
a b c
2016-03-30 09:59:36.619 0 55 0
2016-03-30 09:59:41.979 0 20 0
2016-03-30 09:59:41.986 0 1 0
2016-03-30 09:59:51.265 0 20 9
2016-03-30 10:00:03.273 0 55 26
2016-03-30 10:00:17.416 0 156 0
2016-03-30 10:00:17.928 0 122 1073
2016-03-30 10:00:51.147 10 2 0
2016-03-30 10:01:27.060 0 156 0
Using drop_duplicates with rolling_apply comes to mind, but these two functions don't play well together, i.e.:
pd.rolling_apply(df, '10T', lambda x:x.drop_duplicates(subset='b'))
raises an error, since the function must return a value, not a df.
So this is what I have so far:
import datetime as dt
windows = []
for ind in range(len(df)):
t0 = df.index[ind]
t1 = df.index[ind]+dt.timedelta(minutes=10)
windows.append(df[numpy.logical_and(t0<df.index,\
df.index<=t1)].drop_duplicates(subset='b'))
Here I end up with a list of 10 min dataframes with duplicates removed, but there are a lot of overlapping values as the window rolls on to the next 10 min segment. To keep the unique values, I've tried something like:
new_df = []
for ind in range(len(windows)-1):
new_df.append(pd.unique(pd.concat([pd.Series(windows[ind].index),\
pd.Series(windows[ind+1].index)])))
But this doesn't work, and it's already starting to get messy. Does anyone have any bright ideas how to solve this as efficiently as possible?
Thanks in advance.

I hope this is useful. I roll a function that checks if the last value is a duplicate of an earlier element over a 10 minute window. The result can be used with boolean indexing.
# Simple example
dates = pd.date_range('2017-01-01', periods = 5, freq = '4min')
col1 = [1, 2, 1, 3, 2]
df = pd.DataFrame({'col1':col1}, index = dates)
# Make function that checks if last element is a duplicate
def last_is_duplicate(a):
if len(a) > 1:
return a[-1] in a[:len(a)-1]
else:
return False
# Roll over 10 minute window to find duplicates of recent elements
dup = df.col1.rolling('10T').apply(last_is_duplicate).astype('bool')
# Keep only those rows for which col1 is not a recent duplicate
df[~dup]

Totalling the matching values of a dataframe column with Series values

I have a Series :
350 0
254 1
490 0
688 0
393 1
30 1
and a dataframe :
0 outcome
0 350 1
1 254 1
2 490 0
3 688 0
4 393 0
5 30 1
The below code to count the total number of matches between the Series and the outcome column in the dataframe is what was intended.
Is there any other better way besides the below?
i=0
match=0
for pred in results['outcome']:
if test.values[i] == pred:
match+=1
i+=1
print match
I tried using results['Survived'].eq(labels_test).sum() but the answer is wrong.
And using lambda but the syntax is wrong.

You can compare by mapping series i.e
(df['0'].map(s) == df['outcome']).sum()
4

First, align the dataframe and series using align.
df, s = df.set_index('0').align(s, axis=0)
Next, compare the outcome column with the values in s and count the number of True values -
df.outcome.eq(s).sum()
4

Attributes/information contained in DataFrame column names

I have some data imported from a csv, to create something similar I used this:
data = pd.DataFrame([[1,0,2,3,4,5],[0,1,2,3,4,5],[1,1,2,3,4,5],[0,0,2,3,4,5]], columns=['split','sex', 'group0Low', 'group0High', 'group1Low', 'group1High'])
means = data.groupby(['split','sex']).mean()
so the dataframe looks something like this:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
You'll notice that each column actually contains 2 variables (group# and height). (It was set up this way for running repeated measures anova in SPSS.)
I want to split the columns up, so I can also groupby "group", like this (I actually screwed up the order of the numbers, but hopefully the idea is clear):
low high
split sex group
0 0 95 265
0 0 1 123 54
1 0 120 220
1 1 98 111
1 0 0 150 190
0 1 211 300
1 0 139 86
1 1 132 250
How do I achieve this?

The first trick is to gather the columns into a single column using stack:
In [6]: means
Out[6]:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
In [13]: stacked = means.stack().reset_index(level=2)
In [14]: stacked.columns = ['group_level', 'mean']
In [15]: stacked.head(2)
Out[15]:
group_level mean
split sex
0 0 group0Low 2
0 group0High 3
Now we can do whatever string operations we want on group_level using pd.Series.str as follows:
In [18]: stacked['group'] = stacked.group_level.str[:6]
In [21]: stacked['level'] = stacked.group_level.str[6:]
In [22]: stacked.head(2)
Out[22]:
group_level mean group level
split sex
0 0 group0Low 2 group0 Low
0 group0High 3 group0 High
Now you're in business and you can do whatever you want. For example, sum each group/level:
In [31]: stacked.groupby(['group', 'level']).sum()
Out[31]:
mean
group level
group0 High 12
Low 8
group1 High 20
Low 16
How do I group by everything?
If you want to group by split, sex, group and level you can do:
In [113]: stacked.reset_index().groupby(['split', 'sex', 'group', 'level']).sum().head(4)
Out[113]:
mean
split sex group level
0 0 group0 High 3
Low 2
group1 0High 5
0Low 4
What if the split is not always at location 6?
This SO answer will show you how to do the splitting more intelligently.

This can be done by first construct multi-level index on column names and then reshape the dataframe by stack.
import pandas as pd
import numpy as np
# some artificial data
# ==================================
multi_index = pd.MultiIndex.from_arrays([[0,0,1,1], [0,1,0,1]], names=['split', 'sex'])
np.random.seed(0)
df = pd.DataFrame(np.random.randint(50,300, (4,4)), columns='group0Low group0High group1Low group1High'.split(), index=multi_index)
df
group0Low group0High group1Low group1High
split sex
0 0 222 97 167 242
1 117 245 153 59
1 0 261 71 292 86
1 137 120 266 138
# processing
# ==============================
level_group = np.where(df.columns.str.contains('0'), 0, 1)
# output: array([0, 0, 1, 1])
level_low_high = np.where(df.columns.str.contains('Low'), 'low', 'high')
# output: array(['low', 'high', 'low', 'high'], dtype='<U4')
multi_level_columns = pd.MultiIndex.from_arrays([level_group, level_low_high], names=['group', 'val'])
df.columns = multi_level_columns
df.stack(level='group')
val high low
split sex group
0 0 0 97 222
1 242 167
1 0 245 117
1 59 153
1 0 0 71 261
1 86 292
1 0 120 137
1 138 266

retaining order of columns after pivot

I have a N x 3 DataFrame called A that looks like this:
_Segment _Article Binaire
0 550 5568226 1
1 550 5612047 1
2 550 5909228 1
3 550 5924375 1
4 550 5924456 1
5 550 6096557 1
....
The variable _Article is uniquely defined in A (there are N unique values of _Article in A).
I do a pivot:
B=A.pivot(index='_Segment', columns='_Article')
,then replace missing values nan with zeros:
B[np.isnan(B)]=0
and get:
Binaire \
_Article 2332299 2332329 2332337 2932377 2968223 3195643 3346080
_Segment
550 0 0 0 0 0 0 0
551 0 0 0 0 0 0 0
552 0 0 0 0 0 0 0
553 1 1 1 0 0 0 1
554 0 0 0 1 0 1 0
where columns were sorted lexicographically during the pivot.
My question is: how do I retain the sort order of _Article in A in the columns of B?
Thanks!

I think I got it. This works:
First, store the column _Article
order_art=A['_Article']
In the pivot, add the "values" argument to avoid hierarchical columns (see http://pandas.pydata.org/pandas-docs/stable/reshaping.html), which prevent reindex to work properly:
B=A.pivot(index='_Segment', columns='_Article', values='_Binaire')
then, as before, replace nan's with zeros
B[np.isnan(B)]=0
and finally use reindex to restore the original order of variable _Article across columns:
B=B.reindex(columns=order_art)
Are there more elegant solutions?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculation within Pandas dataframe group - python

The diff() and shift() functions are usually helpful for calculation referring to previous or next rows: df['WordDistance'] = (df.groupby(['BlockID', 'LineID']) .apply(lambda g: g['WordStartX'].diff() - g['WordWidth'].shift()).fillna(0).values)

Related

Failed to count occurrence of values in dataFrame based on few columns with groupby

Python pandas drop_duplicates with rolling window

Totalling the matching values of a dataframe column with Series values

Attributes/information contained in DataFrame column names

retaining order of columns after pivot

Categories

Resources