I have pandas dataframe:
id colA colB colC
194 1 0 1
194 1 1 0
194 2 1 3
195 1 1 2
195 0 1 0
197 1 1 2
i would to calculate occurrence of each value group by id. in my case, expected result is:
id countOfValue0 countOfValue1 countOfValue2 countOfValue3
194 2 3 1 1
195 1 2 1 0
197 0 1 1 0
if value appeared in same row - distinct value by row (this is why i have for id=194, value1 = 3)
i thought to separate the data to 3 data frames using group by id-colA, id-colB, id-colC
something like = df.groupby('id', 'colaA') but i can't find an proper way to calculate those dataframe values based on id. probably there is more efficient way for doing this
Try:
res=df.set_index("id", append=True).stack()\
.reset_index(level=0).reset_index(level=1,drop=True)\
.drop_duplicates().assign(_dummy=1)\
.rename(columns={0: "countOfValue"})\
.pivot_table(index="id", columns="countOfValue", values="_dummy", aggfunc="sum")\
.fillna(0).astype(int)
res=res.add_prefix("countOfValue")
del res.columns.name
Outputs:
countOfValue0 ... countOfValue3
id ...
194 2 ... 1
195 1 ... 0
197 0 ... 0
I have a pandas data frame (about 500,000 rows) with a datetime index and 3 columns (a, b, c):
a b c
2016-03-30 09:59:36.619 0 55 0
2016-03-30 09:59:41.979 0 20 0
2016-03-30 09:59:41.986 0 1 0
2016-03-30 09:59:45.853 0 1 3
2016-03-30 09:59:51.265 0 20 9
2016-03-30 10:00:03.273 0 55 26
2016-03-30 10:00:05.658 0 55 28
2016-03-30 10:00:17.416 0 156 0
2016-03-30 10:00:17.928 0 122 1073
2016-03-30 10:00:21.933 0 122 0
2016-03-30 10:00:31.937 0 122 10
2016-03-30 10:00:40.941 0 122 0
2016-03-30 10:00:51.147 10 2 0
2016-03-30 10:01:27.060 0 156 0
I want to search within a 10 minute rolling window and remove duplicate items from one of the columns (column b), to get something like this:
a b c
2016-03-30 09:59:36.619 0 55 0
2016-03-30 09:59:41.979 0 20 0
2016-03-30 09:59:41.986 0 1 0
2016-03-30 09:59:51.265 0 20 9
2016-03-30 10:00:03.273 0 55 26
2016-03-30 10:00:17.416 0 156 0
2016-03-30 10:00:17.928 0 122 1073
2016-03-30 10:00:51.147 10 2 0
2016-03-30 10:01:27.060 0 156 0
Using drop_duplicates with rolling_apply comes to mind, but these two functions don't play well together, i.e.:
pd.rolling_apply(df, '10T', lambda x:x.drop_duplicates(subset='b'))
raises an error, since the function must return a value, not a df.
So this is what I have so far:
import datetime as dt
windows = []
for ind in range(len(df)):
t0 = df.index[ind]
t1 = df.index[ind]+dt.timedelta(minutes=10)
windows.append(df[numpy.logical_and(t0<df.index,\
df.index<=t1)].drop_duplicates(subset='b'))
Here I end up with a list of 10 min dataframes with duplicates removed, but there are a lot of overlapping values as the window rolls on to the next 10 min segment. To keep the unique values, I've tried something like:
new_df = []
for ind in range(len(windows)-1):
new_df.append(pd.unique(pd.concat([pd.Series(windows[ind].index),\
pd.Series(windows[ind+1].index)])))
But this doesn't work, and it's already starting to get messy. Does anyone have any bright ideas how to solve this as efficiently as possible?
Thanks in advance.
I hope this is useful. I roll a function that checks if the last value is a duplicate of an earlier element over a 10 minute window. The result can be used with boolean indexing.
# Simple example
dates = pd.date_range('2017-01-01', periods = 5, freq = '4min')
col1 = [1, 2, 1, 3, 2]
df = pd.DataFrame({'col1':col1}, index = dates)
# Make function that checks if last element is a duplicate
def last_is_duplicate(a):
if len(a) > 1:
return a[-1] in a[:len(a)-1]
else:
return False
# Roll over 10 minute window to find duplicates of recent elements
dup = df.col1.rolling('10T').apply(last_is_duplicate).astype('bool')
# Keep only those rows for which col1 is not a recent duplicate
df[~dup]
I have a Series :
350 0
254 1
490 0
688 0
393 1
30 1
and a dataframe :
0 outcome
0 350 1
1 254 1
2 490 0
3 688 0
4 393 0
5 30 1
The below code to count the total number of matches between the Series and the outcome column in the dataframe is what was intended.
Is there any other better way besides the below?
i=0
match=0
for pred in results['outcome']:
if test.values[i] == pred:
match+=1
i+=1
print match
I tried using results['Survived'].eq(labels_test).sum() but the answer is wrong.
And using lambda but the syntax is wrong.
You can compare by mapping series i.e
(df['0'].map(s) == df['outcome']).sum()
4
First, align the dataframe and series using align.
df, s = df.set_index('0').align(s, axis=0)
Next, compare the outcome column with the values in s and count the number of True values -
df.outcome.eq(s).sum()
4
I have some data imported from a csv, to create something similar I used this:
data = pd.DataFrame([[1,0,2,3,4,5],[0,1,2,3,4,5],[1,1,2,3,4,5],[0,0,2,3,4,5]], columns=['split','sex', 'group0Low', 'group0High', 'group1Low', 'group1High'])
means = data.groupby(['split','sex']).mean()
so the dataframe looks something like this:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
You'll notice that each column actually contains 2 variables (group# and height). (It was set up this way for running repeated measures anova in SPSS.)
I want to split the columns up, so I can also groupby "group", like this (I actually screwed up the order of the numbers, but hopefully the idea is clear):
low high
split sex group
0 0 95 265
0 0 1 123 54
1 0 120 220
1 1 98 111
1 0 0 150 190
0 1 211 300
1 0 139 86
1 1 132 250
How do I achieve this?
The first trick is to gather the columns into a single column using stack:
In [6]: means
Out[6]:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
In [13]: stacked = means.stack().reset_index(level=2)
In [14]: stacked.columns = ['group_level', 'mean']
In [15]: stacked.head(2)
Out[15]:
group_level mean
split sex
0 0 group0Low 2
0 group0High 3
Now we can do whatever string operations we want on group_level using pd.Series.str as follows:
In [18]: stacked['group'] = stacked.group_level.str[:6]
In [21]: stacked['level'] = stacked.group_level.str[6:]
In [22]: stacked.head(2)
Out[22]:
group_level mean group level
split sex
0 0 group0Low 2 group0 Low
0 group0High 3 group0 High
Now you're in business and you can do whatever you want. For example, sum each group/level:
In [31]: stacked.groupby(['group', 'level']).sum()
Out[31]:
mean
group level
group0 High 12
Low 8
group1 High 20
Low 16
How do I group by everything?
If you want to group by split, sex, group and level you can do:
In [113]: stacked.reset_index().groupby(['split', 'sex', 'group', 'level']).sum().head(4)
Out[113]:
mean
split sex group level
0 0 group0 High 3
Low 2
group1 0High 5
0Low 4
What if the split is not always at location 6?
This SO answer will show you how to do the splitting more intelligently.
This can be done by first construct multi-level index on column names and then reshape the dataframe by stack.
import pandas as pd
import numpy as np
# some artificial data
# ==================================
multi_index = pd.MultiIndex.from_arrays([[0,0,1,1], [0,1,0,1]], names=['split', 'sex'])
np.random.seed(0)
df = pd.DataFrame(np.random.randint(50,300, (4,4)), columns='group0Low group0High group1Low group1High'.split(), index=multi_index)
df
group0Low group0High group1Low group1High
split sex
0 0 222 97 167 242
1 117 245 153 59
1 0 261 71 292 86
1 137 120 266 138
# processing
# ==============================
level_group = np.where(df.columns.str.contains('0'), 0, 1)
# output: array([0, 0, 1, 1])
level_low_high = np.where(df.columns.str.contains('Low'), 'low', 'high')
# output: array(['low', 'high', 'low', 'high'], dtype='<U4')
multi_level_columns = pd.MultiIndex.from_arrays([level_group, level_low_high], names=['group', 'val'])
df.columns = multi_level_columns
df.stack(level='group')
val high low
split sex group
0 0 0 97 222
1 242 167
1 0 245 117
1 59 153
1 0 0 71 261
1 86 292
1 0 120 137
1 138 266
I have a N x 3 DataFrame called A that looks like this:
_Segment _Article Binaire
0 550 5568226 1
1 550 5612047 1
2 550 5909228 1
3 550 5924375 1
4 550 5924456 1
5 550 6096557 1
....
The variable _Article is uniquely defined in A (there are N unique values of _Article in A).
I do a pivot:
B=A.pivot(index='_Segment', columns='_Article')
,then replace missing values nan with zeros:
B[np.isnan(B)]=0
and get:
Binaire \
_Article 2332299 2332329 2332337 2932377 2968223 3195643 3346080
_Segment
550 0 0 0 0 0 0 0
551 0 0 0 0 0 0 0
552 0 0 0 0 0 0 0
553 1 1 1 0 0 0 1
554 0 0 0 1 0 1 0
where columns were sorted lexicographically during the pivot.
My question is: how do I retain the sort order of _Article in A in the columns of B?
Thanks!
I think I got it. This works:
First, store the column _Article
order_art=A['_Article']
In the pivot, add the "values" argument to avoid hierarchical columns (see http://pandas.pydata.org/pandas-docs/stable/reshaping.html), which prevent reindex to work properly:
B=A.pivot(index='_Segment', columns='_Article', values='_Binaire')
then, as before, replace nan's with zeros
B[np.isnan(B)]=0
and finally use reindex to restore the original order of variable _Article across columns:
B=B.reindex(columns=order_art)
Are there more elegant solutions?