Aloha,
I have the following DataFrame
stores = [1,2,3,4,5]
weeks = [1,1,1,1,1]
df = pd.DataFrame({'Stores' : stores,
'Weeks' : weeks})
df = pd.concat([df]*53)
df['Weeks'] = df['Weeks'].add(df.groupby('Stores').cumcount())
df['Target'] = np.random.randint(400,600,size=len(df))
df['Actual'] = np.random.randint(350,800,size=len(df))
df['Variance %'] = (df['Target'] - df['Actual']) / df['Target']
df.loc[df['Variance %'] >= 0.01, 'Status'] = 'underTarget'
df.loc[df['Variance %'] <= 0.01, 'Status'] = 'overTarget'
df['Status'] = df['Status'].fillna('atTarget')
df.sort_values(['Stores','Weeks'],inplace=True)
this gives me the following
print(df.head())
Stores Weeks Target Actual Variance % Status
0 1 1 430 605 -0.406977 overTarget
0 1 2 549 701 -0.276867 overTarget
0 1 3 471 509 -0.080679 overTarget
0 1 4 549 378 0.311475 underTarget
0 1 5 569 708 -0.244288 overTarget
0 1 6 574 650 -0.132404 overTarget
0 1 7 466 623 -0.336910 overTarget
now what I'm trying to do is do a cumulative count of Stores where they were either over or undertarget but reset when the status changes.
I thought this would be the best way to do this (and many variants of this) but this does not reset the counter.
s = df.groupby(['Stores','Weeks','Status'])['Status'].shift().ne(df['Status'])
df['Count'] = s.groupby(df['Stores']).cumsum()
my logic was to group by my relevant columns, and do a != shift to reset the cumsum
Naturally I've scoured lots of different questions but I can't seem to figure this out. Would anyone be so kind to explain to me what would be the best method to tackle this problem?
I hope everything here is clear and reproducible. Please let me know if you need any additional information.
Expected Output
Stores Weeks Target Actual Variance % Status Count
0 1 1 430 605 -0.406977 overTarget 1
0 1 2 549 701 -0.276867 overTarget 2
0 1 3 471 509 -0.080679 overTarget 3
0 1 4 549 378 0.311475 underTarget 1 # Reset here as status changes
0 1 5 569 708 -0.244288 overTarget 1 # Reset again.
0 1 6 574 650 -0.132404 overTarget 2
0 1 7 466 623 -0.336910 overTarget 3
Try pd.Series.groupby() after create the key by cumsum
s=df.groupby('Stores')['Status'].apply(lambda x : x.ne(x.shift()).ne(0).cumsum())
df['Count']=df.groupby([df.Stores,s]).cumcount()+1
Related
I have these two dataframes:
sp_client
ConnectionID Value
0 CN01493292 495
1 CN01492424 440
2 CN01491959 403
3 CN01493200 312
4 CN01493278 282
.. ... ...
110 CN01492864 1
111 CN01492513 1
112 CN01492899 1
113 CN01493010 1
114 CN01493032 1
[115 rows x 2 columns]
sp_server
ConnectionID Value
1 CN01491920 2
1 CN01491920 2
3 CN01491922 2
3 CN01491922 2
5 CN01491928 2
.. ... ...
595 CN01493166 3
595 CN01493166 3
595 CN01493166 3
597 CN01493163 2
597 CN01493163 2
[673 rows x 2 columns]
I would like to merge them in a way where sp_client['Value'] increments by addition of sp_sever['Value'] and sp_client['Value'] only when the rows satisfy the condition sp_sever['ConnectionID']==sp_client['ConnectionID'].
It was a little bit complicated for me but I tried the following, but I am missing the condition part. Maybe it does not need to be merged in the first place. Happy to hear suggestions.
as per my comment, try to append tables and group them by ID while summing Value column as per example:
all_data = pd.concat([sp_server,sp_client])
all_data = all_data.groupby('ConnectionID')['Value'].agg(sum).reset_index()
out:
ConnectionID Value
0 CN01491920 4
1 CN01491922 4
2 CN01491928 2
3 CN01491959 403
4 CN01492424 440
5 CN01493200 312
I am trying to group dataframe by 3 elements and want to get rows with the highest column value from the group, but max method applies to all columns. How can I achieve this?
What I do:
In [69]: fr
Out[69]:
ping delta
0 1516190798773 161
1 1516191845372 143
2 1516192904988 144
3 1516193952748 295
4 1516195008033 233
5 1516196049407 252
In [70]: fr.groupby(fr.index / 3).max()
Out[70]:
ping delta
0 1516192904988 161
1 1516196049407 295
Result I want to get:
ping delta
0 1516190798773 161
1 1516193952748 295
If want first value in ping column and max value in delta:
df = fr.groupby(fr.index // 3).agg({'delta':'max','ping':'first'})
print (df)
ping delta
0 1516190798773 161
1 1516193952748 295
If want max value in delta and all coresponding rows:
df = fr.loc[fr.groupby(fr.index // 3)['delta'].idxmax()]
print (df)
ping delta
0 1516190798773 161
3 1516193952748 295
Better sample for difference:
print (fr)
ping delta
0 1516190798773 161
1 1516191845372 143
2 1516192904988 144
3 1516193952748 233 <-swapped values 233
4 1516195008033 295 <-swapped values 295
5 1516196049407 252
df = fr.groupby(fr.index // 3).agg({'delta':'max','ping':'first'})
print (df)
ping delta
0 1516190798773 161
1 1516193952748 295
df = fr.loc[fr.groupby(fr.index // 3)['delta'].idxmax()]
print (df)
ping delta
0 1516190798773 161
4 1516195008033 295
Suppose I construct a multi-index dataframe like the one show here:
prim_ind=np.array(range(0,1000))
for i in range(0,1000):
prim_ind[i]=round(i/4)
d = {'prim_ind' :prim_ind,
'sec_ind' : np.array(range(1,1001)),
'a' : np.array(range(325,1325)),
'b' : np.array(range(8318,9318))}
df= pd.DataFrame(d).set_index(['prim_ind','sec_ind'])
The sec_ind runs sequentially from 1 upwards, but I want to reset this second index so that for each of the prim_ind levels the sec_ind always starts at 1. I have been trying to work out if I can use reset index to do this but am failing miserably.
I know i could iterate over the dataframe to get this outcome but that will be a horrible way to do it and there must be a more pythonic way - can anyone help?
Note: the dataframe i'm working with is actually imported from csv, the code above is just to illustrate this question.
You can use cumcount for count categories.
df.index = [df.index.get_level_values(0), df.groupby(level=0).cumcount() + 1]
Or better if want also index names is use MultiIndex.from_arrays:
df.index = pd.MultiIndex.from_arrays([df.index.get_level_values(0),
df.groupby(level=0).cumcount() + 1],
names=df.index.names)
print (df)
a b
prim_ind sec_ind
0 1 325 8318
2 326 8319
3 327 8320
1 1 328 8321
2 329 8322
3 330 8323
2 1 331 8324
So column sec_ind is not necessary, you can use also:
d = {'prim_ind' :prim_ind,
'a' : np.array(range(325,1325)),
'b' : np.array(range(8318,9318))}
df = pd.DataFrame(d)
print (df.head(8))
a b prim_ind
0 325 8318 0
1 326 8319 0
2 327 8320 0
3 328 8321 1
4 329 8322 1
5 330 8323 1
6 331 8324 2
7 332 8325 2
df = df.set_index(['prim_ind', df.groupby('prim_ind').cumcount() + 1]) \
.rename_axis(('first','second'))
print (df.head(8))
a b
first second
0 1 325 8318
2 326 8319
3 327 8320
1 1 328 8321
2 329 8322
3 330 8323
2 1 331 8324
2 332 8325
I am aligning two dataframes which look like the following:
Dataframe 1
Timestamp L_x L_y L_a R_x R_y R_a
2403950 621.3 461.3 313 623.3 461.8 260
2404050 622.5 461.3 312 623.3 462.6 260
2404150 623.1 461.5 311 623.4 464 261
2404250 623.6 461.7 310 623.7 465.4 261
2404350 623.8 461.5 309 623.9 466.1 261
Dataframe 2
This dataframe contains the timestamps that a particular event occured.
Timestamp
0 2404030
1 2404050
2 2404250
3 2404266
4 2404282
5 2404298
6 2404314
7 2404330
8 2404350
9 2404382
All timestamps are in milliseconds. As you can see, the first dataframe is resampled to 100milliseconds. So what I want to do is, to align the two dataframes based on count. Which means based on the count how many events occur during a particular 100milliseconds bin time. For example, from the dataframe 1, in the first 100millisecond bin time (24043950 - 2404049), only one event occur according to the second dataframe which is at 2404030 and so on. The aligned table should look like the following:
Timestamp L_x L_y L_a R_x R_y R_a count
2403950 621.3 461.3 313 623.3 461.8 260 1
2404050 622.5 461.3 312 623.3 462.6 260 1
2404150 623.1 461.5 311 623.4 464 261 0
2404250 623.6 461.7 310 623.7 465.4 261 6
2404350 623.8 461.5 309 623.9 466.1 261 2
Thank you for your help and suggestion.
You want to perform integer division on the timestamp (i.e. a // b), but first need to add 50 to it given your bucketing. Then convert it back into the correct units by multiplying by 100 and subtracting 50.
Now, group on this new index and perform a count.
You then merge these counts to your original dataframe and do some formatting operations to get the data in the desired shape. Make sure to fill NaNs with zero.
df2['idx'] = (df2.Timestamp + 50) // 100 * 100 - 50
counts = df2.groupby('idx').count()
>>> counts
Timestamp
idx
2403950 1
2404050 1
2404250 6
2404350 2
df_new =df.merge(counts, how='left', left_on='Timestamp', right_index=True, suffixes=['', '_'])
columns = list(df_new)
columns[-1] = 'count'
df_new.columns = columns
df_new['count'].fillna(0, inplace=True)
>>> df_new
Timestamp L_x L_y L_a R_x R_y R_a count
0 2403950 621.3 461.3 313 623.3 461.8 260 1
1 2404050 622.5 461.3 312 623.3 462.6 260 1
2 2404150 623.1 461.5 311 623.4 464.0 261 0
3 2404250 623.6 461.7 310 623.7 465.4 261 6
4 2404350 623.8 461.5 309 623.9 466.1 261 2
Problem:
I'm trying to two relatively small datasets together, but the merge raises a MemoryError. I have two datasets of aggregates of country trade data, that I'm trying to merge on the keys year and country, so the data needs to be particularity placed. This unfortunately makes the use of concat and its performance benefits impossible as seen in the answer to this question: MemoryError on large merges with pandas in Python.
Here's the setup:
The attempted merge:
df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
Basic data structure:
i:
Year Reporter_Code Trade_Flow_Code Partner_Code Classification Commodity Code Quantity Unit Code Supplementary Quantity Netweight (kg) Value Estimation Code
0 2003 381 2 36 H2 070951 8 1274 1274 13810 0
1 2003 381 2 36 H2 070930 8 17150 17150 30626 0
2 2003 381 2 36 H2 0709 8 20493 20493 635840 0
3 2003 381 1 36 H2 0507 8 5200 5200 27619 0
4 2003 381 1 36 H2 050400 8 56439 56439 683104 0
df:
mporter cod CC ComTrade_CC Distance_miles
0 110 215 215 757 428.989
1 110 215 215 757 428.989
2 110 215 215 757 428.989
3 110 215 215 757 428.989
4 110 215 215 757 428.989
Error Traceback:
MemoryError Traceback (most recent call last)
<ipython-input-10-8d6e9fb45de6> in <module>()
1 for i in c_list:
----> 2 df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
36 right_index=right_index, sort=sort, suffixes=suffixes,
37 copy=copy)
---> 38 return op.get_result()
39 if __debug__:
40 merge.__doc__ = _merge_doc % '\nleft : DataFrame'
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
193 copy=self.copy)
194
--> 195 result_data = join_op.get_result()
196 result = DataFrame(result_data)
197
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
693 if klass in mapping:
694 klass_blocks.extend((unit, b) for b in mapping[klass])
--> 695 res_blk = self._get_merged_block(klass_blocks)
696
697 # if we have a unique result index, need to clear the _ref_locs
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_merged_block(self, to_merge)
706 def _get_merged_block(self, to_merge):
707 if len(to_merge) > 1:
--> 708 return self._merge_blocks(to_merge)
709 else:
710 unit, block = to_merge[0]
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _merge_blocks(self, merge_chunks)
728 # Should use Fortran order??
729 block_dtype = _get_block_dtype([x[1] for x in merge_chunks])
--> 730 out = np.empty(out_shape, dtype=block_dtype)
731
732 sofar = 0
MemoryError:
Thanks for your thoughts!
In case anyone coming across this question still has similar trouble with merge, you can probably get concat to work by renaming the relevant columns in the two dataframes to the same names, setting them as a MultiIndex (i.e. df = dv.set_index(['A','B'])), and then using concat to join them.
UPDATE
Example:
df1 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'C':[3, 4]})
df2 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'D':[7, 8]})
both = pd.concat([df1.set_index(['A','B']), df2.set_index(['A','B'])], axis=1).reset_index()
df1
A B C
0 1 2 3
1 2 3 4
df2
A B D
0 1 2 7
1 2 3 8
both
A B C D
0 1 2 3 7
1 2 3 4 8
I haven't benchmarked the performance of this approach, but it didn't get the memory error and worked for my applications.