I'm trying to find a more efficient way of transferring information from one DataFrame to another by iterating rows. I have 2 DataFrames, one containing unique values called 'id' in a column and a value called 'region' in another column:
dfkey = DataFrame({'id':[1122,3344,3467,1289,7397,1209,5678,1792,1928,4262,9242],
'region': [1,2,3,4,5,6,7,8,9,10,11]})
id region
0 1122 1
1 3344 2
2 3467 3
3 1289 4
4 7397 5
5 1209 6
6 5678 7
7 1792 8
8 1928 9
9 4262 10
10 9242 11
...the other DataFrame contains these same ids, but now sometimes repeated and without any order:
df2 = DataFrame({'id':[1792,1122,3344,1122,3467,1289,7397,1209,5678],
'other': [3,2,3,4,3,5,7,3,1]})
id other
0 1792 3
1 1122 2
2 3344 3
3 1122 4
4 3467 3
5 1289 5
6 7397 7
7 1209 3
8 5678 1
I want to use the dfkey DataFrame as a key to input the region of each id in the df2 DataFrame. I already found a way to do this with iterrows(), but it involves nested loops:
df2['region']=0
for i, rowk in dfkey.iterrows():
for j, rowd in df2.iterrows():
if rowk['id'] == rowd['id']:
rowd['region'] = rowk['region']
id other region
0 1792 3 8
1 1122 2 1
2 3344 3 2
3 1122 4 1
4 3467 3 3
5 1289 5 4
6 7397 7 5
7 1209 3 6
8 5678 1 7
The actual dfkey I have has 43K rows and the df2 600K rows. The code has been running for an hour now so I'm wondering if there's a more efficient way of doing this...
pandas.merge could be another solution.
newdf = pandas.merge(df2, dfkey, on='id')
In [22]: newdf
Out[22]:
id other region
0 1792 3 8
1 1122 2 1
2 1122 4 1
3 3344 3 2
4 3467 3 3
5 1289 5 4
6 7397 7 5
7 1209 3 6
8 5678 1 7
I would use map() method:
In [268]: df2['region'] = df2['id'].map(dfkey.set_index('id').region)
In [269]: df2
Out[269]:
id other region
0 1792 3 8
1 1122 2 1
2 3344 3 2
3 1122 4 1
4 3467 3 3
5 1289 5 4
6 7397 7 5
7 1209 3 6
8 5678 1 7
Timing for 900K rows df2 DF:
In [272]: df2 = pd.concat([df2] * 10**5, ignore_index=True)
In [273]: df2.shape
Out[273]: (900000, 3)
In [274]: dfkey.shape
Out[274]: (11, 2)
In [275]: %timeit df2['id'].map(dfkey.set_index('id').region)
10 loops, best of 3: 176 ms per loop
Related
I have a dataframe that has segments of consecutive values appearing in column a (the value in column b does not matter):
import pandas as pd
import numpy as np
np.random.seed(150)
df = pd.DataFrame(data={'a':[1,2,3,4,5,15,16,17,18,203,204,205],'b':np.random.randint(50000,size=(12))})
>>> df
a b
0 1 27066
1 2 28155
2 3 49177
3 4 496
4 5 2354
5 15 23292
6 16 9358
7 17 19036
8 18 29946
9 203 39785
10 204 15843
11 205 21917
I would like to add a column c whose values are sequential counts according to presenting consecutive values in column a, as shown below:
a b c
1 27066 1
2 28155 2
3 49177 3
4 496 4
5 2354 5
15 23292 1
16 9358 2
17 19036 3
18 29946 4
203 39785 1
204 15843 2
205 21917 3
How to do this?
One solution:
df["c"] = (s := df["a"] - np.arange(len(df))).groupby(s).cumcount() + 1
print(df)
Output
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3
The original idea comes from ancient Python docs.
In order to use the walrus operator ((:=) or assignment expressions) you need Python 3.8+, instead you can do:
s = df["a"] - np.arange(len(df))
df["c"] = s.groupby(s).cumcount() + 1
print(df)
A simple solution is to find consecutive groups, use cumsum to get the number sequence and then remove any extra in later groups.
a = df['a'].add(1).shift(1).eq(df['a'])
df['c'] = a.cumsum() - a.cumsum().where(~a).ffill().fillna(0).astype(int) + 1
df
Result:
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3
I have a dataframe sorted and grouped by serial number, then date:
df = df.sort_values(by=["serial_num" , "date"])
df = df.groupby(df['serial_num'])['date'].some_function()
Dataframe
serial_num date
0 1 2001-01-01
1 1 2001-02-01
2 1 2001-03-01
3 1 2001-04-01
4 3 2003-05-01
5 3 2003-06-01
6 3 2003-07-01
7 7 2005-07-01
8 7 2005-08-01
9 7 2005-09-01
10 7 2005-10-01
11 7 2005-11-01
12 7 2005-12-01
Each unique serial_num group will be a line on a line graph.
The way it is graphed now, each line is a time series that starts at a different point -- because the first date is different for every serial_num group.
I need the x-axis of the graph to be time instead of date. All lines on the graph will start at the same point - the origin of the x-axis of my graph.
I think the easiest way to do this would be to add a consecutive index that starts at 0 for each group, like this:
serial_num date new_index
0 1 2001-01-01 0
1 1 2001-02-01 1
2 1 2001-03-01 2
3 1 2001-04-01 3
4 3 2003-05-01 0
5 3 2003-06-01 1
6 3 2003-07-01 2
7 7 2005-07-01 0
8 7 2005-08-01 1
9 7 2005-09-01 2
10 7 2005-10-01 3
11 7 2005-11-01 4
12 7 2005-12-01 5
Then, I think I will be able to graph (in Plotly) with all lines starting at the same point (the 0 index will be the first data point for each serial_num.
NOTE: each serial_num group has a different number of data points.
I'm unsure how to index with groupby this way. Please help! Or if you know another method that will accomplish the same goal, please share. Thanks!
Use cumcount:
df["new_index"] = df.groupby("serial_num").cumcount()
print(df)
==>
serial_num date new_index
0 1 2001-01-01 0
1 1 2001-02-01 1
2 1 2001-03-01 2
3 1 2001-04-01 3
4 3 2003-05-01 0
5 3 2003-06-01 1
6 3 2003-07-01 2
7 7 2005-07-01 0
8 7 2005-08-01 1
9 7 2005-09-01 2
10 7 2005-10-01 3
11 7 2005-11-01 4
12 7 2005-12-01 5
I have a similar dataframe:
action_type value
0 0 link_click 1
1 mobile_app_install 5
2 video_view 181
3 omni_view_content 2
1 0 post_reaction 32
1 link_click 124
2 mobile_app_install 190
3 video_view 6162
4 omni_custom 2420
5 omni_activate_app 4525
2 0 comment 1
1 link_click 53
2 post_reaction 23
3 video_view 2246
4 mobile_app_install 87
5 omni_view_content 24
6 post_engagement 2323
7 page_engagement 2323
I want to transpose so:
It looks like you can try:
(df.set_index('action_type', append=True)
.reset_index(level=1, drop=True)['value']
.unstack('action_type')
)
I'm trying to get all records where the mean of the last 3 rows is greater than the overall mean for all rows in a filtered set.
_filtered_d_all = _filtered_d.iloc[:, 0:50].loc[:, _filtered_d.mean()>0.05]
_last_n_records = _filtered_d.tail(3)
Something like this
_filtered_growing = _filtered_d.iloc[:, 0:50].loc[:, _last_n_records.mean() > _filtered_d.mean()]
However, the problem here is that the value length is incorrect. Any tips?
ValueError: Series lengths must match to compare
Sample Data
This has an index on the year and month, and 2 columns.
Col1 Col2
year month
2005 12 0.533835 0.170679
12 0.494733 0.198347
2006 3 0.440098 0.202240
6 0.410285 0.188421
9 0.502420 0.200188
12 0.522253 0.118680
2007 3 0.378120 0.171192
6 0.431989 0.145158
9 0.612036 0.178097
12 0.519766 0.252196
2008 3 0.547705 0.202163
6 0.560985 0.238591
9 0.617320 0.199537
12 0.343939 0.253855
Why not just boolean index directly on your filtered DataFrame with
df[df.tail(3).mean() > df.mean()]
Demo
>>> df
0 1 2 3 4
0 4 8 2 4 6
1 0 0 0 2 8
2 5 3 0 9 3
3 7 5 5 1 2
4 9 7 8 9 4
>>> df[df.tail(3).mean() > df.mean()]
0 1 2 3 4
0 4 8 2 4 6
1 0 0 0 2 8
2 5 3 0 9 3
3 7 5 5 1 2
Update example for MultiIndex edit
The same should work fine for your MultiIndex sample, we just have to mask a bit differently of course.
>>> df
col1 col2
2005 12 -0.340088 -0.574140
12 -0.814014 0.430580
2006 3 0.464008 0.438494
6 0.019508 -0.635128
9 0.622645 -0.824526
12 -1.674920 -1.027275
2007 3 0.397133 0.659467
6 0.026170 -0.052063
9 0.835561 0.608067
12 0.736873 -0.613877
2008 3 0.344781 -0.566392
6 -0.653290 -0.264992
9 0.080592 -0.548189
12 0.585642 1.149779
>>> df.loc[:,df.tail(3).mean() > df.mean()]
col2
2005 12 -0.574140
12 0.430580
2006 3 0.438494
6 -0.635128
9 -0.824526
12 -1.027275
2007 3 0.659467
6 -0.052063
9 0.608067
12 -0.613877
2008 3 -0.566392
6 -0.264992
9 -0.548189
12 1.149779
I am new on pandas. I try to sort a column and group them by their numbers.
df = pd.read_csv("12Patients150526 mutations-ORIGINAL.txt", sep="\t", header=0)
samp=df["SAMPLE"]
samp
Out[3]:
0 11
1 2
2 9
3 1
4 8
5 2
6 1
7 3
8 10
9 4
10 5
..
53157 12
53158 3
53159 2
53160 10
53161 2
53162 3
53163 4
53164 11
53165 12
53166 11
Name: SAMPLE, dtype: int64
#sorting
grp=df.sort(samp)
This code does not work. Can somebody help me about my problem, please.
How can I sort and group them by their numbers?
To sort df based on a particular column, use df.sort() and pass column name as parameter.
import pandas as pd
import numpy as np
# data
# ===========================
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1,10,1000), columns=['SAMPLE'])
df
SAMPLE
0 6
1 1
2 4
3 4
4 8
5 4
6 6
7 3
.. ...
992 3
993 2
994 1
995 2
996 7
997 4
998 5
999 4
[1000 rows x 1 columns]
# sort
# ======================
df.sort('SAMPLE')
SAMPLE
310 1
710 1
935 1
463 1
462 1
136 1
141 1
144 1
.. ...
174 9
392 9
386 9
382 9
178 9
772 9
890 9
307 9
[1000 rows x 1 columns]