Pandas column aggregation for duplicate values using custom function - python

I have a dataframe and I want to aggregate the similar ids in column.
X_train['freq_qd1'] = X_train.groupby('qid1')['qid1'].transform('count')
X_train['freq_qd2'] = X_train.groupby('qid2')['qid2'].transform('count')
The above code I understand but i want to custom build a function to apply on multiple columns.
I have attached a snapshot of the dataframe for reference. On this dataframe i tried to apply a custom function on qid1 and qid2.
I tried the below code :
def frequency(qid):
freq = []
for i in str(qid):
if i not in freq:
freq.append(i)
ids = set()
if i not in ids:
ids.add(i)
freq.append(ids)
return freq
def extract_simple_feat(fe) :
fe['question1'] = fe['question1'].fillna(' ')
fe['question2'] = fe['question2'].fillna(' ')
fe['qid1'] = fe['qid1']
fe['qid2'] = fe['qid2']
token_feat = fe.apply(lambda x : get_simple_features(x['question1'],
x['question2']), axis = 1)
fe['q1_len'] = list(map(lambda x : x[0], token_feat))
fe['q2_len'] = list(map(lambda x : x[1], token_feat))
fe['freq_qd1'] = fe.apply(lambda x: frequency(x['qid1']), axis = 1)
fe['freq_qd2'] = fe.apply(lambda x: frequency(x['qid2']), axis = 1)
fe['q1_n_words'] = list(map(lambda x : x[2], token_feat))
fe['q2_n_words'] = list(map(lambda x : x[3], token_feat))
fe['word_common'] = list(map(lambda x : x[4], token_feat))
fe['word_total'] = list(map(lambda x : x[5], token_feat))
fe['word_share'] = list(map(lambda x : x[6], token_feat))
return fe
X_train = extract_simple_feat(X_train)
after applying my own implementation i am not getting the desired result. i am attaching a snapshot for the result i got.
The desired result wanted is below:
if someone can help me because i am really stuck and not able to rectify it properly.
here's a small text input :
qid1 qid2
23 24
25 26
27 28
318830 318831
359558 318831
384105 318831
413505 318831
451953 318831
530151 318831
I want aggregation output as :
qid1 qid2 freq_qid1 freq_id2
23 24 1 1
25 26 1 1
27 28 1 1
318830 318831 1 6
359558 1 6
384105 1 6
413505 1 6
451953 1 6
530151 1 6

Given: (I added an extra row for an edge case)
qid1 qid2
0 23 24
1 25 26
2 27 28
3 318830 318831
4 359558 318831
5 384105 318831
6 413505 318831
7 451953 318831
8 530151 318831
9 495894 4394
Doing:
def get_freqs(df, cols):
temp_df = df.copy()
for col in cols:
temp_df['freq_' + col] = temp_df.groupby(col)[col].transform('count')
temp_df.loc[temp_df[col].duplicated(), col] = ''
return temp_df
df = get_freqs(df, ['qid1', 'qid2'])
print(df)
Output:
qid1 qid2 qid1_freq qid2_freq
0 23 24 1 1
1 25 26 1 1
2 27 28 1 1
3 318830 318831 1 6
4 359558 1 6
5 384105 1 6
6 413505 1 6
7 451953 1 6
8 530151 1 6
9 495894 4394 1 1
If I wanted to do more of what you're doing...
Given:
id qid1 qid2 question1 question2 is_duplicate
0 0 1 2 Why is the sky blue? Why isn't the sky blue? 0
1 1 3 4 Why is the sky blue and green? Why isn't the sky pink? 0
2 2 5 6 Where are we? Moon landing a hoax? 0
3 3 7 8 Am I real? Chickens aren't real. 0
4 4 9 10 If this Fake, surely it is? Oops I did it again. 0
Doing:
def do_stuff(df):
t_df = df.copy()
quids = [x for x in t_df.columns if 'qid' in x]
questions = [x for x in t_df.columns if 'question' in x]
for col in quids:
t_df['freq_' + col] = t_df.groupby(col)[col].transform('count')
t_df.loc[t_df[col].duplicated(), col] = ''
for i, col in enumerate(questions):
t_df[f'q{i+1}_len'] = t_df[col].str.len()
t_df[f'q{i+1}_no_words'] = t_df[col].str.split(' ').apply(lambda x: len(x))
return t_df
df = do_stuff(df)
print(df)
Output:
id qid1 qid2 question1 question2 is_duplicate freq_qid1 freq_qid2 q1_len q1_n_words q2_len q2_n_words
0 0 1 2 Why is the sky blue? Why isn't the sky blue? 0 1 1 20 5 23 5
1 1 3 4 Why is the sky blue and green? Why isn't the sky pink? 0 1 1 30 7 23 5
2 2 5 6 Where are we? Moon landing a hoax? 0 1 1 13 3 20 4
3 3 7 8 Am I real? Chickens aren't real. 0 1 1 10 3 21 3
4 4 9 10 If this Fake, surely it is? Oops I did it again. 0 1 1 27 6 20 5

Related

Create a Dataframe priority/rank column based on conditions over multiple columns

Suppose I have a pandas Dataframe,
Name thick prio
Aabc1 20 1
Babc2 21 1
Cabc3 22 1
Aabc4 23 1
Axyz1 20 2
Bxyz2 21 2
Axyz3 22 2
I need to create a dataframe column such a way that expected output will be
Name thick prio newPrio
Aabc1 20 1 1
Babc2 21 1 3
Cabc3 22 1 4
Aabc4 23 1 2
Axyz1 20 2 5
Bxyz2 21 2 7
Axyz3 22 2 6
The logic behind this is:
1st group names based on thickness (ascending order) and priority. Then check the prio column, for example, 1, there are multiple names, if a name starts with A give them 1st priority, if B then 2nd and if C then third. Then go to the prio 2 and do the same thing. In this way, I would like to create a newPrio column.
I have tried it and it is working partially
x['newPrio'] = x.sort_values(['Name', 'thick', 'prio'])['thick'].index + 1
You can use sort_values by prio then Name and thick:
rank = df.sort_values(['prio', 'Name', 'thick']).index
df['newPrio'] = pd.Series(range(1, len(df)+1), index=rank)
print(df)
# Output
Name thick prio newPrio
0 Aabc1 20 1 1
1 Babc2 21 1 3
2 Cabc3 22 1 4
3 Aabc4 23 1 2
4 Axyz1 20 2 5
5 Bxyz2 21 2 7
6 Axyz3 22 2 6
Use DataFrame.sort_values with Index.argsort for positions of indices:
df['newPrio'] = df.sort_values(['prio', 'Name', 'thick']).index.argsort() + 1
print (df)
Name thick prio newPrio
0 Aabc1 20 1 1
1 Babc2 21 1 3
2 Cabc3 22 1 4
3 Aabc4 23 1 2
4 Axyz1 20 2 5
5 Bxyz2 21 2 7
6 Axyz3 22 2 6
If want sorting by first letter only:
df['newPrio'] = (df.assign(Name = df['Name'].str[0])
.sort_values(['prio', 'Name', 'thick']).index.argsort() + 1)
Or if first letter if is possible converting to numeric:
df['newPrio'] = (df.assign(Name = df['Name'].str[0].astype(int))
.sort_values(['prio', 'Name', 'thick']).index.argsort() + 1)

How do I make bins of equal number of observations in a pandas dataframe?

I'm trying to make a column in a dataframe depicting a group or bin that observation belongs to. The idea is to sort the dataframe according to some column, then develop another column denoting which bin that observation belongs to. If I want deciles, then I should be able to tell a function I want 10 equal (or close to equal) groups.
I tried the pandas qcut but that just gives a tuples of the the upper and lower limits of the bins. I would like just 1,2,3,4....etc. Take the following for example
import numpy as np
import pandas as pd
x = [1,2,3,4,5,6,7,8,5,45,64545,65,6456,564]
y = np.random.rand(len(x))
df_dict = {'x': x, 'y': y}
df = pd.DataFrame(df_dict)
This gives a df of 14 observations. How could I get groups of 5 equal bins?
The desired result would be the following:
x y group
0 1 0.926273 1
1 2 0.678101 1
2 3 0.636875 1
3 4 0.802590 2
4 5 0.494553 2
5 6 0.874876 2
6 7 0.607902 3
7 8 0.028737 3
8 5 0.493545 3
9 45 0.498140 4
10 64545 0.938377 4
11 65 0.613015 4
12 6456 0.288266 5
13 564 0.917817 5
Group by N rows, and find ngroup
df['group']=df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1
x y group
0 1 0.548801 1
1 2 0.096620 1
2 3 0.713771 1
3 4 0.922987 2
4 5 0.283689 2
5 6 0.807755 2
6 7 0.592864 3
7 8 0.670315 3
8 5 0.034549 3
9 45 0.355274 4
10 64545 0.239373 4
11 65 0.156208 4
12 6456 0.419990 5
13 564 0.248278 5
Another option by generating list of indexes from near_split:
def near_split(base, num_bins):
quotient, remainder = divmod(base, num_bins)
return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)
bins = 5
df['group'] = [i + 1 for i, v in enumerate(near_split(len(df), bins)) for _ in range(v)]
print(df)
Output:
x y group
0 1 0.313614 1
1 2 0.765079 1
2 3 0.153851 1
3 4 0.792098 2
4 5 0.123700 2
5 6 0.239107 2
6 7 0.133665 3
7 8 0.979318 3
8 5 0.781948 3
9 45 0.264344 4
10 64545 0.495561 4
11 65 0.504734 4
12 6456 0.766627 5
13 564 0.428423 5
You can split evenly with np.array_split(), assign the groups, then recombine with pd.concat():
bins = 5
splits = np.array_split(df, bins)
for i in range(len(splits)):
splits[i]['group'] = i + 1
df = pd.concat(splits)
Or as a one-liner using assign():
df = pd.concat([d.assign(group=i+1) for i, d in enumerate(np.array_split(df, bins))])
x y group
0 1 0.145781 1
1 2 0.262097 1
2 3 0.114799 1
3 4 0.275054 2
4 5 0.841606 2
5 6 0.187210 2
6 7 0.582487 3
7 8 0.019881 3
8 5 0.847115 3
9 45 0.755606 4
10 64545 0.196705 4
11 65 0.688639 4
12 6456 0.275884 5
13 564 0.579946 5
Here is an approach that "manually" computes the extent of the bins, based on the requested number bins:
bins = 5
l = len(df)
minbinlen = l // bins
remainder = l % bins
repeats = np.repeat(minbinlen, bins)
repeats[:remainder] += 1
group = np.repeat(range(bins), repeats) + 1
df['group'] = group
Result:
x y group
0 1 0.205168 1
1 2 0.105466 1
2 3 0.545794 1
3 4 0.639346 2
4 5 0.758056 2
5 6 0.982090 2
6 7 0.942849 3
7 8 0.284520 3
8 5 0.491151 3
9 45 0.731265 4
10 64545 0.072668 4
11 65 0.601416 4
12 6456 0.239454 5
13 564 0.345006 5
This seems to follow the splitting logic of np.array_split (i.e. try to evenly split the bins, but add onto earlier bins if that isn't possible).
While the code is less concise, it doesn't use any loops, so it theoretically should be faster with larger amounts of data.
Just because I was curious, going to leave this perfplot testing here...
import numpy as np
import pandas as pd
import perfplot
def make_data(n):
x = np.random.rand(n)
y = np.random.rand(n)
df_dict = {'x': x, 'y': y}
df = pd.DataFrame(df_dict)
return df
def repeat(df, bins=5):
l = len(df)
minbinlen = l // bins
remainder = l % bins
repeats = np.repeat(minbinlen, bins)
repeats[:remainder] += 1
group = np.repeat(range(bins), repeats) + 1
return group
def near_split(base, num_bins):
quotient, remainder = divmod(base, num_bins)
return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)
def array_split(df, bins=5):
splits = np.array_split(df, bins)
for i in range(len(splits)):
splits[i]['group'] = i + 1
return pd.concat(splits)
perfplot.show(
setup = lambda n : make_data(n),
kernels = [
lambda df: repeat(df),
lambda df: [i + 1 for i, v in enumerate(near_split(len(df), 5)) for _ in range(v)],
lambda df: df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1,
lambda df: array_split(df)
],
labels=['repeat', 'near_split', 'groupby', 'array_split'],
n_range=[2 ** k for k in range(25)],
equality_check=None)

Skip rows in CSV file

kf = 10
sets = 90
for i in range(0, kf):
chunk[i] = pd.read_csv("Dataset.csv", skiprows=(i*sets), nrows=sets)
By printing i always get the 90 first rows instead of 0 to 89 and 90 to 179 etc. How can i correct the initialization in order to first skip the lines and the start reading the file ?
Output with kf = 100 and sets = 9.
X1 X2 X3 ... X29 X30 Target
0 -2.335543 -2.325887 -2.367347 ... 2.001746 3.102024 1
1 -0.132771 0.463992 -0.282286 ... 3.003794 2.473191 1
2 -1.000121 -1.512276 -3.326958 ... 0.155254 5.855211 1
3 -1.170981 -3.493062 -2.241450 ... 3.228326 3.301115 1
4 -1.449553 -1.428624 -1.401973 ... 1.547833 2.008935 1
5 -1.657024 -1.567071 -1.784387 ... 0.606907 -2.135309 1
6 -0.323730 -1.237250 -2.679961 ... -1.365039 3.101155 1
7 -1.011255 -0.706056 -1.583983 ... -0.678562 -1.950106 1
8 0.388855 0.359412 0.037113 ... -3.413041 -4.051897 1
[9 rows x 31 columns]
X1 X2 X3 ... X29 X30 Target
0 -2.335543 -2.325887 -2.367347 ... 2.001746 3.102024 1
1 -0.132771 0.463992 -0.282286 ... 3.003794 2.473191 1
2 -1.000121 -1.512276 -3.326958 ... 0.155254 5.855211 1
3 -1.170981 -3.493062 -2.241450 ... 3.228326 3.301115 1
4 -1.449553 -1.428624 -1.401973 ... 1.547833 2.008935 1
5 -1.657024 -1.567071 -1.784387 ... 0.606907 -2.135309 1
6 -0.323730 -1.237250 -2.679961 ... -1.365039 3.101155 1
7 -1.011255 -0.706056 -1.583983 ... -0.678562 -1.950106 1
8 0.388855 0.359412 0.037113 ... -3.413041 -4.051897 1
[9 rows x 31 columns]
I believe you need parameter chunksize in read_csv:
for df in pd.read_csv("Dataset.csv", chunksize=sets):
print(df)
EDIT:
I create sample data with your code, problem is values of columns are incorrectly parsed, so is necessary parameter names with if-else with None for first group:
import pandas as pd
#original data
temp=u"""colA,colB
A,1
B,2
A,3
C,4
B,5
A,6
C,7
B,8
A,9
C,10
B,11
A,12
C,13
D,14
B,15
C,16"""
kf = 3
sets = 6
#after testing replace 'pd.compat.StringIO(temp)' to 'Dataset.csv'
cols = pd.read_csv(pd.compat.StringIO(temp), nrows=0).columns
print (cols)
Index(['colA', 'colB'], dtype='object')
for i in range(0, kf):
if i == 0:
val = 0
names = None
else:
val = 1
names = cols
df = pd.read_csv(pd.compat.StringIO(temp),
skiprows=(i*sets) + val,
nrows=sets,
names=names)
print (df)
colA colB
0 A 1
1 B 2
2 A 3
3 C 4
4 B 5
5 A 6
colA colB
0 C 7
1 B 8
2 A 9
3 C 10
4 B 11
5 A 12
colA colB
0 C 13
1 D 14
2 B 15
3 C 16

Looping through a groupby and adding a new column

I need to write a small script to get through some data (around 50k rows/file) and my original file looks like this:
Label ID TRACK_ID QUALITY POSITION_X POSITION_Y POSITION_Z POSITION_T FRAME RADIUS VISIBILITY MANUAL_COLOR MEAN_INTENSITY MEDIAN_INTENSITY MIN_INTENSITY MAX_INTENSITY TOTAL_INTENSITY STANDARD_DEVIATION ESTIMATED_DIAMETER CONTRAST SNR
ID1119 1119 9 6.672 384.195 122.923 0 0 0 5 1 -10921639 81.495 0 0 255 7905 119.529 5.201 1 0.682
ID2237 2237 9 7.078 381.019 122.019 0 1 1 5 1 -10921639 89.381 0 0 255 8670 122.301 5.357 1 0.731
ID2512 2512 9 7.193 377.739 120.125 0 2 2 5 1 -10921639 92.01 0 0 255 8925 123.097 5.356 1 0.747
(...)
ID1102 1102 18 4.991 808.857 59.966 0 0 0 5 1 -10921639 52.577 0 0 255 5100 103.7 4.798 1 0.507
(...)
Its a rather big table with up to 50k rows. Now not all the data is important to me, I mainly need the Track_ID and the X and Y Position.
So I create a dataframe using the excel file and only access the corresponding columns
IN df = pd.read_excel('.../sample.xlsx', 'Sheet1',parse_cols="D, F,G")
And this works as expected. Each track_id is basically one set of data that needs to be analyzed. So the straight forward way is to group the dataframe by track_id
IN Grouping = df.groupby("TRACK_ID")
Also works as intended. Now I need to grab the first POSITION_X value of each group and substract them from the other POSITION_X values in that group.
Now, I already read that looping is probably not the best way to go about it, but I have no idea how else to do it.
for name, group in Grouping:
first_X = group.iloc[0, 1]
vect = group.iloc[1:,1] - first_X
This stores the value in vect, which, if I print it out, gives me the correct value. However, I have the problem that I do not know how to add it now to a new column.
Maybe someone could guide me into the correct direction. Thanks in advance.
EDIT
This was suggested by chappers
def f(grouped):
grouped.iloc[1:] = 0
return grouped
grouped = df.groupby('TRACK_ID')
df['Calc'] = grouped['POSITION_X'].apply(lambda x: x - x.iloc[0]) grouped['POSITION_X'].apply(f)
for name, group in grouped:
print name
print group
Input:
TRACK_ID POSITION_X POSITION_Y
0 9 384.195 122.923
1 9 381.019 122.019
2 9 377.739 120.125
3 9 375.211 117.224
4 9 373.213 113.938
5 9 371.625 110.161
6 9 369.803 106.424
7 9 367.717 103.239
8 18 808.857 59.966
9 18 807.715 61.032
10 18 808.165 63.133
11 18 810.147 64.853
12 18 812.084 65.084
13 18 812.880 63.683
14 18 812.083 62.203
15 18 810.041 61.188
16 18 808.568 62.260
Output for group == 9
TRACK_ID POSITION_X POSITION_Y Calc
0 9 384.195 122.923 384.195
1 9 381.019 122.019 -3.176
2 9 377.739 120.125 -6.456
3 9 375.211 117.224 -8.984
4 9 373.213 113.938 -10.982
5 9 371.625 110.161 -12.570
6 9 369.803 106.424 -14.392
7 9 367.717 103.239 -16.478
So expected Output would be that the very first calc value of every group is 0
Here is one way of approaching it, using the apply method to subtract the first item from all the other obs.
df = DataFrame({'A' : ['foo', 'foo', 'foo', 'foo',
'bar', 'bar', 'bar', 'bar'],
'C' : [1,2,3,4,4,3,2,1]})
grouped = df.groupby('A')
df['C1'] = grouped['C'].apply(lambda x: x - x.iloc[0])
This would have input:
A C
0 foo 1
1 foo 2
2 foo 3
3 foo 4
4 bar 4
5 bar 3
6 bar 2
7 bar 1
and output
A C C1
0 foo 1 0
1 foo 2 1
2 foo 3 2
3 foo 4 3
4 bar 4 0
5 bar 3 -1
6 bar 2 -2
7 bar 1 -3

Python pandas, multindex, slicing

I have got a pd.DataFrame
Time Value
a 1 1 1
2 2 5
3 5 7
b 1 1 5
2 2 9
3 10 11
I want to multiply the column Value with the column Time - Time(t-1) and write the result to a column Product, starting with row b, but separately for each top level index.
For example Product('1','b') should be (Time('1','b') - Time('1','a')) * Value('1','b'). To do this, i would need a "shifted" version of column Time "starting" at row b so that i could do df["Product"] = (df["Time"].shifted - df["Time"]) * df["Value"]. The result should look like this:
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
This should do it:
>>> time_shifted = df['Time'].groupby(level=0).apply(lambda x: x.shift())
>>> df['Product'] = ((df.Time - time_shifted)*df.Value).fillna(0)
>>> df
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
Hey this should do what you need it to. Comment if I missed anything.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Time':[1,2,5,1,2,10],'Value':[1,5,7,5,9,11]},
index = [['a','a','a','b','b','b'],[1,2,3,1,2,3]])
def product(x):
x['Product'] = (x['Time']-x.shift()['Time'])*x['Value']
return x
df = df.groupby(level =0).apply(product)
df['Product'] = df['Product'].replace(np.nan, 0)
print df

Categories