Filter data with groupby in pandas - python

I have a DataFrame where I have the following data. Each row represents a word appearing in each episode of a TV series. If a word appears 3 times in an episode, the pandas dataframe has 3 rows. Now I need to filter a list of words such that I should only get only words which appear more than or equal to 2 times. I can do this by groupby, but if a word appears 2 (or say 3,4 or 5) times, I need two (3, 4 or 5) rows for it.
By groupby, I will only get the unique entry and count, but I need the entry to repeat as many times as it appears in the dialogue. Is there a one-liner to do this?
dialogue episode
0 music 1
1 corrections 1
2 somnath 1
3 yadav 5
4 join 2
5 instagram 1
6 wind 2
7 music 1
8 whimpering 2
9 music 1
10 wind 3
SO here I should ideally get,
dialogue episode
0 music 1
6 wind 2
7 music 1
9 music 1
10 wind 3
As these are the only 2 words that appear more than or equal to 2 times.

You can use groupby's filter:
In [11]: df.groupby("dialogue").filter(lambda x: len(x) > 1)
Out[11]:
dialogue episode
0 music 1
6 wind 2
7 music 1
9 music 1
10 wind 3

Answer for the updated question:
In [208]: df.groupby('dialogue')['episode'].transform('size') >= 3
Out[208]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 True
10 False
dtype: bool
In [209]: df[df.groupby('dialogue')['episode'].transform('size') >= 3]
Out[209]:
dialogue episode
0 music 1
7 music 1
9 music 1
Answer for the original question:
you can use duplicated() method:
In [202]: df[df.duplicated(subset=['dialogue'], keep=False)]
Out[202]:
dialogue episode
0 music 1
6 wind 2
7 music 1
9 music 1
10 wind 3
if you want to sort the result:
In [203]: df[df.duplicated(subset=['dialogue'], keep=False)].sort_values('dialogue')
Out[203]:
dialogue episode
0 music 1
7 music 1
9 music 1
6 wind 2
10 wind 3

I'd use value_counts
vc = df.dialogue.value_counts() >= 2
vc = vc[vc]
df[df.dialogue.isin(vc.index)]
Timing
keep in mind, this is completely over the top. however, i'm sharpening up my timing skills.
code
from timeit import timeit
def pirsquared(df):
vc = df.dialogue.value_counts() > 1
vc = vc[vc]
return df[df.dialogue.isin(vc.index)]
def maxu(df):
return df[df.groupby('dialogue')['episode'].transform('size') > 1]
def andyhayden(df):
return df.groupby("dialogue").filter(lambda x: len(x) > 1)
rows = ['pirsquared', 'maxu', 'andyhayden']
cols = ['OP_Given', '10000_3_letters']
summary = pd.DataFrame([], rows, cols)
iterations = 10
df = pd.DataFrame({'dialogue': {0: 'music', 1: 'corrections', 2: 'somnath', 3: 'yadav', 4: 'join', 5: 'instagram', 6: 'wind', 7: 'music', 8: 'whimpering', 9: 'music', 10: 'wind'}, 'episode': {0: 1, 1: 1, 2: 1, 3: 5, 4: 2, 5: 1, 6: 2, 7: 1, 8: 2, 9: 1, 10: 3}})
summary.loc['pirsquared', 'OP_Given'] = timeit(lambda: pirsquared(df), number=iterations)
summary.loc['maxu', 'OP_Given'] = timeit(lambda: maxu(df), number=iterations)
summary.loc['andyhayden', 'OP_Given'] = timeit(lambda: andyhayden(df), number=iterations)
df = pd.DataFrame(
pd.DataFrame(np.random.choice(list(lowercase), (10000, 3))).sum(1),
columns=['dialogue'])
df['episode'] = 1
summary.loc['pirsquared', '10000_3_letters'] = timeit(lambda: pirsquared(df), number=iterations)
summary.loc['maxu', '10000_3_letters'] = timeit(lambda: maxu(df), number=iterations)
summary.loc['andyhayden', '10000_3_letters'] = timeit(lambda: andyhayden(df), number=iterations)
summary

Related

Creating a function in Python for creating buckets from pandas dataframe values based on multiple conditions

I asked this question and it helped me, but now my task is more complex.
My dataframe has ~100 columns and values with 14 scales.
{'Diseasetype': {0: 'Oncology',
1: 'Oncology',
2: 'Oncology',
3: 'Nononcology',
4: 'Nononcology',
5: 'Nononcology'},
'Procedures1': {0: 100, 1: 300, 2: 500, 3: 200, 4: 400, 5: 1000},
'Procedures2': {0: 1, 1: 3, 2: 5, 3: 2, 4: 4, 5: 10},
'Procedures100': {0: 1000, 1: 3000, 2: 5000, 3: 2000, 4: 4000, 5: 10000}}
I want to convert each value in each column of the dataframe into a bucket value.
My current solution is:
def encoding(col, labels):
return np.select([col<200, col.between(200,500), col.between(500,1000), col>1000], labels, 0)
onc_labels = [1,2,3,4]
nonc_labels = [11,22,33,44]
msk = df['Therapy_area'] == 'Oncology'
df[cols] = pd.concat((df.loc[msk, cols].apply(encoding, args=(onc_labels,)), df.loc[msk, cols].apply(encoding, args=(nonc_labels,)))).reset_index(drop=True)
It works well, if all columns of the dataframe has the same scale, but they do not. Remember, I have 14 different scales.
I would like to update the code above (or get another solution), which would allow me to bucket data. I cannot use the same range of values for bucketing everything.
My logic is the following:
If Disease == Oncology and Procedures1 on this scale, convert values to these buckets (1, 2, 3)
If Disease == Oncology and Procedures2 on this scale, convert values to these buckets (1, 2, 3)
If Disease != Oncology and Procedures77 on this scale, convert values to these buckets (4, 5, 6)
Example of a scale and buckets:
Procedures1 for Oncology: < 200 = 1, 200-400 = 2, >400 = 3
Procedures2 for Oncology: < 2 = 1, 2-4 = 2, >4 = 3
Procedures3 for Oncology: < 2000 = 1, 2000-4000 = 2, >4000 = 3
Procedures1 for nonOncology: < 200 = 4, 200-400 = 5, >400 = 6
Procedures2 for nonOncology: < 2 = 4, 2-4 = 5, >4 = 6
Procedures3 for nonOncology: < 2000 = 4, 2000-4000 = 5, >4000 = 6
Expected output (happy to provide more info!)
Diseasetype Procedures1 Procedures2 Procedures100
Oncology 1 1 1
Oncology 2 2 2
Oncology 3 3 3
Nononcology 4 4 4
Nononcology 5 5 5
Nononcology 6 6 6
Link with rules:
I used an helper file with all scales (source at the end of answer):
Use melt to flatten your dataframe then filter out your rows with query and finally use pivot to reshape your dataframe. You can execute each line independently to show the transformations:
scales = pd.read_csv('scales.csv').fillna({'Start': -np.inf, 'End': np.inf})
out = (
df.melt('Diseasetype', var_name='Procedure', ignore_index=False).reset_index()
.merge(scales, on=['Diseasetype', 'Procedure'], how='left')
.query("value.between(Start, End)")
.pivot_table('Label', ['index', 'Diseasetype'], 'Procedure').astype(int)
.droplevel(0).rename_axis(columns=None).reset_index()
)
Output:
>>> df
Diseasetype Procedures1 Procedures100 Procedures2
0 Oncology 1 1 1
1 Oncology 2 2 2
2 Oncology 3 3 3
3 Nononcology 4 4 4
4 Nononcology 5 5 5
5 Nononcology 6 6 6
Content of scales.csv:
Diseasetype,Procedure,Start,End,Label
Oncology,Procedures1,,200,1
Oncology,Procedures1,200,400,2
Oncology,Procedures1,400,,3
Oncology,Procedures2,,2,1
Oncology,Procedures2,2,4,2
Oncology,Procedures2,4,,3
Oncology,Procedures100,,2000,1
Oncology,Procedures100,2000,4000,2
Oncology,Procedures100,4000,,3
Nononcology,Procedures1,,200,4
Nononcology,Procedures1,200,400,5
Nononcology,Procedures1,400,,6
Nononcology,Procedures2,,2,4
Nononcology,Procedures2,2,4,5
Nononcology,Procedures2,4,,6
Nononcology,Procedures100,,2000,4
Nononcology,Procedures100,2000,4000,5
Nononcology,Procedures100,4000,,6

python panda apply compare to external list and remove part of list

I have a parking lot with cars of different models (nr) and the cars are so closely packed that in order for one to get out one might need to move some others. A little like a 15Puzzle, only I can take one or more cars out of the parking lot. Ordered_car_List includes the cars that will be picked up today, and they need to be taken out of the parking lot with as few non-ordered cars as possible moved. There are more columns to this panda, but this is what I can't figure out.
I have a Program that works good for small sets of data, but it seems that this is not the way of the PANDAS :-)
I have this:
cars = pd.DataFrame({'x': [1,1,1,1,1,2,2,2,2],
'y': [1,2,3,4,5,1,2,3,4],
'order_number':[6,6,7,6,7,9,9,10,12]})
cars['order_number_no_dublicates_down'] = None
Ordered_car_List = [6,9,9,10,28]
i=0
while i < len(cars):
temp_val = cars.at[i, 'order_number']
if temp_val in Ordered_car_List:
cars.at[i, 'order_number_no_dublicates_down'] = temp_val
Ordered_car_List.remove(temp_val)
i+=1
If I use cars.apply(lambda..., how can I change the Ordered_car_List in each iteration?
Is there another approach that I can take?
I found this page, and it made me want to be faster. The Lambda approach is in the middle when it comes to speed, but it still is so much faster than what I am doing now.
https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06
Updating cars
We can vectorize this based on two counters:
cumcount() to cumulatively count each unique value in cars['order_number']
collections.Counter() to count each unique value in Ordered_car_List
cumcount = cars.groupby('order_number').cumcount().add(1)
maxcount = cars['order_number'].map(Counter(Ordered_car_List))
# order_number cumcount maxcount
# 0 6 1 1
# 1 6 2 1
# 2 7 1 0
# 3 6 3 1
# 4 7 2 0
# 5 9 1 2
# 6 9 2 2
# 7 10 1 1
# 8 12 1 0
So then we only want to keep cars['order_number'] where cumcount <= maxcount:
either use DataFrame.loc[]
cars.loc[cumcount <= maxcount, 'nodup'] = cars['order_number']
or Series.where()
cars['nodup'] = cars['order_number'].where(cumcount <= maxcount)
or Series.mask() with the condition inverted
cars['nodup'] = cars['order_number'].mask(cumcount > maxcount)
Updating Ordered_car_List
The final Ordered_car_List is a Counter() difference:
Used_car_List = cars.loc[cumcount <= maxcount, 'order_number']
# [6, 9, 9, 10]
Ordered_car_List = list(Counter(Ordered_car_List) - Counter(Used_car_List))
# [28]
Final output
cumcount = cars.groupby('order_number').cumcount().add(1)
maxcount = cars['order_number'].map(Counter(Ordered_car_List))
cars['nodup'] = cars['order_number'].where(cumcount <= maxcount)
# x y order_number nodup
# 0 1 1 6 6.0
# 1 1 2 6 NaN
# 2 1 3 7 NaN
# 3 1 4 6 NaN
# 4 1 5 7 NaN
# 5 2 1 9 9.0
# 6 2 2 9 9.0
# 7 2 3 10 10.0
# 8 2 4 12 NaN
Used_car_List = cars.loc[cumcount <= maxcount, 'order_number']
Ordered_car_List = list(Counter(Ordered_car_List) - Counter(Used_car_List))
# [28]
Timings
Note that your loop is still very fast with small data, but the vectorized counter approach just scales much better:

grouping a list values based on max value

I'm working on k-mean algorthim to cluster list of number, If i have an array (X)
X=array([[0.85142858],[0.85566274],[0.85364912],[0.81536489],[0.84929932],[0.85042336],[0.84899714],[0.82019115], [0.86112067],[0.8312496 ]])
then I run the following code
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
cluster.fit_predict(X)
for i in range(len(X)):
print("%4d " % cluster.labels_[i], end=""); print(X[i])
i got the results
1 1 [0.85142858]
2 3 [0.85566274]
3 3 [0.85364912]
4 0 [0.81536489]
5 1 [0.84929932]
6 1 [0.85042336]
7 1 [0.84899714]
8 0 [0.82019115]
9 4 [0.86112067]
10 2 [0.8312496]
how to get the max number in each cluster with value of (i) ? like this
0: 0.82019115 8
1: 0.85142858 1
2: 0.8312496 10
3: 0.85566274 2
4: 0.86112067 9
First group them together as pair using zip then sort it by values(second element of pair) in increasing order and create a dict out of it.
Try:
res = list(zip(cluster.labels_, X))
max_num = dict(sorted(res, key=lambda x: x[1], reverse=False))
max_num:
{0: array([0.82019115]),
2: array([0.8312496]),
1: array([0.85142858]),
3: array([0.85566274]),
4: array([0.86112067])}
Edit:
Do you want this?
elem = list(zip(res, range(1,len(X)+1)))
e = sorted(elem, key=lambda x: x[0][1], reverse=False)
final_dict = {k[0]:(k[1], v) for (k,v) in e}
for key in sorted(final_dict):
print(f"{key}: {final_dict[key][0][0]} {final_dict[key][1]}")
0: 0.82019115 8
1: 0.85142858 1
2: 0.8312496 10
3: 0.85566274 2
4: 0.86112067 9
OR
import pandas as pd
df = pd.DataFrame(zip(cluster.labels_,X))
df[1] = df[1].str[0]
df = df.sort_values(1).drop_duplicates([0],keep='last')
df.index = df.index+1
df = df.sort_values(0)
df:
0 1
8 0 0.820191
1 1 0.851429
10 2 0.831250
2 3 0.855663
9 4 0.861121

Iterate through Pandas DataFrame, use condition and add column

I have purchasing data and want to label them with a new column, which provides information about the daytime of the purchase. For that I'm using the hour of the timestamp column of each purchase.
Labels should work like this:
hour 4 - 7 => 'morning'
hour 8 - 11 => 'before midday'
...
I've already picked the hours of the timestamp. Now, I have a DataFrame with 50 mio records which looks as follows.
user_id timestamp hour
0 11 2015-08-21 06:42:44 6
1 11 2015-08-20 13:38:58 13
2 11 2015-08-20 13:37:47 13
3 11 2015-08-21 06:59:05 6
4 11 2015-08-20 13:15:21 13
At the moment my approach is to use 6x .iterrows(), each with a different condition:
for index, row in basket_times[(basket_times['hour'] >= 4) & (basket_times['hour'] < 8)].iterrows():
basket_times['periode'] = 'morning'
then:
for index, row in basket_times[(basket_times['hour'] >= 8) & (basket_times['hour'] < 12)].iterrows():
basket_times['periode'] = 'before midday'
and so on.
However, one of those 6 loops for 50 mio records takes already like an hour. Is there a better way to do this?
You can try loc with boolean masks. I changed df for testing:
print basket_times
user_id timestamp hour
0 11 2015-08-21 06:42:44 6
1 11 2015-08-20 13:38:58 13
2 11 2015-08-20 09:37:47 9
3 11 2015-08-21 06:59:05 6
4 11 2015-08-20 13:15:21 13
#create boolean masks
morning = (basket_times['hour'] >= 4) & (basket_times['hour'] < 8)
beforemidday = (basket_times['hour'] >= 8) & (basket_times['hour'] < 11)
aftermidday = (basket_times['hour'] >= 11) & (basket_times['hour'] < 15)
print morning
0 True
1 False
2 False
3 True
4 False
Name: hour, dtype: bool
print beforemidday
0 False
1 False
2 True
3 False
4 False
Name: hour, dtype: bool
print aftermidday
0 False
1 True
2 False
3 False
4 True
Name: hour, dtype: bool
basket_times.loc[morning, 'periode'] = 'morning'
basket_times.loc[beforemidday, 'periode'] = 'before midday'
basket_times.loc[aftermidday, 'periode'] = 'after midday'
print basket_times
user_id timestamp hour periode
0 11 2015-08-21 06:42:44 6 morning
1 11 2015-08-20 13:38:58 13 after midday
2 11 2015-08-20 09:37:47 9 before midday
3 11 2015-08-21 06:59:05 6 morning
4 11 2015-08-20 13:15:21 13 after midday
Timings - len(df) = 500k:
In [87]: %timeit a(df)
10 loops, best of 3: 34 ms per loop
In [88]: %timeit b(df1)
1 loops, best of 3: 490 ms per loop
Code for testing:
import pandas as pd
import io
temp=u"""user_id;timestamp;hour
11;2015-08-21 06:42:44;6
11;2015-08-20 10:38:58;10
11;2015-08-20 09:37:47;9
11;2015-08-21 06:59:05;6
11;2015-08-20 10:15:21;10"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", index_col=None, parse_dates=[1])
df = pd.concat([df]*100000).reset_index(drop=True)
print df.shape
#(500000, 3)
df1 = df.copy()
def a(basket_times):
morning = (basket_times['hour'] >= 4) & (basket_times['hour'] < 8)
beforemidday = (basket_times['hour'] >= 8) & (basket_times['hour'] < 11)
basket_times.loc[morning, 'periode'] = 'morning'
basket_times.loc[beforemidday, 'periode'] = 'before midday'
return basket_times
def b(basket_times):
def get_periode(hour):
if 4 <= hour <= 7:
return 'morning'
elif 8 <= hour <= 11:
return 'before midday'
basket_times['periode'] = basket_times['hour'].map(get_periode)
return basket_times
print a(df)
print b(df1)
You can define a function that maps a time period to the string you want, and then use map.
def get_periode(hour):
if 4 <= hour <= 7:
return 'morning'
elif 8 <= hour <= 11:
return 'before midday'
basket_times['periode'] = basket_times['hour'].map(get_periode)

Quickly Find Non-Zero Intervals

I am writing an algorithm to determine the intervals of the "mountains" on a density plot. The plot is taken from the depths from a Kinect if anyone is interested. Here is a quick visual example of what this algorithm finds: (with the small mountains removed):
My current algorithm:
def find_peak_intervals(data):
previous = 0
peak = False
ranges = []
begin_range = 0
end_range = 0
for current in xrange(len(data)):
if (not peak) and ((data[current] - data[previous]) > 0):
peak = True
begin_range = current
if peak and (data[current] == 0):
peak = False
end_range = current
ranges.append((begin_range, end_range))
previous = current
return np.array(ranges)
The function works but it takes nearly 3 milliseconds on my laptop, and I need to be able to run my entire program at at least 30 frames per second. This function is rather ugly and I have to run it 3 times per frame for my program, so I would like any hints as to how to simplify and optimize this function (maybe something from numpy or scipy that I missed).
Assuming a pandas dataframe like so:
Value
0 0
1 3
2 2
3 2
4 1
5 2
6 3
7 0
8 1
9 3
10 0
11 0
12 0
13 1
14 0
15 3
16 2
17 3
18 1
19 0
You can get the contiguous non-zero ranges by using df["Value"].shift(x) where x could either be 1 or -1 so you can check if it's bounded by zeroes. Once you get the boundaries, you can just store their index pairs and use them later on when filtering the data.
The following code is based on the excellent answer here by #behzad.nouri.
import pandas as pd
df = pd.read_csv("data.csv")
# Or you can use df = pd.DataFrame.from_dict({'Value': {0: 0, 1: 3, 2: 2, 3: 2, 4: 1, 5: 2, 6: 3, 7: 0, 8: 1, 9: 3, 10: 0, 11: 0, 12: 0, 13: 1, 14: 0, 15: 3, 16: 2, 17: 3, 18: 1, 19: 0}})
# --
# from https://stackoverflow.com/questions/24281936
# credits to #behzad.nouri
df['tag'] = df['Value'] > 0
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]
pr = [(i, j) for i, j in zip(fst, lst)]
# --
for i, j in pr:
print df.loc[i:j, "Value"]
This gives the result:
1 3
2 2
3 2
4 1
5 2
6 3
Name: Value, dtype: int64
8 1
9 3
Name: Value, dtype: int64
13 1
Name: Value, dtype: int64
15 3
16 2
17 3
18 1
Name: Value, dtype: int64
Timing it in IPython gives the following:
%timeit find_peak_intervals(df)
1000 loops, best of 3: 1.49 ms per loop
This is not too far from your attempt speed-wise. An alternative is to use convert the pandas series to a numpy array and operate from there. Let's take another excellent answer, this one by #Warren Weckesser, and modify it to suit your needs. Let's time it as well.
In [22]: np_arr = np.array(df["Value"])
In [23]: def greater_than_zero(a):
...: isntzero = np.concatenate(([0], np.greater(a, 0).view(np.int8), [0]))
...: absdiff = np.abs(np.diff(isntzero))
...: ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
...: return ranges
In [24]: %timeit greater_than_zero(np_arr)
100000 loops, best of 3: 17.1 µs per loop
Not so bad at 17.1 microseconds, and it gives the same ranges as well.
[1 7] # Basically same as indices 1-6 in pandas.
[ 8 10] # 8, 9
[13 14] # 13, 13
[15 19] # 15, 18

Categories