I have purchasing data and want to label them with a new column, which provides information about the daytime of the purchase. For that I'm using the hour of the timestamp column of each purchase.
Labels should work like this:
hour 4 - 7 => 'morning'
hour 8 - 11 => 'before midday'
...
I've already picked the hours of the timestamp. Now, I have a DataFrame with 50 mio records which looks as follows.
user_id timestamp hour
0 11 2015-08-21 06:42:44 6
1 11 2015-08-20 13:38:58 13
2 11 2015-08-20 13:37:47 13
3 11 2015-08-21 06:59:05 6
4 11 2015-08-20 13:15:21 13
At the moment my approach is to use 6x .iterrows(), each with a different condition:
for index, row in basket_times[(basket_times['hour'] >= 4) & (basket_times['hour'] < 8)].iterrows():
basket_times['periode'] = 'morning'
then:
for index, row in basket_times[(basket_times['hour'] >= 8) & (basket_times['hour'] < 12)].iterrows():
basket_times['periode'] = 'before midday'
and so on.
However, one of those 6 loops for 50 mio records takes already like an hour. Is there a better way to do this?
You can try loc with boolean masks. I changed df for testing:
print basket_times
user_id timestamp hour
0 11 2015-08-21 06:42:44 6
1 11 2015-08-20 13:38:58 13
2 11 2015-08-20 09:37:47 9
3 11 2015-08-21 06:59:05 6
4 11 2015-08-20 13:15:21 13
#create boolean masks
morning = (basket_times['hour'] >= 4) & (basket_times['hour'] < 8)
beforemidday = (basket_times['hour'] >= 8) & (basket_times['hour'] < 11)
aftermidday = (basket_times['hour'] >= 11) & (basket_times['hour'] < 15)
print morning
0 True
1 False
2 False
3 True
4 False
Name: hour, dtype: bool
print beforemidday
0 False
1 False
2 True
3 False
4 False
Name: hour, dtype: bool
print aftermidday
0 False
1 True
2 False
3 False
4 True
Name: hour, dtype: bool
basket_times.loc[morning, 'periode'] = 'morning'
basket_times.loc[beforemidday, 'periode'] = 'before midday'
basket_times.loc[aftermidday, 'periode'] = 'after midday'
print basket_times
user_id timestamp hour periode
0 11 2015-08-21 06:42:44 6 morning
1 11 2015-08-20 13:38:58 13 after midday
2 11 2015-08-20 09:37:47 9 before midday
3 11 2015-08-21 06:59:05 6 morning
4 11 2015-08-20 13:15:21 13 after midday
Timings - len(df) = 500k:
In [87]: %timeit a(df)
10 loops, best of 3: 34 ms per loop
In [88]: %timeit b(df1)
1 loops, best of 3: 490 ms per loop
Code for testing:
import pandas as pd
import io
temp=u"""user_id;timestamp;hour
11;2015-08-21 06:42:44;6
11;2015-08-20 10:38:58;10
11;2015-08-20 09:37:47;9
11;2015-08-21 06:59:05;6
11;2015-08-20 10:15:21;10"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", index_col=None, parse_dates=[1])
df = pd.concat([df]*100000).reset_index(drop=True)
print df.shape
#(500000, 3)
df1 = df.copy()
def a(basket_times):
morning = (basket_times['hour'] >= 4) & (basket_times['hour'] < 8)
beforemidday = (basket_times['hour'] >= 8) & (basket_times['hour'] < 11)
basket_times.loc[morning, 'periode'] = 'morning'
basket_times.loc[beforemidday, 'periode'] = 'before midday'
return basket_times
def b(basket_times):
def get_periode(hour):
if 4 <= hour <= 7:
return 'morning'
elif 8 <= hour <= 11:
return 'before midday'
basket_times['periode'] = basket_times['hour'].map(get_periode)
return basket_times
print a(df)
print b(df1)
You can define a function that maps a time period to the string you want, and then use map.
def get_periode(hour):
if 4 <= hour <= 7:
return 'morning'
elif 8 <= hour <= 11:
return 'before midday'
basket_times['periode'] = basket_times['hour'].map(get_periode)
Related
I have a data frame that consists of a time-series of integers. I'm trying to group the data frame by year and then for each year count the number of times that the sum of the absolute value of consecutive entries with the same sign is greater than or equal to 5.
>>> import pandas as pd
>>> l = [1, -1, -4, 2, 2, 4, 5, 1, -3, -4]
>>> idx1 = pd.date_range('2019-01-01',periods=5)
>>> idx2 = pd.date_range('2020-01-01',periods=5)
>>> idx = idx1.union(idx2)
>>> df = pd.DataFrame(l, index=idx, columns=['a'])
>>> df
a
2019-01-01 1
2019-01-02 -1
2019-01-03 -4 \\ 2019 count = 1: abs(-1) + abs(-4) >= 5
2019-01-04 2
2019-01-05 2
2020-01-01 4
2020-01-02 5 \\ 2020 count = 1: abs(4) + abs(5) + abs(1) = 10 >=5
2020-01-03 1
2020-01-04 -3
2020-01-05 -4 \\ 2020 count = 2: abs(-3) + abs(-4) = 7 >= 5
The desired output is:
2019 1
2020 2
My approach to solve this problem is to chain groupby and apply. Below are the implementations of the functions I created to pass to groupby and apply respectively.
>>> def get_year(x):
return x.year
>>> def count(group, t=5):
c = 0 # counter
s = 0 # sum of consec vals w same sign
for i in range(1,len(group)):
if np.sign(group['a'].iloc[i-1]) == np.sign(group['a'].iloc[i]):
if s == 0:
s = group['a'].iloc[i-1] + group['a'].iloc[i]
else:
s += group['a'].iloc[i]
if i == (len(group) -1):
return c + 1
elif (np.sign(group['a'].iloc[i-1]) != np.sign(group['a'].iloc[i])) and (abs(s) >= t):
#if consec streak of vals w same sign is broken and abs(s) >= t then inc c and reset s
c += 1
s = 0
elif (np.sign(group['a'].iloc[i-1]) != np.sign(group['a'].iloc[i])) and (abs(s) < t):
#if consec streak of vals w same sign is broken and abs(s) < t then reset s
s = 0
return c
>>> by_year = df.groupby(get_year)
>>> by_year.apply(count)
2019 1
2020 2
My question is:
Is there a more "pythonic" implementation of the above count function that produces the desired result but doesn't rely on for loops?
I have a dataframe like this:
ID, Values
1 10, 11, 12, 13
2 14
3 15, 16, 17, 18
I want to create a new dataframe like this:
ID COl1 Col2
1 10 11
1 11 12
1 12 13
2 14
3 15 16
3 16 17
3 17 18
Please help me in how to do this???
Note: The rows in Values column of input df are str type.
Use list comprehension with flattening and small change - if i > 0: to if i == 2: for correct working with one element values:
from collections import deque
#https://stackoverflow.com/a/36586925
def chunks(iterable, chunk_size=2, overlap=1):
# we'll use a deque to hold the values because it automatically
# discards any extraneous elements if it grows too large
if chunk_size < 1:
raise Exception("chunk size too small")
if overlap >= chunk_size:
raise Exception("overlap too large")
queue = deque(maxlen=chunk_size)
it = iter(iterable)
i = 0
try:
# start by filling the queue with the first group
for i in range(chunk_size):
queue.append(next(it))
while True:
yield tuple(queue)
# after yielding a chunk, get enough elements for the next chunk
for i in range(chunk_size - overlap):
queue.append(next(it))
except StopIteration:
# if the iterator is exhausted, yield any remaining elements
i += overlap
if i == 2:
yield tuple(queue)[-i:]
L = [[x] + list(z) for x, y in zip(df['ID'], df['Values']) for z in (chunks(y.split(', ')))]
df = pd.DataFrame(L, columns=['ID','Col1','Col2']).fillna('')
print (df)
ID Col1 Col2
0 1 10 11
1 1 11 12
2 1 12 13
3 2 14
4 3 15 16
5 3 16 17
6 3 17 18
Tried slightly different approach. Created a function which will return numbers in pairs from the initial comma separated string.
def pairup(mystring):
"""Function to return paired up list from string"""
mylist = mystring.split(',')
if len(mylist) == 1: return [mylist]
splitlist = []
for index, item in enumerate(mylist):
try:
splitlist.append([mylist[index], mylist[index+1]])
except:
pass
return splitlist
Now let's create the new data frame.
# https://stackoverflow.com/a/39955283/3679377
new_df = df[['ID']].join(
df.Values.apply(lambda x: pd.Series(pairup(x)))
.stack()
.apply(lambda x: pd.Series(x))
.fillna("")
.reset_index(level=1, drop=True),
how='left').reset_index(drop=True)
new_df.columns = ['ID', 'Col 1', 'Col 2']
Here's the output of print(new_df).
ID Col 1 Col 2
0 1 10 11
1 1 11 12
2 1 12 13
3 2 14
4 3 15 16
5 3 16 17
6 3 17 18
Here is my code:
pizzarequests = pd.Series(open('pizza_requests.txt').read().splitlines())
line = "unix_timestamp_of_request_utc"
lines = pizzarequests[pizzarequests.str.contains(line)].str.split(",").str[1]
print(lines)
dts = pd.to_datetime(lines, unit='s')
hours = dts.dt.hour
print(hours)
pizzarequests = pd.Series(open('pizza_requests.txt').read().splitlines())
line = "requester_received_pizza"
lines = pizzarequests[pizzarequests.str.contains(line)].str.split(",").str[1]
data = pd.DataFrame({'houroftheday' : hours.values, 'successpizza' : lines})
print(data)
****Which gives me:****
houroftheday successpizza
23 18 true
67 2 true
105 14 true
166 23 true
258 20 true
297 1 true
340 2 true
385 22 true
...
304646 21 false
304686 12 false
304746 1 false
304783 3 false
304840 20 false
304907 17 false
304948 1 false
305023 4 false
How can I sum the hours that only correspond to the trues?
First filter all rows by Trues in column successpizza and then sum column houroftheday:
sum_hour = data.loc[data['successpizza'] == 'true', 'houroftheday'].sum()
print (sum_hour)
102
If want size is necessary only count Trues, if use sum, Trues are processes like 1:
len_hour = (data['successpizza'] == 'true').sum()
print (len_hour)
8
Or if need length of each houroftheday:
mask = (data['successpizza'] == 'true').astype(int)
out = mask.groupby(data['houroftheday']).sum()
print (out)
houroftheday
1 1
2 2
3 0
12 0
14 1
18 1
20 1
21 0
22 1
23 1
Name: successpizza, dtype: int32
Solution for remove traling whitespaces is str.strip:
line = "requester_received_pizza"
lines = pizzarequests[pizzarequests.str.contains(line)].str.split(",").str[1].str.strip()
I think you want a count of the occurrences of each hour where successpizza is true. If so you will want to slice the data frame using successpizza, then groupby the houroftheday column and aggregate using a count.
It also looks like you are reading in the true/false values from a file, so they are strings. You will need to convert them first.
data.successpizza = data.successpizza.apply(lambda x: x=='true')
data[data.successpizza].groupby('houroftheday').count()
First Part
I have a dataframe with finance data (33023 rows, here the link to the
data: https://mab.to/Ssy3TelRs); df.open is the price of the title and
df.close is the closing price.
I have been trying to see how many times in a row the title closed
with a gain and with a lost.
The result that I'm looking for should tell me that the title was
positive 2 days in a row x times, 3 days in a row y times, 4 days in a
row z times and so forth.
I have started with a for:
for x in range(1,df.close.count()): y = df.close[x]-df.open[x]
and then unsuccessful series of if statements...
Thank you for your help.
CronosVirus00
EDITS:
>>> df.head(7)
data ora open max min close Unnamed: 6
0 20160801 0 1.11781 1.11781 1.11772 1.11773 0
1 20160801 100 1.11774 1.11779 1.11773 1.11777 0
2 20160801 200 1.11779 1.11800 1.11779 1.11795 0
3 20160801 300 1.11794 1.11801 1.11771 1.11771 0
4 20160801 400 1.11766 1.11772 1.11763 1.11772 0
5 20160801 500 1.11774 1.11798 1.11774 1.11796 0
6 20160801 600 1.11796 1.11796 1.11783 1.11783 0
Ifs:
for x in range(1,df.close.count()): y = df.close[x]-df.open[x] if y > 0 : green += 1 y = df.close[x+1] - df.close[x+1]
twotimes += 1 if y > 0 : green += 1 y = df.close[x+2] -
df.close[x+2] threetimes += 1 if y > 0 :
green += 1 y = df.close[x+3] - df.close[x+3] fourtimes += 1
FINAL SOLUTION
Thank you all! And the end I did this:
df['test'] = df.close - df.open >0
green = df.test #days that it was positive
def gg(z):
tot =green.count()
giorni = range (1,z+1) # days in a row i wanna check
for x in giorni:
y = (green.rolling(x).sum()>x-1).sum()
print(x," ",y, " ", round((y/tot)*100,1),"%")
gg(5)
1 14850 45.0 %
2 6647 20.1 %
3 2980 9.0 %
4 1346 4.1 %
5 607 1.8 %
If i understood your question correctly you can do it this way:
In [76]: df.groupby((df.close.diff() < 0).cumsum()).cumcount()
Out[76]:
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 0
dtype: int64
The result that I'm looking for should tell me that the title was
positive 2 days in a row x times, 3 days in a row y times, 4 days in a
row z times and so forth.
In [114]: df.groupby((df.close.diff() < 0).cumsum()).cumcount().value_counts().to_frame('count')
Out[114]:
count
0 4
2 2
1 2
Data set:
In [78]: df
Out[78]:
data ora open max min close
0 20160801 0 1.11781 1.11781 1.11772 1.11773
1 20160801 100 1.11774 1.11779 1.11773 1.11777
2 20160801 200 1.11779 1.11800 1.11779 1.11795
3 20160801 300 1.11794 1.11801 1.11771 1.11771
4 20160801 400 1.11766 1.11772 1.11763 1.11772
5 20160801 500 1.11774 1.11798 1.11774 1.11796
6 20160801 600 1.11796 1.11796 1.11783 1.11783
7 20160801 700 1.11783 1.11799 1.11783 1.11780
In [80]: df.close.diff()
Out[80]:
0 NaN
1 0.00004
2 0.00018
3 -0.00024
4 0.00001
5 0.00024
6 -0.00013
7 -0.00003
Name: close, dtype: float64
It sounds like what you want to do is:
compute the difference of two series (open & close), eg diff = df.open - df.close
apply a condition to the result to get a boolean series diff > 0
pass the resulting boolean series to the DataFrame to get a subset of the DataFrame where the condition is true df[diff > 0]
Find all contiguous subsequences by applying a column wise function to identify and count
I need to board a plane, but I will provide a sample of what the last step looks like when I can.
If I understood you correctly, you want the number of days that have at least n positive days in a row before and itself included.
Similarly to what #Thang suggested, you can use rolling:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 2), columns=["open", "close"])
# This just sets up random test data, for example:
# open close
# 0 0.997986 0.594789
# 1 0.052712 0.401275
# 2 0.895179 0.842259
# 3 0.747268 0.919169
# 4 0.113408 0.253440
# 5 0.199062 0.399003
# 6 0.436424 0.514781
# 7 0.180154 0.235816
# 8 0.750042 0.558278
# 9 0.840404 0.139869
positiveDays = df["close"]-df["open"] > 0
# This will give you a series that is True for positive days:
# 0 False
# 1 True
# 2 False
# 3 True
# 4 True
# 5 True
# 6 True
# 7 True
# 8 False
# 9 False
# dtype: bool
daysToCheck = 3
positiveDays.rolling(daysToCheck).sum()>daysToCheck-1
This will now give you a series, indicating for every day, whether it has been positive for daysToCheck number of days in a row:
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 False
9 False
dtype: bool
Now you can use (positiveDays.rolling(daysToCheck).sum()>daysToCheck-1).sum() to get the number of days (in the example 3) that obey this, which is what you want, as far as I understand.
This should work:
import pandas as pd
import numpy as np
test = pd.DataFrame(np.random.randn(100,2), columns = ['open','close'])
test['gain?'] = (test['open']-test['close'] < 0)
test['cumulative'] = 0
for i in test.index[1:]:
if test['gain?'][i]:
test['cumulative'][i] = test['cumulative'][i-1] + 1
test['cumulative'][i-1] = 0
results = test['cumulative'].value_counts()
Ignore the '0' row in the results. It can be modified without too much trouble if you want to e.g count both days in a run-of-two as runs-of-one as well.
Edit: without the warnings -
import pandas as pd
import numpy as np
test = pd.DataFrame(np.random.randn(100,2), columns = ['open','close'])
test['gain?'] = (test['open']-test['close'] < 0)
test['cumulative'] = 0
for i in test.index[1:]:
if test['gain?'][i]:
test.loc[i,'cumulative'] = test.loc[i-1,'cumulative'] + 1
test.loc[i-1,'cumulative'] = 0
results = test['cumulative'].value_counts()
I have a DataFrame where I have the following data. Each row represents a word appearing in each episode of a TV series. If a word appears 3 times in an episode, the pandas dataframe has 3 rows. Now I need to filter a list of words such that I should only get only words which appear more than or equal to 2 times. I can do this by groupby, but if a word appears 2 (or say 3,4 or 5) times, I need two (3, 4 or 5) rows for it.
By groupby, I will only get the unique entry and count, but I need the entry to repeat as many times as it appears in the dialogue. Is there a one-liner to do this?
dialogue episode
0 music 1
1 corrections 1
2 somnath 1
3 yadav 5
4 join 2
5 instagram 1
6 wind 2
7 music 1
8 whimpering 2
9 music 1
10 wind 3
SO here I should ideally get,
dialogue episode
0 music 1
6 wind 2
7 music 1
9 music 1
10 wind 3
As these are the only 2 words that appear more than or equal to 2 times.
You can use groupby's filter:
In [11]: df.groupby("dialogue").filter(lambda x: len(x) > 1)
Out[11]:
dialogue episode
0 music 1
6 wind 2
7 music 1
9 music 1
10 wind 3
Answer for the updated question:
In [208]: df.groupby('dialogue')['episode'].transform('size') >= 3
Out[208]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 True
10 False
dtype: bool
In [209]: df[df.groupby('dialogue')['episode'].transform('size') >= 3]
Out[209]:
dialogue episode
0 music 1
7 music 1
9 music 1
Answer for the original question:
you can use duplicated() method:
In [202]: df[df.duplicated(subset=['dialogue'], keep=False)]
Out[202]:
dialogue episode
0 music 1
6 wind 2
7 music 1
9 music 1
10 wind 3
if you want to sort the result:
In [203]: df[df.duplicated(subset=['dialogue'], keep=False)].sort_values('dialogue')
Out[203]:
dialogue episode
0 music 1
7 music 1
9 music 1
6 wind 2
10 wind 3
I'd use value_counts
vc = df.dialogue.value_counts() >= 2
vc = vc[vc]
df[df.dialogue.isin(vc.index)]
Timing
keep in mind, this is completely over the top. however, i'm sharpening up my timing skills.
code
from timeit import timeit
def pirsquared(df):
vc = df.dialogue.value_counts() > 1
vc = vc[vc]
return df[df.dialogue.isin(vc.index)]
def maxu(df):
return df[df.groupby('dialogue')['episode'].transform('size') > 1]
def andyhayden(df):
return df.groupby("dialogue").filter(lambda x: len(x) > 1)
rows = ['pirsquared', 'maxu', 'andyhayden']
cols = ['OP_Given', '10000_3_letters']
summary = pd.DataFrame([], rows, cols)
iterations = 10
df = pd.DataFrame({'dialogue': {0: 'music', 1: 'corrections', 2: 'somnath', 3: 'yadav', 4: 'join', 5: 'instagram', 6: 'wind', 7: 'music', 8: 'whimpering', 9: 'music', 10: 'wind'}, 'episode': {0: 1, 1: 1, 2: 1, 3: 5, 4: 2, 5: 1, 6: 2, 7: 1, 8: 2, 9: 1, 10: 3}})
summary.loc['pirsquared', 'OP_Given'] = timeit(lambda: pirsquared(df), number=iterations)
summary.loc['maxu', 'OP_Given'] = timeit(lambda: maxu(df), number=iterations)
summary.loc['andyhayden', 'OP_Given'] = timeit(lambda: andyhayden(df), number=iterations)
df = pd.DataFrame(
pd.DataFrame(np.random.choice(list(lowercase), (10000, 3))).sum(1),
columns=['dialogue'])
df['episode'] = 1
summary.loc['pirsquared', '10000_3_letters'] = timeit(lambda: pirsquared(df), number=iterations)
summary.loc['maxu', '10000_3_letters'] = timeit(lambda: maxu(df), number=iterations)
summary.loc['andyhayden', '10000_3_letters'] = timeit(lambda: andyhayden(df), number=iterations)
summary