Here is my code:
pizzarequests = pd.Series(open('pizza_requests.txt').read().splitlines())
line = "unix_timestamp_of_request_utc"
lines = pizzarequests[pizzarequests.str.contains(line)].str.split(",").str[1]
print(lines)
dts = pd.to_datetime(lines, unit='s')
hours = dts.dt.hour
print(hours)
pizzarequests = pd.Series(open('pizza_requests.txt').read().splitlines())
line = "requester_received_pizza"
lines = pizzarequests[pizzarequests.str.contains(line)].str.split(",").str[1]
data = pd.DataFrame({'houroftheday' : hours.values, 'successpizza' : lines})
print(data)
****Which gives me:****
houroftheday successpizza
23 18 true
67 2 true
105 14 true
166 23 true
258 20 true
297 1 true
340 2 true
385 22 true
...
304646 21 false
304686 12 false
304746 1 false
304783 3 false
304840 20 false
304907 17 false
304948 1 false
305023 4 false
How can I sum the hours that only correspond to the trues?
First filter all rows by Trues in column successpizza and then sum column houroftheday:
sum_hour = data.loc[data['successpizza'] == 'true', 'houroftheday'].sum()
print (sum_hour)
102
If want size is necessary only count Trues, if use sum, Trues are processes like 1:
len_hour = (data['successpizza'] == 'true').sum()
print (len_hour)
8
Or if need length of each houroftheday:
mask = (data['successpizza'] == 'true').astype(int)
out = mask.groupby(data['houroftheday']).sum()
print (out)
houroftheday
1 1
2 2
3 0
12 0
14 1
18 1
20 1
21 0
22 1
23 1
Name: successpizza, dtype: int32
Solution for remove traling whitespaces is str.strip:
line = "requester_received_pizza"
lines = pizzarequests[pizzarequests.str.contains(line)].str.split(",").str[1].str.strip()
I think you want a count of the occurrences of each hour where successpizza is true. If so you will want to slice the data frame using successpizza, then groupby the houroftheday column and aggregate using a count.
It also looks like you are reading in the true/false values from a file, so they are strings. You will need to convert them first.
data.successpizza = data.successpizza.apply(lambda x: x=='true')
data[data.successpizza].groupby('houroftheday').count()
Related
I have a float column in a dataframe. And I want to add another boolean column which will be True if condition satisfies on two consecutive values till another condition satisfies on next two consecutive values.
For Example I have a data-frame which look like this:
index
Values %
0
0
1
5
2
11
3
9
4
14
5
18
6
30
7
54
8
73
9
100
10
100
11
100
12
100
13
100
Now I want to mark True from where two consecutive values satisfies the condition df['Values %'] >= 10 till next two consecutive values satisfies the next condition i.e. df[Values %] == 100.
So the final result will look like something this:
index
Values %
Flag
0
0
False
1
5
False
2
11
False
3
9
False
4
14
False
5
18
True
6
30
True
7
54
True
8
73
True
9
100
True
10
100
True
11
100
False
12
100
False
13
100
False
Not sure how exactly the second part of your question is supposed to work but here is how to achieve the first.
example data
s = pd.Series([0,5,11,9,14,18,2,14,16,18])
solution
# create true/false series for first condition and take cumulative sum
x = (s >= 10).cumsum()
# compare each element of x with 2 elements before. There will be a difference of 2 for elements which belong to streak of 2 or more True
condition = x - x.shift(2) == 2
condition looks like this
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 True
9 True
dtype: bool
I have a rather inefficient way of doing this. It's not vectorised, so not ideal, but it works:
# Convert the values column to a 1D NumPy array for ease of use.
values = df["Values %"].tolist()
values_np = np.array(values)
# Initialize flags 1D array to be the same size as values_np. Initially set to all 0s. Uses int form of booleans, i.e. 0 = False and 1 = True.
flags = np.zeros((values_np.shape[0]), dtype=int)
# Iterate from 1st (not 0th) row to last row.
for i in range(1, values_np.shape[0]):
# First set flag to 1 (True) if meets the condition that consecutive values are both >= 10.
if values_np[i] >= 10 and values_np[i-1] >= 10:
flags[i] = 1
# Then if consecutive values are both larger than 100, set flag to 0 (False).
if values_np[i] >= 100 and values_np[i-1] >= 100:
flags[i] = 0
# Turn flags into boolean form (i.e. convert 0 and 1 to False and True).
flags = flags.astype(bool)
# Add flags as a new column in df.
df["Flags"] = flags
One thing -- my method gives False for row 10, because both row 9 and row 10 >= 100. If this is not what you wanted, let me know and I can change it so that the flag is True only if the previous two values and the current value (3 consecutive values) are all >= 100.
Business Problem: For each row in a Pandas data frame where condition is true, set value in a column. When successive rows meet condition, then increase the value by one. The end goal is to create a column containing integers (e.g., 1, 2, 3, 4, ... , n) upon which a pivot table can be made. As a side note, there will be a second index upon which the pivot will be made.
Below is my attempt, but I'm new to using Pandas.
sales_data_cleansed_2.loc[sales_data_cleansed_2['Duplicate'] == 'FALSE', 'sales_index'] = 1
j = 2
# loop through whether duplicate exists.
for i in range(0, len(sales_data_cleansed_2)):
while sales_data_cleansed_2.loc[i,'Duplicate'] == 'TRUE':
sales_data_cleansed_2.loc[i,'sales_index'] = j
j = j + 1
break
j = 2
You can try:
import pandas as pd
# sample DataFrame
df = pd.DataFrame(np.random.randint(0,2, 15).astype(str), columns=["Duplicate"])
df = df.replace({'1': 'TRUE', '0':'FALSE'})
df['sales_index'] = ((df['Duplicate'] == 'TRUE')
.groupby((df['Duplicate'] != 'TRUE')
.cumsum()).cumsum() + 1)
print(df)
This gives:
Duplicate sales_index
0 FALSE 1
1 FALSE 1
2 TRUE 2
3 TRUE 3
4 TRUE 4
5 TRUE 5
6 TRUE 6
7 TRUE 7
8 TRUE 8
9 FALSE 1
10 FALSE 1
11 TRUE 2
12 TRUE 3
13 TRUE 4
14 FALSE 1
I have a dataframe like given below
df = pd.DataFrame({
'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],
'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]
})
df['fake_flag'] = ''
In this operation, I am performing an operation as shown below in code. This code works fine and produces expected output but I can't use this approach for a real dataset as it has more than million records.
t1 = df['PEEP']
for i in t1.index:
if i >=2:
print("current value is ", t1[i])
print("preceding 1st (n-1) ", t1[i-1])
print("preceding 2nd (n-2) ", t1[i-2])
if (t1[i-1] == t1[i-2] or t1[i-2] >= t1[i-1]):
r1_output = t1[i-2] # we get the max of these two values (t1[i-2]), it doesn't matter when it's constant(t1[i-2] or t1[i-1]) will have the same value anyway
print("rule 1 output is ", r1_output)
if t1[i] >= r1_output + 3:
print("found a value for rule 2", t1[i])
print("check for next value is same as current value", t1[i+1])
if (t1[i]==t1[i+1]):
print("fake flag is being set")
df['fake_flag'][i] = 'fake_vac'
However, I can't apply this to real data as it has more than million records. I am learning Python and can you help me understand how to vectorize my code in Python?
You can refer this post related post to understand the logic. As I have got the logic right, I have created this post mainly to seek help in vectorizing and fastening my code
I expect my output to be like as shown below
subject_id = 1
subject_id = 2
Is there any efficient and elegant way to fasten my code operation for a million records dataset
Not sure what's the story behind this, but you can certainly vectorize three if independently and combine them together,
con1 = t1.shift(2).ge(t1.shift(1))
con2 = t1.ge(t1.shift(2).add(3))
con3 = t1.eq(t1.shift(-1))
df['fake_flag']=np.where(con1 & con2 & con3,'fake VAC','')
Edit (Groupby SubjectID)
con = lambda x: (x.shift(2).ge(x.shift(1))) & (x.ge(x.shift(2).add(3))) & (x.eq(x.shift(-1)))
df['fake_flag'] = df.groupby('subject_id')['PEEP'].transform(con).map({True:'fake VAC',False:''})
Does this work?
df.groupby('subject_id')\
.rolling(3)['PEEP'].apply(lambda x: (x[-1] - x[:2].max()) >= 3, raw=True).fillna(0).astype(bool)
Output:
subject_id
1 0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 True
9 False
10 True
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 True
19 False
2 20 False
21 False
22 False
23 False
24 False
25 False
26 True
27 False
28 False
29 False
30 False
31 False
32 True
33 False
34 True
35 False
36 False
37 True
38 False
39 False
Name: PEEP, dtype: bool
Details:
Use groupby to break the data up using 'subject_id'
Apply rolling with a n=3 or a window size three.
Look at that last value in that window using -1 indexing and subtact
the maximum of the first two values in that window using index
slicing.
I have a dataframe like this:
ID, Values
1 10, 11, 12, 13
2 14
3 15, 16, 17, 18
I want to create a new dataframe like this:
ID COl1 Col2
1 10 11
1 11 12
1 12 13
2 14
3 15 16
3 16 17
3 17 18
Please help me in how to do this???
Note: The rows in Values column of input df are str type.
Use list comprehension with flattening and small change - if i > 0: to if i == 2: for correct working with one element values:
from collections import deque
#https://stackoverflow.com/a/36586925
def chunks(iterable, chunk_size=2, overlap=1):
# we'll use a deque to hold the values because it automatically
# discards any extraneous elements if it grows too large
if chunk_size < 1:
raise Exception("chunk size too small")
if overlap >= chunk_size:
raise Exception("overlap too large")
queue = deque(maxlen=chunk_size)
it = iter(iterable)
i = 0
try:
# start by filling the queue with the first group
for i in range(chunk_size):
queue.append(next(it))
while True:
yield tuple(queue)
# after yielding a chunk, get enough elements for the next chunk
for i in range(chunk_size - overlap):
queue.append(next(it))
except StopIteration:
# if the iterator is exhausted, yield any remaining elements
i += overlap
if i == 2:
yield tuple(queue)[-i:]
L = [[x] + list(z) for x, y in zip(df['ID'], df['Values']) for z in (chunks(y.split(', ')))]
df = pd.DataFrame(L, columns=['ID','Col1','Col2']).fillna('')
print (df)
ID Col1 Col2
0 1 10 11
1 1 11 12
2 1 12 13
3 2 14
4 3 15 16
5 3 16 17
6 3 17 18
Tried slightly different approach. Created a function which will return numbers in pairs from the initial comma separated string.
def pairup(mystring):
"""Function to return paired up list from string"""
mylist = mystring.split(',')
if len(mylist) == 1: return [mylist]
splitlist = []
for index, item in enumerate(mylist):
try:
splitlist.append([mylist[index], mylist[index+1]])
except:
pass
return splitlist
Now let's create the new data frame.
# https://stackoverflow.com/a/39955283/3679377
new_df = df[['ID']].join(
df.Values.apply(lambda x: pd.Series(pairup(x)))
.stack()
.apply(lambda x: pd.Series(x))
.fillna("")
.reset_index(level=1, drop=True),
how='left').reset_index(drop=True)
new_df.columns = ['ID', 'Col 1', 'Col 2']
Here's the output of print(new_df).
ID Col 1 Col 2
0 1 10 11
1 1 11 12
2 1 12 13
3 2 14
4 3 15 16
5 3 16 17
6 3 17 18
I am trying to extend my current pattern to accommodate extra conditions of +- a percentage of the last value rather than strict does it match previous value.
data = np.array([[2,30],[2,900],[2,30],[2,30],[2,30],[2,1560],[2,30],
[2,300],[2,30],[2,450]])
df = pd.DataFrame(data)
df.columns = ['id','interval']
UPDATE 2 (id fix): Updated Data 2 with more data:
data2 = np.array([[2,30],[2,900],[2,30],[2,29],[2,31],[2,30],[2,29],[2,31],[2,1560],[2,30],[2,300],[2,30],[2,450], [3,40],[3,900],[3,40],[3,39],[3,41], [3,40],[3,39],[3,41] ,[3,1560],[3,40],[3,300],[3,40],[3,450]])
df2 = pd.DataFrame(data2)
df2.columns = ['id','interval']
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
results in [30,30,30]
however I really want to catch near number conditions say when a number is +-10% of the previous number.
so looking at df2 I would like to pickup the series [30,29,31]
for i, g in df2.groupby([(df2.interval != <???+- 10% magic ???>).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
UPDATE: Here is the end of line processing code where I store the gathered lists into a dictionary with the ID as the key
leak_intervals = {}
final_leak_intervals = {}
serials = []
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist()) >= 3:
print(g.interval.tolist())
serial = g.id.values[0]
if serial not in serials:
serials.append(serial)
if serial not in leak_intervals:
leak_intervals[serial] = g.interval.tolist()
else:
leak_intervals[serial] = leak_intervals[serial] + (g.interval.tolist())
UPDATE:
In [116]: df2.groupby(df2.interval.pct_change().abs().gt(0.1).cumsum()) \
.filter(lambda x: len(x) >= 3)
Out[116]:
id interval
2 2 30
3 2 29
4 2 31
5 2 30
6 2 29
7 2 31
15 3 40
16 3 39
17 2 41
18 2 40
19 2 39
20 2 41