I have a table like the following but approximately 7 million rows. What I am trying to find out is how many cases is each user working on simultaneously? I would like to groupby the username and then get an average count of how many references are open concurrently between the two times.
Reference
starttime
stoptime
Username
1
2020-07-28 06:41:56.000
2020-07-28 07:11:25.000
Arthur
2
2020-07-18 13:24:02.000
2020-07-18 13:38:42.000
Arthur
3
2020-07-03 09:27:03.000
2020-07-03 10:35:24.000
Arthur
4
2020-07-05 19:42:38.000
2020-07-05 20:07:52.000
Bob
5
2020-07-04 10:22:48.000
2020-07-04 10:24:32.000
Bob
Any ideas?
Someone asked a similar question just yesterday so here it is:
ends = df['starttime'].values < df['endtime'].values[:, None]
starts = df['starttime'].values > df['starttime'].values[:, None]
same_name = (df['Username'].values == df['Username'].values[:, None])
# check for rows where all three conditions are met
# count the nubmer of matches by sum across axis=1 !!!
df['overlap'] = (ends & starts & same_name).sum(1)
df
To answer your final question for the mean value you would then run:
df['overlap'].mean()
I would use Pandas groupby function as you suggested in your tag already, by username. Let me describe the general workflow below per grouped user:
Collect all start times and stop times as 'moments of change in activities'.
Loop over all of them in your grouped dataframe
Use e.g. Pandas.DataFrame.loc to check how many cases are 'active' at moments of changes.
Save these in a list to compute the average count of cases
I don't have your code, but in pseudo-code it would look something like:
df = ... # your raw df
grouped = df.groupby(by='Username')
for user, user_df in grouped:
cases = []
user_starts_cases = user_df['starttime'].to_numpy()
user_stops_cases = user_df['stoptime'].to_numpy()
times_of_activity_changes = np.concatenate(user_starts_cases, user_stops_cases)
for xs in times_of_activity_changes:
num_activities = len(user_df.loc[(user_df['starttime'] <= xs) & (user_df['stoptime'] >= xs)]) # mind the brackets
active_cases.append(num_activities)
print(sum(active_cases)/len(active_cases))
It depends a bit what you would call 'on average' but with this you could sample the amount of active cases at the times of your interest and compute an average.
I'm looking to add a field or two into my data set that represents the difference in sales from the last week to current week and from current week to the next week.
My dataset is about 4.5 million rows so I'm looking to find an efficient way of doing this, currently I'm getting into a lot of iteration and for loops and I'm quite sure I'm going about this the wrong way. but Im trying to write code that will be reusable on other datasets and there are situations where you might have nulls or no change in sales week to week (therefore there is no record)
The dataset looks like the following:
Store Item WeekID WeeklySales
1 1567 34 100.00
2 2765 34 86.00
3 1163 34 200.00
1 1567 35 160.00
. .
. .
. .
I have each week as its own dictionary and then each store sales for that week in a dictionary within. So I can use the week as a key and then within the week I access the store's dictionary of item sales.
weekly_sales_dict = {}
for i in df['WeekID'].unique():
store_items_dict = {}
subset = df[df['WeekID'] == i]
subset = subset.groupby(['Store', 'Item']).agg({'WeeklySales':'sum'}).reset_index()
for j in subset['Store'].unique():
storeset = subset[subset['Store'] == j]
store_items_dict.update({str(j): storeset})
weekly_sales_dict.update({ str(i) : store_items_dict})
Then I iterate through each week in the weekly_sales_dict and compare each store/item within it to the week behind it (I planned to do the same for the next week as well). The 'lag_list' I create can be indexed by week, store, and Item so I was going to iterate through and add the values to my df as a new lag column but I feel I am way overthinking this.
count = 0
key_list = list(df['WeekID'].unique())
lag_list = []
for k,v in weekly_sales_dict.items():
if count != 0 and count != len(df['WeekID'].unique())-1:
prev_wk = weekly_sales_dict[str(key_list[(count - 1)])]
current_wk = weekly_sales_dict[str(key_list[count])
for i in df['Store'].unique():
prev_df = prev_wk[str(i)]
current_df = current_wk[str(i)]
for j in df['Item'].unique():
print('in j')
if j in list(current_df['Item'].unique()) and j in list(prev_df['Item'].unique()):
item_lag = current_df[current_df['Item'] == int(j)]['WeeklySales'].values - prev_df[prev_df['Item'] == int(j)]['WeeklySales'].values
df[df['Item'] == j][df['Store'] == i ][df['WeekID'] == key_list[count]]['lag'] = item_lag[0]
lag_list.append((str(i),str(j),item_lag[0]))
elif j in list(current_df['Item'].unique()):
item_lag = current_df[current_df['Item'] == int(j)]['WeeklySales'].values
lag_list.append((str(i),str(j),item_lag[0]))
else:
pass
count += 1
else:
count += 1
Using pd.diff() the problem was solved. I sorted all rows by week, then created a subset with a multi-index by grouping on store,items,and week. Finally I used pd.diff() with a period of 1 and I ended up with the sales difference from the current week to the week prior.
df = df.sort_values(by = 'WeekID')
subset = df.groupby(['Store', 'Items', 'WeekID']).agg({''WeeklySales'':'sum'})
subset['lag'] = subset[['WeeklySales']].diff(1)
Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278
I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through.
I have data in the form of:
Start End
0 2008-10-01 2008-10-31
1 2006-07-01 2006-12-31
2 2000-05-01 2002-12-31
3 1971-08-01 1973-12-31
4 1969-01-01 1969-12-31
I have added a column to write the result to, even though I just want to highlight whether there are incorrect ones so I can delete them:
dates['Correct'] = " "
And have began to check each date pair using the following, where my dataframe is called dates:
for index, row in dates.iterrows():
if dates.Start[index] < dates.End[index]:
dates.Correct[index] = "correct"
elif dates.Start[index] == dates.End[index]:
dates.Correct[index] = "same"
elif dates.Start[index] > dates.End[index]:
dates.Correct[index] = "incorrect"
Which works, it is just taking a really really long-time (about over 15 minutes). I need a more efficiently running code - is there something I am doing wrong or could improve?
Why not just do it in a vectorized way:
is_correct = dates['Start'] < dates['End']
is_incorrect = dates['Start'] > dates['End']
is_same = ~is_correct & ~is_incorrect
Since the list doesn't need to be compared sequentially, you can gain performance by splitting your dataset and then using multiple processes to perform the comparison simultaneously. Take a look at the multiprocessing module for help.
Something like the following may be quicker:
import pandas as pd
import datetime
df = pd.DataFrame({
'start': ["2008-10-01", "2006-07-01", "2000-05-01"],
'end': ["2008-10-31", "2006-12-31", "2002-12-31"],
})
def comparison_check(df):
start = datetime.datetime.strptime(df['start'], "%Y-%m-%d").date()
end = datetime.datetime.strptime(df['end'], "%Y-%m-%d").date()
if start < end:
return "correct"
elif start == end:
return "same"
return "incorrect"
In [23]: df.apply(comparison_check, axis=1)
Out[23]:
0 correct
1 correct
2 correct
dtype: object
Timings
In [26]: %timeit df.apply(comparison_check, axis=1)
1000 loops, best of 3: 447 µs per loop
So by my calculations, 14,000 rows should take (447/3)*14,000 = (149 µs)*14,000 = 2.086s, so a might shorter than 15 minutes :)
I am trying to sum (and plot) a total from functions which change states at different times using Python's Pandas.DataFrame. For example:
Suppose we have 3 people whose states can be a) holding nothing, b) holding a 5 pound weight, and c) holding a 10 pound weight. Over time, these people pick weights up and put them down. I want to plot the total amount of weight being held. So, given:
My brute forece attempt:
import pandas as ps
import math
import numpy as np
person1=[3,0,10,10,10,10,10]
person2=[4,0,20,20,25,25,40]
person3=[5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['count','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDfNoCount=allPeopleDf[['start1', 'end1', 'start2', 'end2', 'start3','end3']]
uniqueTimes=sorted(ps.unique(allPeopleDfNoCount.values.ravel()))
possibleStates=[-1,0,1,2] #extra state 0 for initialization
stateData={}
comboStates={}
#initialize dict to add up all of the stateData
for time in uniqueTimes:
comboStates[time]=0.0
allPeopleDf['track']=-1
allPeopleDf['status']=-1
numberState=len(possibleStates)
starti=-1
endi=0
startState=0
for i in range(3):
starti=starti+2
print starti
endi=endi+2
for time in uniqueTimes:
def helper(row):
start=row[starti]
end=row[endi]
track=row[7]
if start <= time and time < end:
return possibleStates[i+1]
else:
return possibleStates[0]
def trackHelp(row):
status=row[8]
track=row[7]
if track<=status:
return status
else:
return track
def Multiplier(row):
x=row[8]
if x==0:
return 0.0*row[0]
if x==1:
return 5.0*row[0]
if x==2:
return 10.0*row[0]
if x==-1:#numeric place holder for non-contributing
return 0.0*row[0]
allPeopleDf['status']=allPeopleDf.apply(helper,axis=1)
allPeopleDf['track']=allPeopleDf.apply(trackHelp,axis=1)
stateData[time]=allPeopleDf.apply(Multiplier,axis=1).sum()
for k,v in stateData.iteritems():
comboStates[k]=comboStates.get(k,0)+v
print allPeopleDf
print stateData
print comboStates
Plots of weight being held over time might look like the following:
And the sum of the intensities over time might look like the black line in the following:
with the black line defined with the Cartesian points: (0,0 lbs),(5,0 lbs),(5,5 lbs),(15,5 lbs),(15,10 lbs),(20,10 lbs),(20,15 lbs),(25,15 lbs),(25,20 lbs),(40,20 lbs). However, I'm flexible and don't necessarily need to define the combined intensity line as a set of Cartesian points. The unique times can be found with:
print list(set(uniqueTimes).intersection(allNoCountT[1].values.ravel())).sort()
,but I can't come up with a slick way of getting the corresponding intensity values.
I started out with a very ugly function to break apart each "person's" graph so that all people had start and stop times (albeit many stop and start times without state change) at the same time, and then I could add up all the "chunks" of time. This was cumbersome; there has to be a slick pandas way of handling this. If anyone can offer a suggestion or point me to another SO like that I might have missed, I'd appreciate the help!
In case my simplified example isn't clear, another might be plotting the intensity of sound coming from a piano: there are many notes being played for different durations with different intensities. I would like the sum of intensity coming from the piano over time. While my example is simplistic, I need a solution that is more on the scale of a piano song: thousands of discrete intensity levels per key, and many keys contributing over the course of a song.
Edit--Implementation of mgab's provided solution:
import pandas as ps
import math
import numpy as np
person1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
person2=['person2',4,0,20,20,25,25,40]
person3=['person3',5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['id','intensity','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDf=ps.melt(allPeopleDf,id_vars=['intensity','id'])
allPeopleDf.columns=['intensity','id','timeid','time']
df=ps.DataFrame(allPeopleDf).drop('timeid',1)
df[df.id=='person1'].drop('id',1) #easier to visualize one id for check
df['increment']=df.groupby('id')['intensity'].transform( lambda x: x.sub(x.shift(), fill_value= 0 ))
TypeError: unsupported operand type(s) for -: 'str' and 'int'
End Edit
Going for the piano keys example, lets assume you have three keys, with 30 levels of intensity.
I would try to keep the data in this format:
import pandas as pd
df = pd.DataFrame([[10,'A',5],
[10,'B',7],
[13,'C',10],
[15,'A',15],
[20,'A',7],
[23,'C',0]], columns=["time", "key", "intensity"])
time key intensity
0 10 A 5
1 10 B 7
2 13 C 10
3 15 A 15
4 20 A 7
5 23 C 0
where you record every change in intensity of any of the keys. From here you can already get the Cartesian coordinates for each individual key as (time,intensity) pairs
df[df.key=="A"].drop('key',1)
time intensity
0 10 5
3 15 15
4 20 7
Then, you can easily create a new column increment that will indicate the change in intensity that occurred for that key at that time point (intensity indicates just the new value of intensity)
df["increment"]=df.groupby("key")["intensity"].transform(
lambda x: x.sub(x.shift(), fill_value= 0 ))
df
time key intensity increment
0 10 A 5 5
1 10 B 7 7
2 13 C 10 10
3 15 A 15 10
4 20 A 7 -8
5 23 C 0 -10
And then, using this new column, you can generate the (time, total_intensity) pairs to use as Cartesian coordinates
df.groupby("time").sum()["increment"].cumsum()
time
10 12
13 22
15 32
20 24
23 14
dtype: int64
EDIT: applying the specific data presented in question
Assuming the data comes as a list of values, starting with the element id (person/piano key), then a factor multiplying the measured weight/intensities for this element, and then pairs of time values indicating the start and end of a series of known states (weight being carried/intensity being emitted). Not sure if I got the data format right. From your question:
data1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
data2=['person2',4,0,20,20,25,25,40]
data3=['person3',5,0,5,5,15,15,40]
And if we know the weight/intensity of each one of the states, we can define:
known_states = [5, 10, 15]
DF_columns = ["time", "id", "intensity"]
Then, the easiest way I came up to load the data includes this function:
import pandas as pd
def read_data(data, states, columns):
id = data[0]
factor = data[1]
reshaped_data = []
for i in xrange(len(states)):
j += 2+2*i
if not data[j] == data[j+1]:
reshaped_data.append([data[j], id, factor*states[i]])
reshaped_data.append([data[j+1], id, -1*factor*states[i]])
return pd.DataFrame(reshaped_data, columns=columns)
Notice that the if not data[j] == data[j+1]: avoids loading data to the dataframe when start and end times for a given state are equal (seems uninformative, and wouldn't appear in your plots anyway). But take it out if you still want these entries.
Then, you load the data:
df = read_data(data1, known_states, DF_columns)
df = df.append(read_data(data2, known_states, DF_columns), ignore_index=True)
df = df.append(read_data(data3, known_states, DF_columns), ignore_index=True)
# and so on...
And then you're right at the beginning of this answer (substituting 'key' by 'id' and the ids, of course)
Appears to be what .sum() is for:
In [10]:
allPeopleDf.sum()
Out[10]:
aStart 0
aEnd 35
bStart 35
bEnd 50
cStart 50
cEnd 90
dtype: int32