Pandas For Loop Optimization - python

I'm new to python and although I can write for loops with no issue, I'm finding they're horrendously slow. Here's my code:
perc_match is a function that runs a calculation between two vectors, which in this case are rows of a dataframe.
def perc_match(customer_id,bait_name):
score = int(df_master.loc[customer_id,:].dot(df_pim.loc[bait_name,:].values))
perfect = int(df_master.loc[customer_id,:].dot(df_perf.iloc[0,:].values))
if perfect == 0:
return 0
elif (score / perfect)*100 < 0:
return 0
else:
percent = round((score / perfect)*100,3)
percent = float(percent)
return percent
match_maker calls perc_match for every row in two dataframes and places the output in its respective cell in df_match.
def match_maker(df_match):
for i in df_match.index:
for j in df_match.columns:
df_match.loc[i,j] = perc_match(i,j)
for reference:
df_master.shape = (122905, 33)
df_pim.shape = (36, 33)
df_perf.shape = (1, 33)
df_match.shape = (122905, 36)
This all works fine - except when I test how long it takes...
5.49 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Not good when I'm running this on 100,000s of rows. I know there are ways to optimize the code, but I'm having a hard time understanding it. What's the best way I can slim this code down?
EDIT:
The inputs look something like this:
df_master:
Customer ID Email Technique 1 ... Technique 33
12345 i#me.com 1 ... 0
...
df_pim:
Product ID Technique 1 ... Technique 33
Product 1 1 0
...
df_perc (all values are 1):
index Technique 1 ... Technique 33
1 1
df_match:
Customer ID Email Product 1 ... Product N
12345 i#me.com 0 ... 0
...
I want the function to edit df_match to look like this:
df_match (gives a % match based on comparison between technique values):
Customer ID Email Product 1 ... Product N
12345 i#me.com 12.842 ... 44.312
...

Assumptions:
I'm assuming df_perf in perc_match() line 3 is a typo and you meant df_perc.
You are wanting to think of things as individual values to be calculated. The .dot operator you are using can handle 2 dimensions as well as single dimensions.
In your perc_match() you have:
score = int(df_master.loc[customer_id,:].dot(df_pim.loc[bait_name,:].values))
this operates on one line at a time times one other line. How about making a score dataframe with:
columns = ["Technique "+str(a) for a in range(1,34)]
score_df = df_master[columns].dot(df_pim)
The perfect line is mostly unnecessary if you are multiplying them by a dataframe with all ones. So how about something like this:
perfect = int(df_master.sum(axis=0))
This will give you some thoughts to ponder for a while. I'll finish this answer later or someone can pick this up while I'm away.

Related

Number of concurrent events per username in Pandas

I have a table like the following but approximately 7 million rows. What I am trying to find out is how many cases is each user working on simultaneously? I would like to groupby the username and then get an average count of how many references are open concurrently between the two times.
Reference
starttime
stoptime
Username
1
2020-07-28 06:41:56.000
2020-07-28 07:11:25.000
Arthur
2
2020-07-18 13:24:02.000
2020-07-18 13:38:42.000
Arthur
3
2020-07-03 09:27:03.000
2020-07-03 10:35:24.000
Arthur
4
2020-07-05 19:42:38.000
2020-07-05 20:07:52.000
Bob
5
2020-07-04 10:22:48.000
2020-07-04 10:24:32.000
Bob
Any ideas?
Someone asked a similar question just yesterday so here it is:
ends = df['starttime'].values < df['endtime'].values[:, None]
starts = df['starttime'].values > df['starttime'].values[:, None]
same_name = (df['Username'].values == df['Username'].values[:, None])
# check for rows where all three conditions are met
# count the nubmer of matches by sum across axis=1 !!!
df['overlap'] = (ends & starts & same_name).sum(1)
df
To answer your final question for the mean value you would then run:
df['overlap'].mean()
I would use Pandas groupby function as you suggested in your tag already, by username. Let me describe the general workflow below per grouped user:
Collect all start times and stop times as 'moments of change in activities'.
Loop over all of them in your grouped dataframe
Use e.g. Pandas.DataFrame.loc to check how many cases are 'active' at moments of changes.
Save these in a list to compute the average count of cases
I don't have your code, but in pseudo-code it would look something like:
df = ... # your raw df
grouped = df.groupby(by='Username')
for user, user_df in grouped:
cases = []
user_starts_cases = user_df['starttime'].to_numpy()
user_stops_cases = user_df['stoptime'].to_numpy()
times_of_activity_changes = np.concatenate(user_starts_cases, user_stops_cases)
for xs in times_of_activity_changes:
num_activities = len(user_df.loc[(user_df['starttime'] <= xs) & (user_df['stoptime'] >= xs)]) # mind the brackets
active_cases.append(num_activities)
print(sum(active_cases)/len(active_cases))
It depends a bit what you would call 'on average' but with this you could sample the amount of active cases at the times of your interest and compute an average.

How can I make my python program run faster?

I am reading in a .csv file and creating a pandas dataframe. The file is a file of stocks. I am only interested in the date, the company, and the closing cost. I want my program to find the max profit with the starting date, the ending date and the company. It needs to use the divide and conquer algorithm. I only know how to use for loops but it takes forever to run. The .csv file is 200,000 rows. How can I get this to run fast?
import pandas as pd
import numpy as np
import math
def cleanData(file):
df = pd.read_csv(file)
del df['open']
del df['low']
del df['high']
del df['volume']
return np.array(df)
df = cleanData('prices-split-adjusted.csv')
bestStock = [None, None, None, float(-math.inf)]
def DAC(data):
global bestStock
if len(data) > 1:
mid = len(data)//2
left = data[:mid]
right = data[mid:]
DAC(left)
DAC(right)
for i in range(len(data)):
for j in range(i+1,len(data)):
if data[i,1] == data[j,1]:
profit = data[j,2] - data[i,2]
if profit > bestStock[3]:
bestStock[0] = data[i,0]
bestStock[1] = data[j,0]
bestStock[2] = data[i,1]
bestStock[3] = profit
print(bestStock)
print('\n')
return bestStock
print(DAC(df))
I've got two things for your consideration (my answer tries not to change your algorithm approach i.e. nested loops and recursive funcs and tackles the low lying fruits first):
Unless you are debugging, try to avoid print() inside a loop. (in your case .. print(bestStock) ..) The I/O overhead can add up esp. if you are looping across large datasets and printing to screen often. Once you are OK with your code, comment it out to run on your full dataset and uncomment it only during debugging sessions. You can expect to see some improvement in speed without having to print to screen in the loop.
If you are after even more ways to 'speed it up', I found in my case (similar to yours which I often encounter especially in search/sort problems) that simply by switching the expensive part (the python 'For' loops) to Cython (and statically defining variable types .. this is KEY! to SPEEEEDDDDDD) gives me several orders of magnitude speed ups even before optimizing implementation. Check Cython out https://cython.readthedocs.io/en/latest/index.html. If thats not enough, then parrelism is your next best friend which would require rethinking your code implementation.
The main problems causing slow system performance are:
You manually iterate over 2 columns in nested loops without using pandas operations which make use of fast ndarray functions;
you use recursive calls which looks nice and simple but slow.
Setting the sample data as follows:
Date Company Close
0 2019-12-31 AAPL 73.412498
1 2019-12-31 FB 205.250000
2 2019-12-31 NFLX 323.570007
3 2020-01-02 AAPL 75.087502
4 2020-01-02 FB 209.779999
... ... ... ...
184 2020-03-30 FB 165.949997
185 2020-03-30 NFLX 370.959991
186 2020-03-31 AAPL 63.572498
187 2020-03-31 FB 166.800003
188 2020-03-31 NFLX 375.500000
189 rows × 3 columns
Then use the following codes (modify the column labels to your labels if different):
df_result = df.groupby('Company').agg(Start_Date=pd.NamedAgg(column='Date', aggfunc="first"), End_Date=pd.NamedAgg(column='Date', aggfunc="last"), bestGain=pd.NamedAgg(column='Close', aggfunc=lambda x: x.max() - x.iloc[0]))
Resulting output:
Start_Date End_Date bestGain
Company
AAPL 2019-12-31 2020-03-31 8.387505
FB 2019-12-31 2020-03-31 17.979996
NFLX 2019-12-31 2020-03-31 64.209991
To get the entry with greatest gain:
df_result.loc[df_result['bestGain'].idxmax()]
Resulting output:
Start_Date 2019-12-31 00:00:00
End_Date 2020-03-31 00:00:00
bestGain 64.209991
Name: NFLX, dtype: object
Execution time comparison
With my scaled down data of 3 stocks over 3 months, the codes making use of pandas function (takes 8.9ms) which is about about half the execution time with the original codes manually iterate over the numpy array with nested loops and recursive calls (takes 16.9ms) even after the majority of print() function calls removed.
Your codes with print() inside DAC() function removed:
%%timeit
"""
def cleanData(df):
# df = pd.read_csv(file)
del df['Open']
del df['Low']
del df['High']
del df['Volume']
return np.array(df)
"""
# df = cleanData('prices-split-adjusted.csv')
# df = cleanData(df0)
df = np.array(df0)
bestStock = [None, None, None, float(-math.inf)]
def DAC(data):
global bestStock
if len(data) > 1:
mid = len(data)//2
left = data[:mid]
right = data[mid:]
DAC(left)
DAC(right)
for i in range(len(data)):
for j in range(i+1,len(data)):
if data[i,1] == data[j,1]:
profit = data[j,2] - data[i,2]
if profit > bestStock[3]:
bestStock[0] = data[i,0]
bestStock[1] = data[j,0]
bestStock[2] = data[i,1]
bestStock[3] = profit
# print(bestStock)
# print('\n')
return bestStock
print(DAC(df))
[Timestamp('2020-03-16 00:00:00'), Timestamp('2020-03-31 00:00:00'), 'NFLX', 76.66000366210938]
16.9 ms ± 303 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New simplified codes in pandas' way of coding:
%%timeit
df_result = df.groupby('Company').agg(Start_Date=pd.NamedAgg(column='Date', aggfunc="first"), End_Date=pd.NamedAgg(column='Date', aggfunc="last"), bestGain=pd.NamedAgg(column='Close', aggfunc=lambda x: x.max() - x.iloc[0]))
df_result.loc[df_result['bestGain'].idxmax()]
8.9 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Solution using recursive function:
The main problem of your recursive function lies in that you did not make use of the results of recursive calls of reduced size data.
To properly use recursive function as a divide-and-conquer approach, you should take 3 major steps:
Divide the whole set of data into smaller pieces and handle the smaller pieces by recursive calls each taking one of the smaller pieces
Handle the end-point case (the easiest case most of the time) in each recursive call
Consolidate the results of all recursive calls of smaller pieces
The beauty of recursive calls is that you can solve a complicated problem by replacing the processing with 2 much more easier steps: 1st step is to handle the end-point case where you can handle for most of the time only ONE data item (which is most often easy). 2nd step is to just take another easy step to consolidate the results of the reduced-size calls.
You managed to take the first step but not the other 2 steps. In particular, you did not take advantage of simplifying the processing by making use of the results of smaller pieces. Instead, you handle the whole set of data in each call by looping all over all rows in the 2-dimensional numpy array. The nested loop logics is just like a "Bubble Sort" [with complexity order(n squared) instead of order(n)] . Hence, your recursive calls are just wasting time without value!
Suggest to modify your recursive functions as follows:
def DAC(data):
# global bestStock # define bestStock as a local variable instead
bestStock = [None, None, None, float(-math.inf)] # init bestStock
if len(data) = 1: # End-point case: data = 1 row
bestStock[0] = data[0,0]
bestStock[1] = data[0,0]
bestStock[2] = data[0,1]
bestStock[3] = 0.0
elif len(data) = 2: # End-point case: data = 2 rows
bestStock[0] = data[0,0]
bestStock[1] = data[1,0]
bestStock[2] = data[0,1] # Enhance here to allow stock break
bestStock[3] = data[1,2] - data[0,2]
elif len(data) >= 3: # Recursive calls and consolidate results
mid = len(data)//2
left = data[:mid]
right = data[mid:]
bestStock_left = DAC(left)
bestStock_right = DAC(right)
# Now make use of the results of divide-and-conquer and consolidate the results
bestStock[0] = bestStock_left[0]
bestStock[1] = bestStock_right[1]
bestStock[2] = bestStock_left[2] # Enhance here to allow stock break
bestStock[3] = bestStock_left[3] if bestStock_left[3] >= bestStock_right[3] else bestStock_right[3]
# print(bestStock)
# print('\n')
return bestStock
Here we need to handle 2 kinds of end-point cases: 1 row and 2 rows. The reason is that for case with only 1 row, we cannot calculate the gain and can only set the gain to zero. Gain can start to calculate with 2 rows. If not split into these 2 end-point cases, we could end up only propagating zero gain all the way up.
Here is a demo of how you should code the recursive calls to take advantage of it. There is limitation of the codes that you still need to fine-tune. You have to enhance it further to handle stock break case. The codes for 2 rows and >= 3 rows now assume no stock break at the moment.

Pandas merge - combination of and and or conditions [duplicate]

Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278

Compare each pair of dates in two columns in python efficiently

I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through.
I have data in the form of:
Start End
0 2008-10-01 2008-10-31
1 2006-07-01 2006-12-31
2 2000-05-01 2002-12-31
3 1971-08-01 1973-12-31
4 1969-01-01 1969-12-31
I have added a column to write the result to, even though I just want to highlight whether there are incorrect ones so I can delete them:
dates['Correct'] = " "
And have began to check each date pair using the following, where my dataframe is called dates:
for index, row in dates.iterrows():
if dates.Start[index] < dates.End[index]:
dates.Correct[index] = "correct"
elif dates.Start[index] == dates.End[index]:
dates.Correct[index] = "same"
elif dates.Start[index] > dates.End[index]:
dates.Correct[index] = "incorrect"
Which works, it is just taking a really really long-time (about over 15 minutes). I need a more efficiently running code - is there something I am doing wrong or could improve?
Why not just do it in a vectorized way:
is_correct = dates['Start'] < dates['End']
is_incorrect = dates['Start'] > dates['End']
is_same = ~is_correct & ~is_incorrect
Since the list doesn't need to be compared sequentially, you can gain performance by splitting your dataset and then using multiple processes to perform the comparison simultaneously. Take a look at the multiprocessing module for help.
Something like the following may be quicker:
import pandas as pd
import datetime
df = pd.DataFrame({
'start': ["2008-10-01", "2006-07-01", "2000-05-01"],
'end': ["2008-10-31", "2006-12-31", "2002-12-31"],
})
def comparison_check(df):
start = datetime.datetime.strptime(df['start'], "%Y-%m-%d").date()
end = datetime.datetime.strptime(df['end'], "%Y-%m-%d").date()
if start < end:
return "correct"
elif start == end:
return "same"
return "incorrect"
In [23]: df.apply(comparison_check, axis=1)
Out[23]:
0 correct
1 correct
2 correct
dtype: object
Timings
In [26]: %timeit df.apply(comparison_check, axis=1)
1000 loops, best of 3: 447 µs per loop
So by my calculations, 14,000 rows should take (447/3)*14,000 = (149 µs)*14,000 = 2.086s, so a might shorter than 15 minutes :)

Inexpensive way to add time series intensity in python pandas dataframe

I am trying to sum (and plot) a total from functions which change states at different times using Python's Pandas.DataFrame. For example:
Suppose we have 3 people whose states can be a) holding nothing, b) holding a 5 pound weight, and c) holding a 10 pound weight. Over time, these people pick weights up and put them down. I want to plot the total amount of weight being held. So, given:
My brute forece attempt:
import pandas as ps
import math
import numpy as np
person1=[3,0,10,10,10,10,10]
person2=[4,0,20,20,25,25,40]
person3=[5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['count','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDfNoCount=allPeopleDf[['start1', 'end1', 'start2', 'end2', 'start3','end3']]
uniqueTimes=sorted(ps.unique(allPeopleDfNoCount.values.ravel()))
possibleStates=[-1,0,1,2] #extra state 0 for initialization
stateData={}
comboStates={}
#initialize dict to add up all of the stateData
for time in uniqueTimes:
comboStates[time]=0.0
allPeopleDf['track']=-1
allPeopleDf['status']=-1
numberState=len(possibleStates)
starti=-1
endi=0
startState=0
for i in range(3):
starti=starti+2
print starti
endi=endi+2
for time in uniqueTimes:
def helper(row):
start=row[starti]
end=row[endi]
track=row[7]
if start <= time and time < end:
return possibleStates[i+1]
else:
return possibleStates[0]
def trackHelp(row):
status=row[8]
track=row[7]
if track<=status:
return status
else:
return track
def Multiplier(row):
x=row[8]
if x==0:
return 0.0*row[0]
if x==1:
return 5.0*row[0]
if x==2:
return 10.0*row[0]
if x==-1:#numeric place holder for non-contributing
return 0.0*row[0]
allPeopleDf['status']=allPeopleDf.apply(helper,axis=1)
allPeopleDf['track']=allPeopleDf.apply(trackHelp,axis=1)
stateData[time]=allPeopleDf.apply(Multiplier,axis=1).sum()
for k,v in stateData.iteritems():
comboStates[k]=comboStates.get(k,0)+v
print allPeopleDf
print stateData
print comboStates
Plots of weight being held over time might look like the following:
And the sum of the intensities over time might look like the black line in the following:
with the black line defined with the Cartesian points: (0,0 lbs),(5,0 lbs),(5,5 lbs),(15,5 lbs),(15,10 lbs),(20,10 lbs),(20,15 lbs),(25,15 lbs),(25,20 lbs),(40,20 lbs). However, I'm flexible and don't necessarily need to define the combined intensity line as a set of Cartesian points. The unique times can be found with:
print list(set(uniqueTimes).intersection(allNoCountT[1].values.ravel())).sort()
,but I can't come up with a slick way of getting the corresponding intensity values.
I started out with a very ugly function to break apart each "person's" graph so that all people had start and stop times (albeit many stop and start times without state change) at the same time, and then I could add up all the "chunks" of time. This was cumbersome; there has to be a slick pandas way of handling this. If anyone can offer a suggestion or point me to another SO like that I might have missed, I'd appreciate the help!
In case my simplified example isn't clear, another might be plotting the intensity of sound coming from a piano: there are many notes being played for different durations with different intensities. I would like the sum of intensity coming from the piano over time. While my example is simplistic, I need a solution that is more on the scale of a piano song: thousands of discrete intensity levels per key, and many keys contributing over the course of a song.
Edit--Implementation of mgab's provided solution:
import pandas as ps
import math
import numpy as np
person1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
person2=['person2',4,0,20,20,25,25,40]
person3=['person3',5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['id','intensity','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDf=ps.melt(allPeopleDf,id_vars=['intensity','id'])
allPeopleDf.columns=['intensity','id','timeid','time']
df=ps.DataFrame(allPeopleDf).drop('timeid',1)
df[df.id=='person1'].drop('id',1) #easier to visualize one id for check
df['increment']=df.groupby('id')['intensity'].transform( lambda x: x.sub(x.shift(), fill_value= 0 ))
TypeError: unsupported operand type(s) for -: 'str' and 'int'
End Edit
Going for the piano keys example, lets assume you have three keys, with 30 levels of intensity.
I would try to keep the data in this format:
import pandas as pd
df = pd.DataFrame([[10,'A',5],
[10,'B',7],
[13,'C',10],
[15,'A',15],
[20,'A',7],
[23,'C',0]], columns=["time", "key", "intensity"])
time key intensity
0 10 A 5
1 10 B 7
2 13 C 10
3 15 A 15
4 20 A 7
5 23 C 0
where you record every change in intensity of any of the keys. From here you can already get the Cartesian coordinates for each individual key as (time,intensity) pairs
df[df.key=="A"].drop('key',1)
time intensity
0 10 5
3 15 15
4 20 7
Then, you can easily create a new column increment that will indicate the change in intensity that occurred for that key at that time point (intensity indicates just the new value of intensity)
df["increment"]=df.groupby("key")["intensity"].transform(
lambda x: x.sub(x.shift(), fill_value= 0 ))
df
time key intensity increment
0 10 A 5 5
1 10 B 7 7
2 13 C 10 10
3 15 A 15 10
4 20 A 7 -8
5 23 C 0 -10
And then, using this new column, you can generate the (time, total_intensity) pairs to use as Cartesian coordinates
df.groupby("time").sum()["increment"].cumsum()
time
10 12
13 22
15 32
20 24
23 14
dtype: int64
EDIT: applying the specific data presented in question
Assuming the data comes as a list of values, starting with the element id (person/piano key), then a factor multiplying the measured weight/intensities for this element, and then pairs of time values indicating the start and end of a series of known states (weight being carried/intensity being emitted). Not sure if I got the data format right. From your question:
data1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
data2=['person2',4,0,20,20,25,25,40]
data3=['person3',5,0,5,5,15,15,40]
And if we know the weight/intensity of each one of the states, we can define:
known_states = [5, 10, 15]
DF_columns = ["time", "id", "intensity"]
Then, the easiest way I came up to load the data includes this function:
import pandas as pd
def read_data(data, states, columns):
id = data[0]
factor = data[1]
reshaped_data = []
for i in xrange(len(states)):
j += 2+2*i
if not data[j] == data[j+1]:
reshaped_data.append([data[j], id, factor*states[i]])
reshaped_data.append([data[j+1], id, -1*factor*states[i]])
return pd.DataFrame(reshaped_data, columns=columns)
Notice that the if not data[j] == data[j+1]: avoids loading data to the dataframe when start and end times for a given state are equal (seems uninformative, and wouldn't appear in your plots anyway). But take it out if you still want these entries.
Then, you load the data:
df = read_data(data1, known_states, DF_columns)
df = df.append(read_data(data2, known_states, DF_columns), ignore_index=True)
df = df.append(read_data(data3, known_states, DF_columns), ignore_index=True)
# and so on...
And then you're right at the beginning of this answer (substituting 'key' by 'id' and the ids, of course)
Appears to be what .sum() is for:
In [10]:
allPeopleDf.sum()
Out[10]:
aStart 0
aEnd 35
bStart 35
bEnd 50
cStart 50
cEnd 90
dtype: int32

Categories