I have multiple timeseries that are outputs of various algorithms. These algorithms can have various parameters and they produce timeseries as a result:
timestamp1=1;
value1=5;
timestamp2=2;
value2=8;
timestamp3=3;
value3=4;
timestamp4=4;
value4=12;
resultsOfAlgorithms=[
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'200',
'result-of-algorithm':[[timestamp1,value1],[timestamp2,value2]]
},
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'30',
'result-of-algorithm':[[timestamp1,value1],[timestamp3,value3]]
},
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'30',
'result-of-algorithm':[[timestamp2,value2],[timestamp4,value4]]
},
{
'algorithm':'delta',
'param-a':'12',
'param-b':'50',
'result-of-algorithm':[[timestamp2,value2],[timestamp4,value4]]
}
]
I would like to be able to filter the timeseries by algorithm and parameters and plot filtered timeseries to see how given parameters affect the output. To do that I need to know all the occurring values for given parameter and then to be able to select timeseries with desired parameters. E.g. I would like to plot all results of minmax algorithm with param-b==30. There are 2 results that were produced with minmax algorithm and param-b==30. Thus I would like to have a plot with 2 timeseries in it.
Is this possible with pandas or is this out of pandas functionality? How could this be implemented?
Edit:
Searching more the internet I think I am looking for a way to use hierarchical indexing. Also the timeseries should stay separated. Each result is a an individual time-series. It should not be merged together with other result. I need to filter the results of algorithms by parameters used. The result of filter should be still a list of timeseries.
Edit 2:
There are multiple sub-problems:
Find all existing values for each parameter (user does not know all the values since parameters can be auto-generated by system)
user selects some of values for filtering
One way this could be provided by user is a dictionary (but more-user friendly ideas are welcome):
filter={
'param-b':[30,50],
'algorithm':'minmax'
}
Timeseries from resultsOfAlgorithms[1:2] (2nd and 3rd result) are given as a result of filtering, since these results were produced by minmax algorithm and param-b was 30. Thus in this case
[
[[timestamp1,value1],[timestamp3,value3]],
[[timestamp1,value1],[timestamp3,value3]]
]
The result of filtering will return multiple time series, which I want to plot and compare.
user wants to try various filters to see how they affect results
I am doing all this in Jupyter notebook. And I would like to allow user to try various filters with the least hassle possible.
Timestamps in results are not shared. Timestamps between results are not necessarily shared. E.g. all timeseries might occur between 1pm and 3 pm and have roundly same amount of values but the timestamps nor the amount of values are not identical.
So there are two options here, one is to clean up the dict first, then convert it easily to a dataframe, the second is to convert it to a dataframe, then clean up the column that will have nested lists in it. For the first solution, you can just restructure the dict like this:
import pandas as pd
from collections import defaultdict
data = defaultdict(list)
for roa in resultsOfAlgorithms:
for i in range(len(roa['result-of-algorithm'])):
data['algorithm'].append(roa['algorithm'])
data['param-a'].append(roa['param-a'])
data['param-b'].append(roa['param-b'])
data['time'].append(roa['result-of-algorithm'][i][0])
data['value'].append(roa['result-of-algorithm'][i][1])
df = pd.DataFrame(data)
In [31]: df
Out[31]:
algorithm param-a param-b time value
0 minmax 12 200 1 5
1 minmax 12 200 2 8
2 minmax 12 30 1 5
3 minmax 12 30 3 4
4 minmax 12 30 2 8
5 minmax 12 30 4 12
6 delta 12 50 2 8
7 delta 12 50 4 12
And from here you can do whatever analysis you need with it, whether it's plotting or making the time column the index or grouping and aggregating, and so on. You can compare this to making a dataframe first in this link:
Splitting a List inside a Pandas DataFrame
Where they basically did the same thing, with splitting a column of lists into multiple rows. I think fixing the dictionary will be easier though, depending on how representative your fairly simple example is of the real data.
Edit: If you wanted to turn this into a multi-index, you can add one more line:
df_mi = df.set_index(['algorithm', 'param-a', 'param-b'])
In [25]: df_mi
Out[25]:
time value
algorithm param-a param-b
minmax 12 200 1 5
200 2 8
30 1 5
30 3 4
30 2 8
30 4 12
delta 12 50 2 8
50 4 12
Related
Hello I want to store a dataframe in another dataframe cell.
I have a data that looks like this
I have daily data which consists of date, steps, and calories. In addition, I have minute by minute HR data of a specific date. Obviously it would be easy to put the minute by minute data in 2 dimensional list but I'm fearing that would be harder to analyze later.
What would be the best practice when I want to have both data in one dataframe? Is it even possible to even nest dataframes?
Any better ideas ? Thanks!
Yes, it seems possible to nest dataframes but I would recommend instead rethinking how you want to structure your data, which depends on your application or the analyses you want to run on it after.
How to "nest" dataframes into another dataframe
Your dataframe containing your nested "sub-dataframes" won't be displayed very nicely. However, just to show that it is possible to nest your dataframes, take a look at this mini-example:
Here we have 3 random dataframes:
>>> df1
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
>>> df2
0 1 2
0 0.090917 0.457668 0.598548
1 0.748639 0.729935 0.680409
2 0.301244 0.024004 0.361283
>>> df3
0 1 2
0 0.200375 0.059798 0.665323
1 0.086708 0.320635 0.594862
2 0.299289 0.014134 0.085295
We can make a main dataframe that includes these dataframes as values in individual "cells":
df = pd.DataFrame({'idx':[1,2,3], 'dfs':[df1, df2, df3]})
We can then access these nested datframes as we would access any value in any other dataframe:
>>> df['dfs'].iloc[0]
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
I am trying to create dictionary from a dictionary from a dataframe in the following way.
My dataframe contains a column called station_id. The station_id values are unique. That is each row correspond to station id. Then there is another column called trip_id (see example below). Many stations can be associated with a single trip_id. For example
l1=[1,1,2,2]
l2=[34,45,66,67]
df1=pd.DataFrame(list(zip(l1,l2)),columns=['trip_id','station_name'])
df1.head()
trip_id station_name
0 1 34
1 1 45
2 2 66
3 2 67
I am trying to get a dictionary d={1:[34,45],2:[66,67]}.
I solved it with a for loop in the following fashion.
from tqdm import tqdm
Trips_Stations={}
Trips=set(df['trip_id'])
T=list(Trips)
for i in tqdm(range(len(Trips))):
c_id=T[i]
Values=list(df[df.trip_id==c_id].stop_id)
Trips_Stations.update({c_id:Values})
Trips_Stations
My actual dataset has about 65000 rows. The above takes about 2 minutes to run. While this is acceptable for my application, I was wondering if there is a faster way to do it using base pandas.
Thanks
somehow stackoverflow suggested that I look at Group_By
This is much faster
d=df.groupby('trip_id')['stop_id'].apply(list)
from collections import OrderedDict, defaultdict
o_d=d.to_dict(OrderedDict)
o_d=dict(o_d)
It took about 30 secs for the dataframe with 65000 rows. Then
I have a data frame with particular readouts as indexes (different types of measurements for a given sample), each column is a sample for which these readouts were taken. I also have a treatment group assigned as the column name for each sample. You can see the example below.
What I need to do: for a given readout (row) group samples by treatment (column name) and perform a t-test (Welch's t-test) on each group (each treatment). T-test must be done as a comparison with one fixed treatment (control treatment). I do not care about tracking out the sample ids (it was required, now I dropped them on purpose), I'm not going to do paired tests.
For example here, for readout1 I need to compare treatment1 vs treatment3, treatment2 vs treatment3 (it's ok if I'll also get treatment3 vs treatment3).
Example of data frame:
frame = pd.DataFrame(np.arange(27).reshape((3, 9)),
index=['readout1', 'readout2', 'readout3'],
columns=['treatment1', 'treatment1', 'treatment1',\
'treatment2', 'treatment2', 'treatment2', \
'treatment3', 'treatment3', 'treatment3'])
frame
Out[757]:
treatment1 treatment1 ... treatment3 treatment3
readout1 0 1 ... 7 8
readout2 9 10 ... 16 17
readout3 18 19 ... 25 26
[3 rows x 9 columns]
I'm fighting it for several days now. Tried to unstack/stack the data, transposing the data frame, then grouping by index, removing nan and applying lambda. Tried other strategies but none worked. Will appreciate any help.
thank you!
I am new to Python and I need to plot a graph between correlation coefficient of each attributes against target value. I have an input dataset with huge number of values. I have provided sample dataset value as below. We need to predict whether a particular consumer will leave or not in a company and hence Result column is the target variable.
SALARY DUE RENT CALLSPERDAY CALL DURATION RESULT
238790 7 109354 0 6 YES
56004 0 204611 28 15 NO
671672 27 371953 0 4 NO
786035 1 421999 19 11 YES
89684 2 503335 25 8 NO
904285 3 522554 0 13 YES
12072 4 307649 4 11 NO
23621 19 389157 0 4 YES
34769 11 291214 1 13 YES
945835 23 515777 0 5 NO
Here, if you see, the result column is String where as rest of the columns are integer. Similar to result, I also have few other columns(not mentioned in sample) which have string value. Here, I need to compute the values of column which has both string and integer values.
Using dictionary I have assigned a value to each of the columns which has string value.
Example: Result column has Yes or No. Hence assigned value as below:
D = {'NO': 0, 'YES': 1}
and using lambda function, looped through each columns of dataset and replaced NO with 0 and YES with 1.
I tried to calculate the correlation coefficient using the formula:
pearsonr(S.SALARY,targetVarible)
Where S is the dataframe which holds all values.
Similarly, I will loop through all the columns of dataset and calculate correlation coefficient of each columns agains target variable.
Is this an efficient way of calculating correlation coefficient?
Because, I am getting value as below
(0.088327739664096655, 1.1787456108540725e-25)
e^-25 seems to be too small.
Is there any other way to calculate it? Would you be suggesting any other way to input String values, so that it can be treated as integer when compared with other columns that has integer values(other than Dictionaries and lambdas which I used?)
Also I need to plot bar graph using the same code. I am planning to use from matplotlib import pyplot as plt library.
Would you be suggesting any other function to plot bar graph. Mostly Im using sklearn libraries,numpy and pandas to use existing functions from them.
It would be great, if someone helps me. Thanks.
As mentioned in the comments you can use df.corr() to get the correlation matrix of your data. Assuming the name of your DataFrame is df you can plot the correlation with:
df_corr = df.corr()
df_corr[['RESULT']].plot(kind='hist')
Pandas DataFrames have a plot function that uses matplotlib. You can learn more about it here: http://pandas.pydata.org/pandas-docs/stable/visualization.html
I've been reading a huge (5 GB) gzip file in the form:
User1 User2 W
0 11 12 1
1 12 11 2
2 13 14 1
3 14 13 2
which is basically a directed graph representation of connections among users with a certain weight W. Since the file is so big, I tried to read it through networkx, building a Directed Graph and then set it to Undirected. But it took too much time. So I was thinking in doing the same thing analysing a pandas dataframe. I would like to return the previous dataframe in the form:
User1 User2 W
0 11 12 3
1 13 14 3
where the common links in the two directions have been merged into one having as W the sum of the single weights. Any help would be appreciated.
There is probably a more concise way, but this works. The main trick is just to normalize the data such that User1 is always the lower number ID. Then you can use groupby since 11,12 and 12,11 are now recognized as representing the same thing.
In [330]: df = pd.DataFrame({"User1":[11,12,13,14],"User2":[12,11,14,13],"W":[1,2,1,2]})
In [331]: df['U1'] = df[['User1','User2']].min(axis=1)
In [332]: df['U2'] = df[['User1','User2']].max(axis=1)
In [333]: df = df.drop(['User1','User2'],axis=1)
In [334]: df.groupby(['U1','U2'])['W'].sum()
Out[334]:
U1 U2
11 12 3
13 14 3
Name: W, dtype: int64
For more concise code that avoids creating new variables, you could replace the middle 3 steps with:
In [400]: df.ix[df.User1>df.User2,['User1','User2']] = df.ix[df.User1>df.User2,['User2','User1']].values
Note that column switching can be trickier than you'd think, see here: What is correct syntax to swap column values for selected rows in a pandas data frame using just one line?
As far as making this code fast in general, it will depend on your data. I don't think the above code will be as important as other things you might do. For example, your problem should be amenable to a chunking approach where you iterate over sections of the code, gradually shrinking it on each pass. In that case, the main thing you need to think about is sorting the data before chunking, so as to minimize how many passes you need to make. But doing it that way should allow you to do all the work in memory.