I'm starting to learn Pandas and am trying to find the most Pythonic (or panda-thonic?) ways to do certain tasks.
Suppose we have a DataFrame with columns A, B, and C.
Column A contains boolean values: each row's A value is either true or false.
Column B has some important values we want to plot.
What we want to discover is the subtle distinctions between B values for rows that have A set to false, vs. B values for rows that have A is true.
In other words, how can I group by the value of column A (either true or false), then plot the values of column B for both groups on the same graph? The two datasets should be colored differently to be able to distinguish the points.
Next, let's add another feature to this program: before graphing, we want to compute another value for each row and store it in column D. This value is the mean of all data stored in B for the entire five minutes before a record - but we only include rows that have the same boolean value stored in A.
In other words, if I have a row where A=True and time=t, I want to compute a value for column D that is the mean of B for all records from time t-5 to t that have the same A=True.
In this case, how can we execute the groupby on values of A, then apply this computation to each individual group, and finally plot the D values for the two groups?
I think #herrfz hit all the high points. I'll just flesh out the details:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
sin = np.sin
cos = np.cos
pi = np.pi
N = 100
x = np.linspace(0, pi, N)
a = sin(x)
b = cos(x)
df = pd.DataFrame({
'A': [True]*N + [False]*N,
'B': np.hstack((a,b))
})
for key, grp in df.groupby(['A']):
plt.plot(grp['B'], label=key)
grp['D'] = pd.rolling_mean(grp['B'], window=5)
plt.plot(grp['D'], label='rolling ({k})'.format(k=key))
plt.legend(loc='best')
plt.show()
Related
I have a Pandas dataframe with two columns I am interested in: A categorical label and a timestamp. Presumably what I'm trying to do would also work with ordered numerical data. The dataframe is already sorted by timestamps in ascending order. I want to find out which label spans the longest time-window and select only the values associated with it in the original dataframe.
I have tried grouping the df by label, calculating the difference and selecting the maximum (longest time-window) successfully, however I'm having trouble finding an expression to select the corresponding values in the original df using this information.
Consider this example with numerical values:
d = {'cat': ['A','A','A','A','A','A','B','B','B','B','C','C','C','C','C','C','C'],
'val': [1,3,5,6,8,9,0,5,10,20,4,5,6,7,8,9,10]}
df = pd.DataFrame(data = d)
Here I would expect something equivalent to df.loc[df.cat == 'B'] since B has the maximum difference of all the categories.
df.groupby('cat').val.apply(lambda x: x.max() - x.min()).max()
gives me the correct difference, but I have no idea how to use this to select the correct category in the original df.
You can go for idxmax to get the category that gave rise to maximum peak-to-peak value within groups (np.ptp does the maximum minus minimum). Then you can index with loc as you said, or query:
>>> max_cat = df.groupby("cat").val.apply(np.ptp).idxmax()
>>> max_cat
"B"
>>> df.query("cat == #max_cat") # or df.loc[df.cat == max_cat]
cat val
6 B 0
7 B 5
8 B 10
9 B 20
I have a dataset with several columns (Time Series) and I would like to synchronize them - the 'col2' should be the reference.
Here is an example with two time series:
Here is my df:
With the code below I am able to synchronize the only two columns 'col3' according to 'col2' (time series).
import pandas as pd
import numpy as np
# pip install fastdtw
df=pd.DataFrame({'ID':range(0,25), 'col2':np.random.randn(25)+3, 'col3':np.random.randn(25)+3,'col4':np.random.randn(25)+3,'col5':np.random.randn(25)+3})
from fastdtw import *
from scipy.spatial.distance import *
x = np.array(df['col2'].fillna(0))
y = np.array(df['col3'].fillna(0))
distance, path = fastdtw(x, y, dist=euclidean)
result = []
for i in range(0,len(path)):
result.append([df['ID'].iloc[path[i][0]],
df['col2'].iloc[path[i][0]],
df['col3'].iloc[path[i][1]]])
df_synchronized = pd.DataFrame(data=result,columns=['ID','col2','col3']).dropna()
df_synchronized = df_synchronized.drop_duplicates(subset=['ID'])
df_synchronized = df_synchronized.sort_values(by='ID')
df_synchronized = df_synchronized.reset_index(drop=True)
df_synchronized.head(n=3)
Here is the df_synchronized:
I would like to iterate over all columns in DataFrame and do the same for 'col4' and 'col5' as was for 'col3' being done.
Simply, 'col3' needs to be replaced in a loop with 'col4' and 'col5'.
The goal would be to have the df_synchronized with all columns from df.
Is there any way, how to make it done?
distance, path = fastdtw(x, y, dist=euclidean)
can't be change to distance, path = fastdtw(x, y, z, aa, dist=euclidean).
'Synchronization' needs to be done on one column, then save into df_synchronized, then with next column...
This can be done by picking one Time series as a "reference" and then run distance, path = fastdtw(ref, x) for all other time series and collect the alignment paths (path) from each run.
With all of these time series aligned to a common reference you can create a global alignment that allows a data point from any one of the time series to be matched to its corresponding data point in all of the other time series.
This will work vey well as long as all of the time series are somewhat similar to each other. Ideally the "reference" time series will be the most average/normal (but not required). Finding the most "average" time series is possible by aligning each time series to all/most of the others and the time series with the smallest average distance is the most "average" time series in the set.
An example of this was performed in this paper. See section 6.2 for a description and page 104 has a picture showing the results of multiple time series aligned together. That paper took an extra step of "merging" the time series together after the global alignment.
Using np.interp(query, x, y) produces the same results as I calculate in Excel sometimes. Here is a case where np.interp() and Excel agree:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'x': [-9.210,-6.908,-4.605,-2.303,0.000,2.303],
'y': [-1.867,-1.867,-2.027,-3.667,-7.850,-21.112]}
)
val = -7.313
test1 = np.interp(val, df['x'], df['y'])
And print(test1) yields -1.867. This is exactly as I calculate in Excel and it looks right (our query value is between the yellow values):
However, test2 = np.interp(val, df['y'], df['x']) yields 2.303. In Excel, I calculate -0.2956, which looks right because our query value is between the yellow values.
Is there some kind of weird behavior in numpy where it gets confused going from negative to zero to positive when trying to interpolate? I have tried this with a much more descritized dataframe (50 rows instead of these 6), and the values are always in increasing order, and I get the same issue.
The values in the predictor column must be in increasing order. (Note: -21 is less than -1.8 on the number line, as is -1 less than 1.) Use sort_values to sort the data frame in ascending order by column y, and then the output matches your Excel output.
df1=df.sort_values(by="y")
test3= np.interp(val, df1["y"], df1["x"])
print(test3)
-0.29565168539325837
I'm looking for help with the Pandas .corr() method.
As is, I can use the .corr() method to calculate a heatmap of every possible combination of columns:
corr = data.corr()
sns.heatmap(corr)
Which, on my dataframe of 23,000 columns, may terminate near the heat death of the universe.
I can also do the more reasonable correlation between a subset of values
data2 = data[list_of_column_names]
corr = data2.corr(method="pearson")
sns.heatmap(corr)
That gives me something that I can use--here's an example of what that looks like:
What I would like to do is compare a list of 20 columns with the whole dataset. The normal .corr() function can give me a 20x20 or 23,000x23,000 heatmap, but essentially I would like a 20x23,000 heatmap.
How can I add more specificity to my correlations?
Thanks for the help!
Make a list of the subset that you want (in this example it is A, B, and C), create an empty dataframe, then fill it with the desired values using a nested loop.
df = pd.DataFrame(np.random.randn(50, 7), columns=list('ABCDEFG'))
# initiate empty dataframe
corr = pd.DataFrame()
for a in list('ABC'):
for b in list(df.columns.values):
corr.loc[a, b] = df.corr().loc[a, b]
corr
Out[137]:
A B C D E F G
A 1.000000 0.183584 -0.175979 -0.087252 -0.060680 -0.209692 -0.294573
B 0.183584 1.000000 0.119418 0.254775 -0.131564 -0.226491 -0.202978
C -0.175979 0.119418 1.000000 0.146807 -0.045952 -0.037082 -0.204993
sns.heatmap(corr)
After working through this last night, I came to the following answer:
#datatable imported earlier as 'data'
#Create a new dictionary
plotDict = {}
# Loop across each of the two lists that contain the items you want to compare
for gene1 in list_1:
for gene2 in list_2:
# Do a pearsonR comparison between the two items you want to compare
tempDict = {(gene1, gene2): scipy.stats.pearsonr(data[gene1],data[gene2])}
# Update the dictionary each time you do a comparison
plotDict.update(tempDict)
# Unstack the dictionary into a DataFrame
dfOutput = pd.Series(plotDict).unstack()
# Optional: Take just the pearsonR value out of the output tuple
dfOutputPearson = dfOutput.apply(lambda x: x.apply(lambda x:x[0]))
# Optional: generate a heatmap
sns.heatmap(dfOutputPearson)
Much like the other answers, this generates a heatmap (see below) but it can be scaled to allow for a 20,000x30 matrix without computing the correlation between the entire 20,000x20,000 combinations (and therefore terminating much quicker).
Usually the calculation of correlation coefficients pairwise for all variables make most sense. pd.corr() is convenience function to calculate the correlation coefficient pairwise (and for all pairs).
You can do it with scipy also only for specified pairs within a loop.
Example:
d=pd.DataFrame([[1,5,8],[2,5,4],[7,3,1]], columns=['A','B','C'])
One pair in pandas could be:
d.corr().loc['A','B']
-0.98782916114726194
Equivalent in scipy:
import scipy.stats
scipy.stats.pearsonr(d['A'].values,d['B'].values)[0]
-0.98782916114726194
I've run into an odd problem yet again.
Suppose I have the following dummy data frame (by way of demonstrating my problem):
import numpy as np
import pandas as pd
import string
# Test data frame
N = 3
col_ids = string.letters[:N]
df = pd.DataFrame(
np.random.randn(5, 3*N),
columns=['{}_{}'.format(letter, coord) for letter in col_ids for coord in list('xyz')])
df
This produces:
A_x A_y A_z B_x B_y B_z C_x C_y C_z
0 -1.339040 0.185817 0.083120 0.498545 -0.569518 0.580264 0.453234 1.336992 -0.346724
1 -0.938575 0.367866 1.084475 1.497117 0.349927 -0.726140 -0.870142 -0.371153 -0.881763
2 -0.346819 -1.689058 -0.475032 -0.625383 -0.890025 0.929955 0.683413 0.819212 0.102625
3 0.359540 -0.125700 -0.900680 -0.403000 2.655242 -0.607996 1.117012 -0.905600 0.671239
4 1.624630 -1.036742 0.538341 -0.682000 0.542178 -0.001380 -1.126426 0.756532 -0.701805
Now I would like to use scipy.spatial.distance.pdist on this pandas data frame. This turns out to be a rather non-trivial process. What pdist does is to compute the distance between m points using Euclidean distance (2-norm) as the distance metric between the points. The points are arranged as m n-dimensional row vectors in the matrix X (source).
So, there are a couple of things that one has to do to create a function that operates on a pandas data frame, such that the pdist function can be used. You will note that pdist is convenient when the number of points gets very large. I've tried making my own, which works for a one-row data-frame, but I cannot get it to work, ideally, on the whole data frame at once.
Here's my attempt:
from scipy.spatial.distance import pdist, squareform
import numpy as np
import pandas as pd
import string
def Euclidean_distance(df):
EcDist = pd.DataFrame(index=df.index) # results container
arr = df.values # Store data frame values into a numpy array
tag_list = [num for elem in arr for num in elem] # flatten numpy array into single list
tag_list_3D = zip(*[iter(tag_list)]*3) # separate list into length = 3 sub-lists, that pdist() can work with
EcDist = pdist(tag_list_3D) # the distance between m points using Euclidean distance (2-norm)
return EcDist
First I begin my creating a results container in pandas form, to store the result in. Secondly I save the pandas data frame as a numpy array, in order to get it into list form in the next step. It has to be list form because the pdist function does only operate on lists. When saving the data frame into an array, it stores it as a list within a list. This has to be flattened which is saved in the 'tag_list' variable. Thirdly, the tag_list is furthered reduced into sub-lists of length three, such that the x, y and z coordinates can be obtained for each point, which can the be used to find the Euclidean distance between all of these points (in this example there are three points: A,B and C each being three dimensional).
As said, the function works if the data frame is a single row, but when using the function in the given example it calculates the Euclidean distance for 5x3 points, which yields a total of 105 distances. What I want it to do is to calculate the distances per row (so pdist should only work on a 1x3 vector at a time). Such that my final results, for this example, would look something like this:
dist_1 dist_2 dist_3
0 0.807271 0.142495 1.759969
1 0.180112 0.641855 0.257957
2 0.196950 1.334812 0.638719
3 0.145780 0.384268 0.577387
4 0.044030 0.735428 0.549897
(these are just dummy numbers to show the desired shape)
Hence how do I get my function to apply to the data frame in a row-wise fashion?
Or better yet, how can I get it to perform the function on the entire data frame at once, and then store the result in a new data frame?
Any help would be very appreciated. Thanks.
If I understand correctly, you have "groups" of points. In your example each group has three points, which you call A, B and C. A is represented by three columns A_x, A_y, A_z, and likewise for B and C.
What I suggest is that you restructure your "wide-form" data into a "long" form in which each row contains only one point. Each row then will have only three columns for the coordinates, and then you will add an additional column to represent which group a point is in. Here's an example:
>>> d = pandas.DataFrame(np.random.randn(12, 3), columns=["X", "Y", "Z"])
>>> d["Group"] = np.repeat([1, 2, 3, 4], 3)
>>> d
X Y Z Group
0 -0.280505 0.888417 -0.936790 1
1 0.823741 -0.428267 1.483763 1
2 -0.465326 0.005103 -1.107431 1
3 -1.009077 -1.618600 -0.443975 2
4 0.535634 0.562617 1.165269 2
5 1.544621 -0.858873 -0.349492 2
6 0.839795 0.720828 -0.973234 3
7 -2.273654 0.125304 0.469443 3
8 -0.179703 0.962098 -0.179542 3
9 -0.390777 -0.715896 -0.897837 4
10 -0.030338 0.746647 0.250173 4
11 -1.886581 0.643817 -2.658379 4
The three points with Group==1 correspond to A, B and C in your first row; the three points with Group==2 correspond to A, B, and C in your second row; etc.
With this structure, computing the pairwise distances by group using pdist becomes straightforward:
>>> d.groupby('Group')[["X", "Y", "Z"]].apply(lambda g: pandas.Series(distance.pdist(g), index=["D1", "D2", "D3"]))
D1 D2 D3
Group
1 2.968517 0.918435 2.926395
2 3.119856 2.665986 2.309370
3 3.482747 1.314357 2.346495
4 1.893904 2.680627 3.451939
It is possible to do a similar thing with your existing setup, but it will be more awkward. The problem with the way you set it up is that you have encoded critical information in a difficult-to-extract way. The information about which columns are X coordinates and which are Y or Z coordinates, as well as the information about which columns refer to point A versus B or C, in your setup, is encoded in the textual names of the columns. You as a human can see which columns are X values just by looking at them, but specifying that programmatically requires parsing the string names of the columns.
You can see this in how you made the column names with your '{}_{}'.format(letter, coord) business. This means that in order to get to use pdist on your data, you will have to do the reverse operation of parsing the column names as strings in order to decide which columns to compare. Needless to say, this will be awkward. On the other hand, if you put the data into "long" form, there is no such difficulty: the X coordinates of all points line up in one column, and likewise for Y and Z, and the information about which points are to be compared is also contained in one column (the "Group" column).
When you want to do large-scale operations on subsets of data, it's usually better to split out things into separate rows. This allows you to leverage the power of groupby, and is also usually what is expected by scipy tools.