Pandas dataframe: how to cluster together groups by values without machine learning? - python

I have the following pandas DataFrame.
import pandas as pd
df = pd.read_csv('filename.csv')
print(df)
A B C D
0 2 0 11 0.053095
1 2 0 11 0.059815
2 0 35 11 0.055268
3 0 35 11 0.054573
4 0 1 11 0.054081
5 0 2 11 0.054426
6 0 1 11 0.054426
7 0 1 11 0.054426
8 42 7 3 0.048208
9 42 7 3 0.050765
10 42 7 3 0.05325
....
The problem is, the data is naturally "clustered" into groups, but this data is not given. From the above, rows 0-1 are one group, rows 2-3 are a group, rows 4-7 are a group, and 8-10 are a group.
I need to impute this information. One could use machine learning; however, is it possible to do this only using pandas?
Can users groupby the values of the columns to create these groups? The problem is the values are not exact. For the third group, column B has group 1, 2, 1, 1.

A pure pandas solution would involve binning, assuming that your values are close to each other and your bin size is large enough for cluster variation but smaller than distance between cluster values. That answer depends on your data.
The binning approach uses the cut function in pandas. You provide a series (or array) and the number of bins you want to the function. The function evenly subdivides the range of your series into the given number of bins and determines where each value in the input falls. The output for the below set of columns will be which bin the value fell in and will be what you can group by, following your original train of thought.
The way this would come out in practice for bins of size ~5 is
for col in df.columns:
binned_name = col + '_binned'
num_bins = np.ceil(df[col].max()/5)
df[binned_name] = pd.cut(df[col],num_bins,labels=False)

Related

Why do we need to add : when defining a new column using .iloc function

When we make a new column in a dataset in pandas
df["Max"] = df.iloc[:, 5:7].sum(axis=1)
If we are only getting the columns from index 5 to index 7, why do we need to pass: as all the columns.
pandas.DataFrame.iloc() is used purely for integer-location based indexing for selection by position (read here for documentation). The : means all rows in the selected columns, here column index 5 and 6 (iloc is not inclusive of the last index).
You are using .iloc() to take a slice out of the dataframe and apply an aggregate function across columns of the slice.
Consider an example:
df = pd.DataFrame({"a":[0,1,2],"b":[2,3,4],"c":[4,5,6]})
df
would produce the following dataframe
a b c
0 0 2 4
1 1 3 5
2 2 4 6
You are using iloc to avoid dealing with named columns, so that
df.iloc[:,1:3]
would look as follows
b c
0 2 4
1 3 5
2 4 6
Now a slight modification of your code would get you a new column containing sums across columns
df.iloc[:,1:3].sum(axis=1)
0 6
1 8
2 10
Alternatively you could use function application:
df.apply(lambda x: x.iloc[1:3].sum(), axis=1)
0 6
1 8
2 10
Thus you explicitly tell to apply sum across columns. However your syntax is more succinct and is preferable to explicit function application. The result is the same as one would expect.

Effciency: Dropping rows with the same timestamp while still having the median of second column for that timestamp

What I wanna do:
Column 'angle' has tracked about 20 angles per second (can vary). But my 'Time' timestamp has only an accuracy of 1s (therefore always about ~20 rows are having the same timestamp)(total rows of over 1 million in the dataframe).
My result shall be a new dataframe with a changing timestamp for each row. The angle for the timestamp shall be the median of the ~20 timestamps in that intervall.
My Idea:
I iterate through the rows and check if the timestamp has changed.
If so, I select all timestamps until it changes, calculate the median, and append it to a new dataframe.
Nevertheless I have many many big data files and I am wondering if there is a faster way to achieve my goal.
Right now my code is the following (see below).
It is not fast and I think there must be a better way to do that with pandas/numpy (or something else?).
a = 0
for i in range(1,len(df1.index)):
if df1.iloc[[a],[1]].iloc[0][0]==df1.iloc[[i],[1]].iloc[0][0]:
continue
else:
if a == 0:
df_result = df1[a:i-1].median()
else:
df_result = df_result.append(df1[a:i-1].median(), ignore_index = True)
a = i
You can use groupby here. Below, I made a simple dummy dataframe.
import pandas as pd
df1 = pd.DataFrame({'time': [1,1,1,1,1,1,2,2,2,2,2,2],
'angle' : [8,9,7,1,4,5,11,4,3,8,7,6]})
df1
time angle
0 1 8
1 1 9
2 1 7
3 1 1
4 1 4
5 1 5
6 2 11
7 2 4
8 2 3
9 2 8
10 2 7
11 2 6
Then, we group by the timestamp and take the median of the angle column within that group, and convert the result to a pandas dataframe.
df2 = pd.DataFrame(df1.groupby('time')['angle'].median())
df2 = df2.reset_index()
df2
time angle
0 1 6.0
1 2 6.5
You can use the .agg after grouping function to select operation according to the column
df1.groupby('Time', as_index=False).agg({"angle":"median"})

How to select columns that are highly correlated with one specific column in a dataframe

I have a dataframe which has over 100 columns, with which I am trying to build a model. In this case, one column (A) in this dataframe is considered as a response and all the other columns (B,C,D, etc.) are predictors. So I am trying to select all the columns that are correlated to column A based on correlation factor (say >0.2). I already generated a heatmap with all the correlation factors between each pair of the columns. But can I have a quick method in pandas to get all the columns with a collrelation factor over 0.2 (which I will adjust of course if needed) to column A? Thanks in advance!
Use the DataFrame to calculate the correlation, then slice the columns by your cut-off condition with a Boolean mask.
import pandas as pd
df = pd.DataFrame({'A': [1,2,3,4,5,6,7,8,9,10],
'B': [1,2,4,3,5,7,6,8,10,11],
'C': [15,-1,17,-10,-10,-13,-99,-101,0,0],
'D': [0,10,0,0,-10,0,0,-10,0,10]} )
df.loc[:, df.corr()['A'] > 0.2]
A B
0 1 1
1 2 2
2 3 4
3 4 3
4 5 5
5 6 7
6 7 6
7 8 8
8 9 10
9 10 11

Pandas compute means by group using value closest to current date

I'm attempting to compute the mean by date for all categories. However, each category (called mygroup in the example) does not have a value for each date. I would like to use an apply in pandas to compute the mean at each date, filling in the value using the closest date less than or equal to the current date. For instance if I have:
pd.DataFrame({'date':['1','2','3','6','1','3','4','5','1','2','3','4'],
'mygroup':['a','a','a','a','b','b','b','b','c','c','c','c'],
'myval':[10,20,30,40,50,60,70,80,90,100,110,120]})
date mygroup myval
0 1 a 10
1 2 a 20
2 3 a 30
3 6 a 40
4 1 b 50
5 3 b 60
6 4 b 70
7 5 b 80
8 1 c 90
9 2 c 100
10 3 c 110
11 4 c 120
Computing the mean for date == 1 should be equal to (10 + 50 + 90)/3 = 50 which can be done with a typical mean apply groupby date. However, for date == 6 I would like to use the last known values for each mygroup. The average then for date == 6 would be calculated as
(40 + 80 + 120)/3 = 80 since a has a value at date == 6 of 40, b does not have a value at date == 6, so the last known value was at date == 5 which is 80 and the last known value for c was at date == 4 of 120. The final result should look like:
date meanvalue
1 50
2 56.67
3 66.67
4 73.33
5 76.67
6 80
Is it possible to compute the mean by date with a groupby and apply in this manner, using each mygroup and filling in with the last known value if there is no value for the current date? This will have to be done for thousands of dates and tens of thousands of categories, so for loops are to be avoided.
df.set_index(['mygroup', 'date']).unstack().ffill(axis=1) \
.stack().groupby(level=1).mean()
myval
date
1 50.000000
2 56.666667
3 66.666667
4 73.333333
5 76.666667
6 80.000000
set your index to the key columns
unstack the date level into columns
fill the gaps horizontally - you have know a dense matrix you can calc against
put the date back in
group by date that is your expect output
apply the math - here you want a mean
The key point to remember that's useful for a number of problems is that stacking / unstacking / pivoting, etc... "rubikscubing" your dataframe is always filling gaps of a sparse format (like the columnar format you have to begin with) into a dense one full of NAs.
So if you're able to do the calculation easily with a full dense matrix, then I encourage you to always focus first on obtaining that dense matrix, so that you can do the easy math afterwards.
You can convert all implicit missing values to explicit and fill missing values with forward fill scheme and then do a normal groupby average:
from itertools import product
import pandas as pd
# get all combinations of date and mygroup using product function from itertools
all_combinations = list(product(df.date.drop_duplicates(), df.mygroup.drop_duplicates()))
# convert implicit missing values to explicit missing values by merging all combinations
# with original data frame
df1 = pd.merge(df, pd.DataFrame.from_records(all_combinations,
columns = ['date', 'mygroup']), 'outer')
# fill missing date values with previous date values within each group
df1.sort_values(['mygroup', 'date']).ffill().groupby('date').mean()
# myval
#date
#1 50.000000
#2 56.666667
#3 66.666667
#4 73.333333
#5 76.666667
#6 80.000000

What is the best way to create new pandas dataframe consisting of specific rows of an existing dataframe that match criteria?

I have a pandas dataframe with 6 columns and several rows, each row being data from a specific participant in an experiment. Each column is a particular scale that the participant responded to and contains their scores. I want to create a new dataframe that has only data from those participants whose score for one particular measure matches a criteria.
The criteria is that it has to match one of the items from a list that I have generated separately.
To paraphrase, I have the data in a dataframe and I want to isolate participants who scored a certain score in one of the 6 measures that matches a list of scores that are of interest. I want to have all the 6 columns in the new dataframe with just the rows of participants of interest. Hope this is clear.
I tried using the groupby function but it doesn't offer enough specificity in specifying the criteria, or at least I don't know the syntax if such methods exist. I'm fairly new to pandas.
You could use isin() and any() to isolate the participants getting a particular score in the tests.
Here's a small example DataFrame showing the scores of five participants in three tests:
>>> df = pd.DataFrame(np.random.randint(1,6,(5,3)), columns=['Test1','Test2','Test3'])
>>> df
Test1 Test2 Test3
0 3 3 5
1 5 5 2
2 5 3 4
3 1 3 3
4 2 1 1
If you want a DataFrame with the participants scoring a 1 or 2 in any of the three tests, you could do the following:
>>> score = [1, 2]
>>> df[df.isin(score).any(axis=1)]
Test1 Test2 Test3
1 5 5 2
3 1 3 3
4 2 1 1
Here df.isin(score) creates a boolean DataFrame showing whether each value of df was in the list scores or not. any(axis=1) checks each row for at least one True value, creating a boolean Series. This Series is then used to index the DataFrame df.
If I understood your question correctly you want to query a dataframe for inclusion of the entries in a list.
Like, you have a "results" df like
df = pd.DataFrame({'score1' : np.random.randint(0,10,5)
, 'score2' : np.random.randint(0,10,5)})
score1 score2
0 7 2
1 9 9
2 9 3
3 9 3
4 0 4
and a set of positive outcomes
positive_outcomes = [1,5,7,3]
then you can query the df like
df_final = df[df.score1.isin(positive_outcomes) | df.score2.isin(positive_outcomes)]
to get
score1 score2
0 7 2
2 9 3
3 9 3

Categories