New to Pandas and I'm wondering if there's a better way to accomplish the following -
Set up:
import pandas as pd
import numpy as np
x = np.arange(0, 1, .01)
y = np.random.binomial(10, x, 100)
bins = 50
df = pd.DataFrame({'x':x, 'y':y})
print(df.head())
x y
0 -1 1
1 38 1
2 56 0
3 42 0
4 41 0
I would like to group the x values into equal size bins, and for each bin take the average value of both x and y.
my_bins = pd.cut(x, bins=20)
data = df[['x', 'y']].groupby(my_bins).agg(['mean', 'size'])
print(data.head())
x y
mean size mean size
age
(-1.101, 4.05] -1.000000 87990 0.768428 87990
(4.05, 9.1] NaN 0 NaN 0
(9.1, 14.15] NaN 0 NaN 0
(14.15, 19.2] 18.512286 1872 0.493590 1872
(19.2, 24.25] 22.768022 8906 0.496968 8906
Well that works. But from here, how do I plot x's mean vs y's mean? I know I can do something like
data.columns = data.columns.droplevel() # remove the multiple levels that were created
data.columns = ['x_mean', 'x_size', 'y_mean', 'y_size'] # manually set new column names
data.plot.scatter(x='x_mean', y='y_mean') # plot
But this feels wrong and clunky as I have to drop the column levels (which removes useful structure from my data) and I have to manually rename the columns. Is there a better way?
You can specify the x and y parameters pointing the multi-level columns using tuples:
data.plot.scatter(x=('x', 'mean'), y=('y', 'mean'))
This way, you don't need to rename the columns in order to plot it.
Related
I have a very large data file (tens of thousands of rows and columns) formatted similarly to this.
name x y gh_00hr_bio_rep1 gh_00hr_bio_rep2 gh_00hr_bio_rep3 gh_06hr_bio_rep1
gene1 x y 2 3 2 1
gene2 x y 5 7 6 2
My goal for each gene is to find the mean of each set of repetitions.
At the end I would like to only have columns of mean values titled something like "00hr_bio" and delete all the individual repetitions.
My thinking right now is to use something like this:
for row in df:
df[avg] = df.iloc[3:].rolling(window=3, axis=1).mean()
But I have no idea how to actually make this work.
The df.iloc[3] is my way of trying to start from the 3rd column but I am fairly certain doing it this way does not work.
I don't even know where to begin in terms of "merging" the 3 columns into only 1.
Any suggestions you have will be greatly appreciated as I obviously have no idea what I am doing.
I would first build a Series of final names indexed by the original columns:
names = pd.Series(['_'.join(i.split('_')[:-1]) for i in df.columns[3:]],
index = df.columns[3:])
I would then use it to ask a mean of a groupby on axis 1:
tmp = df.iloc[:, 3:].groupby(names, axis=1).agg('mean')
It gives a new dataframe indexed like the original one and having the averaged columns:
gh_00hr_bio gh_06hr_bio
0 2.333333 1.0
1 6.000000 2.0
You can then horizontally concat it to the first dataframe or to its 3 first columns:
result = pd.concat([df.iloc[:, :3], tmp], axis=1)
to get:
name x y gh_00hr_bio gh_06hr_bio
0 gene1 x y 2.333333 1.0
1 gene2 x y 6.000000 2.0
You're pretty close.
df['avg'] = df.iloc[:, 2:].mean(axis=1)
will get you this:
x y gh_00hr_bio_rep1 gh_00hr_bio_rep2 gh_00hr_bio_rep3 gh_06hr_bio_rep1 avg
gene1 x y 2 3 2 1 2.0
gene2 x y 5 7 6 2 5.0
If you wish to get the mean from different sets of columns, you could do something like this:
for col in range(10):
df['avg%i' % col] = df.iloc[:, 2+col*5:7+col*5].mean(axis=1)
If you have the same number of columns per average. Otherwise you'd probably want to use the name of the rep columns, depending on what your data looks like.
I have a dataframe with 2 columns in python. I want to enter the dataframe with one column and obtain the value of the 2nd column. Sometimes the values can be exact, but they can also be values between 2 rows.
I have this example dataframe:
x y
0 0 0
1 10 100
2 20 200
I want to find the value of y if I check the dataframe with the value of x. For example, if I enter in the dataframe with the value of 10, I obtain the value of 100. But if I check with 15, I need to interpolate between the two values of y. Is there any function to do it?
numpy.interp is probaly the simplest way here for linear interpolation:
def interpolate(xval, df, xcol, ycol):
# compute xval as the linear interpolation of xval where df is a dataframe and
# df.x are the x coordinates, and df.y are the y coordinates. df.x is expected to be sorted.
return np.interp([xval], df[xcol], df[ycol])
With your example data it gives:
>>> interpolate(10, df, 'x', 'y')
>>> 100.0
>>> interpolate(15, df, 'x', 'y')
>>> 150.0
You can even directly do:
>>> np.interp([10, 15], df.x, df.y)
array([100., 150.])
You can have a look at the interpolate method provided in Pandas module (doc). But I'm not sure that answers your question.
You can do it with interp1d from the sklearn module. Several types of interpolation are possible: ‘linear’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’... You find the list at the (doc page).
The interpolation process can be summarised as three steps:
Split your data between missing and non missing values. I use isna (doc)
Create the interpolation function using the data without missing values. I use interp1d (doc)
Interpolate (predict the missing values). Just call the function find in step 2 on the missing data (column x).
Here the code:
# Import modules
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
# Data
df = pd.DataFrame(
[[0, 0],
[10, 100],
[11, np.NaN],
[15, np.NaN],
[17, np.NaN],
[20, 200]],
columns=["x", "y"])
print(df)
# x y
# 0 0 0.0
# 1 10 100.0
# 2 11 NaN
# 3 15 NaN
# 4 17 NaN
# 5 20 200.0
# Split data in training (not NaN values) and missing (NaN values)
missing = df.isna().any(axis=1)
df_training = df[~missing]
df_missing = df[missing].reset_index(drop=True)
# Create function that interpolate missing value (from our training values)
f = interp1d(df_training.x, df_training.y)
# Interpolate the missing values
df_missing["y"] = f(df_missing.x)
print(df_missing)
# x y
# 0 11 110.0
# 1 15 150.0
# 2 17 170.0
You can find others works on the topic at this link.
How could I randomly introduce NaN values into my dataset for each column taking into account the null values already in my starting data.
I want to have for example 20% of NaN values by column.
For example:
If I have in my dataset 3 columns: "A", "B" and "C" for each columns I have NaN values rate how do I introduce randomly NaN values by column to reach 20% per column:
A: 10% nan
B: 15% nan
C: 8% nan
For the moment I tried this code but it degrades too much my dataset and I think that it is not the good way:
df = df.mask(np.random.choice([True, False], size=df.shape, p=[.20,.80]))
I am not sure what do you mean by the last part ("degrades too much") but here is a rough way to do it.
import numpy as np
import pandas as pd
A = pd.Series(np.arange(99))
# Original missing rate (for illustration)
nanidx = A.sample(frac=0.1).index
A[nanidx] = np.NaN
###
# Complementing to 20%
# Original ratio
ori_rat = A.isna().mean()
# Adjusting for the dataframe without missing values
add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)
nanidx2 = A.dropna().sample(frac=add_miss_rat).index
A[nanidx2] = np.NaN
A.isna().mean()
Obviously, it will not always be exactly 20%...
Update
Applying it for the whole dataframe
for col in df:
ori_rat = df[col].isna().mean()
if ori_rat >= 0.2: continue
add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)
vals_to_nan = df[col].dropna().sample(frac=add_miss_rat).index
df.loc[vals_to_nan, col] = np.NaN
Update 2
I made a correction to also take into account the effect of dropping NaN values when calculating the ratio.
Unless you have a giant DataFrame and speed is a concern, the easy-peasy way to do it is by iteration.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({'A':list(range(100)),'B':list(range(100)),'C':list(range(100))})
#before adding nan
print(df.head(10))
nan_percent = {'A':0.10, 'B':0.15, 'C':0.08}
for col in df:
for i, row_value in df[col].iteritems():
if random.random() <= nan_percent[col]:
df[col][i] = np.nan
#after adding nan
print(df.head(10))
Here is a way to get as close to 20% nan in each column as possible:
def input_nan(x,pct):
n = int(len(x)*(pct - x.isna().mean()))
idxs = np.random.choice(len(x), max(n,0), replace=False, p=x.notna()/x.notna().sum())
x.iloc[idxs] = np.nan
df.apply(input_nan, pct=.2)
It first takes the difference between the NaN percent you want, and the percent NaN values in your dataset already. Then it multiplies it by the length of the column, which gives you how many NaN values you want to put in (n). Then uses np.random.choice which randomly choose n indexes that don't have NaN values in them.
Example:
df = pd.DataFrame({'y':np.random.randn(10), 'x1':np.random.randn(10), 'x2':np.random.randn(10)})
df.y.iloc[1]=np.nan
df.y.iloc[8]=np.nan
df.x2.iloc[5]=np.nan
# y x1 x2
# 0 2.635094 0.800756 -1.107315
# 1 NaN 0.055017 0.018097
# 2 0.673101 -1.053402 1.525036
# 3 0.246505 0.005297 0.289559
# 4 0.883769 1.172079 0.551917
# 5 -1.964255 0.180651 NaN
# 6 -0.247067 0.431622 -0.846953
# 7 0.603750 0.475805 0.524619
# 8 NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071 0.029515
df.apply(input_nan)
# y x1 x2
# 0 2.635094 0.800756 -1.107315
# 1 NaN 0.055017 0.018097
# 2 0.673101 -1.053402 1.525036
# 3 0.246505 0.005297 NaN
# 4 0.883769 1.172079 0.551917
# 5 -1.964255 NaN NaN
# 6 -0.247067 0.431622 -0.846953
# 7 0.603750 NaN 0.524619
# 8 NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071 0.029515
I have applied it to the whole dataset, but you can apply it to any column you want. For example, if you wanted 15% NaNs in columns y and x1, you could call df[['y','x1]].apply(input_nan, pct=.15)
I guess I am a little late to the party but if someone needs a solution that's faster and takes the percentage value into account when introducing null values, here's the code:
nan_percent = {'A':0.15, 'B':0.05, 'C':0.23}
for col, perc in nan_percent.items():
df['null'] = np.random.choice([0, 1], size=df.shape[0], p=[1-perc, perc])
df.loc[df['null'] == 1, col] = np.nan
df.drop(columns=['null'], inplace=True)
Let us assume we created a dataframe df using the code below. I have created a bin frequency count based on the 'value' column in df. Now how do I get the frequency count of these label=1 samples frequency count based on previous created bin? Obviously, I should not use qcut for those label = 1 samples to get the count, since the bin positions are not same as before.
import numpy as np
import pandas as pd
mu, sigma = 0, 0.1
theta = 0.3
s = np.random.normal(mu, sigma, 100)
group = np.random.binomial(1, theta, 100)
df = pd.DataFrame(np.vstack([s,group]).transpose())
df.columns = ['value','label']
factor = pd.qcut(df['value'], 5)
factor_bin_count = pd.value_counts(factor)
Update: I took the solution from jeff
df.groupby(['label',factor]).value.count()
If I understand your question. You want to take a grouping factor (e.g. you created using qcut to bin the continuous values), and another grouper (e.g. 'label'), then perform an operation. count in this case.
In [36]: df.groupby(['label',factor]).value.count()
Out[36]:
label value
0 [-0.248, -0.0864] 14
(-0.0864, -0.0227] 15
(-0.0227, 0.0208] 15
(0.0208, 0.0718] 17
(0.0718, 0.24] 13
1 [-0.248, -0.0864] 6
(-0.0864, -0.0227] 5
(-0.0227, 0.0208] 5
(0.0208, 0.0718] 3
(0.0718, 0.24] 7
Name: value, dtype: int64
Suppose I have a pandas data frame df:
I want to calculate the column wise mean of a data frame.
This is easy:
df.apply(average)
then the column wise range max(col) - min(col). This is easy again:
df.apply(max) - df.apply(min)
Now for each element I want to subtract its column's mean and divide by its column's range. I am not sure how to do that
Any help/pointers are much appreciated.
In [92]: df
Out[92]:
a b c d
A -0.488816 0.863769 4.325608 -4.721202
B -11.937097 2.993993 -12.916784 -1.086236
C -5.569493 4.672679 -2.168464 -9.315900
D 8.892368 0.932785 4.535396 0.598124
In [93]: df_norm = (df - df.mean()) / (df.max() - df.min())
In [94]: df_norm
Out[94]:
a b c d
A 0.085789 -0.394348 0.337016 -0.109935
B -0.463830 0.164926 -0.650963 0.256714
C -0.158129 0.605652 -0.035090 -0.573389
D 0.536170 -0.376229 0.349037 0.426611
In [95]: df_norm.mean()
Out[95]:
a -2.081668e-17
b 4.857226e-17
c 1.734723e-17
d -1.040834e-17
In [96]: df_norm.max() - df_norm.min()
Out[96]:
a 1
b 1
c 1
d 1
If you don't mind importing the sklearn library, I would recommend the method talked on this blog.
import pandas as pd
from sklearn import preprocessing
data = {'score': [234,24,14,27,-74,46,73,-18,59,160]}
cols = data.columns
df = pd.DataFrame(data)
df
min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(df)
df_normalized = pd.DataFrame(np_scaled, columns = cols)
df_normalized
You can use apply for this, and it's a bit neater:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randn(4,4)* 4 + 3)
0 1 2 3
0 9.497381 0.552974 0.887313 -1.291874
1 6.461631 -6.206155 9.979247 -0.044828
2 4.276156 2.002518 8.848432 -5.240563
3 1.710331 1.463783 7.535078 -1.399565
df.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
0 1 2 3
0 0.515087 0.133967 -0.651699 0.135175
1 0.125241 -0.689446 0.348301 0.375188
2 -0.155414 0.310554 0.223925 -0.624812
3 -0.484913 0.244924 0.079473 0.114448
Also, it works nicely with groupby, if you select the relevant columns:
df['grp'] = ['A', 'A', 'B', 'B']
0 1 2 3 grp
0 9.497381 0.552974 0.887313 -1.291874 A
1 6.461631 -6.206155 9.979247 -0.044828 A
2 4.276156 2.002518 8.848432 -5.240563 B
3 1.710331 1.463783 7.535078 -1.399565 B
df.groupby(['grp'])[[0,1,2,3]].apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
0 1 2 3
0 0.5 0.5 -0.5 -0.5
1 -0.5 -0.5 0.5 0.5
2 0.5 0.5 0.5 -0.5
3 -0.5 -0.5 -0.5 0.5
Slightly modified from: Python Pandas Dataframe: Normalize data between 0.01 and 0.99? but from some of the comments thought it was relevant (sorry if considered a repost though...)
I wanted customized normalization in that regular percentile of datum or z-score was not adequate. Sometimes I knew what the feasible max and min of the population were, and therefore wanted to define it other than my sample, or a different midpoint, or whatever! This can often be useful for rescaling and normalizing data for neural nets where you may want all inputs between 0 and 1, but some of your data may need to be scaled in a more customized way... because percentiles and stdevs assumes your sample covers the population, but sometimes we know this isn't true. It was also very useful for me when visualizing data in heatmaps. So i built a custom function (used extra steps in the code here to make it as readable as possible):
def NormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.):
if low=='min':
low=min(s)
elif low=='abs':
low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s))
if hi=='max':
hi=max(s)
elif hi=='abs':
hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s))
if center=='mid':
center=(max(s)+min(s))/2
elif center=='avg':
center=mean(s)
elif center=='median':
center=median(s)
s2=[x-center for x in s]
hi=hi-center
low=low-center
center=0.
r=[]
for x in s2:
if x<low:
r.append(0.)
elif x>hi:
r.append(1.)
else:
if x>=center:
r.append((x-center)/(hi-center)*0.5+0.5)
else:
r.append((x-low)/(center-low)*0.5+0.)
if insideout==True:
ir=[(1.-abs(z-0.5)*2.) for z in r]
r=ir
rr =[x-(x-0.5)*shrinkfactor for x in r]
return rr
This will take in a pandas series, or even just a list and normalize it to your specified low, center, and high points. also there is a shrink factor! to allow you to scale down the data away from endpoints 0 and 1 (I had to do this when combining colormaps in matplotlib:Single pcolormesh with more than one colormap using Matplotlib) So you can likely see how the code works, but basically say you have values [-5,1,10] in a sample, but want to normalize based on a range of -7 to 7 (so anything above 7, our "10" is treated as a 7 effectively) with a midpoint of 2, but shrink it to fit a 256 RGB colormap:
#In[1]
NormData([-5,2,10],low=-7,center=1,hi=7,shrinkfactor=2./256)
#Out[1]
[0.1279296875, 0.5826822916666667, 0.99609375]
It can also turn your data inside out... this may seem odd, but I found it useful for heatmapping. Say you want a darker color for values closer to 0 rather than hi/low. You could heatmap based on normalized data where insideout=True:
#In[2]
NormData([-5,2,10],low=-7,center=1,hi=7,insideout=True,shrinkfactor=2./256)
#Out[2]
[0.251953125, 0.8307291666666666, 0.00390625]
So now "2" which is closest to the center, defined as "1" is the highest value.
Anyways, I thought my application was relevant if you're looking to rescale data in other ways that could have useful applications to you.
This is how you do it column-wise:
[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]