I have a very large data file (tens of thousands of rows and columns) formatted similarly to this.
name x y gh_00hr_bio_rep1 gh_00hr_bio_rep2 gh_00hr_bio_rep3 gh_06hr_bio_rep1
gene1 x y 2 3 2 1
gene2 x y 5 7 6 2
My goal for each gene is to find the mean of each set of repetitions.
At the end I would like to only have columns of mean values titled something like "00hr_bio" and delete all the individual repetitions.
My thinking right now is to use something like this:
for row in df:
df[avg] = df.iloc[3:].rolling(window=3, axis=1).mean()
But I have no idea how to actually make this work.
The df.iloc[3] is my way of trying to start from the 3rd column but I am fairly certain doing it this way does not work.
I don't even know where to begin in terms of "merging" the 3 columns into only 1.
Any suggestions you have will be greatly appreciated as I obviously have no idea what I am doing.
I would first build a Series of final names indexed by the original columns:
names = pd.Series(['_'.join(i.split('_')[:-1]) for i in df.columns[3:]],
index = df.columns[3:])
I would then use it to ask a mean of a groupby on axis 1:
tmp = df.iloc[:, 3:].groupby(names, axis=1).agg('mean')
It gives a new dataframe indexed like the original one and having the averaged columns:
gh_00hr_bio gh_06hr_bio
0 2.333333 1.0
1 6.000000 2.0
You can then horizontally concat it to the first dataframe or to its 3 first columns:
result = pd.concat([df.iloc[:, :3], tmp], axis=1)
to get:
name x y gh_00hr_bio gh_06hr_bio
0 gene1 x y 2.333333 1.0
1 gene2 x y 6.000000 2.0
You're pretty close.
df['avg'] = df.iloc[:, 2:].mean(axis=1)
will get you this:
x y gh_00hr_bio_rep1 gh_00hr_bio_rep2 gh_00hr_bio_rep3 gh_06hr_bio_rep1 avg
gene1 x y 2 3 2 1 2.0
gene2 x y 5 7 6 2 5.0
If you wish to get the mean from different sets of columns, you could do something like this:
for col in range(10):
df['avg%i' % col] = df.iloc[:, 2+col*5:7+col*5].mean(axis=1)
If you have the same number of columns per average. Otherwise you'd probably want to use the name of the rep columns, depending on what your data looks like.
Related
I am doing some computing on a dataset using loops. Then, based on random event, I am going to compute some float number(This means that I don't know in advance how many floats I am going to retrieve). I want to save these numbers(results) in a some kind of a list and then save them to a dataframe column ( I want to have these results for each iteration in my loop and save them in a column so I can compare them, meaning, each iteration will produce a "list" of results that will be registred in a df column)
example:
for y in range(1,10):
for x in range(1,100):
if(x>random number and x<y):
result=2*x
I want to save all the results in a dataframe columns by combination x,y. For example, the results for x=1,y=2 in a column then x=2,y=2 in column ...etc and the results are not of the same size, so I guess that I'll use fillna.
Now I know that I can create an empty dataframe with max index and then fill it result by result, but I think there's a better way to do it!
Thanks in advance.
You want to take advantage of the efficiency that numpy and pandas give you. If you use numpy.where, you can set the value to nan when the if statement is False, and otherwise you can execute your formula:
import numpy as np
import pandas as pd
np.random.seed(0) # so you can reproduce my result, you can remove this in practice
x = list(range(10))
y = list(range(1, 11))
random_nums = 10 * np.random.random(10)
df = pd.DataFrame({'x' : x, 'y': y})
# the first argument is your if condition
df['new_col'] = np.where((df['x'] > random_nums) & (df['x'] < df['y']), 2*df['x'], np.nan)
print(df)
Here, random_nums generates an entire np.ndarray of random numbers to compare with. This gives
x y new_col
0 0 1 NaN
1 1 2 NaN
2 2 3 NaN
3 3 4 NaN
4 4 5 NaN
5 5 6 NaN
6 6 7 12.0
7 7 8 NaN
8 8 9 NaN
9 9 10 18.0
This is especially faster if your formula (here, 2*x) is relatively quick to compute.
I have this dataframe and my goal is to remove any columns that have less than 1000 entries.
Prior to to pivoting the df I know I have 880 unique well_id's with entries ranging from 4 to 60k+. I know should end up with 102 well_id's.
I tried to accomplish this in a very naïve way by collecting the wells that I am trying to remove in an array and using a loop but I keep getting a 'TypeError: Level type mismatch' but when I just use del without a for loop it works.
#this works
del df[164301.0]
del df['TB-0071']
# this doesn't work
for id in unwanted_id:
del df[id]
Any help is appreciated, Thanks.
You can use dropna method:
df.dropna(thresh=[]) #specify [here] how many non-na values you require to keep the row
The advantage of this method is that you don't need to create a list.
Also don't forget to add the usual inplace = True if you want the changes to be made in place.
You can use pandas drop method:
df.drop(columns=['colName'], inplace=True)
You can actually pass a list of columns names:
unwanted_id = [164301.0, 'TB-0071']
df.drop(columns=unwanted_ids, inplace=True)
Sample:
df[:5]
from to freq
0 A X 20
1 B Z 9
2 A Y 2
3 A Z 5
4 A X 8
df.drop(columns=['from', 'to'])
freq
0 20
1 9
2 2
3 5
4 8
And to get those column names with more than 1000 unique values, you can use something like this:
counts = df.nunique()[df.nunique()>1000].to_frame('uCounts').reset_index().rename(columns={'index':'colName'})
counts
colName uCounts
0 to 1001
1 freq 1050
Say I have some data in a pandas dataframe that I want to work with.
>>> df = pd.DataFrame([['a',10,5],['a',12,6],['b',4,2],['b',5,10]],
... columns=['id','val','val2']))
So the dataframe looks something like this:
>>> df
id val val2
0 a 10 5
1 a 12 6
2 b 4 2
3 b 5 10
What I want to achieve is a dataframe containing the id values as column names and val and val2 as row names, where the values shall be composed the following way:
Build the mean value for value columns based on id, leaving something like
id mean-val mean-val2
a 11 5.5
b 4.5 6
Calculate the percentage of mean-val and mean-val2 on the sum of both values based on id (e.g. 11 / (11+5.5) * 100 = 66.67), rendering
id perc-val perc-val2
a 66.67 33.33
b 42.86 57.14
The final dataframe shall look like this:
>>> new_df
a b
val 66.67 42.86
val2 33.33 57.14
My approach
I'm quite inexperienced with pandas, so it took me a while to get an unsatisfying approach.
>>> idx = ['val','val2']
>>> lst = [df.groupby('id')[index].mean() for index in idx]
>>> df_new = pd.DataFrame(
... [[x/y*100 for x, y in zip(lst2,sum(lst))] for lst2 in lst],
... index=idx, columns=df['id'].unique())
This works, but I'm not sure if it is guaranteed that either the columns or the rows are named in the right order, or if it's possible that, e.g., the a column is named b and vice versa.
So my actual question is if there is a nicer, cleaner, safer and maybe more efficient way of doing this.
Yes, there is.
If you're taking the mean over every column, you don't have to specify the column names
You can vectorize your division using DataFrame.div (or the division operator __div__)
v = df.groupby('id').mean()
v.T / v.sum(1) * 100 # thanks to #fuglede
# v.div(v.sum(1), axis=0).T # thanks to #Scott Boston
id a b
val 66.666667 42.857143
val2 33.333333 57.142857
If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.
I have a dataframe with a number of columns, two of which are grouping variables.
>>> df2
Groupvar1 Groupvar2 x y z
0 A 1 0.726317 0.574514 0.700475
1 A 2 0.422089 0.798931 0.191157
2 A 3 0.888318 0.658061 0.686496
....
13 B 2 0.978920 0.764266 0.673941
14 B 3 0.759589 0.162488 0.698958
and I want to make a new dataframe which holds the diffrence between each datapoint in the origianl df and the mean corresponding to its subgroup.
So to begin with a make the new df with the grouped averages:
>>> grp_vars = ['Groupvar1','Groupvar2']
>>> df2_grp = df2.groupby(grp_vars)
>>> df2_grp_avg = df2_grp.mean()
>>> df2_grp_avg
x y z
Groupvar1 Groupvar2
A 1 0.364533 0.645237 0.886286
2 0.325533 0.500077 0.246287
3 0.796326 0.496950 0.510085
4 0.774854 0.688732 0.487547
B 1 0.743783 0.452482 0.612006
2 0.575687 0.396902 0.446126
3 0.473152 0.476379 0.508060
4 0.434320 0.406458 0.382187
and in the new dtaframe I want to keep the deltas, defined as:
delta = individual value - average value of the subgroup this individual is a member of
Now, it's clear to me how to do this the hard way (for loop) but I supose there must be a more elegant solution. Apprecaite any advice on finding that more elegant solution. TIA.
Use .groupby(...).transform function:
>>> demean = lambda df: df - df.mean()
>>> df.groupby(['Groupvar1', 'Groupvar2']).transform(demean)
ant then pd.concat the result with the original data-frame.