Subsetting column from pandas dataframe with indexing - python

I have a large data set in a pandas dataframe I would like to subset in order to further manipulate.
For example, I have a df that looks like this:
Sample Group AMP ADP ATP
1 A 0.3840396 0.55635504 0.5844648
2 A 0.3971521 0.57851902 -0.24603208
3 A 0.4578926 0.68118957 0.19129746
4 B 0.400222 0.58370811 0.01782915
5 B 0.4110945 0.60208593 -0.6285537
6 B 0.3307011 -0.82615087 -0.25354715
7 C 0.3485679 -0.79597002 -0.17294609
8 C 0.3408411 -0.8090222 0.76138965
9 C 0.3856457 -0.73333568 0.27364299
Lets say I want to make a new dataframe df2 from df that contains only the samples in group B and only the corresponding values for ATP. I should be able to do this from indexing alone.(?)
I would like to do something like this:
df2 = df[(df['Group']=='B') & (df['ATP'])]
I know df['ATP'] obviously is not the correct way to do this. Printing df2 yields this:
Sample Group AMP ADP ATP
4 B 0.400222 0.583708 0.017829
5 B 0.411094 0.602086 -0.628554
6 B 0.330701 -0.826151 -0.253547
Ideally, df2 would look like this:
Sample Group ATP
4 B 0.017829
5 B -0.628554
6 B -0.253547
Is there a way to do this without resorting to some convoluted looping or simply manually deleting the unwanted columns and their values?
Thanks!!!

df2 = df.loc[df['Group'] == 'B', ['Sample', 'Group', 'ATP']]

Related

Saving small sub-dataframes containing all values associated to a specific 'key' string

I'd need a little suggestion on a procedure using pandas, I have a 2-columns dataset that looks like this:
A 0.4533
B 0.2323
A 1.2343
A 1.2353
B 4.3521
C 3.2113
C 2.1233
.. ...
where first column contains strings and the second one floats. I would like to save the minimum value for each group of unique strings in order to have the associated minimum with A, B, C. Does anybody have any suggestions on that? It could help me also storing somehow all the values for each string they are associated.
Many thanks,
James
Input data:
>>> df
0 1
0 A 0.4533
1 B 0.2323
2 A 1.2343
3 A 1.2353
4 B 4.3521
5 C 3.2113
6 C 2.1233
Use groupby before min:
out = df.groupby(0).min()
Output result:
>>> out
1
0
A 0.4533
B 0.2323
C 2.1233
Update:
filter out all the values in the original dataset that are more than 20% different from the minimum
out = df[df.groupby(0)[1].apply(lambda x: x <= x.min() * 1.2)]
>>> out
0 1
0 A 0.4533
1 B 0.2323
6 C 2.1233
You can simply do it by
min_A=min(df[df["column_1"]=="A"]["value"])
min_B=min(df[df["column_1"]=="B"]["value"])
min_C=min(df[df["column_1"]=="C"]["value"])
where df = Dataframe column_1 and value are the names of the columns of the dataframe
You can also do it by using the pre-defined function of pandas i.e. groupby()
>> df.groupby(["column_1"]).min()
The Above will also give the same results.

Split Pandas Dataframe Column According To a Value

I searched and I couldn't find a problem like mine. So if there is and somehow I couldn't find please let me know. So I can delete this post.
I stuck with a problem to split pandas dataframe into different data frames (df) by a value.
I have a dataset inside a text file and I store them as pandas dataframe that has only one column. There are more than one sets of information inside the dataset and a certain value defines the end of that set, you can see a sample below:
The Sample Input
In [8]: df
Out[8]:
var1
0 a
1 b
2 c
3 d
4 endValue
5 h
6 f
7 b
8 w
9 endValue
So I want to split this df into different data frames. I couldn't find a way to do that but I'm sure there must be an easy way. The format I display in sample output can be a wrong format. So, If you have a better idea I'd love to see. Thank you for help.
The sample output I'd like
var1
{[0 a
1 b
2 c
3 d
4 endValue]},
{[0 h
1 f
2 b
3 w
4 endValue]}
You could check where var1 is endValue, take the cumsum, and use the result as a custom grouper. Then Groupby and build a dictionary from the result:
d = dict(tuple(df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))))
Or for a list of dataframes (effectively indexed in the same way):
l = [v for _,v in df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))]
print(l[0])
var1
0 a
1 b
2 c
3 d
4 endValue
One idea with unique index values is replace non matched values to NaNs and backfilling them, last loop groupby object for list of DataFrames:
g = df.index.to_series().where(df['var1'].eq('endValue')).bfill()
dfs = [a for i, a in df.groupby(g, sort=False)]
print (dfs)
[ var1
0 a
1 b
2 c
3 d
4 endValue, var1
5 h
6 f
7 b
8 w
9 endValue]

What's the fastest way to select values from columns based on keys in another columns in pandas?

I need a fast way to extract the right values from a pandas dataframe:
Given a dataframe with (a lot of) data in several named columns and an additional columns whose values only contains names of the other columns, how do I select values from the data-columns with the additional columns as keys?
It's simple to do via an explicit loop, but this is extremely slow with something like .iterrows() directly on the DataFrame. If converting to numpy-arrays, it's faster, but still not fast. Can I combine methods from pandas to do it even faster?
Example: This is the kind of DataFrame structure, where columns A and B contain data and column keys contains the keys to select from:
import pandas
df = pandas.DataFrame(
{'A': [1,2,3,4],
'B': [5,6,7,8],
'keys': ['A','B','B','A']},
)
print(df)
output:
Out[1]:
A B keys
0 1 5 A
1 2 6 B
2 3 7 B
3 4 8 A
Now I need some fast code that returns a DataFrame like
Out[2]:
val_keys
0 1
1 6
2 7
3 4
I was thinking something along the lines of this:
tmp = df.melt(id_vars=['keys'], value_vars=['A','B'])
out = tmp.loc[a['keys']==a['variable']]
which produces:
Out[2]:
keys variable value
0 A A 1
3 A A 4
5 B B 6
6 B B 7
but doesn't have the right order or index. So it's not quite a solution.
Any suggestions?
See if either of these work for you
df['val_keys']= np.where(df['keys'] =='A', df['A'],df['B'])
or
df['val_keys']= np.select([df['keys'] =='A', df['keys'] =='B'], [df['A'],df['B']])
No need to specify anything for the code below!
def value(row):
a = row.name
b = row['keys']
c = df.loc[a,b]
return c
df.apply(value, axis=1)
Have you tried filtering then mapping:
df_A = df[df['key'].isin(['A'])]
df_B = df[df['key'].isin(['B'])]
A_dict = dict(zip(df_A['key'], df_A['A']))
B_dict = dict(zip(df_B['key'], df_B['B']))
df['val_keys'] = df['key'].map(A_dict)
df['val_keys'] = df['key'].map(B_dict).fillna(df['val_keys']) # non-exhaustive mapping for the second one
Your df['val_keys'] column will now contain the result as in your val_keys output.
If you want you can just retain that column as in your expected output by:
df = df[['val_keys']]
Hope this helps :))

Restore hierarchical column index when using groupby in pandas

I am using groupby in pandas to compute some aggregates statistics in pandas on data where columns in a data frame are organized with a hierarchical index.
For the computed statistics I want to get back to a table form in the end, where the groups are reconverted to columns with the group values, e.g. like:
index = pd.MultiIndex.from_tuples([('A', 'a'), ('B', 'b')])
df = pd.DataFrame(np.random.randn(8,2), columns=index)
which results in e.g. this data frame
A B
a b
0 0.511157 0.334748
1 0.031113 -0.477456
2 0.288080 -0.258238
3 0.138467 -0.955547
4 -0.087873 0.017494
5 -0.667393 1.190039
6 -0.068245 -1.282864
7 -0.996982 0.589667
Now I compute the statistics using groupby and reset the index to recreate a flat data frame:
df.groupby([('A','a')]).mean().reset_index()
(A, a) B
b
0 -0.996982 0.589667
1 -0.667393 1.190039
2 -0.087873 0.017494
3 -0.068245 -1.282864
4 0.031113 -0.477456
5 0.138467 -0.955547
6 0.288080 -0.258238
7 0.511157 0.334748
How can I achieve that ('A', 'a') becomes a part of the multi index again, hopefully in an automatic fashion? Or stated otherwise: is there a way to preserve the hierarchical column structure during the groupby operation.
For me work add parameter as_index=False to groupby:
print df.groupby([('A','a')], as_index=False).mean()
A B
a b
0 -0.765088 -0.556601
1 -0.628040 2.074559
2 -0.516396 -2.028387
3 -0.152027 0.389853
4 0.450218 1.474989
5 0.718040 -0.882018
6 1.932556 -0.977316
7 2.028468 -0.875167
The simplest thing to do is reassign back the original columns:
In [182]:
df1 = df.groupby([('A','a')]).mean().reset_index()
df1.columns = df.columns
df1
Out[182]:
A B
a b
0 -0.857465 -0.761948
1 -0.263677 0.538251
2 0.067710 -1.038906
3 0.345584 -0.425514
4 0.478200 0.119345
5 0.639305 0.047526
6 1.528260 1.956677
7 3.114834 -0.532462

Selecting a subset of values in python

I have a pandas dataframe, df, which contains a feature ('alpha') which is a list of letters {'A','B',...,'G'}
I'd like to select from df all rows which belong to a subset of this feature, say {'A','B','C'}.
What's the most 'pythonic' way to do this?
I was thinking something along the lines of:
subset = {'A','B','C'}
df1 = df[df['alpha'] == subset]
...but this generates an error:
"need more than 0 values to unpack"
I think you want to use isin to test for membership, example:
In [79]:
subset = {'a','b','c'}
df = pd.DataFrame({'a':list('abasbvggcgasgfdasgcdce')})
df[df['a'].isin(subset)]
Out[79]:
a
0 a
1 b
2 a
4 b
8 c
10 a
15 a
18 c
20 c

Categories