I'm trying to come up with a DataFrame to do some data analysis and would really benefit from having a data frame that can handle regular indexing and MultiIndexing together in one data frame.
For each patient, I have 6 slices of various types of data (T1avg, T2avg, etc...). Let's call this dataframe1 (from an ipython notebook):
import pandas
dat0 = numpy.zeros([6])
dat1 = numpy.zeros([6])
pat0=(['NecS3Hs05']*6)
pat1=(['NecS3Hs06']*6)
slc = (['Slice ' + str(x) for x in xrange(dat0.shape[-1])])
ind = zip(*[pat0+pat1,slc+slc])
named_ind = pandas.MultiIndex.from_tuples(ind, names = ['Patients','Slices'])
ser = pandas.Series(numpy.append(dat0,dat1),index = named_ind)
df = pandas.DataFrame(data=ser, columns=['T1avg'])
Image of output: df1
I also have, for each patient, various strings of information (tumour type, number of imaging sessions, treatment type):
pats = ['NecS3Hs05','NecS3Hs05']
tx = ['Control','Treated']
Ttype = ['subcutaneous','orthotopic']
NSessions = ['2','3']
cols = ['Tx Group', 'Tumour Type', 'Imaging Sessions']
dat = numpy.array([tx,Ttype,NSessions]).T
df2 = pandas.DataFrame(dat, index=pats,columns=cols)
[I'd like to post a picture here as well, but I need at least 10 reputation to do so]
Ideally, I want to have a dataframe that looks as follows (sketched it out in an image editor sorry)
Image of desired output: df-desired
But when I try to use the append command,
com = df.append(df2)
I get something undesired, the MultiIndex that I set up in df is now gone, replaced with a simple index of type tuples ('NecS3Hs05, Slice 0' etc...). The indices from df2 remain the same 'NecS3Hs05'.
Is this possible to do with PANDAS, or am I barking up the wrong tree here? Also, is this even a recommended way of storing Patient attributes in a dataframe (i.e. is this unpandas)? I think what I would really like is to keep everything a simple index, but instead store N-d arrays inside the elements of the data frame.
For instance, if I try something like:
com['NecS3Hs05','T1avg']
I want to get an array/tuple of shape/len 6
and when I try to get the tumour type:
com['NecS3Hs05','Tumour Type']
I get the string 'subcutaneous'. Obviously I also want to retain the cool features of data frames as well, it looks like PANDAS is the right way to go here, I just need to understand a bit more about how to set up my dataframe
I hope this is a sensible question, if not, I'd be happy to re-form it.
Your problem can be solved, I believe, if you drop the MultiIndex business. Imagine '''df''' only has the (non-unique) 'Patient' as index. 'Slices' would become a simple column.
ind = zip(*[pat0+pat1])
named_ind = pandas.MultiIndex.from_tuples(ind, names = ['Patients'])
df = pandas.DataFrame({'T1avg':ser})
df['Slice']=pandas.Series(numpy.append(slc, slc), index=df.index)
If you had to select on the slice, you can still do that:
df[df['Slice']=='Slice 4']
Will give you Slice 4 for all patients. Note how this eliminates the need to have that row for all patients.
As long as your new dataframe (df2) defines the same index you can now join on that index quite simply:
df.join(df2)
and you'll get
T1avg Slice Tx Group Tumour Type Imaging Sessions
Patients
NecS3Hs05 0 Slice 0 Control subcutaneous 2
NecS3Hs05 0 Slice 1 Control subcutaneous 2
NecS3Hs05 0 Slice 2 Control subcutaneous 2
NecS3Hs05 0 Slice 3 Control subcutaneous 2
NecS3Hs05 0 Slice 4 Control subcutaneous 2
NecS3Hs05 0 Slice 5 Control subcutaneous 2
NecS3Hs06 0 Slice 0 Treated orthotopic 3
NecS3Hs06 0 Slice 1 Treated orthotopic 3
NecS3Hs06 0 Slice 2 Treated orthotopic 3
NecS3Hs06 0 Slice 3 Treated orthotopic 3
NecS3Hs06 0 Slice 4 Treated orthotopic 3
NecS3Hs06 0 Slice 5 Treated orthotopic 3
Related
Hello I want to store a dataframe in another dataframe cell.
I have a data that looks like this
I have daily data which consists of date, steps, and calories. In addition, I have minute by minute HR data of a specific date. Obviously it would be easy to put the minute by minute data in 2 dimensional list but I'm fearing that would be harder to analyze later.
What would be the best practice when I want to have both data in one dataframe? Is it even possible to even nest dataframes?
Any better ideas ? Thanks!
Yes, it seems possible to nest dataframes but I would recommend instead rethinking how you want to structure your data, which depends on your application or the analyses you want to run on it after.
How to "nest" dataframes into another dataframe
Your dataframe containing your nested "sub-dataframes" won't be displayed very nicely. However, just to show that it is possible to nest your dataframes, take a look at this mini-example:
Here we have 3 random dataframes:
>>> df1
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
>>> df2
0 1 2
0 0.090917 0.457668 0.598548
1 0.748639 0.729935 0.680409
2 0.301244 0.024004 0.361283
>>> df3
0 1 2
0 0.200375 0.059798 0.665323
1 0.086708 0.320635 0.594862
2 0.299289 0.014134 0.085295
We can make a main dataframe that includes these dataframes as values in individual "cells":
df = pd.DataFrame({'idx':[1,2,3], 'dfs':[df1, df2, df3]})
We can then access these nested datframes as we would access any value in any other dataframe:
>>> df['dfs'].iloc[0]
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
I have a "large" DataFrame table with index being country codes (alpha-3) and columns being years (1900 to 2000) imported via a pd.read_csv(...) [as I understand, these are actually string so I need to pass it as '1945' for example].
The values are 0,1,2,3.
I need to "spread" these values until the next non-0 for each row.
example : 0 0 1 0 0 3 0 0 2 1
becomes: 0 0 1 1 1 3 3 3 2 1
I understand that I should not use iterations (current implementation is something like this, as you can see, using 2 loops is not optimal, I guess I could get rid of one by using apply(row) )
def spread_values(df):
for idx in df.index:
previous_v = 0
for t_year in range(min_year, max_year):
current_v = df.loc[idx, str(t_year)]
if current_v == 0 and previous_v != 0:
df.loc[idx, str(t_year)] = previous_v
else:
previous_v = current_v
However I am told I should use the apply() function, or vectorisation or list comprehension because it is not optimal?
The apply function however, regardless of the axis, does not allow to dynamically get the index/column (which I need to conditionally update the cell), and I think the core issue I can't make the vec or list options work is because I do not have a finite set of column names but rather a wide range (all examples I see use a handful of named columns...)
What would be the more optimal / more elegant solution here?
OR are DataFrames not suited for my data at all? what should I use instead?
You can use df.replace(to_replace=0, method='ffil). This will fill all zeros in your dataframe (except for zeros occuring at the start of your dataframe) with the previous non-zero value per column.
If you want to do it rowwise unfortunately the .replace() function does not accept an axis argument. But you can transpose your dataframe, replace the zeros and transpose it again: df.T.replace(0, method='ffill').T
I have some data in a pandas dataframe which looks like this:
gene VIM
time:2|treatment:TGFb|dose:0.1 -0.158406
time:2|treatment:TGFb|dose:1 0.039158
time:2|treatment:TGFb|dose:10 -0.052608
time:24|treatment:TGFb|dose:0.1 0.157153
time:24|treatment:TGFb|dose:1 0.206030
time:24|treatment:TGFb|dose:10 0.132580
time:48|treatment:TGFb|dose:0.1 -0.144209
time:48|treatment:TGFb|dose:1 -0.093910
time:48|treatment:TGFb|dose:10 -0.166819
time:6|treatment:TGFb|dose:0.1 0.097548
time:6|treatment:TGFb|dose:1 0.026664
time:6|treatment:TGFb|dose:10 -0.008032
where the left is an index. This is just a subsection of the data which is actually much larger. The index is composed of three components, time, treatment and dose. I want to reorganize this data such that I can access it easily by slicing. The way to do this is to use pandas MultiIndexing but I don't know how to convert my DataFrame with one index into another with three. Does anybody know how to do this?
To clarify, the desired output here is the same data with a three level index, the outer being treatment, middle is dose and the inner being time. This would be useful so then I could access the data with something like df['time']['dose'] or 'df[0]` (or something to that effect at least).
You can first replace unnecessary strings (index has to be converted to Series by to_series, because replace doesnt work with index yet) and then use split. Last set index names by rename_axis (new in pandas 0.18.0)
df.index = df.index.to_series().replace({'time:':'','treatment:': '','dose:':''}, regex=True)
df.index = df.index.str.split('|', expand=True)
df = df.rename_axis(('time','treatment','dose'))
print (df)
VIM
time treatment dose
2 TGFb 0.1 -0.158406
1 0.039158
10 -0.052608
24 TGFb 0.1 0.157153
1 0.206030
10 0.132580
48 TGFb 0.1 -0.144209
1 -0.093910
10 -0.166819
6 TGFb 0.1 0.097548
1 0.026664
10 -0.008032
This is a problem I've encountered in various contexts, and I'm curious if I'm doing something wrong, or if my whole approach is off. The particular data/functions are not important here, but I'll include a concrete example in any case.
It's not uncommon to want a groupby/apply that does various operations on each group, and returns a new dataframe. An example might be something like this:
def patch_stats(df):
first = df.iloc[0]
diversity = (len(df['artist_id'].unique())/float(len(df))) * df['dist'].mean()
start = first['ts']
return pd.DataFrame({'diversity':[diversity],'start':[start]})
So, this is a grouping function that generates a new DataFrame with two columns, each derived from a different operation on the input data. Again, the specifics aren't too important here, but this is the issue:
When I look at the output, I get something like this:
result = df.groupby('patch_idx').apply(patch_stats)
print result
diversity start
patch_idx
0 0 0.876161 2007-02-24 22:54:28
1 0 0.588997 2007-02-25 01:55:39
2 0 0.655306 2007-02-25 04:27:05
3 0 0.986047 2007-02-25 05:37:58
4 0 0.997020 2007-02-25 06:27:08
5 0 0.639499 2007-02-25 17:40:56
6 0 0.687874 2007-02-26 05:24:11
7 0 0.003714 2007-02-26 07:07:20
8 0 0.065533 2007-02-26 09:01:11
9 0 0.000000 2007-02-26 19:23:52
10 0 0.068846 2007-02-26 20:43:03
...
It's all good, except I have an extraneous, unnamed index level that I don't want:
print result.index.names
FrozenList([u'patch_idx', None])
Now, this isn't a huge deal; I can always get rid of the extraneous index level with something like:
result = result.reset_index(level=1,drop=True)
But seeing how this comes up anytime I have grouping function that returns a DataFrame, I'm wondering if there's a better approach to how I'm approaching this. Is it bad form to have a grouping function that returns a DataFrame? If so, what's the right method to get the same kind of result? (again, this is a general question fitting problems of this type)
In your grouping function, return a Series instead of a DataFrame. Specifically, replace the last line of patch_stats with:
return pd.Series({'diversity':diversity, 'start':start})
I've encountered this same issue.
Solution
result = df.groupby('patch_idx', group_keys=False).apply(patch_stats)
print result
While I generally understand the warnings, and many posts deal with this, I don understand why I am getting a warning only when I reach the groupby line (the last one):
grouped = data.groupby(['group'])
for name, group in grouped:
data2=group.loc[data['B-values'] > 0]
data2["unique_A-values"]=data2.groupby(["A-values"])["A-values"].transform('count')
EDIT:
Here is my dataframe (data):
group A-values B-values
human 1 -1
human 1 5
human 1 4
human 3 4
human 2 10
bird 7 8
....
For B-values > 0 (data2=group.loc[data['B-values'] > 0]):
human has two A-values equal to one, one equals to 3 and one equals to 2 (data2["unique_A-values"]=data2.groupby(["A-values"])["A-values"].transform('count'))
You get the error because you take a reference to your groupby and then try add a column to it, so it's just warning you that if your intention is to update the original df then this may or may not work.
If you are just modifying a local copy then take a copy using copy() so it's explicit and the warning will go away:
for name, group in grouped:
data2=group.loc[data['B-values'] > 0].copy() # <- add .copy() here
data2["unique_A-values"]=data2.groupby(["A-values"])["A-values"].transform('count')
FYI the pandas groupby user guide says:
Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected results.
for name, group in grouped:
# making a reference to the group chunk
data2 = group.loc[data['B-values'] > 0]
# trying to make a change to that group chunk reference
data2["unique_A-values"] = data2.groupby(["A-values"])["A-values"].transform('count')
That said, it looks like you just want to count the values in the data frame so you may be better off using value_counts():
>>> data[data['B-values']>0].groupby('group')['A-values'].value_counts()
group A-values
bird 7 1
human 1 2
2 1
3 1
Name: A-values, dtype: int64