Pandas Multiindex df - slicing multiple sub-ranges of an index - python

I have a dataframe that looks likes this:
Sweep Index
Sweep0001 0 -70.434570
1 -67.626953
2 -68.725586
3 -70.556641
4 -71.899414
5 -69.946289
6 -63.964844
7 -73.974609
...
Sweep0039 79985 -63.964844
79986 -66.406250
79987 -67.993164
79988 -68.237305
79989 -66.894531
79990 -71.411133
I want to slice out different ranges of Sweeps.
So for example, I want Sweep0001 : Sweep0003, Sweep0009 : Sweep0015, etc.
I know I can do this in separate lines with ix, i.e.:
df.ix['Sweep0001':'Sweep0003']
df.ix['Sweep0009':'Sweep0015']
And then put those back together into one dataframe (I'm doing this so I can average sweeps together, but I need to select some of them and remove others).
Is there a way to do that selection in one line though? I.e. without having to slice each piece separately, followed by recombining all of it into one dataframe.

Use Pandas IndexSlice
import pandas as pd
idx = pd.IndexSlice
df.loc[idx[["Sweep0001", "Sweep0002", ..., "Sweep0003", "Sweep0009", ..., "Sweep0015"]]
You can retrieve the labels you want this way:
list1 = df.index.get_level_values(0).unique()
list2 = [x for x in list1]
list3 = list2[1:4] #For your Sweep0001:Sweep0003
list3.extend(list2[9:16]) #For you Sweep0009:Sweep0015
df.loc[idx[list3]] #Note that you need one set of "[]"
#less around "list3" as this list comes
#by default with its own set of "[]".
In case you want to also slice by columns you can use:
df.loc[idx[list3],:] #Same as above to include all columns.
df.loc[idx[list3],:"column label"] #Returns data up to that "column label".
More information on slicing is on the Pandas website (http://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers) or in this similar Stackoverflow Q/A: Python Pandas slice multiindex by second level index (or any other level)

Related

How to separate tuple into independent pandas columns?

I am working with matching two separate dataframes on first name using HMNI's fuzzymerge.
On output each row returns a key like: (May, 0.9905315373004635)
I am trying to separate the Name and Score into their own columns. I tried the below code but don't quite get the right output - every row ends up with the same exact name/score in the new columns.
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])
matched[['consumer_name_first', 'key','MatchedNameFinal', 'MatchedNameScore']]
first when going over rows in pandas is better to use apply
matched['MatchedNameFinal'] = matched.key.apply(lambda x: x[0][0])
matched['MatchedNameScore'] = matched.key.apply(lambda x: x[0][1])
and in your case I think you are missing a tab in the for loop
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])
Generally, you want to avoid using enumerate for pandas because pandas functions are vectorized and much faster to execute.
So this solution won't iterate using enumerate.
First you turn the list into single tuple per row.
matched.key.explode()
Then use zip to split the tuple into 2 columns.
matched['col1'], matched['col2'] = zip(tuples)
Do all in 1 line.
matched['MatchedNameFinal'], matched['MatchedNameScore'] = zip(*matched.key.explode())

Given a list of strings, search a specific column for matching strings and return index value

The goal: I have a excel sheet with three columns. "BigList" containing about ~1000 genes. "Expression" with numeric gene expression values. "SmallList" containing small list of ~10 genes I am interested in.
For each gene in "SmallList", I want to search for its index in BigList and use that index to retrieve the expression value.
Here is what I tried so far. I used Pandas to read my excel file.
import pandas as pd
df = pd.read_excel (r'C:\Users\Me\Box\Excel.xlsx',
sheet_name='Genes')
Then I save my small list of genes "SmallList" into new a variable without NA values.
SmallList = df["SmallList"].dropna().tolist()
When I try to use this chunk of code, I get the following error message
df.loc[df['BigList'] == SmallList[1].index[0]
IndexError: index 0 is out of bounds for axis 0 with size 0
The code return an index if I use SmallList[0]. But going higher than 0 gives me the error above. I'm going insane trying to figure this out.
I tried to manually type my "SmallList"
SmallList = ["GeneA", "GeneB", "GeneC"]
and I was able to avoid the error. I got indices for SmallList[1] and SmallList[2]. I don't understand why this is happening? I hope someone can explain this to me.
Using my manually created SmallList, I was able to get a list of indices the way I wanted.
indices = []
for i in range(0, len(SmallList)):
indices += df.index[df['BigList'] == SmallList[i].tolist()
indices
[315, 148, 165]
Here is a solution (if I understand the goal correctly). First, create test data:
import pandas as pd
big = pd.DataFrame({
'gene': ['GeneA', 'GeneB', 'GeneC', 'GeneD', 'GeneE', ],
'expression': ['exp-A', 'exp-B', 'exp-C', 'exp-D', 'exp-E', ]})
small = pd.DataFrame({
'gene': ['GeneA', 'GeneC', 'GeneE'],})
Second, do a left-join (keeping all items in the small list, and pulling in items from the big list with matching genes. Note that big.reset_index() will put the big index values into the result set.
result = (pd.merge(left=small,
right=big.reset_index(),
on='gene',
how='left')
.rename(columns={'index': 'big_list_index'})
)
print(result)
gene big_list_index expression
0 GeneA 0 exp-A
1 GeneC 2 exp-C
2 GeneE 4 exp-E

How to drop element from a list inside a pandas column in Python?

I have a column in a dataframe that contain a list inside. My dataframe column is:
[],
['NORM'],
['NORM'],
['NORM'],
['NORM'],
['MI', 'STTC'],
As you can see I have an empty list and also a list with two elements. How can I change list with two elements to just take one of it (I don't care which one of it).
I tried with df.column.explode()but this just add more rows and I don't want more rows, I just need to take one of it.
Thank you so much
You can use Series.map with a custom mapping function which maps the elements of column according to desired requirements:
df['col'] = df['col'].map(lambda l: l[:1])
Result:
# print(df['col'])
0 []
1 [NORM]
2 [NORM]
3 [NORM]
4 [NORM]
5 [MI]
i, j is the location of the cell you need to access and this will give the first element of the list
list_ = df.loc[i][j]
if len(list_) > 0:
print(list_[0])
As you store lists into a pandas column, I assume that you do not worry for vectorization. So you could just use a list comprehension:
df[col] = [i[:1] for i in df[col]]

Converting several 2D DataFrames into one 3D DataFrame

I am trying to convert a list of 2d-dataframes into one large dataframe. Lets assume I have the following example, where I create a set of dataframes, each one having the same columns / index:
import pandas as pd
import numpy as np
frames = []
names = []
frame_columns = ['DataPoint1', 'DataPoint2']
for i in range(5):
names.append("DataSet{0}".format(i))
frames.append(pd.DataFrame(np.random.randn(3, 2), columns=frame_columns))
I would like to convert this set of dataframes into one dataframe df which I can access using df['DataSet0']['DataPoint1'].
This dataset would have to have a multi-index consisting of the product of ['DataPoint1', 'DataPoint2'] and the index of the individual dataframes (which is of course the same for all individual frames).
Conversely, the columns would be given as the product of ['Dataset0', ...] and ['DataPoint1', 'DataPoint2'].
In either case, I can create a corresponding MultiIndex and derive an (empty) dataframe based on that:
mux = pd.MultiIndex.from_product([names, frames[0].columns])
frame = pd.DataFrame(index=mux).T
However, I would like to have the contents of the dataframes present rather than having to then add them.
Note that a similar question has been asked here. However, the answers seem to revolve around the Panel class, which is, as of now, deprecated.
Similarly, this thread suggests a join, which is not really what I need.
You can use concat with keys:
total_frame = pd.concat(frames, keys=names)
Output:
DataPoint1 DataPoint2
DataSet0 0 -0.656758 1.776027
1 -0.940759 1.355495
2 0.173670 0.274525
DataSet1 0 -0.744456 -1.057482
1 0.186901 0.806281
2 0.148567 -1.065477
DataSet2 0 -0.980312 -0.487479
1 2.117227 -0.511628
2 0.093718 -0.514379
DataSet3 0 0.046963 -0.563041
1 -0.663800 -1.130751
2 -1.446891 0.879479
DataSet4 0 1.586213 1.552048
1 0.196841 1.933362
2 -0.545256 0.387289
Then you can extract Dataset0 by:
total_frame.loc['DataSet0']
If you really want to use MultiIndex columns instead, you can add axis=1 to concat:
total_frame = pd.concat(frames, axis=1, keys=names)

Pandas dataframe, empty or with 3 column to pickle

I'm not used to pandas at all, thus the several question on my problem.
I have a function computing computing a list called solutions. This list can either be made of tuples of 3 values (a, b, c) or empty.
solutions = [(a,b,c), (d,e,f), (g,h,i)]
To save it, I first turn it into a numpy array, and then I save it with pandas after naming the columns.
solutions = np.asarray(solutions)
df = pd.DataFrame(solutions)
df.columns = ["Name1", "Name2", "Name3"]
df.to_pickle(path)
My issue is that I sometimes have a empty solutions list: solutions = []. Thus, the line df.columns raises an error. To bypass it, I currently check the size of solutions, and if it is empty, I do:
pickle.dump([], path, "wb")
I would like to be a more consistent between my data type, and to save the SAME format between both scenario.
=> If the list is empty, I would like to save the 3 columns name with an empty data frame. Ultimate goal, is to reopen the file with pd.read_pickle() and to access easily the data in it.
Second issue, I would like to reopen the files pickled, and to add a column. Could you show me the right way to do so?
And third question, how can I select a part of the dataframe. For instance, I want all lines in which the column Name1 value % 0.25 == 0.
Thanks
Create your dataframe using:
df = pandas.DataFrame(data=solutions, columns=['name1', 'name2', 'name3'])
If solutions is empty, it will nevertheless create a dataframe with 3 columns and 0 row.
In [2]: pd.DataFrame(data=[(1,2,3), (4,5,6)], columns=['a','b','c'])
Out[2]:
a b c
0 1 2 3
1 4 5 6
In [3]: pd.DataFrame(data=[], columns=['a','b','c'])
Out[3]:
Empty DataFrame
Columns: [a, b, c]
Index: []
For your third question:
df["Name1"] % 0.25 == 0
computes a series of booleans which are true where the value in the first column can be divided by 0.25. You can use it to select the rows of your dataframe:
df[ df["Name1"] % 0.25 == 0 ]

Categories