Pandas align multiindex dataframe with other with regular index - python

I have one dataframe, let's call it df1, with a a MultiIndex (just a snippet, there are many more columns and rows)
M1_01 M1_02 M1_03 M1_04 M1_05
Eventloc Exonloc
chr10:52619746-52623793|- 52622648-52622741 0 0 0 0 0
chr19:58859211-58865080|+ 58864686-58864827 0 0 0 0 0
58864686-58864840 0 0 0 0 0
58864744-58864840 0 0 0 0 0
chr19:58863054-58863649|- 58863463-58863550 0 0 0 0 0
And another dataframe, let's go with the creative name df2, like this (these are the results of different algorithms, which is why they have different indices). The columns are the same, though in the first df they are not sorted.
M1_01 M1_02 M1_03 M1_04 M1_05
chr3:53274267:53274364:-#chr3:53271813:53271836:-#chr3:53268999:53269190:- 0.02 NaN NaN NaN NaN
chr2:9002720:9002852:-#chr2:9002401:9002452:-#chr2:9000743:9000894:- 0.04 NaN NaN NaN NaN
chr1:160192441:160192571:-#chr1:160190249:160190481:-#chr1:160188639:160188758:- NaN NaN NaN NaN NaN
chr7:100473194:100473333:+#chr7:100478317:100478390:+#chr7:100478906:100479034:+ NaN NaN NaN NaN NaN
chr11:57182088:57182204:-#chr11:57177408:57177594:-#chr11:57176648:57176771:- NaN NaN NaN NaN NaN
And I have this dataframe, again let's be creative and call it df3, which unifies the indices of df1 and df2:
Eventloc Exonloc
event_id
chr3:53274267:53274364:-#chr3:53271813:53271836:-#chr3:53268999:53269190:- chr3:53269191-53274267|- 53271812-53271836
chr2:9002720:9002852:-#chr2:9002401:9002452:-#chr2:9000743:9000894:- chr2:9000895-9002720|- 9002400-9002452
chr1:160192441:160192571:-#chr1:160190249:160190481:-#chr1:160188639:160188758:- chr1:160188759-160192441|- 160190248-160190481
chr7:100473194:100473333:+#chr7:100478317:100478390:+#chr7:100478906:100479034:+ chr7:100473334-100478906|+ 100478316-100478390
chr4:55124924:55124984:+#chr4:55127262:55127579:+#chr4:55129834:55130094:+ chr4:55124985-55129834|+ 55127261-55127579
I need to do a 1:1 comparison of these results, so I tried doing both
df1.ix[df3.head().values]
and
df1.ix[pd.MultiIndex.from_tuples(df3.head().values.tolist(), names=['Eventloc', 'Exonloc'])]
But they both give me dataframes of NAs. The only thing that works is:
event_id = df2.index[0]
df1.ix[df3.ix[event_id]]
But this obviously suboptimal as it is not vectorized and very slow. I think I'm missing some critical concept of MultiIndexes.
Thanks,
Olga

If I understand what you are doing, you need to either explicity construct the tuples (they must be fully qualifiied tuples though, e.g. have a value for EACH level), or easier, construct a boolean indexer)
In [7]: df1 = DataFrame(0,index=MultiIndex.from_product([list('abc'),[range(2)]]),columns=['A'])
In [8]: df1
Out[8]:
A
a 0 0
b 1 0
c 0 0
[3 rows x 1 columns]
In [9]: df1 = DataFrame(0,index=MultiIndex.from_product([list('abc'),list(range(2))]),columns=['A'])
In [10]: df1
Out[10]:
A
a 0 0
1 0
b 0 0
1 0
c 0 0
1 0
[6 rows x 1 columns]
In [11]: df3 = DataFrame(0,index=['a','b'],columns=['A'])
In [12]: df3
Out[12]:
A
a 0
b 0
[2 rows x 1 columns]
These are all the values of level 0 in the first frame
In [13]: df1.index.get_level_values(level=0)
Out[13]: Index([u'a', u'a', u'b', u'b', u'c', u'c'], dtype='object')
Construct a boolean indexer of the result
In [14]: df1.index.get_level_values(level=0).isin(df3.index)
Out[14]: array([ True, True, True, True, False, False], dtype=bool)
In [15]: df1.loc[df1.index.get_level_values(level=0).isin(df3.index)]
Out[15]:
A
a 0 0
1 0
b 0 0
1 0
[4 rows x 1 columns]

Related

Making a matrix-format from python

I have the following data in my dataframe B:
F1 F2 Count
A C 5
B C 2
B U 6
C A 1
I want to make a square matrix out of them so the results will be:
A B C U
A 0 0 6 0
B 0 0 2 6
C 6 2 0 0
U 0 6 0 0
I initially used pd.crosstab() but some variables in F1/F2 is missing in the matrix.
AC = 5 CA = 1 therefore the output should be 6.
Also pdcrosstab() does not recognize BU = UB, etc.
Anyone who could help? I am basically new to python.
Btw, this is my code:
wow=pd.crosstab(B.F1,
B.F2,
values=B.Count,
aggfunc='sum',
).rename_axis(None).rename_axis(None, axis=1)
You can pd.concat, wow and wow.T then groupby index and sum again:
>>> wow=pd.crosstab(B.F1,
B.F2,
values=B.Count,
aggfunc='sum',
).rename_axis(None).rename_axis(None, axis=1)
>>> wow
A C U
A NaN 5.0 NaN
B NaN 2.0 6.0
C 1.0 NaN NaN
>>> pd.concat([wow, wow.T], sort=True).fillna(0, downcast='infer').groupby(level=0).sum()
A B C U
A 0 0 6 0
B 0 0 2 6
C 6 2 0 0
U 0 6 0 0
You can make columns F1 and F2 categorical and use crosstab for the work.
FDtype = pd.CategoricalDtype(list("ABCU"))
df[["F1", "F2"]] = df[["F1", "F2"]].astype(FDtype)
count = pd.crosstab(df["F1"], df["F2"], df["Count"], aggfunc='sum', dropna=False)
count.fillna(0, inplace=True, downcast="infer")
count += count.T
Remark: it is more efficient to specify the column dtypes while the DataFrame is constructed
You can append the DataFrame where 'F1' and 'F2' are swapped to the original DataFrame.
df1 = df.append(df.rename({'F1': 'F2', 'F2': 'F1'}, axis=1), sort=False)
Then you can use pivot_table:
pd.pivot_table(df1, values='Count', index='F1', columns='F2', aggfunc='sum', fill_value=0)
or crosstab:
pd.crosstab(df1.F1, df1.F2, df1.Count, aggfunc='sum').fillna(0)
Finally remove columns and index names:
del df1.columns.name, df1.index.name
Result:
A B C U
A 0 0 6 0
B 0 0 2 6
C 6 2 0 0
U 0 6 0 0

How to replace a value in a pandas dataframe with column name based on a condition?

I have a dataframe that looks something like this:
I want to replace all 1's in the range A:D with the name of the column, so that the final result should resemble:
How can I do that?
You can recreate my dataframe with this:
dfz = pd.DataFrame({'A' : [1,0,0,1,0,0],
'B' : [1,0,0,1,0,1],
'C' : [1,0,0,1,3,1],
'D' : [1,0,0,1,0,0],
'E' : [22.0,15.0,None,10.,None,557.0]})
One way could be to use replace and pass in a Series mapping column labels to values (those same labels in this case):
>>> dfz.loc[:, 'A':'D'].replace(1, pd.Series(dfz.columns, dfz.columns))
A B C D
0 A B C D
1 0 0 0 0
2 0 0 0 0
3 A B C D
4 0 0 3 0
5 0 B C 0
To make the change permanent, you'd assign the returned DataFrame back to dfz.loc[:, 'A':'D'].
Solutions aside, it's useful to keep in mind that you may lose a lot of performance benefits when you mix numeric and string types in columns, as pandas is forced to use the generic 'object' dtype to hold the values.
A solution using where:
>>> dfz.where(dfz != 1, dfz.columns.to_series(), axis=1)
A B C D E
0 A B C D 22.0
1 0 0 0 0 15.0
2 0 0 0 0 NaN
3 A B C D 10.0
4 0 0 3 0 NaN
5 0 B C 0 557.0
Maybe it's not so elegant but...just looping through columns and replace:
for i in dfz[['A','B','C','D']].columns:
dfz[i].replace(1,i,inplace=True)
I do prefer very elegant solution from #ajcr.
In case if you have column names that you cant use that easily for slicing, here is my solution:
dfz.ix[:, dfz.filter(regex=r'(A|B|C|D)').columns.tolist()] = (
dfz[dfz!=1].ix[:,dfz.filter(regex=r'(A|B|C|D)').columns.tolist()]
.apply(lambda x: x.fillna(x.name))
)
Output:
In [207]: dfz
Out[207]:
A B C D E
0 A B C D 22.0
1 0 0 0 0 15.0
2 0 0 0 0 NaN
3 A B C D 10.0
4 0 0 3 0 NaN
5 0 B C 0 557.0

Why does a pandas Series of DataFrame mean() fail, but sum() does not, and how to make it work?

There may be a smarter way to do this in Python Pandas, but the following example should, but doesn't work:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1, 0], [1, 2], [2, 0]], columns=['a', 'b'])
df2 = df1.copy()
df3 = df1.copy()
idx = pd.date_range("2010-01-01", freq='H', periods=3)
s = pd.Series([df1, df2, df3], index=idx)
# This causes an error
s.mean()
I won't post the whole traceback, but the main error message is interesting:
TypeError: Could not convert melt T_s
0 6 12
1 0 6
2 6 10 to numeric
It looks like the dataframe was successfully sum'med, but not divided by the length of the series.
However, we can take the sum of the dataframes in the series:
s.sum()
... returns:
a b
0 6 12
1 0 6
2 6 10
Why wouldn't mean() work when sum() does? Is this a bug or a missing feature? This does work:
(df1 + df2 + df3)/3.0
... and so does this:
s.sum()/3.0
a b
0 2 4.000000
1 0 2.000000
2 2 3.333333
But this of course is not ideal.
You could (as suggested by #unutbu) use a hierarchical index but when you have a three dimensional array you should consider using a "pandas Panel". Especially when one of the dimensions represents time as in this case.
The Panel is oft overlooked but it is after all where the name pandas comes from. (Panel Data System or something like that).
Data slightly different from your original so there are not two dimensions with the same length:
df1 = pd.DataFrame([[1, 0], [1, 2], [2, 0], [2, 3]], columns=['a', 'b'])
df2 = df1 + 1
df3 = df1 + 10
Panels can be created a couple of different ways but one is from a dict. You can create the dict from your index and the dataframes with:
s = pd.Panel(dict(zip(idx,[df1,df2,df3])))
The mean you are looking for is simply a matter of operating on the correct axis (axis=0 in this case):
s.mean(axis=0)
Out[80]:
a b
0 4.666667 3.666667
1 4.666667 5.666667
2 5.666667 3.666667
3 5.666667 6.666667
With your data, sum(axis=0) returns the expected result.
EDIT: OK too late for panels as the hierarchical index approach is already "accepted". I will say that that approach is preferable if the data is know to be "ragged" with an unknown but different number in each grouping. For "square" data, the panel is absolutly the way to go and will be significantly faster with more built-in operations. Pandas 0.15 has many improvements for multi-level indexing but still has limitations and dark edge cases in real world apps.
When you define s with
s = pd.Series([df1, df2, df3], index=idx)
you get a Series with DataFrames as items:
In [77]: s
Out[77]:
2010-01-01 00:00:00 a b
0 1 0
1 1 2
2 2 0
2010-01-01 01:00:00 a b
0 1 0
1 1 2
2 2 0
2010-01-01 02:00:00 a b
0 1 0
1 1 2
2 2 0
Freq: H, dtype: object
The sum of the items is a DataFrame:
In [78]: s.sum()
Out[78]:
a b
0 3 0
1 3 6
2 6 0
but when you take the mean, nanops.nanmean is called:
def nanmean(values, axis=None, skipna=True):
values, mask, dtype, dtype_max = _get_values(values, skipna, 0)
the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_max))
...
Notice that _ensure_numeric (source code) is called on the resultant sum.
An error is raised because a DataFrame is not numeric.
Here is a workaround. Instead of making a Series with DataFrames as items,
you can concatenate the DataFrames into a new DataFrame with a hierarchical index:
In [79]: s = pd.concat([df1, df2, df3], keys=idx)
In [80]: s
Out[80]:
a b
2010-01-01 00:00:00 0 1 0
1 1 2
2 2 0
2010-01-01 01:00:00 0 1 0
1 1 2
2 2 0
2010-01-01 02:00:00 0 1 0
1 1 2
2 2 0
Now you can take the sum and the mean:
In [82]: s.sum(level=1)
Out[82]:
a b
0 3 0
1 3 6
2 6 0
In [84]: s.mean(level=1)
Out[84]:
a b
0 1 0
1 1 2
2 2 0

Non-reducing variant of the ANY() function that respects NaN

Hard to explain in words but the expample should be clear:
df = DataFrame( { 'x':[0,1], 'y':[np.NaN,0], 'z':[0,np.NaN] }, index=['a','b'] )
x y z
a 0 NaN 0
b 1 0 NaN
I want to replace all non-NaN values with a '1', if there is a '1' anywhere in that row. Just like this:
x y z
a 0 NaN 0
b 1 1 NaN
This sort of works, but unfortunately overwrites the NaN
df[ df.any(1) ] = 1
x y z
a 0 NaN 0
b 1 1 1
I thought there might be some non-reducing form of any (like cumsum is a non-reducing form of sum), but I can't find anything like that so far...
You could combine a multiplication by zero (to give an empty frame but which remembers nan locations) with an add on axis=0:
>>> df
x y z
a 0 NaN 0
b 1 0 NaN
>>> (df * 0).add(df.any(1), axis=0)
x y z
a 0 NaN 0
b 1 1 NaN

Find missing data in pandas dataframe and fill with NA

I have a dataframe in pandas with company name and date as multi-index.
companyname date emp1 emp2 emp3..... emp80
Where emp1, emp2 is the count of phone calls made by emp1 and 2 respectively on that date. Now there are dates when no employee made a call. Means there are rows where all the column values are 0. I want to fill these values by NA.
Should I manually write the names of all columns in some function? Any suggestions how to achieve this?
You can check that the entire row is 0 with all:
In [11]: df = pd.DataFrame([[1, 2], [0, 4], [0, 0], [7, 8]])
In [12]: df
Out[12]:
0 1
0 1 2
1 0 4
2 0 0
3 7 8
In [13]: (df == 0).all(1)
Out[13]:
0 False
1 False
2 True
3 False
dtype: bool
Now you can assign all the entries in these rows to NaN using loc:
In [14]: df.loc[(df == 0).all(1)] = np.nan
In [15]: df
Out[15]:
0 1
0 1 2
1 0 4
2 NaN NaN
3 7 8

Categories