Pandas Multiindex with one column not having multiple levels - python

I have a complex MultiIndex:
import pandas as pd
metrics = ['PT', 'TF', 'AF']
n_replicates = 3
n_nodes = 6
cols = [(r,m,n) for r in range(n_replicates) for m in metrics for n in range(n_nodes)]
cols = pd.MultiIndex.from_tuples(cols,names = ['Replicates', 'Metrics', 'Nodes'])
ind = range(5)
df = pd.DataFrame(columns=cols, index=ind)
df.sortlevel(level=0, axis=1, inplace=True)
And it's giving me some problems. One of them is: I would like to add a column that doesn't have all the levels of the MultiIndex:
df[r, 'Graph'] = ....
However, I ultimately end up needing to make:
df[r, 'Graph', 0] = ....
And when I reference that column I also need to use df[r, 'Graph', 0], which is clunky since there's not actually anything happening on that third level. Is there a way around this?
Edit: more examples
Adding the column:
df[0,'Graph'] = arange(5)
ValueError: invalid entry
df.ix[:,[0,'Graph']] = arange(5)
KeyError: "['Graph'] not in index"
df.xs[0,'Graph'] = arange(5)
TypeError: 'instancemethod' object does not support item assignment
df[0, 'Graph', 0] = arange(5) #Works! But I have to reference a lower level that doesn't mean anything for this 'Graph' column.
Reading from the column:
df.ix[:, [0,'Graph']] #Gives the whole of df[0], not just the [0,'Graph'] column
df[0, 'Graph']
KeyError: 'MultiIndex lexsort depth 0, key was length 2'
df.sortlevel(level=0, axis=1, inplace=True)
df[0, 'Graph'] #Works! Though if one is making many manipulations to the dataframe then this sortlevel needs to be called a lot.
Further edit from Jeff's second comment:
I appreciate a given column needing to have all the levels of the multiindex. I have thought about just having another frame, though I would like to keep the data all together in one unit, for storing in HDF5s, etc. The answer for this would be panel. However, I will ultimately have several levels above this one, and I'm leery of panels of panels of panels, if that's even possible. I'm definitely open to other angles of attack that I haven't thought about that obviate all these issues.

Related

Subtract 2 dataframes with different column length if key matches

I have a dataframe which has columns with different length. I want to subtract columns VIEWS from each other if the fields URL match.
This is my code which gives me completely false results and almost exclusively NAN values and floats which both doesn´t make sense to me. Is there a better solution for this or an obvious mistake in my code?
a = a.loc[:, ['VIEWS', 'URL']]
b = b.loc[:, ['VIEWS', 'URL']]
df = pd.concat([a,b], ignore_index=True)
df['VIEWS'] = pd.to_numeric(df['VIEWS'], errors='coerce').fillna(0).astype(int)
df['VIEWS'] = df.groupby(['URL'])['VIEWS'].diff().abs()
Great question!
Let's start with a possible solution
I assume you want to deduct the total of the first from the total of the second per group. Taking your cleaning as the basis, here's a small, (hopefully) complete example, which uses .sum() and multiplies the views from b by -1 prior to grouping:
import pandas as pd
import numpy as np
a = pd.DataFrame(data = [
[100, 'x.me'], [200, 'y.me'], [50, 'x.me'], [np.nan, 'y.me']
], columns=['VIEWS', 'URL'])
b = pd.DataFrame(data = [
[90, 'x.me'], [200, 'z.me'],
], columns=['VIEWS', 'URL'])
for x in [a, b]:
x['VIEWS'] = pd.to_numeric(x['VIEWS'], errors='coerce').fillna(0).astype(int)
df = pd.concat([x.groupby(['URL'])['VIEWS'].apply(lambda y: y.sum() * (1 - 2 * cnt)).reset_index(drop = False) for (cnt, x) in enumerate([a, b])], ignore_index=True)
df = df.groupby(['URL'])['VIEWS'].sum().abs().reset_index()
A few words on why your approach is currently not working
diff() There is a diff function for the SeriesGroupBy class. It takes the difference of some row to the previous row in the group. Check this out for a valid usage of diff() in this context: Pandas groupby multiple fields then diff
nan's appear in your last operation since you're trying to set a series object with the indices being the urls onto a series with completely different indices.
So if anything, an operation such as the following could work
df['VIEWS'] = df.groupby(['URL'])['VIEWS'].sum().reset_index(drop=True)
although this still assumes, that df does not change in size and that the indices on the left side accord the ones after the reset on the right side.

Pandas interval multiindex

I need to create a data stucture allowing indexing via a tuple of floats. Each dimension of the tuple represents one parameter. Each parameter spans a continuous range and to be able to perform my work, I binned the range to categories.
Then, I want to create a dataframe with a MultiIndex, each dimension of the index referring to a parameter with the defined categories
import pandas as pd
import numpy as np
index = pd.interval_range(start=0, end=10, periods = 5, closed='both')
index2 = pd.interval_range(start=20, end=30, periods = 3, closed='both')
index3 = pd.MultiIndex.from_product([index,index2])
dataStructure = pd.DataFrame(np.zeros((5*3,1)), index = index3)
print(Qhat)
I checked that the interval_range provides me with the necessary methods e.g.
index.get_loc(2.5)
would provide me the right answer. However I can't extend this with the dataframe nor the multiIndex
index3.get_loc((2.5,21))
does not work. Any idea ? I managed to get that working yesterday somehow therefore I am 99% convinced there is a simple way to make this work. But my jupyter notebook was in the cloud and the server crashed and notebook has been lost. I became dumber overnight apparently.
I think select by tuple is not implemented yet, possible solution is get position for each level separately with Index.get_level_values, get intersection by intersect1d and last select by iloc:
idx1 = df.index.get_level_values(0).get_loc(2.5)
idx2 = df.index.get_level_values(1).get_loc(21)
df1 = df.iloc[np.intersect1d(idx1, idx2)]
print (df1)
0
[2, 4] [20.0, 23.333333333333332] 0.0

Why do loc and iloc work differently for slicing rows of a pandas DataFrame?

I want a DataFrame where the top rows of one column (called 'cat') have value "LOW", the mid and bottom parts of the frame will have values "MID" and "HI". So, for a frame of 1,200 rows, the value counts for the cat columns should result in:
LOW 400
MID 400
HI 400
This should be easy. But, apparently it is not really. To no avail I tried to select and change the bottom rows using df.loc[-400:,["cat"]] = "HI"
But, this approach does work for the top-rows: df.loc[:399,["cat"]] = "LOW"
The sample below shows a working example, and note that it requires both loc and iloc. Is this where pandas can improve?
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random([1200, 4]), columns=['A', 'B', 'C', 'D'])
df["cat"] = "MID"
df.loc[:399,["cat"]] = "LOW"
df.iloc[-400:,-1] = "HI" # The -1 selects the last column ('cat') - not ideal.
df.cat.value_counts()
Use get_loc for position of column cat if want select by positions by iloc - need positions of index and columns:
df = pd.DataFrame(np.random.random([1200, 4]), columns=['A', 'B', 'C', 'D'])
df["cat"] = "MID"
df.iloc[:400,df.columns.get_loc('cat')] = "LOW"
df.iloc[-400:,df.columns.get_loc('cat')] = "HI"
Detail:
print (df.columns.get_loc('cat'))
4
Alternative is use loc for select by labels - then need select 400 values of index by indexing:
df.loc[df.index[:400],"cat"] = "LOW"
df.loc[df.index[-400:],"cat"] = "HI"
a = df.cat.value_counts()
print (a)
MID 400
HI 400
LOW 400
Name: cat, dtype: int64
Another ways for set 400 values use numpy.repeat or set values by repeat of lists:
df["cat"] = np.array(["LOW", "MID", "HI"]).repeat(400)
df["cat"] = ["LOW"] * 400 + ["MID"] * 400 + ["HI"] * 400
#thanks #Quickbeam2k1
df = df.assign(cat = ['LOW']*400 + ['MID']*400 + ['HIGH']*400 )
Answering the question if pandas can improve here:
I the documentation it's clearly stated what loc is doing:
.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found.
so -400 is simply not a label in your index. Thus the behavior is as intended.
What one often wants is and accessor for iloc based row access and loc based column access. But for this, the .get_loc-function comes into play.
You could also use the deprecated .ix-indexer. However, its behavior caused some confusion. She examples and methods using the .loc and .iloc accessors here.
Essentially, #Jezrael's solution are also found in the link above.
To summarize: Pandas had a solution to your problem in place, but it confused users. So in order to provide a more consistent API it was decided to remove that feature in the future

Pandas dataFrame.nunique() : ("unhashable type : 'list'", 'occured at index columns')

I want to apply the .nunique() function to a full dataFrame.
On the following screenshot, we can see that it contains 130 features. Screenshot of shape and columns of the dataframe.
The goal is to get the number of different values per feature.
I use the following code (that worked on another dataFrame).
def nbDifferentValues(data):
total = data.nunique()
total = total.sort_values(ascending=False)
percent = (total/data.shape[0]*100)
return pd.concat([total, percent], axis=1, keys=['Total','Pourcentage'])
diffValues = nbDifferentValues(dataFrame)
And the code fails at the first line and I get the following error which I don't know how to solve ("unhashable type : 'list'", 'occured at index columns'):
Trace of the error
You probably have a column whose content are lists.
Since lists in Python are mutable they are unhashable.
import pandas as pd
df = pd.DataFrame([
(0, [1,2]),
(1, [2,3])
])
# raises "unhashable type : 'list'" error
df.nunique()
SOLUTION: Don't use mutable structures (like lists) in your dataframe:
df = pd.DataFrame([
(0, (1,2)),
(1, (2,3))
])
df.nunique()
# 0 2
# 1 2
# dtype: int64
To get nunique or unique in a pandas.Series , my preferred approaches are
Quick Approach
NOTE: It wouldn't hurt if the col values are lists and string type. Also, nested lists might needed to be flattened.
_unique_items = df.COL_LIST.explode().unique()
or
_unique_count = df.COL_LIST.explode().nunique()
Alternate Approach
Alternatively, if I wish not to explode the items,
# If col values are strings
_unique_items = df.COL_STR_LIST.apply("|".join).unique()
# Lambda will save if col values are non-strings
_unique_items = df.COL_LIST.apply(lambda _l: "|".join([str(_y) for _y in _i])).unique()
Bonus
df.COL.apply(json.dumps) might handle all the cases.
OP's solution
df['uniqueness'] = df.apply(lambda _x: json.dumps(_x.to_list()), axis=1)
...
# Plug more code
...
I have come across this problem with .nunique() when converting results from a Rest API from dict (or list) to pandas dataframe. The problem is that one of the columns is stored as a list or dict (common situation in nested json results). Here is a sample code to remove the columns causing the error.
# this is the dataframe that is causing your issues
df = data.copy()
print(f"Rows and columns: {df.shape} \n")
print(f"Null values per column: \n{df.isna().sum()} \n")
# check which columns error when counting number of uniques
ls_cols_nunique = []
ls_cols_error_nunique = []
for each_col in df.columns:
try:
df[each_col].nunique()
ls_cols_nunique.append(each_col)
except:
ls_cols_error_nunique.append(each_col)
print(f"Unique values per column: \n{df[ls_cols_nunique].nunique()} \n")
print(f"Columns error nunique: \n{ls_cols_error_nunique} \n")
This code should split your dataframe columns into 2 lists:
Column that can calculate .nunique()
Column that errors when running .nunique()
Then just calculate the .nunique() on the columns without errors.
As far as converting the columns with errors, there are other resources that address that with .apply(pd.series).

Pandas MultiIndex with integer labels

I have a MultiIndex with some levels labeled with strings, and others with integers:
import pandas as pd
metrics = ['PT', 'TF', 'AF']
n_replicates = 3
n_nodes = 6
cols = [(r,m,n) for r in range(n_replicates) for m in metrics for n in range(n_nodes)]
cols = pd.MultiIndex.from_tuples(cols,names = ['Replicates', 'Metrics', 'Nodes'])
ind = range(5)
df = pd.DataFrame(columns=cols, index=ind)
df.sortlevel(level=0, axis=1, inplace=True)
If I want to select a single column with an integer label, no problem:
df[2,'AF',10]
If I try to select a range, though:
df[1:4,'AF',10]
TypeError:
(No message given)
If I leave out the last level, I get a different error:
df.sortlevel(level=0,axis=1,inplace=True)
df[1:4,'AF']
TypeError: unhashable type
I suspect I'm playing with fire when I'm using integers as column labels. Is the "safe" route to simply have them all as strings? Or are there other ways of indexing MuliIndex dataframes with integer labels?
Edit:
It's now clear to me that I should be using .loc. Good. However, it's still not clear to me out to interact with the lower levels of the MultiIndex.
df.loc[:,:] #Good
df.loc[:,1:2] #Good
df.loc[:,[1:2, 'AF']]
SyntaxError: invalid syntax
df.loc[:,1:2].xs('AF', level='Metrics', axis=1) #Good
Is the last line just what I need to use? If so, fine. It's just sufficiently long that it makes me feel I'm ignorant of a better way. Thanks for the help!

Categories