I have a MultiIndex with some levels labeled with strings, and others with integers:
import pandas as pd
metrics = ['PT', 'TF', 'AF']
n_replicates = 3
n_nodes = 6
cols = [(r,m,n) for r in range(n_replicates) for m in metrics for n in range(n_nodes)]
cols = pd.MultiIndex.from_tuples(cols,names = ['Replicates', 'Metrics', 'Nodes'])
ind = range(5)
df = pd.DataFrame(columns=cols, index=ind)
df.sortlevel(level=0, axis=1, inplace=True)
If I want to select a single column with an integer label, no problem:
df[2,'AF',10]
If I try to select a range, though:
df[1:4,'AF',10]
TypeError:
(No message given)
If I leave out the last level, I get a different error:
df.sortlevel(level=0,axis=1,inplace=True)
df[1:4,'AF']
TypeError: unhashable type
I suspect I'm playing with fire when I'm using integers as column labels. Is the "safe" route to simply have them all as strings? Or are there other ways of indexing MuliIndex dataframes with integer labels?
Edit:
It's now clear to me that I should be using .loc. Good. However, it's still not clear to me out to interact with the lower levels of the MultiIndex.
df.loc[:,:] #Good
df.loc[:,1:2] #Good
df.loc[:,[1:2, 'AF']]
SyntaxError: invalid syntax
df.loc[:,1:2].xs('AF', level='Metrics', axis=1) #Good
Is the last line just what I need to use? If so, fine. It's just sufficiently long that it makes me feel I'm ignorant of a better way. Thanks for the help!
Related
I have a function that I'm using to log transform individual values of a dataframe based a source dataframe and a list of columns that are passed in.
def split(columns, start_df):
df = start_df[columns].copy()
numeric_features = df.select_dtypes(exclude = ["object"]).columns
for cols in numeric_features:
for rows in range(0,train_num.shape[0]):
# Offending row
train_num[cols][rows] = np.log(train_num[cols][rows])
Since the df and the list of columns will be unknown and the columns may come from another df as .columns.tolist(), is there a way to work around this warning without the column index (because it may not match)?
It's the only thing I can think of that's messing up the model I'm making.
I have tried the below, but still getting the warning as well but I'm out of ideas.
train_num.loc[cols][rows] = np.log(train_num.loc[cols][rows])
This give me an error: 'numpy.float64' object has no attribute 'where'
train_num[cols][rows].where(train_num[cols][rows] > 0,
np.log(train_num[cols][rows],
train_num[cols][rows]))
What's strange is this section in the same function is throwing the same warning as well, hopefully it's the same fix!
X_train.loc[:, numeric_features] = scaler.fit_transform(X_train.loc[:, numeric_features])
X_val.loc[:, numeric_features] = scaler.transform(X_val.loc[:, numeric_features])
Any help is much appreciated!
How can I label my x-axis with multiple columns? Here's an example that works:
df = pd.DataFrame({"player_name": ["Alan","Bob","Carl","Dan","Earl"],
"jersey_number": ['1','2','3','4','5'],
"hits" : [2,3,1,2,4],
"at_bats" : [7,6,8,7,8]
})
df["label"] = df["player_name"]+"-"+df["jersey_number"]
df.plot(x="label", y=["hits", "at_bats"])
plt.show()
But this has an couple weaknesses. First, the example line to create the label column is tedious. Second, string concat is finicky. If the jersey_numbers aren't strings (e.g. ints instead), the concat fails. I can write a subroutine to take a list of columns, cast all as strings, and concat them. That seems like it should be unnecessary though, that there should be some built-in way to do this, something like:
df = pd.DataFrame({"player_name": ["Alan","Bob","Carl","Dan","Earl"],
"jersey_number": ['1','2','3','4','5'],
"hits" : [2,3,1,2,4],
"at_bats" : [7,6,8,7,8]
})
df.plot(x=["player_name","jersey_number"], y=["hits", "at_bats"])
plt.show()
This doesn't work; it throws ValueError: x must be a label or position.
My googlefu hasn't been strong enough to discover the correct syntax. Does it exist, and if yes what is it? Thanks
One option is to set those column as index then plot:
df.set_index(["player_name","jersey_number"]).plot( y=["hits", "at_bats"])
which gives
Although I would prefer your first approach since it gives better representation:
df["label"] = df[["player_name","jersey_number"]].astype(str).agg('-'.join)
or
df['label'] = [f'{x}-{y}' for x,y in zip(df["player_name"],df["jersey_number"]) ]
I need to create a data stucture allowing indexing via a tuple of floats. Each dimension of the tuple represents one parameter. Each parameter spans a continuous range and to be able to perform my work, I binned the range to categories.
Then, I want to create a dataframe with a MultiIndex, each dimension of the index referring to a parameter with the defined categories
import pandas as pd
import numpy as np
index = pd.interval_range(start=0, end=10, periods = 5, closed='both')
index2 = pd.interval_range(start=20, end=30, periods = 3, closed='both')
index3 = pd.MultiIndex.from_product([index,index2])
dataStructure = pd.DataFrame(np.zeros((5*3,1)), index = index3)
print(Qhat)
I checked that the interval_range provides me with the necessary methods e.g.
index.get_loc(2.5)
would provide me the right answer. However I can't extend this with the dataframe nor the multiIndex
index3.get_loc((2.5,21))
does not work. Any idea ? I managed to get that working yesterday somehow therefore I am 99% convinced there is a simple way to make this work. But my jupyter notebook was in the cloud and the server crashed and notebook has been lost. I became dumber overnight apparently.
I think select by tuple is not implemented yet, possible solution is get position for each level separately with Index.get_level_values, get intersection by intersect1d and last select by iloc:
idx1 = df.index.get_level_values(0).get_loc(2.5)
idx2 = df.index.get_level_values(1).get_loc(21)
df1 = df.iloc[np.intersect1d(idx1, idx2)]
print (df1)
0
[2, 4] [20.0, 23.333333333333332] 0.0
I want to apply the .nunique() function to a full dataFrame.
On the following screenshot, we can see that it contains 130 features. Screenshot of shape and columns of the dataframe.
The goal is to get the number of different values per feature.
I use the following code (that worked on another dataFrame).
def nbDifferentValues(data):
total = data.nunique()
total = total.sort_values(ascending=False)
percent = (total/data.shape[0]*100)
return pd.concat([total, percent], axis=1, keys=['Total','Pourcentage'])
diffValues = nbDifferentValues(dataFrame)
And the code fails at the first line and I get the following error which I don't know how to solve ("unhashable type : 'list'", 'occured at index columns'):
Trace of the error
You probably have a column whose content are lists.
Since lists in Python are mutable they are unhashable.
import pandas as pd
df = pd.DataFrame([
(0, [1,2]),
(1, [2,3])
])
# raises "unhashable type : 'list'" error
df.nunique()
SOLUTION: Don't use mutable structures (like lists) in your dataframe:
df = pd.DataFrame([
(0, (1,2)),
(1, (2,3))
])
df.nunique()
# 0 2
# 1 2
# dtype: int64
To get nunique or unique in a pandas.Series , my preferred approaches are
Quick Approach
NOTE: It wouldn't hurt if the col values are lists and string type. Also, nested lists might needed to be flattened.
_unique_items = df.COL_LIST.explode().unique()
or
_unique_count = df.COL_LIST.explode().nunique()
Alternate Approach
Alternatively, if I wish not to explode the items,
# If col values are strings
_unique_items = df.COL_STR_LIST.apply("|".join).unique()
# Lambda will save if col values are non-strings
_unique_items = df.COL_LIST.apply(lambda _l: "|".join([str(_y) for _y in _i])).unique()
Bonus
df.COL.apply(json.dumps) might handle all the cases.
OP's solution
df['uniqueness'] = df.apply(lambda _x: json.dumps(_x.to_list()), axis=1)
...
# Plug more code
...
I have come across this problem with .nunique() when converting results from a Rest API from dict (or list) to pandas dataframe. The problem is that one of the columns is stored as a list or dict (common situation in nested json results). Here is a sample code to remove the columns causing the error.
# this is the dataframe that is causing your issues
df = data.copy()
print(f"Rows and columns: {df.shape} \n")
print(f"Null values per column: \n{df.isna().sum()} \n")
# check which columns error when counting number of uniques
ls_cols_nunique = []
ls_cols_error_nunique = []
for each_col in df.columns:
try:
df[each_col].nunique()
ls_cols_nunique.append(each_col)
except:
ls_cols_error_nunique.append(each_col)
print(f"Unique values per column: \n{df[ls_cols_nunique].nunique()} \n")
print(f"Columns error nunique: \n{ls_cols_error_nunique} \n")
This code should split your dataframe columns into 2 lists:
Column that can calculate .nunique()
Column that errors when running .nunique()
Then just calculate the .nunique() on the columns without errors.
As far as converting the columns with errors, there are other resources that address that with .apply(pd.series).
I have a complex MultiIndex:
import pandas as pd
metrics = ['PT', 'TF', 'AF']
n_replicates = 3
n_nodes = 6
cols = [(r,m,n) for r in range(n_replicates) for m in metrics for n in range(n_nodes)]
cols = pd.MultiIndex.from_tuples(cols,names = ['Replicates', 'Metrics', 'Nodes'])
ind = range(5)
df = pd.DataFrame(columns=cols, index=ind)
df.sortlevel(level=0, axis=1, inplace=True)
And it's giving me some problems. One of them is: I would like to add a column that doesn't have all the levels of the MultiIndex:
df[r, 'Graph'] = ....
However, I ultimately end up needing to make:
df[r, 'Graph', 0] = ....
And when I reference that column I also need to use df[r, 'Graph', 0], which is clunky since there's not actually anything happening on that third level. Is there a way around this?
Edit: more examples
Adding the column:
df[0,'Graph'] = arange(5)
ValueError: invalid entry
df.ix[:,[0,'Graph']] = arange(5)
KeyError: "['Graph'] not in index"
df.xs[0,'Graph'] = arange(5)
TypeError: 'instancemethod' object does not support item assignment
df[0, 'Graph', 0] = arange(5) #Works! But I have to reference a lower level that doesn't mean anything for this 'Graph' column.
Reading from the column:
df.ix[:, [0,'Graph']] #Gives the whole of df[0], not just the [0,'Graph'] column
df[0, 'Graph']
KeyError: 'MultiIndex lexsort depth 0, key was length 2'
df.sortlevel(level=0, axis=1, inplace=True)
df[0, 'Graph'] #Works! Though if one is making many manipulations to the dataframe then this sortlevel needs to be called a lot.
Further edit from Jeff's second comment:
I appreciate a given column needing to have all the levels of the multiindex. I have thought about just having another frame, though I would like to keep the data all together in one unit, for storing in HDF5s, etc. The answer for this would be panel. However, I will ultimately have several levels above this one, and I'm leery of panels of panels of panels, if that's even possible. I'm definitely open to other angles of attack that I haven't thought about that obviate all these issues.