Proper way to extract value from DataFrame with composite index? - python

I have a dataframe, call it current_data. This dataframe is generated via running statistical functions over another dataframe, current_data_raw. It has a compound index on columns "Method" and "Request.Name"
current_data = current_data_raw.groupby(['Name', 'Request.Method']).size().reset_index().set_index(['Name', 'Request.Method'])
I then run a bunch of statistical functions over current_data_raw adding new columns to current_data
I then need to query that dataframe for specific values of columns. I would love to do something like:
val = df['Request.Name' == some_name, 'Method' = some_method]['Average']
However this isn't working, nor are the varients I have attempted above. .xs is returning a series. I could grab the only row in the series but that doesn't seem proper.

If want select in MultiIndex is possible use tuple in order of levels, but here is not specified index name like 'Request.Name':
val = df.loc[(some_name, some_method), 'Average']
Another way is use DataFrame.query, but if levels names contains spaces or . is necessary use backticks:
val = df.query("`Request.Name`=='some_name' & `Request.Method`=='some_method'")['Average']
If one word levels names:
val = df.query("Name=='some_name' & Method=='some_method'")['Average']

Related

How to add a column with values depending on existing rows with lower index in pandas?

Is there a fast way of adding a column to a data frame df with values depending on all the rows of df with smaller index? A very simple example where the new column only depends on the value of one other column would be df["new_col"] = df["old_col"].cumsum() (if df is ordered), but I have something more complicated in mind. Ideally, I'd like to write something like
df["new_col"] = df.[some function here](f),
where [some function] sets the i-th value of df["new_col"] to f(df[df.index <= df.index[i]]). (Ideally [some function] can also be applied to groupby() objects.)
At the moment I loop through rows, add a temporary column containing a dict of relevant values and then apply a function, but this is very slow, memory-inefficient, etc.

How i can use value_counts in many columns in an efficient way?

i'm looking for use value_counts in many columns and i know that i can use df.island.value_counts() but i want a loop or something more efficient to don't put the name of each column of the DataFrame. It's important to say that I know the specific columns with i want apply this function. The code that i'm using is:
data_url = "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv"
df = pd.read_csv(data_url)
df.select_dtypes(object).head() # I use this to know the columns that has variables categorical. I want this because i'm looking to know what categorical variables they has, and that is why i'm using value_count
df.island.value_counts()
categories = df.select_dtypes(object).head() # this is a dataframe object
categorical_columns = categories.columns # this is an iterable which contains strings of column headers
counter_dict = {}
# loop through each column, add values to dict
for column in categorical_columns:
counter_dict[column] = df[column].value_counts()
# each item in dict has column title as key and value_counts series as value

Sort value issue on categorical column

I would like to take a categorical column, group by individual type and then sum each type
I am using Python code and its result is what I want
data2 = data.groupby(['service_type']).sum().unstack()
popular_ser2 = data2.sort_values(ascending = False).head(10).droplevel(0)
popular_ser2
I would like to confirm whether my code is logic due to the need of unstack and droplevel that is uncommon to see when using groupby and sort value.

How can I create a pandas dataframe column for each part-of-speech tag?

I have a dataset that consists of tokenized, POS-tagged phrases as one column of a dataframe:
Current Dataframe
I want to create a new column in the dataframe, consisting only of the proper nouns in the previous column:
Desired Solution
Right now, I'm trying something like this for a single row:
if 'NNP' in df['Description_POS'][96][0:-1]:
df['Proper Noun'] = df['Description_POS'][96]
But then I don't know how to loop this for each row, and how to obtain the tuple which contains the proper noun.
I'm very new right now and at a loss for what to use, so any help would be really appreciated!
Edit: I tried the solution recommended, and it seems to work, but there is an issue.
this was my dataframe:
Original dataframe
After implementing the code recommended
df['Proper Nouns'] = df['POS_Description'].apply(
lambda row: [i[0] for i in row if i[1] == 'NNP'])
it looks like this:
Dataframe after creating a proper nouns column
You can use the apply method, which as the name suggests will apply the given function to every row of the dataframe or series. This will return a series, which you can add as a new column to your dataframe
df['Proper Nouns'] = df['POS_Description'].apply(
lambda row: [i[0] for i in row if i[1] == 'NNP'])
I am assuming the POS_Description dtype to be a list of tuples.

accessing specific columns of a dataframe, index specified by idxmax()

I have a dataframe row that I would like to access specific columns of. The index for this row is specified from a idxmax command.
idx_interest=(df['colA']==matchingstring).idxmax()
Using this index, I want to access specific columns, namely colB and colD of the df # index=idx_interest
df.loc[idx_interest,['colB','colD']].reset_index().values.tolist()
however, doing so gave me the error: cannot perform reduce on flexible type. How do I go about accessing columns of a df # index given from an idxmax command>
You need to first apply your filter to your dataframe df correctly, in order to return idx_interest. If your original dataframe is a MultiIndex, then be mindful that this will return a tuple:
idx_interest = df[df['colA']==matchingstring].idxmax()
Now that you have idx_interest, you can limit your dataframe to the columns you want and then call .iloc() to specify a row index:
df[['colB','colD']].iloc[idx_interest].values.tolist()
The code you provide above will also work assuming that idx_interest returns an int:
df.loc[idx_interest,['colB','colD']].reset_index().values.tolist()

Categories