Accessing groups in Pandas lambda function - python

I have a Pandas dataframe with a multiindex. Level 0 is 'Strain' and level 1 is 'JGI library.' Each 'Strain' has several 'JGI library' columns associated with it. I would like to use a lambda function to apply a t-test to compare two different strains. To troubleshoot, I have been taking one row of my dataframe using the .iloc[0] command.
row = pvalDf.iloc[0]
parent = 'LL1004'
child = 'LL345'
ttest_ind(row.groupby(level='Strain').get_group(parent), row.groupby(level='Strain').get_group(child))[1]
This works as expected. Now I try to apply it to my whole dataframe
parent = 'LL1004'
child = 'LL345'
pvalDf = countsDf4.apply(lambda row: ttest_ind(row.groupby(level='Strain').get_group(parent), row.groupby(level='Strain').get_group(child))[1])
Now I get an error message saying, "ValueError: ('level name Strain is not the name of the index', 'occurred at index (LL1004, BCHAC)')"
'LL1004' is a 'Strain,' but Pandas doesn't seem to be aware of this. It looks like maybe the multiindex was not passed to the lambda function correctly? Is there a better way to troubleshoot lambda functions than using .iloc[0]?
I put a copy of my Jupyter notebook and an excel file with the countsDf4 dataframe on Github https://github.com/danolson1/pandas_ttest
Thanks,
Dan

How about, more simply:
pvalDf = countsDf4.apply(lambda row: ttest_ind(row[parent], row[child]), axis=1)
I've tested it on your notebook and it works.
Your problem is that DataFrame.apply() by default applies the function to each column, not to each row. So, you need to specify the axis=1 parameter to override the default behavior and apply the function row by row.
Also, there's no reason to use row.groupby(level='Strain').get_group(x) when you could simply index the group of columns by row[x]. :)

Related

How can I get a single column out of a spark dataframe?

I would like to take a single column out of my spark dataframe.
And I would like to put the latitude in a variable, and the longitude.
When I do this;
I only get the column name.
The best or corrected way to select any column would be to use col() function in order to let spark know that it's not a string and also this will not be dependent on that Dataframe (i.e in case the dataframe is deleted df.select("name") might give issue)
df = df.select(F.col("some_column_name")
Likewise, same for filter operation , use lit to make spark understand it is string -
df = df.filter(F.col("some_column_name") == F.lit("a_string"))
Well all you need to do is :
Lats = [row[0] for row in df.select('latitude').collect()]
print(Lats)

How do I rename a DataFrame column using a lambda?

I'm doing:
df.apply(lambda x: x.rename(x.name + "_something"))
I think this should return the column with _something appended to all columns, but it just returns the same df.
What am I doing wrong?
EDIT: I need to act on the series column by column, not on the dataframe obejct, as I'll be applying other transformations to x in the lambda, not shown here.
EDIT 2 Full Context:
I've got a time series dataframe, and I'm trying to generate features from the data.
I've written a bunch of primitive functions like:
def sumn(n, s):
return s.rolling(n).sum().rename(s.name + "_sum_" + str(n))
When I apply those to Series, it renames them well.
When I apply them to columns in a DataFrame, the numerical transformation goes through, but the rename doesn't work.
(I suppose it implies that a DataFrame isn't just a collection of Series, which means in all likelihood, I now have to explicitly rename things on the df)
I think you can do this use pd.concat:
pd.concat([df[e].rename(df[e].name+'_Something') for e in df],1)
Inside the list comprehension, you can add your other logics:
df[e].rename(df[e].name+'_Something').apply(...)
If you directly use df.apply, you can't change the column name. There is no way I can think of

apply function in pandas to create two columns

I have a Pandas DataFrame called ebola as seen below. variable column has two pieces of information status whether it is Cases or Deaths and country which consists of country names. I try to create two new columns status and country out of that variable column by using .apply() function. However, since there are two values I am trying to extract, it does not work.
# let's create a splitter function
def splitter(column):
status, country = column.split("_")
return status, country
# apply this function to that column and assign to two new columns
ebola[['status', 'country']] = ebola['variable'].apply(splitter)
The error I get is
ValueError: Must have equal len keys and value when setting with an iterable
I want my output to be like this
Use Series.str.split
ebola[['status','country']]=ebola['variable'].str.split(pat='_',expand=True)
This is very late post to original question. Thanks to #ansev, the solution was great and it worked out great. While I was going through my question, I was trying to develop a solution based on my first approach. I was able to work it out and I wanted to share for anyone who might want to see a different perspective on this.
update to my code:
# let's create a splitter function
def splitter(column):
for row in column:
status, country = row.split("_")
return status, country
# apply this function to that column and assign to two new columns
ebola[['status', 'country']] = ebola['variable'].to_frame().apply(splitter, axis=1, result_type='expand')
Two updates to my code, so it could work.
Instead of going through Series, I converted it to dataframe using .to_frame() method.
In my splitter function, I had to iterate through each row since it was a DataFrame. Therefore, I added for row in column line.
To replicate all of this:
import numpy as np
import pandas as pd
# create the data
ebola_dict = {'Date':['3/24/2014', '3/22/2014', '1/15/2015', '1/4/2015'],
'variable': ['Cases_Guinea', 'Cases_Guinea', 'Cases_Liberia', 'Cases_Liberia']}
ebola = pd.DataFrame(ebola_dict)
print(ebola)
# let's create a splitter function
def splitter(column):
for row in column:
status, country = row.split("_")
return status, country
# apply this function to that column and assign to two new columns
ebola[['status', 'country']] = ebola['variable'].to_frame().apply(splitter, axis=1, result_type='expand')
# check if it worked
print(ebola)

Pandas Group By and Sum , Header being removed

after I run the following code I seem to lose the headers of my dataframe. If i remove the below line, my headers exist.
unifiedview = unifiedview.groupby(['key','MTM'])['MTM'].sum()
When i use to_csv my excel has no headers.
ive tried :
unifiedview = unifiedview.groupby(['key','MTM'], as_index = False)['MTM'].sum()
unifiedview = unifiedview.reset_index()
any help would be appreciated.
Calling
unifiedview.groupby(['key','MTM'])['MTM']'
will return a Pandas Series of only the 'MTM' column...
Therefore, the expression
unifiedview.groupby(['key','MTM'])['MTM'].sum() will return the sum of the GroupBy'd 'MTM' column...
unifiedview.groupby(['key','MTM']).sum().reset_index() should return the sum of all columns in unifiedview of the int or float dtype.
Are you looking to preserve all columns from the original dataframe?
Also, you must place an aggregate function after the groupby clause...
unifiedview.groupby(['key','MTM']) must have a .count(), .sum(), .mean(), ... method to group your columns...
unifiedview.groupby(['key','MTM']).sum()
unifiedview.groupby(['key','MTM']).count()
unifiedview.groupby(['key','MTM']).mean()
Is this helping you get in the right direction?
What version of pandas are you using? If you check the documentation it states:
Write out the column names. If a list of strings is given it is assumed to be aliases for the column names.
Changed in version 0.24.0: Previously defaulted to False for Series
Since you are transforming your dataframe into a series object this might be the cause of your issue.
The documenation can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

Selecting Various "Pieces" of a List

I have a list of columns in a Pandas DataFrame and looking to create a list of certain columns without manual entry.
My issue is that I am learning and not knowledgable enough yet.
I have tried searching around the internet but nothing was quite my case. I apologize if there is a duplicate.
The list I am trying to cut from looks like this:
['model',
'displ',
'cyl',
'trans',
'drive',
'fuel',
'veh_class',
'air_pollution_score',
'city_mpg',
'hwy_mpg',
'cmb_mpg',
'greenhouse_gas_score',
'smartway']
Here is the code that I wrote on my own: dataframe.columns.tolist()[:6,8:10,11]
In this case scenario I am trying to select everything but 'air_pollution_score' and 'greenhouse_gas_score'
My ultimate goal is to understand the syntax and how to select pieces of a list.
You could do that, or you could just use drop to remove the columns you don't want:
dataframe.drop(['air_pollution_score', 'greenhouse_gas_score'], axis=1).columns
Note that you need to specify axis=1 so that pandas knows you want to remove columns, not rows.
Even if you wanted to use list syntax, I would say that it's better to use a list comprehension instead; something like this:
exclude_columns = ['air_pollution_score', 'greenhouse_gas_score']
[col for col in dataframe.columns if col not in exclude_columns]
This gets all the columns in the dataframe unless they are present in exclude_columns.
Let's say df is your dataframe. You can actually use filters and lambda, though it quickly becomes too long. I present this as a "one-liner" alternative to the answer of #gmds.
df[
list(filter(
lambda x: ('air_pollution_score' not in x) and ('greenhouse_gas_x' not in x),
df.columns.values
))
]
What's happening here are:
filter applies a function to a list to only include elements following a defined function/
We defined that function using lambda to only check if 'air_pollution_score' or 'greenhouse_gas_x' are in the list.
We're filtering on the df.columns.values list; so the resulting list will only retain the elements that weren't the ones we mentioned.
We're using the df[['column1', 'column2']] syntax, which is "make a new dataframe but only containing the 2 columns I define."
Simple solution with pandas
import pandas as pd
data = pd.read_csv('path to your csv file')
df = data['column1','column2','column3',....]
Note: data is your source you have already loaded using pandas, new selected columns will be stored in a new data frame df

Categories