Working with the output of groupby and groupby.size() - python

I have a pandas dataframe containing a row for each object manipulated by participants during a user study. Each participant participates in the study 3 times, one in each of 3 conditions (a,b,c), working with around 300-700 objects in each condition.
When I report the number of objects worked with I want to make sure that this didn't vary significantly by condition (I don't expect it to have done, but need to confirm this statistically).
I think I want to run an ANOVA to compare the 3 conditions, but I can't figure out how to get the data I need for the ANOVA.
I currently have some pandas code to group the data and count the number of rows per participant per condition (so I can then use mean() and similar to summarise the data). An example with a subset of the data follows:
>>> tmp = df.groupby([FIELD_PARTICIPANT, FIELD_CONDITION]).size()
>>> tmp
participant_id condition
1 a 576
2 b 367
3 a 703
4 c 309
dtype: int64
To calculate the ANOVA I would normally just filter these by the condition column, e.g.
cond1 = tmp[tmp[FIELD_CONDITION] == CONDITION_A]
cond2 = tmp[tmp[FIELD_CONDITION] == CONDITION_B]
cond3 = tmp[tmp[FIELD_CONDITION] == CONDITION_C]
f_val, p_val = scipy.stats.f_oneway(cond1, cond2, cond3)
However, since tmp is a Series rather than the DataFrame I'm used to, I can't figure out how to achieve this in the normal way.
>>> tmp[FIELD_CONDITION]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas/core/series.py", line 583, in __getitem__
result = self.index.get_value(self, key)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 626, in get_value
raise e1
KeyError: 'condition'
>>> type(tmp)
<class 'pandas.core.series.Series'>
>>> tmp.index
MultiIndex(levels=[[u'1', u'2', u'3', u'4'], [u'd', u's']],
labels=[[0, 1, 2, 3], [0, 0, 0, 1]],
names=[u'participant_id', u'condition'])
I feel sure this is a straightforward problem to solve, but I can't seem to get there without some help :)

I think you need reset_index and then output is DataFrame:
tmp = df.groupby([FIELD_PARTICIPANT, FIELD_CONDITION]).size().reset_index(name='count')
Sample:
import pandas as pd
df = pd.DataFrame({'participant_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2, 7: 3, 8: 4, 9: 4},
'condition': {0: 'a', 1: 'a', 2: 'a', 3: 'a', 4: 'b', 5: 'b', 6: 'b', 7: 'a', 8: 'c', 9: 'c'}})
print (df)
condition participant_id
0 a 1
1 a 1
2 a 1
3 a 1
4 b 2
5 b 2
6 b 2
7 a 3
8 c 4
9 c 4
tmp = df.groupby(['participant_id', 'condition']).size().reset_index(name='count')
print (tmp)
participant_id condition count
0 1 a 4
1 2 b 3
2 3 a 1
3 4 c 2
If need working with Series you can use condition where select values of level condition of Multiindex by get_level_values:
tmp = df.groupby(['participant_id', 'condition']).size()
print (tmp)
participant_id condition
1 a 4
2 b 3
3 a 1
4 c 2
dtype: int64
print (tmp.index.get_level_values('condition'))
Index(['a', 'b', 'a', 'c'], dtype='object', name='condition')
print (tmp.index.get_level_values('condition') == 'a')
[ True False True False]
print (tmp[tmp.index.get_level_values('condition') == 'a'])
participant_id condition
1 a 4
3 a 1
dtype: int64

Related

TypeError: unsupported operand type(s) for &: 'str' and 'bool' for DF filtering

I am trying to filter my dataframe such that when I create a new columnoutput, it displays the "medium" rating. My dataframe has str values, so I convert them to numbers based on a ranking system I have and then I filter out the maximum and minimum rating per row.
I am running into this error:
TypeError: unsupported operand type(s) for &: 'str' and 'bool'
I've created a data frame that pulls str values from my csv file:
df = pdf.read_csv('csv path', usecols=['rating1','rating2','rating3'])
And my dataframe looks like this:
rating1 rating2 rating3
0 D D C
1 C B A
2 B B B
I need it to look like this
rating1 rating2 rating3 mediumrating
0 D D C 1
1 C B A 3
2 B B B 3
I have a mapping dictionary that converts the values to numbers.
ranking = {
'D': 1, 'C':2, 'B': 3, 'A' : 4
}
Below you can find the code I use to determine the "medium rating". Basically, if all the ratings are the same, you can pull the minimum rating. If two of the ratings are the same, pull in the lowest rating. If the three ratings differ, filter out the max rating and the min rating.
if df == df.loc[(['rating1'] == df['rating2'] & df['rating1'] == df['rating3'])]:
df['mediumrating'] = df.replace(ranking).min(axis=1)
elif df == df.loc[(['rating1'] == df['rating2'] | df['rating1'] == df['rating3'] | df['rating2'] == df['rating3'])]:
df['mediumrating'] = df.replace(ranking).min(axis=1)
else:
df['mediumrating'] == df.loc[(df.replace(ranking) > df.replace(ranking).min(axis=1) & df.replace(ranking)
Any help on my formatting or process would be welcomed!!
Use np.where:
For the condition, use df.nunique applied to axis=1 and check if the result equals either 1 (all values are the same) or 2 (two different values) with Series.isin.
If True, we need df.min along axis=1.
If False (all unique values), we need df.median along axis=1.
Finally, use astype to turn resulting floats into integers.
import pandas as pd
import numpy as np
data = {'rating1': {0: 'D', 1: 'C', 2: 'B'},
'rating2': {0: 'D', 1: 'B', 2: 'B'},
'rating3': {0: 'C', 1: 'A', 2: 'B'}}
df = pd.DataFrame(data)
ranking = {'D': 1, 'C':2, 'B': 3, 'A' : 4}
df['mediumrating'] = np.where(df.replace(ranking).nunique(axis=1).isin([1,2]),
df.replace(ranking).min(axis=1),
df.replace(ranking).median(axis=1)).astype(int)
print(df)
rating1 rating2 rating3 mediumrating
0 D D C 1
1 C B A 3
2 B B B 3
Took to sec to understand what you really meant by filter. Here is some code that should be self explanatory and should achieve what you're looking for:
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['D', 'D', 'C'], ['C', 'B', 'A'], ['B', 'B', 'B']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['rating1', 'rating2', 'rating3'])
# dictionary that maps the rating to a number
rating_map = {'D': 1, 'C': 2, 'B': 3, 'A': 4}
def rating_to_number(rating1, rating2, rating3):
if rating1 == rating2 and rating2 == rating3:
return rating_map[rating1]
elif rating1 == rating2 or rating1 == rating3 or rating2 == rating3:
return min(rating_map[rating1], rating_map[rating2], rating_map[rating3])
else:
return rating_map[sorted([rating1, rating2, rating3])[1]]
# create a new column based on the values of the other columns such that the new column has the value of therating_to_number function applied to the other columns
df['mediumrating'] = df.apply(lambda x: rating_to_number(x['rating1'], x['rating2'], x['rating3']), axis=1)
print(df)
This prints out:
rating1 rating2 rating3 mediumrating
0 D D C 2
1 C B A 3
2 B B B 3
Edit: updated rating_to_number based on your updated question

Pandas: adjust value of DataFrame that is sliced multiple times

Imagine I have the follow Pandas.DataFrame:
df = pd.DataFrame({
'type': ['A', 'A', 'A', 'B', 'B', 'B'],
'value': [1, 2, 3, 4, 5, 6]
})
I want to adjust the first value when type == 'B' to 999, i.e. the fourth row's value should become 999.
Initially I imagined that
df.loc[df['type'] == 'B'].iloc[0, -1] = 999
or something similar would work. But as far as I can see, slicing the df twice does not point to the original df anymore so the value of the df is not updated.
My other attempt is
df.loc[df.loc[df['type'] == 'B'].index[0], df.columns[-1]] = 999
which works, but is quite ugly.
So I'm wondering -- what would be the best approach in such situation?
You can use idxmax which returns the index of the first occurrence of a max value. Like this using a boolean series:
df.loc[(df['type'] == 'B').idxmax(), 'value'] = 999
Output:
type value
0 A 1
1 A 2
2 A 3
3 B 999
4 B 5
5 B 6

Merging multiple dataframe lines into aggregate lines

For the following dataframe:
df = pd.DataFrame({'Name': {0: "A", 1: "A", 2:"A", 3: "B"},
'Spec1': {0: '1', 1: '3', 2:'5',
3: '1'},
'Spec2': {0: '2a', 1: np.nan, 2:np.nan,
3: np.nan}
}, columns=['Name', 'Spec1', 'Spec2'])
Name Spec1 Spec2
0 A 1 2a
1 A 3 NaN
2 A 5 NaN
3 B 1 NaN
I would like to aggregate the columns into:
Name Spec
0 A 1,3,5,2a
1 B 1
Is there a more "pandas" way of doing this than just looping and keeping track of the values?
Or using melt
df.melt('Name').groupby('Name').value.apply(lambda x:','.join(pd.Series(x).dropna())).reset_index().rename(columns={'value':'spec'})
Out[2226]:
Name spec
0 A 1,3,5,2a
1 B 1
Another way
In [966]: (df.set_index('Name').unstack()
.dropna().reset_index()
.groupby('Name')[0].apply(','.join))
Out[966]:
Name
A 1,3,5,2a
B 1
Name: 0, dtype: object
Group rows by name, combine column values as a list, dropping NaN:
df = df.groupby('Name').agg(lambda x: list(x.dropna()))
Spec1 Spec2
Name
A [1, 3, 5] [2a]
B [1] []
Now merge Spec1 and Spec2 lists. Bring Name back as a column. Name the new Spec column.
df = (df.Spec1 + df.Spec2).reset_index().rename(columns={0:"Spec"})
Name Spec
0 A [1, 3, 5, 2a]
1 B [1]
Finally, convert Spec lists to string representations:
df.Spec = df.Spec.apply(','.join)
Name Spec
0 A 1,3,5,2a
1 B 1

how to re-arrange multiple columns into one column with same index

I'm using python pandas and I want to adjust one same index to multiple columns and make it into one column. And when it's possible, I also want to delete the zero value.
I have this data frame
index A B C
a 8 0 1
b 2 3 0
c 0 4 0
d 3 2 7
I'd like my output to look like this
index data value
a A 8
b A 2
d A 3
b B 3
c B 4
d B 2
a C 1
d C 7
===
I solved this task as below. My original data has 2 indexes & 0 in dataframe were NaN values.
At first, I tried to apply melt function while removing NaN values following this (How to melt a dataframe in Pandas with the option for removing NA values), but I couldn't.
Because my original data has several columns ('value_vars'). so I re-organized dataframe by 2 steps:
Firstly, I made multi-column into one-column by melt function,
Then removed NaN values in each rows by dropna function.
This looks a little like the melt function in pandas, with the only difference being the index.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
Here is some code you can run to test:
import pandas as pd
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},'B': {0: 1, 1: 3, 2: 5},'C': {0: 2, 1: 4, 2: 6}})
pd.melt(df)
With a little manipulation, you could solve for the indexing issue.
This is not particularly pythonic, but if you have a limited number of columns, you could make due with:
molten = pd.melt(df)
a = molten.merge(df, left_on='value', right_on = 'A')
b = molten.merge(df, left_on='value', right_on = 'B')
c = molten.merge(df, left_on='value', right_on = 'C')
merge = pd.concat([a,b,c])
try this:
array = [['a', 8, 0, 1], ['b', 2, 3, 0] ... ]
cols = ['A', 'B', 'C']
result = [[[a[i][0], cols[j], a[i][j + 1]] for i in range(len(a))] for j in range(2)]
output:
[[['a', 'A', 8], ['b', 'A', 2]], [['a', 'B', 0], ['b', 'B', 3]] ... ]

Python Pandas - Combining Multiple Columns into one Staggered Column

How do you combine multiple columns into one staggered column? For example, if I have data:
Column 1 Column 2
0 A E
1 B F
2 C G
3 D H
And I want it in the form:
Column 1
0 A
1 E
2 B
3 F
4 C
5 G
6 D
7 H
What is a good, vectorized pythonic way to go about doing this? I could probably do some sort of df.apply() hack but I'm betting there is a better way. The application is putting multiple dimensions of time series data into a single stream for ML applications.
First stack the columns and then drop the multiindex:
df.stack().reset_index(drop=True)
Out:
0 A
1 E
2 B
3 F
4 C
5 G
6 D
7 H
dtype: object
To get a dataframe:
pd.DataFrame(df.values.reshape(-1, 1), columns=['Column 1'])
For a series answering OP question:
pd.Series(df.values.flatten(), name='Column 1')
For a series timing tests:
pd.Series(get_df(n).values.flatten(), name='Column 1')
Timing
code
def get_df(n=1):
df = pd.DataFrame({'Column 2': {0: 'E', 1: 'F', 2: 'G', 3: 'H'},
'Column 1': {0: 'A', 1: 'B', 2: 'C', 3: 'D'}})
return pd.concat([df for _ in range(n)])
Given Sample
Given Sample * 10,000
Given Sample * 1,000,000

Categories