I currently have a dataset with two indexes, year and zip code, but multiple observations (prices) per zip code. How can I get the average price per zip code, so that I only have distinct observations per zip code and year.
Screenshot of current table
Use DataFrame.mean with level parameter:
df = s.mean(level=[0,1])
Sample:
s = pd.DataFrame({
'B':[5,5,4,5,5,4],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
}).set_index(['F','B'])['E']
print (s)
F B
a 5 5
5 3
4 6
b 5 9
5 2
4 4
Name: E, dtype: int64
df = s.mean(level=[0,1]).reset_index()
print (df)
F B E
0 a 5 4.0
1 a 4 6.0
2 b 5 5.5
3 b 4 4.0
Related
I'm currently trying to do analysis of rolling correlations of a dataset with four compared values but only need the output of rows containing 'a'
I got my data frame by using the command newdf = df.rolling(3).corr()
Sample input (random numbers)
a b c d
1 a
1 b
1 c
1 d
2 a
2 b
2 c
2 d
3 a
3 b 5 6 3
3 c 4 3 1
3 d 3 4 2
4 a 1 3 5 6
4 b 6 2 4 1
4 c 8 6 6 7
4 d 2 5 4 6
5 a 2 5 4 1
5 b 1 4 6 3
5 c 2 6 3 7
5 d 3 6 3 7
and need the output
a b c d
1 a 1 3 5 6
2 a 2 5 4 1
I've tried filtering it by doing adf = newdf.filter(['a'], axis=0) however that gets rid of everything and when doing it for the other axis it filters by column. Unfortunately the column containing the rows with values: a, b, c, d is unnamed so I cant filter that column individually. This wouldn't be an issue however if its possible to flip the rows and columns with the values being listed by index to get the desired output.
Try using loc. Put the column of abcdabcd ... as index and just use loc
df.loc['a']
The actual source of problem in your case is that your DataFrame
has a MultiIndex.
So when you attempt to execute newdf.filter(['a'], axis=0) you want
to leave rows with the index containing only "a" string.
But since your DataFrame has a MultiIndex, each row with "a" at
level 1 contains also some number at level 0.
To get your intended result, run:
newdf.filter(like='a', axis=0)
maybe followed by .dropna().
An alterantive solution is:
newdf.xs('a', level=1, drop_level=False)
I have a dataset, df, where I would like to:
filter the values in the 'id' column, group these values, take their average, and then sum these values
id use free total G_Used G_Free G_Total
a 5 5 10 4 1 5
b 14 6 20 5 1 6
a 10 5 15 9 1 10
c 6 4 10 10 10 20
b 10 5 15 5 5 10
b 5 5 10 1 4 5
c 4 1 5 3 1 4
Desired Output
use free total
9.5 7.5 20
filter only values that contain 'a' or 'b'
group by each id
take the mean of the 'use', 'free' and 'total' columns
sum these values
Intermediate steps:
filter out only the a and c values
id use free total G_Used G_Free G_Total
a 5 5 10 4 1 5
a 10 5 15 9 1 10
c 6 4 10 10 10 20
c 4 1 5 3 1 4
take mean of a
a
use free total
7.5 5 12.5
take mean of c
c
use free total
2 2.5 7.5
sum both a and c values for final desired output
use free total
9.5 7.5 20
This is what I am doing, however the syntax is not correct for some of the code. I am still researching. Any suggestion is appreciated
df1 = df[df.id = 'a' | 'b']
df2 = df1.groupby(['id'], as_index=False).agg({'use': 'mean', 'free': 'mean', 'total': 'mean'})
df3= df2.sum(['id'], axis = 0)
Use Series.isin for test membership first and then filter columns with mean, ouput is summed and converted Series to one row DataFrame by Series.to_frame and DataFrame.T for transpose:
df1 = df[df.id.isin(['a','c'])]
df2 = df1.groupby('id')[['use','free','total']].mean().sum().to_frame().T
Your solution is similar, only used GroupBy.agg:
df1 = df[df.id.isin(['a','c'])]
df2 = df1.groupby('id').agg({'use': 'mean', 'free': 'mean', 'total': 'mean'}).sum().to_frame().T
print (df2)
use free total
0 12.5 7.5 20.0
I have a dataframe with a bunch of columns labelled in 'YYYY-MM' format, along with several other columns. I need to collapse the date columns into calendar quarters and take the mean; I was able to do it manually, but there are a few hundred date columns in my real data and I'd like to not have to map every single one of them by hand. I'm generating the initial df from a CSV; I didn't see anything in read_csv that seemed like it would help, but if there's anything I can leverage there that would be great. I found dataframe.dt.to_period("Q") that will convert a datetime object to quarter, but I'm not quite sure how to apply that here, if I can at all.
Here's a sample df (code below):
foo bar 2016-04 2016-05 2016-06 2016-07 2016-08
0 6 5 3 3 5 8 1
1 9 3 6 9 9 7 8
2 8 5 8 1 9 9 4
3 5 8 1 2 3 5 6
4 4 5 1 2 7 2 6
This code will do what I'm looking for, but I had to generate mapping by hand:
mapping = {'2016-04':'2016q2', '2016-05':'2016q2', '2016-06':'2016q2', '2016-07':'2016q3', '2016-08':'2016q3'}
df = df.set_index(['foo', 'bar']).groupby(mapping, axis=1).mean().reset_index()
New df:
foo bar 2016q2 2016q3
0 6 5 3.666667 4.5
1 9 3 8.000000 7.5
2 8 5 6.000000 6.5
3 5 8 2.000000 5.5
4 4 5 3.333333 4.0
Code to generate the initial df:
df = pd.DataFrame(np.random.randint(1, 11, size=(5, 7)), columns=('foo', 'bar', '2016-04', '2016-05', '2016-06', '2016-07', '2016-08')) '2016-07', '2016-08'))
Use a callable that gets applied to the index values. Use axis=1 to apply it to the column values instead.
(df.set_index(['foo', 'bar'])
.groupby(lambda x: pd.Period(x, 'Q'), axis=1)
.mean().reset_index())
foo bar 2016Q2 2016Q3
0 6 5 3.666667 4.5
1 9 3 8.000000 7.5
2 8 5 6.000000 6.5
3 5 8 2.000000 5.5
4 4 5 3.333333 4.0
The solution is quite short:
Start from copying "monthly" columns to another DataFrame and converting
column names to PeriodIndex:
df2 = df.iloc[:, 2:]
df2.columns = pd.PeriodIndex(df2.columns, freq='M')
Then, to get the result, resample columns by quarter,
compute the mean (for each quarter) and join with 2 "initial" columns:
df.iloc[:, :2].join(df2.resample('Q', axis=1).agg('mean'))
data = [[2,2,2,3,3,3],[1,2,2,3,4,5],[1,2,2,3,4,5],[1,2,2,3,4,5],[1,2,2,3,4,5],[1,2,2,3,4,5]]
df = pd.DataFrame(data, columns = ['A','1996-04','1996-05','2000-07','2000-08','2010-10'])
# separate year columns and other columns
# separate year columns
df3 = df.iloc[:, 1:]
# separate other columns
df2 = df.iloc[:,0]
#apply groupby using period index
df3=df3.groupby(pd.PeriodIndex(df3.columns, freq='Q'), axis=1).mean()
final_df = pd.concat([df3,df2], axis=1)
print(final_df)
output is attached in image:
In pandas and python:
I have a large datasets with health records where patients have records of diagnoses.
How to display the most frequent diagnoses, but only count 1 occurrence of the same diagnoses per patient?
Example ('pid' is patient id. 'code' is the code of a diagnosis):
IN:
pid code
1 A
1 B
1 A
1 A
2 A
2 A
2 B
2 A
3 B
3 C
3 D
4 A
4 A
4 A
4 B
OUT:
B 4
A 3
C 1
D 1
I would like to be able to use .isin .index if possible.
Example:
Remove all rows with less than 3 in frequency count in column 'code'
s = df['code'].value_counts().ge(3)
df = df[df['code'].isin(s[s].index)]
You can use groupby + nunique:
df.groupby(by='code').pid.nunique().sort_values(ascending=False)
Out[60]:
code
B 4
A 3
D 1
C 1
Name: pid, dtype: int64
To remove all rows with less than 3 in frequency count in column 'code'
df.groupby(by='code').filter(lambda x: x.pid.nunique()>=3)
Out[55]:
pid code
0 1 A
1 1 B
2 1 A
3 1 A
4 2 A
5 2 A
6 2 B
7 2 A
8 3 B
11 4 A
12 4 A
13 4 A
14 4 B
Since you mention value_counts
df.groupby('code').pid.value_counts().count(level=0)
Out[42]:
code
A 3
B 4
C 1
D 1
Name: pid, dtype: int64
You should be able to use the groupby and nunique() functions to obtain a distinct count of patients that had each diagnosis. This should give you the result you need:
df[['pid', 'code']].groupby(['code']).nunique()
I have two data sets as following
A B
IDs IDs
1 1
2 2
3 5
4 7
How in Pandas, Numpy we can apply a join which can give me all the data from B, which is not present in A
Something like Following
B
Ids
5
7
I know it can be done with for loop, but that I don't want, since my real data is in millions, and I am really not sure how to use Panda Numpy here, something like following
pd.merge(A, B, on='ids', how='right')
Thanks
You can use NumPy's setdiff1d, like so -
np.setdiff1d(B['IDs'],A['IDs'])
Also, np.in1d could be used for the same effect, like so -
B[~np.in1d(B['IDs'],A['IDs'])]
Please note that np.setdiff1d would give us a sorted NumPy array as output.
Sample run -
>>> A = pd.DataFrame([1,2,3,4],columns=['IDs'])
>>> B = pd.DataFrame([1,7,5,2],columns=['IDs'])
>>> np.setdiff1d(B['IDs'],A['IDs'])
array([5, 7])
>>> B[~np.in1d(B['IDs'],A['IDs'])]
IDs
1 7
2 5
You can use merge with parameter indicator and then boolean indexing. Last you can drop column _merge:
A = pd.DataFrame({'IDs':[1,2,3,4],
'B':[4,5,6,7],
'C':[1,8,9,4]})
print (A)
B C IDs
0 4 1 1
1 5 8 2
2 6 9 3
3 7 4 4
B = pd.DataFrame({'IDs':[1,2,5,7],
'A':[1,8,3,7],
'D':[1,8,9,4]})
print (B)
A D IDs
0 1 1 1
1 8 8 2
2 3 9 5
3 7 4 7
df = (pd.merge(A, B, on='IDs', how='outer', indicator=True))
df = df[df._merge == 'right_only']
df = df.drop('_merge', axis=1)
print (df)
B C IDs A D
4 NaN NaN 5.0 3.0 9.0
5 NaN NaN 7.0 7.0 4.0
You could convert the data series to sets and take the difference:
import pandas as pd
df=pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
A=set(df['A'])
B=set(df['B'])
C=pd.DataFrame({'C' : list(B-A)}) # Take difference and convert back to DataFrame
The variable "C" then yields
C
0 5
1 7
You can simply use pandas' .isin() method:
df = pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
df[~df['B'].isin(df['A'])]
If these are separate DataFrames:
a = pd.DataFrame({'IDs' : [1,2,3,4]})
b = pd.DataFrame({'IDs' : [1,2,5,7]})
b[~b['IDs'].isin(a['IDs'])]
Output:
IDs
2 5
3 7