I am trying to calculate how long each store has been open in years. Here is an example of the dataset:
year
store name
2000
Store A
2001
Store A
2002
Store A
2003
Store A
2000
Store B
2001
Store B
2002
Store B
2000
Store C
I'm not sure how to calculate the difference in max and min year for each store name as they are all in the same column. Do I put it into a new column using pandas?
You need to use a groupby:
g = df.groupby('store name')['year']
out = g.max()-g.min()
You can use groupby and transform to create an additional column in the same dataframe.
df["years open"] = df.groupby("store name")["year"].transform(lambda x: x.max()-x.min())
You can also use:
out = df.groupby('store name').agg(['min', 'max']).diff(axis=1).iloc[:, -1]
print(out)
# Output
store name
Store A 3
Store B 2
Store C 0
Name: (year, max), dtype: int64
Related
I have been working with a dataframe having the following format (the actual table has much more rows(id's) and columns(value_3, value_4 etc..)):
where for each id, the status column has the value 'new' if this is the first entry for that id, and the value 'modified' if any of the value_1, value_2 columns have changed compared to their previous value. I would like to create a log of any changes made in the table, in particular I would like the resulted format for the given data above to be something like this:
Ideally, I would like to avoid using loops, so could you please suggest any more efficient pythonic way to achieve the format above?
I have seen the answers posted for the question here: Determining when a column value changes in pandas dataframe
which partly do the job I want (using shift or diff) for identifying cases where there was a change, and I was wondering if this is the best way to build on for my case, or if there is a more efficient way to do that and speed up the process. Ideally, I would like something that can work for both numeric and non-numeric values in value_1, value_2, etc columns..
Code for creating the sample data of the first pic:
import pandas as pd
data = [[1,2,5,'new'], [1,1,5,'modified'], [1,0,5,'modified'],
[2,5,2,'new'], [2,5,3,'modified'], [2,5,4,'modified'] ]
df = pd.DataFrame(data, columns = ['id', 'value_1', 'value_2',
'status'])
df
Many thanks in advance for any suggestion/help!
We do need melt first then groupby after drop_duplicates
s = df.melt(['id','status']).drop_duplicates(['id','variable','value'])
s['new'] = s.groupby(['id','variable'])['value'].shift()
s #s.sort_values('id')
id status variable value new
0 1 new value_1 2 NaN
1 1 modified value_1 1 2.0
2 1 modified value_1 0 1.0
3 2 new value_1 5 NaN
6 1 new value_2 5 NaN
9 2 new value_2 2 NaN
10 2 modified value_2 3 2.0
11 2 modified value_2 4 3.0
Is it possible to create a row based on values of a previous row?
Lets say
Name Location Amount
1 xyz london 23423
is a row in a DF. and I want to scan the DF, and if amount > 2000 and Location == london I want to append another row that keeps the location and amount of row 1 but changes name to EEE
As per my note, I would like the output to be the same DF but this:
Name Location Amount
1 xyz london 23423
2 EEE london 23424
You can slice the dataframe based on the conditions, then change the name.
df2 = df[df.Location.eq('london') & df.Amount.gt(2000)].reset_index(drop=True)
df2.Name = 'EEE'
I have a large dataframe (from 500k to 1M rows) which contains for example these 3 numeric columns: ID, A, B
I want to filter the results in order to obtain a table like the one in the image below, where, for each unique value of column id, i have the maximum and minimum value of A and B.
How can i do?
EDIT: i have updated the image below in order to be more clear: when i get the max or min from a column i need to get also the data associated to it of the others columns
Sample data (note that you posted an image which can't be used by potential answerers without retyping, so I'm making a simple example in its place):
df=pd.DataFrame({ 'id':[1,1,1,1,2,2,2,2],
'a':range(8), 'b':range(8,0,-1) })
The key to this is just using idxmax and idxmin and then futzing with the indexes so that you can merge things in a readable way. Here's the whole answer and you may wish to examine intermediate dataframes to see how this is working.
df_max = df.groupby('id').idxmax()
df_max['type'] = 'max'
df_min = df.groupby('id').idxmin()
df_min['type'] = 'min'
df2 = df_max.append(df_min).set_index('type',append=True).stack().rename('index')
df3 = pd.concat([ df2.reset_index().drop('id',axis=1).set_index('index'),
df.loc[df2.values] ], axis=1 )
df3.set_index(['id','level_2','type']).sort_index()
a b
id level_2 type
1 a max 3 5
min 0 8
b max 0 8
min 3 5
2 a max 7 1
min 4 4
b max 4 4
min 7 1
Note in particular that df2 looks like this:
id type
1 max a 3
b 0
2 max a 7
b 4
1 min a 0
b 3
2 min a 4
b 7
The last column there holds the index values in df that were derived with idxmax & idxmin. So basically all the information you need is in df2. The rest of it is just a matter of merging back with df and making it more readable.
For anyone looking to get min and max values of a specific column where there is a unique ID, this is how I modified the above code:
df_maxA = df.groupby('id').max()['A']
df_maxA['type'] = 'max'
df_minA = df.groupby('id').max()['A']
df_minA['type'] = 'min'
df_maxB = df.groupby('id').max()['B']
df_maxB['type'] = 'max'
df_minB = df.groupby('id').max()['B']
df_minB['type'] = 'min'
Then you can merge these together to create a single dataframe.
I have a dataset like this:
Participant Type Rating
1 A 6
1 A 5
1 B 4
1 B 3
2 A 9
2 A 8
2 B 7
2 B 6
I want obtain this:
Type MeanRating
A mean(6,9)
A mean(5,8)
B mean(4,7)
B mean(3,6)
So, for each type, I want the mean of the higher value in each group, then the mean of the second higher value in each group, etc.
I can't think up a proper way to do this with python pandas, since the means seem to apply always within groups, but not across them.
First use groupby.rank to create a column that allows you to align the highest values, second highest values, etc. Then perform another groupby using the newly created column to compute the means:
# Get the grouping column.
df['Grouper'] = df.groupby(['Type', 'Participant']).rank(method='first', ascending=False)
# Perform the groupby and format the result.
result = df.groupby(['Type', 'Grouper'])['Rating'].mean().rename('MeanRating')
result = result.reset_index(level=1, drop=True).reset_index()
The resulting output:
Type MeanRating
0 A 7.5
1 A 6.5
2 B 5.5
3 B 4.5
I used the method='first' parameter of groupby.rank to handle the case of duplicate ratings within a ['Type', 'Participant'] group. You can omit it if this is not a possibility within your dataset, but it won't change the output if you leave it and there are no duplicates.
I have a data frame like this:
df
time type qty
12:00 A 5
13:00 A 1
14:00 B 9
I need to sum the values of qty and group them by type. This is how I do it, but it seems to be not working, because I don't know how to add qty.
keys = df['type'].unique()
summary = pd.DataFrame()
for k in keys:
summary[k] = df[df['type']==k].sum()
GroupBy has a sum method:
In [11]: df.groupby("type").sum()
Out[11]:
qty
type
A 6
B 9
see the groupby docs.
To make sure you are summing up the column you want to:
df.groupby(by=['type'])['qty'].sum()