I have a data frame which current reads per the below:
df_new = pd.DataFrame({'Week':['nan',14, 14, 14, 14, 14],
'Date':['NaT','2020-04-01', '2020-04-02', '2020-04-03', '2020-04-04', '2020-04-05'],
'site 1':['entry',0, 0, 0, 0, 0],
'site 1':['exit',0, 0, 0, 0, 0],
'site 2':['entry',1, 0,50, 7, 0],
'site 2':['exit',10, 0, 7, 19, 0],
'site 3':['entry',0, 100, 14, 9, 0],
'site 3':['exit',0, 0, 7, 0, 0],
'site 4':['entry',0, 0, 0, 0, 0],
'site 4':['exit',0, 0, 0, 0, 0],
'site 5':['entry',0, 0, 0, 0, 0],
'site 5':['exit',15, 0, 25, 0, 80],
})
What I desire however is columns dictating exit/entry per site (columns came from merged Excel headers)
An example of what's desired is below (ignore the actual values as I typed them out)
df_target = pd.DataFrame({'Week':[14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14],
'Date':['2020-04-01', '2020-04-02', '2020-04-03', '2020-04-04', '2020-04-05','2020-04-01', '2020-04-02', '2020-04-03', '2020-04-04', '2020-04-05','2020-04-01', '2020-04-02', '2020-04-03', '2020-04-04', '2020-04-05'],
'site':['site 1', 'site 1', 'site 1', 'site 1', 'site 1', 'site 1', 'site 1', 'site 1', 'site 1', 'site 2', 'site 2','site 2','site 2','site 2','site 2'],
'entry/exit':['exit','exit', 'exit', 'entry', 'entry', 'entry', 'entry', 'entry', 'entry', 'exit', 'exit', 'exit', 'exit', 'entry', 'entry'],
'Value':[12 ,1, 0, 50, 7, 0, 12 ,1, 0, 50, 7, 0, 12 ,1, 0]
})
as an image:
I have tried
df_target = df_new.melt(id_vars=['Week','Date'], var_name="Site", value_name="Value")
but guess I need to somehow group by the second row too or consider it as a second header?
First create MultiIndex from input DataFrame:
#if possible
#df = pd.read_csv(file, header=[0,1], index_col=[0,1])
df_new.columns = [df_new.columns, df_new.iloc[0]]
df = df_new.iloc[1:]
print (df.columns)
MultiIndex([( 'Week', 'nan'),
( 'Date', 'NaT'),
('site 1', 'exit'),
('site 2', 'exit'),
('site 3', 'exit'),
('site 4', 'exit'),
('site 5', 'exit')],
)
And then convert first 2 MultiIndex columns to index, so possible use DataFrame.unstack for melting with Series.rename_axis and
Series.reset_index:
df = (df.set_index(df.columns[:2].tolist())
.unstack([0,1])
.rename_axis(['site','entry/exit','Week','Date'])
.reset_index(name='Value'))
print (df)
site entry/exit Week Date Value
0 site 1 exit 14 2020-04-01 0
1 site 1 exit 14 2020-04-02 0
2 site 1 exit 14 2020-04-03 0
3 site 1 exit 14 2020-04-04 0
4 site 1 exit 14 2020-04-05 0
5 site 2 exit 14 2020-04-01 10
6 site 2 exit 14 2020-04-02 0
7 site 2 exit 14 2020-04-03 7
8 site 2 exit 14 2020-04-04 19
9 site 2 exit 14 2020-04-05 0
10 site 3 exit 14 2020-04-01 0
11 site 3 exit 14 2020-04-02 0
12 site 3 exit 14 2020-04-03 7
13 site 3 exit 14 2020-04-04 0
14 site 3 exit 14 2020-04-05 0
15 site 4 exit 14 2020-04-01 0
16 site 4 exit 14 2020-04-02 0
17 site 4 exit 14 2020-04-03 0
18 site 4 exit 14 2020-04-04 0
19 site 4 exit 14 2020-04-05 0
20 site 5 exit 14 2020-04-01 15
21 site 5 exit 14 2020-04-02 0
22 site 5 exit 14 2020-04-03 25
23 site 5 exit 14 2020-04-04 0
24 site 5 exit 14 2020-04-05 80
Related
I have real estate dataframe with many outliers and many observations.
I have variables: total area, number of rooms (if rooms = 0, then it's studio appartment) and kitchen_area.
"Minimalized" extraction from my dataframe:
dic = [{'area': 40, 'kitchen_area': 10, 'rooms': 1, 'price': 50000 },
{'area': 20, 'kitchen_area': 0, 'rooms': 0, 'price': 50000},
{'area': 60, 'kitchen_area': 0, 'rooms': 2, 'price': 70000},
{'area': 29, 'kitchen_area': 9, 'rooms': 1, 'price': 30000},
{'area': 15, 'kitchen_area': 0, 'rooms': 0, 'price': 25000}]
df = pd.DataFrame(dic, index=['apt1', 'apt2','apt3','apt4', 'apt5'])
My target would be to eliminate apt3, because by law, kitchen area cannot be smaller than 5 squared meters in non-studio apartments.
In other words, I would like to eliminate all rows from my dataframe containing the data about apartments which are non-studio (rooms>0), but have kitchen_area <5
I have tried code like this:
df1 = df.drop(df[(df.rooms > 0) & (df.kitchen_area < 5)].index)
But it just eliminated all data from both columns kitchen_area and rooms according to the multiple conditions I put.
Clean
mask1 = df.rooms > 0
mask2 = df.kitchen_area < 5
df1 = df[~(mask1 & mask2)]
df1
area kitchen_area rooms price
apt1 40 10 1 50000
apt2 20 0 0 50000
apt4 29 9 1 30000
apt5 15 0 0 25000
pd.DataFRame.query
df1 = df.query('rooms == 0 | kitchen_area >= 5')
df1
area kitchen_area rooms price
apt1 40 10 1 50000
apt2 20 0 0 50000
apt4 29 9 1 30000
apt5 15 0 0 25000
I have the following df1 dataframe:
t A
0 23:00 2
1 23:01 1
2 23:02 2
3 23:03 2
4 23:04 6
5 23:05 5
6 23:06 4
7 23:07 9
8 23:08 7
9 23:09 10
10 23:10 8
For each t (increments simplified here, not uniformly distributed in real life), I would like to find, if any, the most recent time tr within the previous 5 min where A(t)- A(tr) >= 4. I want to get:
t A tr
0 23:00 2
1 23:01 1
2 23:02 2
3 23:03 2
4 23:04 6 23:03
5 23:05 5 23:01
6 23:06 4
7 23:07 9 23:06
8 23:08 7
9 23:09 10 23:06
10 23:10 8 23:06
Currently, I can use shift(-1) to compare each row to the previous row like cond = df1['A'] >= df1['A'].shift(-1) + 4.
How can I look further in time?
Assuming your data is continuous by the minute, then you can do usual shift:
df1['t'] = pd.to_timedelta(df1['t'].add(':00'))
df = pd.DataFrame({i:df1.A - df1.A.shift(i) >= 4 for i in range(1,5)})
df1['t'] - pd.to_timedelta('1min') * df.idxmax(axis=1).where(df.any(1))
Output:
0 NaT
1 NaT
2 NaT
3 NaT
4 23:03:00
5 23:01:00
6 NaT
7 23:06:00
8 NaT
9 23:06:00
10 23:06:00
dtype: timedelta64[ns]
I added a datetime index and used rolling(), which now includes time-window functionalities beyond simple index-window.
import pandas as pd
import numpy as np
import datetime
df1 = pd.DataFrame({'t' : [
datetime.datetime(2020, 5, 17, 23, 0, 0),
datetime.datetime(2020, 5, 17, 23, 0, 1),
datetime.datetime(2020, 5, 17, 23, 0, 2),
datetime.datetime(2020, 5, 17, 23, 0, 3),
datetime.datetime(2020, 5, 17, 23, 0, 4),
datetime.datetime(2020, 5, 17, 23, 0, 5),
datetime.datetime(2020, 5, 17, 23, 0, 6),
datetime.datetime(2020, 5, 17, 23, 0, 7),
datetime.datetime(2020, 5, 17, 23, 0, 8),
datetime.datetime(2020, 5, 17, 23, 0, 9),
datetime.datetime(2020, 5, 17, 23, 0, 10)
], 'A' : [2,1,2,2,6,5,4,9,7,10,8]}, columns=['t', 'A'])
df1.index = df1['t']
df2 = df1
cond = df1['A'] >= df1.rolling('5s')['A'].apply(lambda x: x[0] + 4)
result = df1[cond]
Gives
t A
2020-05-17 23:00:04 6
2020-05-17 23:00:05 5
2020-05-17 23:00:07 9
2020-05-17 23:00:09 10
2020-05-17 23:00:10 8
Can anyone explain the following statement.
list(diff.sort_values(ascending=False).index.astype(int)[0:5])
Output: Int64Index([24, 26, 17, 2, 1], dtype='int64')
It sorts first, but what is the index doing and how do i get 24, 26, 17, 2 ,1 ??
diff is series
ipdb> diff
1 0.017647
2 0.311765
3 -0.060000
4 -0.120000
5 -0.040000
6 -0.120000
7 -0.190000
8 -0.200000
9 -0.100000
10 -0.011176
11 -0.130000
12 0.008824
13 -0.060000
14 -0.090000
15 -0.060000
16 0.008824
17 0.341765
18 -0.140000
19 -0.050000
20 -0.060000
21 -0.040000
22 -0.210000
23 0.008824
24 0.585882
25 -0.060000
26 0.555882
27 -0.031176
28 -0.060000
29 -0.170000
30 -0.220000
31 -0.170000
32 -0.040000
dtype: float64
Yout code return list of index values of top5 values of Series sorted in descending order.
First 'column' printed in pandas Series is called index, so your code after sorting convert values of index to integers and slice by indexing.
print (diff.sort_values(ascending=False))
24 0.585882
26 0.555882
17 0.341765
2 0.311765
1 0.017647
12 0.008824
23 0.008824
16 0.008824
10 -0.011176
27 -0.031176
32 -0.040000
21 -0.040000
5 -0.040000
19 -0.050000
15 -0.060000
3 -0.060000
13 -0.060000
25 -0.060000
28 -0.060000
20 -0.060000
14 -0.090000
9 -0.100000
6 -0.120000
4 -0.120000
11 -0.130000
18 -0.140000
31 -0.170000
29 -0.170000
7 -0.190000
8 -0.200000
22 -0.210000
30 -0.220000
Name: a, dtype: float64
print (diff.sort_values(ascending=False).index.astype(int))
Int64Index([24, 26, 17, 2, 1, 12, 23, 16, 10, 27, 32, 21, 5, 19, 15, 3, 13,
25, 28, 20, 14, 9, 6, 4, 11, 18, 31, 29, 7, 8, 22, 30],
dtype='int64')
print (diff.sort_values(ascending=False).index.astype(int)[0:5])
Int64Index([24, 26, 17, 2, 1], dtype='int64')
print (list(diff.sort_values(ascending=False).index.astype(int)[0:5]))
[24, 26, 17, 2, 1]
Here's what's happening:
diff.sort_values(ascending) - sorts a Series. By default, ascending is True, but you've kept it false, so it returns sorted Series in descending order.
pandas.Series.index returns a row-labels of the index (the sorted numbers 1 - 32 in your case)
.as_type(int) typecasts index row-labels as integers.
[0: 5] just picks the cells 0 through 5
Let me know if this helps!
Here's my data:
foo = pd.DataFrame({
'accnt' : [101, 102, 103, 104, 105, 101, 102, 103, 104, 105],
'gender' : [0, 1 , 0, 1, 0, 0, 1 , 0, 1, 0],
'date' : pd.to_datetime(["2019-01-01 00:10:21", "2019-01-05 00:09:18", "2019-01-05 00:09:30", "2019-02-05 00:05:12", "2019-04-01 00:08:46",
"2019-04-01 00:11:31", "2019-02-06 00:01:39", "2019-01-26 00:15:14", "2019-01-21 00:12:36", "2019-03-01 00:09:31"]),
'value' : [10, 20, 30, 40, 50, 5, 2, 6, 48, 96]
})
Which is:
accnt date gender value
0 101 2019-01-01 00:10:21 0 10
1 102 2019-01-05 00:09:18 1 20
2 103 2019-01-05 00:09:30 0 30
3 104 2019-02-05 00:05:12 1 40
4 105 2019-04-01 00:08:46 0 50
5 101 2019-04-01 00:11:31 0 5
6 102 2019-02-06 00:01:39 1 2
7 103 2019-01-26 00:15:14 0 6
8 104 2019-01-21 00:12:36 1 48
9 105 2019-03-01 00:09:31 0 96
I want to do the following:
- Group by accnt, include gender, take latest date as latest_date, count number of transactions as txn_count; resulting in:
accnt gender latest_date txn_count
101 0 2019-04-01 00:11:31 2
102 1 2019-02-06 00:01:39 2
103 0 2019-01-26 00:15:14 2
104 1 2019-02-05 00:05:12 2
105 0 2019-04-01 00:08:46 2
In R, I can do this using group_by and summarise from dplyr:
foo %>% group_by(accnt) %>%
summarise(gender = last(gender), most_recent_order_date = max(date), order_count = n()) %>% data.frame()
I'm taking last(gender) to include it, since gender is the same throughout for any accnt, I can take min, max or mean instead also.
How can I do the same in python using pandas?
I've tried:
foo.groupby('accnt').agg({'gender' : ['mean'],
'date': ['max'],
'value': ['count']}).rename(columns = {'gender' : "gender",
'date' : "most_recent_order_date",
'value' : "order_count"})
But this leads to "extra" column names. I'd also like to know what is the best way to include a non-aggregation column like gender in the result.
In R summarise will equal to agg , mutate equal to transform
The reason why you have multiple index in columns : Since you pass the function call with list , which means you can do something like {'date':['mean','sum']}
foo.groupby('accnt').agg({'gender' : 'first',
'date': 'max',
'value': 'count'}).rename(columns = {'date' : "most_recent_order_date",
'value' : "order_count"}).reset_index()
Out[727]:
accnt most_recent_order_date order_count gender
0 101 2019-04-01 00:11:31 2 0
1 102 2019-02-06 00:01:39 2 1
2 103 2019-01-26 00:15:14 2 0
3 104 2019-02-05 00:05:12 2 1
4 105 2019-04-01 00:08:46 2 0
Some example : Here I called two function same time for one columns , which means there should be have two level of index to make sure the out columns names do not have duplicated
foo.groupby('accnt').agg({'gender' : ['first','mean']})
Out[728]:
gender
first mean
accnt
101 0 0
102 1 1
103 0 0
104 1 1
105 0 0
Sorry for the late response. Here's a solution I found.
# Pandas Operations
foo = foo.groupby('accnt').agg({'gender' : ['mean'],
'date': ['max'],
'value': ['count']})
# Drop additionally created column names from Pandas Operations
foo.columns = foo.columns.droplevel(1)
# Rename original column names
foo.rename( columns = { 'date':'latest_date',
'value':'txn_count'},
inplace=True)
If you'd like to include an additional non aggregate column, you can simply append a new column to the grouped foo dataframe.
I have a long time serie that starts in 1963 and ends in 2013. However, from 1963 til 2007 it has an hourly sampling period while after 2007's sampling rate changes to 5 minutes. Is it possible to resample data just after 2007 in a way that the entire time serie has hourly data sampling? Data slice below.
yr, m, d, h, m, s, sl
2007, 11, 30, 19, 0, 0, 2180
2007, 11, 30, 20, 0, 0, 2310
2007, 11, 30, 21, 0, 0, 2400
2007, 11, 30, 22, 0, 0, 2400
2007, 11, 30, 23, 0, 0, 2270
2008, 1, 1, 0, 0, 0, 2210
2008, 1, 1, 0, 5, 0, 2210
2008, 1, 1, 0, 10, 0, 2210
2008, 1, 1, 0, 15, 0, 2200
2008, 1, 1, 0, 20, 0, 2200
2008, 1, 1, 0, 25, 0, 2200
2008, 1, 1, 0, 30, 0, 2200
2008, 1, 1, 0, 35, 0, 2200
2008, 1, 1, 0, 40, 0, 2200
2008, 1, 1, 0, 45, 0, 2200
2008, 1, 1, 0, 50, 0, 2200
2008, 1, 1, 0, 55, 0, 2200
2008, 1, 1, 1, 0, 0, 2190
2008, 1, 1, 1, 5, 0, 2190
Thanks!
Give your dataframe proper column names
df.columns = 'year month day hour minute second sl'.split()
Solution
df.groupby(['year', 'month', 'day', 'hour'], as_index=False).first()
year month day hour minute second sl
0 2007 11 30 19 0 0 2180
1 2007 11 30 20 0 0 2310
2 2007 11 30 21 0 0 2400
3 2007 11 30 22 0 0 2400
4 2007 11 30 23 0 0 2270
5 2008 1 1 0 0 0 2210
6 2008 1 1 1 0 0 2190
Option 2
Here is an option that builds off of the column renaming. We'll use pd.to_datetime to cleverly get at our dates, then use resample. However, you have time gaps and will have to address nulls and re-cast dtypes.
df.set_index(
pd.to_datetime(df.drop('sl', 1))
).resample('H').first().dropna().astype(df.dtypes)
year month day hour minute second sl
2007-11-30 19:00:00 2007 11 30 19 0 0 2180
2007-11-30 20:00:00 2007 11 30 20 0 0 2310
2007-11-30 21:00:00 2007 11 30 21 0 0 2400
2007-11-30 22:00:00 2007 11 30 22 0 0 2400
2007-11-30 23:00:00 2007 11 30 23 0 0 2270
2008-01-01 00:00:00 2008 1 1 0 0 0 2210
2008-01-01 01:00:00 2008 1 1 1 0 0 2190
Rename the minute column for convenience:
df.columns = ['yr', 'm', 'd', 'h', 'M', 's', 'sl']
Create a datetime column:
from datetime import datetime as dt
df['dt'] = df.apply(axis=1, func=lambda x: dt(x.yr, x.m, x.d, x.h, x.M, x.s))
Resample:
For pandas < 0.19:
df = df.set_index('dt').resample('60T').reset_index('dt')
For pandas >= 0.19:
df = df.resample('60T', on='dt')
You'd better first append a datetime column to your dataframe:
df['datetime'] = pd.to_datetime(df[['yr', 'mnth', 'd', 'h', 'm', 's']])
But before that you should rename the month column:
df.rename(columns={ df.columns[1]: "mnth" })
Then you set a datetime column as dataframe index:
data.set_index('datetime', inplace=True)
Now you can apply resample method on your dataframe on by prefereed sampling rate:
df.resample('60T', on='datatime').mean()
Here I used mean to aggregate. You can use other method based on your need.
See Pandas document as a ref.