how to zip and also melt any number of columns in python - python

My table looks like this:
no type 2020-01-01 2020-01-02 2020-01-03 ...................
1 x 1 2 3
2 b 4 3 0
and what I want to do is to melt down the column date and also value to be in separated new columns. I have done it, but I specified the columns that I want to melt like this script below:
cols_dict = dict(zip(df.iloc[:, 3:100].columns, df.iloc[:, 3:100].values[0]))
id_vars = [col for col in df.columns if isinstance(col, str)]
df = df.melt(id_vars = [col for col in df.columns if isinstance(col, str)], var_name = "date", value_name = 'value')
The expected result I want is:
no type date value
1 x 2020-01-01 1
1 x 2020-01-02 2
1 x 2020-01-03 3
2 b 2020-01-01 4
2 b 2020-01-02 3
2 b 2020-01-03 0
I assume that the column dates will be always added into the data frame as time goes by, so my script would not be worked anymore when the column date is more than 100.
How should I write my script so it will provide any number of date column in the future, as basically my current script could only access until columns number 100.
Thanks in advance.

>>> df.set_index(["no", "type"]) \
.rename_axis(columns="date") \
.stack() \
.rename("value") \
.reset_index()
no type date value
0 1 x 2020-01-01 1
1 1 x 2020-01-02 2
2 1 x 2020-01-03 3
3 2 b 2020-01-01 4
4 2 b 2020-01-02 3
5 2 b 2020-01-03 0

Related

pandas shifting missing months

let's assume the following dataframe and shift operation:
d = {'col1': ['2022-01-01','2022-02-01','2022-03-01','2022-05-01'], 'col2': [1,2,3,4]}
df = pd.DataFrame(d)
df['shifted'] = df['col2'].shift(1, fill_value=0)
I want to create a column containing the values of the month before and filling it up for months which do not exist with 0, so the desired result would look like:
col1
col2
shifted
2022-01-01
1
0
2022-02-01
2
1
2022-03-01
3
2
2022-05-01
4
0
So in the last line the value is 0 because there is no data for April.
But at the moment it looks like this:
col1
col2
shifted
2022-01-01
1
0
2022-02-01
2
1
2022-03-01
3
2
2022-05-01
4
3
Does anyone know how to achieve this?
One idea is create month PeriodIndex, so possible shift by months, last replace missing values:
df = df.set_index(pd.to_datetime(df['col1']).dt.to_period('m'))
df['shifted'] = df['col2'].shift(1, freq='m').reindex(df.index, fill_value=0)
print (df)
col1 col2 shifted
col1
2022-01 2022-01-01 1 0
2022-02 2022-02-01 2 1
2022-03 2022-03-01 3 2
2022-05 2022-05-01 4 0
Last is possible remove PeriodIndex:
df = df.reset_index(drop=True)
print (df)
col1 col2 shifted
0 2022-01-01 1 0
1 2022-02-01 2 1
2 2022-03-01 3 2
3 2022-05-01 4 0

consecutive rows of unsorted dates based on one day before after or on the same day into one [duplicate]

I would like to combine rows of same id with consecutive dates and same features values.
I have the following dataframe:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-15 1 1
1 A 2020-01-16 2020-01-30 1 1
2 A 2020-01-31 2020-02-15 0 1
3 A 2020-07-01 2020-07-15 0 1
4 B 2020-01-31 2020-02-15 0 0
5 B 2020-02-16 NaT 0 0
An the expected result is:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
I have been trying other posts answers but they don't really match with my use case.
Thanks in advance!
You can approach by:
Get the day diff of each consecutive entries within same group by substracting current Start with last End with the group using GroupBy.shift().
Set group number group_no such that new group number is issued when day diff with previous entry within the group is greater than 1.
Then, group by Id and group_no and aggregate for each group the Start and End dates using .gropuby() and .agg()
As there is NaT data within the grouping, we need to specify dropna=False during grouping. Furthermore, to get the last entry of End within the group, we use x.iloc[-1] instead of last.
# convert to datetime format if not already in datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
# sort by columns `Id` and `Start` if not already in this sequence
df = df.sort_values(['Id', 'Start'])
day_diff = (df['Start'] - df['End'].groupby([df['Id'], df['Feature1'], df['Feature2']]).shift()).dt.days
group_no = (day_diff.isna() | day_diff.gt(1)).cumsum()
df_out = (df.groupby(['Id', group_no], dropna=False, as_index=False)
.agg({'Id': 'first',
'Start': 'first',
'End': lambda x: x.iloc[-1],
'Feature1': 'first',
'Feature2': 'first',
}))
Result:
print(df_out)
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
Extract months from both date column
df['sMonth'] = df['Start'].apply(pd.to_datetime).dt.month
df['eMonth'] = df['End'].apply(pd.to_datetime).dt.month
Now groupby data frame with ['Id','Feature1','Feature2','sMonth','eMonth'] and we get result
df.groupby(['Id','Feature1','Feature2','sMonth','eMonth']).agg({'Start':'min','End':'max'}).reset_index().drop(['sMonth','eMonth'],axis=1)
Result
Id Feature1 Feature2 Start End
0 A 0 1 2020-01-31 2020-02-15
1 A 0 1 2020-07-01 2020-07-15
2 A 1 1 2020-01-01 2020-01-30
3 B 0 0 2020-01-31 2020-02-15

Pandas filter by date range within each group

I have a df like:
and I have to filter my df by having values within two weeks from each ID
so for each ID, I have to look ahead next two weeks from the first date and only keep those records.
Output:
I tried creating a min date per each ID and using below code to try to filter:
df[df.date.between(df['min_date'],df['min_date']+pd.DateOffset(days=14))]
Is their any efficient way than this? because this is taking a lot of time since my dataframe is big
Setup
df = pd.DataFrame({
'Id': np.repeat([2, 3, 4], [4, 3, 4]),
'Date': ['12/31/2019', '1/1/2020', '1/5/2020', '1/20/2020',
'1/5/2020', '1/10/2020', '1/30/2020', '2/2/2020',
'2/4/2020', '2/10/2020', '2/25/2020'],
'Value': [*'abcbdeefffg']
})
First, convert Date to Timestamp with to_datetime
df['Date'] = pd.to_datetime(df['Date'])
concat with groupby in a comprehension
pd.concat([
d[d.Date <= d.Date.min() + pd.offsets.Day(14)]
for _, d in df.groupby('Id')
])
Id Date Value
0 2 2019-12-31 a
1 2 2020-01-01 b
2 2 2020-01-05 c
4 3 2020-01-05 d
5 3 2020-01-10 e
7 4 2020-02-02 f
8 4 2020-02-04 f
9 4 2020-02-10 f
boolean slice... also with groupby
df[df.Date <= df.Id.map(df.groupby('Id').Date.min() + pd.offsets.Day(14))]
Id Date Value
0 2 2019-12-31 a
1 2 2020-01-01 b
2 2 2020-01-05 c
4 3 2020-01-05 d
5 3 2020-01-10 e
7 4 2020-02-02 f
8 4 2020-02-04 f
9 4 2020-02-10 f
I struggle with pandas.concat, so you can try using merge:
# Convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
# Get min Date for each Id and add two weeks (14 days)
s = df.groupby('Id')['Date'].min() + pd.offsets.Day(14)
# Merge df and s
df = df.merge(s, left_on='Id', right_index=True)
# Keep records where Date is less than the allowed limit
df = df.loc[df['Date_x'] <= df['Date_y'], ['Id','Date_x','Value']]
# Rename Date_x to Date (optional)
df.rename(columns={'Date_x':'Date'}, inplace=True)
The result is:
Id Date Value
0 2 2019-12-31 a
1 2 2020-01-01 b
2 2 2020-01-05 c
4 3 2020-01-05 d
5 3 2020-01-10 e
7 4 2020-02-02 f
8 4 2020-02-04 f
9 4 2020-02-10 f

select the last 2 values ​in the groupby with condition

I need to select the rows of the last value for each user_id and date, but when the last value in the metric column is 'leave' select the last 2 rows(if exists).
My data:
df = pd.DataFrame({
"user_id": [1,1,1, 2,2,2]
,'subscription': [1,1,2,3,4,5]
,"metric": ['enter', 'stay', 'leave', 'enter', 'leave', 'enter']
,'date': ['2020-01-01', '2020-01-01', '2020-03-01', '2020-01-01', '2020-01-01', '2020-01-02']
})
#result
user_id subscription metric date
0 1 1 enter 2020-01-01
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
Expected output:
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01 # stay because last metric='leave' inside group[user_id, date]
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
What I've tried: drop_duplicates and groupby, both give the same result, only with the last value
df.drop_duplicates(['user_id', 'date'], keep='last')
#or
df.groupby(['user_id', 'date']).tail(1)
You can use boolean masking and return three different conditions that are True or False with variables a, b, or c. Then, filter for when the data a, b, or c returns True with the or operator |:
a = df.groupby(['user_id', 'date', df.groupby(['user_id', 'date']).cumcount()])['metric'].transform('last') == 'leave'
b = df.groupby(['user_id', 'date'])['metric'].transform('count') == 1
c = a.shift(-1) & (b == False)
df = df[a | b | c]
print(a, b, c)
df
#a groupby the two required groups plus a group that finds the cumulative count, which is necessary in order to return True for the last "metric" within the the group.
0 False
1 False
2 True
3 False
4 True
5 False
Name: metric, dtype: bool
#b if something has a count of one, then you want to keep it.
0 False
1 False
2 True
3 False
4 False
5 True
Name: metric, dtype: bool
#c simply use .shift(-1) to find the row before the row. For the condition to be satisfied the count for that group must be > 1
0 False
1 True
2 False
3 True
4 False
5 False
Name: metric, dtype: bool
Out[18]:
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
This is one way, but in my opinion, slow, since we are iterating through the grouping :
df["date"] = pd.to_datetime(df["date"])
df = df.assign(metric_is_leave=df.metric.eq("leave"))
pd.concat(
[
value.iloc[-2:, :-1] if value.metric_is_leave.any() else value.iloc[-1:, :-1]
for key, value in df.groupby(["user_id", "date"])
]
)
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02

Elegant way to drop records in pandas based on size/count of a record

This isn't a duplicate. I am not trying drop rows based on Index
I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05
12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06
13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09
22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14
08:00:00'],
'val' :[5,2,3,1,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
I would like to drop records based on subject_id if their count is <=5.
This is what I tried
df1 = df.groupby(['subject_id']).size().reset_index(name='counter')
df1[df1['counter']>5] # this gives the valid subject_id = 1 has count more than 5)
Now using this subject_id, I have to get the base dataframe rows for that subject_id
There might be an elegant way to do this.
I would like to get the output as shown below. I would like have my base dataframe rows
Use:
df[df.groupby('subject_id')['subject_id'].transform('size')>5]
Output:
subject_id time_1 val day
0 1 2173-04-03 12:35:00 5 3
1 1 2173-04-03 12:50:00 2 3
2 1 2173-04-05 12:59:00 3 5
3 1 2173-05-04 13:14:00 1 4
4 1 2173-05-05 13:37:00 1 5
5 1 2173-07-06 13:39:00 6 6
6 1 2173-07-08 11:30:00 5 8

Categories