pandas - Copy each row 'n' times depending on column value - python

I'd like to copy or duplicate the rows of a DataFrame based on the value of a column, in this case orig_qty. So if I have a DataFrame and using pandas==0.24.2:
import pandas as pd
d = {'a': ['2019-04-08', 4, 115.00], 'b': ['2019-04-09', 2, 103.00]}
df = pd.DataFrame.from_dict(
d,
orient='index',
columns=['date', 'orig_qty', 'price']
)
Input
>>> print(df)
date orig_qty price
a 2019-04-08 4 115.0
b 2019-04-09 2 103.0
So in the example above the row with orig_qty=4 should be duplicated 4 times and the row with orig_qty=2 should be duplicated 2 times. After this transformation I'd like a DataFrame that looks like:
Desired Output
>>> print(new_df)
date orig_qty price fifo_qty
1 2019-04-08 4 115.0 1
2 2019-04-08 4 115.0 1
3 2019-04-08 4 115.0 1
4 2019-04-08 4 115.0 1
5 2019-04-09 2 103.0 1
6 2019-04-09 2 103.0 1
Note I do not really care about the index after the transformation. I can elaborate more on the use case for this, but essentially I'm doing some FIFO accounting where important changes can occur between values of orig_qty.

Use Index.repeat, DataFrame.loc, DataFrame.assign and DataFrame.reset_index
new_df = df.loc[df.index.repeat(df['orig_qty'])].assign(fifo_qty=1).reset_index(drop=True)
[output]
date orig_qty price fifo_qty
0 2019-04-08 4 115.0 1
1 2019-04-08 4 115.0 1
2 2019-04-08 4 115.0 1
3 2019-04-08 4 115.0 1
4 2019-04-09 2 103.0 1
5 2019-04-09 2 103.0 1

Use np.repeat
new_df = pd.DataFrame({col: np.repeat(df[col], df.orig_qty) for col in df.columns})

Related

Elegant way to drop records in pandas based on size/count of a record

This isn't a duplicate. I am not trying drop rows based on Index
I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05
12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06
13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09
22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14
08:00:00'],
'val' :[5,2,3,1,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
I would like to drop records based on subject_id if their count is <=5.
This is what I tried
df1 = df.groupby(['subject_id']).size().reset_index(name='counter')
df1[df1['counter']>5] # this gives the valid subject_id = 1 has count more than 5)
Now using this subject_id, I have to get the base dataframe rows for that subject_id
There might be an elegant way to do this.
I would like to get the output as shown below. I would like have my base dataframe rows
Use:
df[df.groupby('subject_id')['subject_id'].transform('size')>5]
Output:
subject_id time_1 val day
0 1 2173-04-03 12:35:00 5 3
1 1 2173-04-03 12:50:00 2 3
2 1 2173-04-05 12:59:00 3 5
3 1 2173-05-04 13:14:00 1 4
4 1 2173-05-05 13:37:00 1 5
5 1 2173-07-06 13:39:00 6 6
6 1 2173-07-08 11:30:00 5 8

"Rank" DataFrame columns per row

Given a Time Series DataFrame is it possible to create a new DataFrame with the same dimensions but the values are the ranking for each row compared to other columns (ordered smallest value first)?
Example:
ABC DEFG HIJK XYZ
date
2018-01-14 0.110541 0.007615 0.063217 0.002543
2018-01-21 0.007012 0.042854 0.061271 0.007988
2018-01-28 0.085946 0.177466 0.046432 0.069297
2018-02-04 0.018278 0.065254 0.038972 0.027278
2018-02-11 0.071785 0.033603 0.075826 0.073270
The first row would become:
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
as XYZ has the smallest value in that row and ABC the largest.
numpy.argsort looks like it might help however as it outputs the location itself I have not managed to get it to work.
Many thanks
Use double argsort for rank per rows and pass to DataFrame constructor:
df1 = pd.DataFrame(df.values.argsort().argsort() + 1, index=df.index, columns=df.columns)
print (df1)
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
2018-01-21 1 3 4 2
2018-01-28 3 4 1 2
2018-02-04 1 4 3 2
2018-02-11 2 1 4 3
Or use DataFrame.rank with method='dense':
df1 = df.rank(axis=1, method='dense').astype(int)
print (df1)
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
2018-01-21 1 3 4 2
2018-01-28 3 4 1 2
2018-02-04 1 4 3 2
2018-02-11 2 1 4 3

Divide columns in df by another df value based on condition

I have a dataframe:
df = pd.DataFrame({'date': ['2013-04-01','2013-04-01','2013-04-01','2013-04-02', '2013-04-02'],
'month': ['1','1','3','3','5'],
'pmonth': ['1', '1', '2', '5', '5'],
'duration': [30, 15, 20, 15, 30],
'pduration': ['10', '20', '30', '40', '50']})
I have to divide duration and pduration by value column of second dataframe where date and month of two df match. The second df is:
df = pd.DataFrame({'date': ['2013-04-01','2013-04-02','2013-04-03','2013-04-04', '2013-04-05'],
'month': ['1','1','3','3','5'],
'value': ['1', '1', '2', '5', '5'],
})
The second df is grouped by date and month, so duplicate combination of date month won't be present in the second df.
First is necessary check if same dtypes of column date and month in both DataFrames and if numeric for columns for divide:
#convert to numeric
df1['pduration'] = df1['pduration'].astype(int)
df2['value'] = df2['value'].astype(int)
print (df1.dtypes)
date object
month object
pmonth object
duration int64
pduration int32
print (df2.dtypes)
date object
month object
value int32
dtype: object
Then merge with left join and divide by DataFrame.div
df = df1.merge(df2, on=['date', 'month'], how='left')
df[['duration_new','pduration_new']] = df[['duration','pduration']].div(df['value'], axis=0)
print (df)
date month pmonth duration pduration value duration_new \
0 2013-04-01 1 1 30 10 1.0 30.0
1 2013-04-01 1 1 15 20 1.0 15.0
2 2013-04-01 3 2 20 30 NaN NaN
3 2013-04-02 3 5 15 40 NaN NaN
4 2013-04-02 5 5 30 50 NaN NaN
pduration_new
0 10.0
1 20.0
2 NaN
3 NaN
4 NaN
For remove value column use pop:
df[['duration_new','pduration_new']] = (df[['duration','pduration']]
.div(df.pop('value'), axis=0))
print (df)
date month pmonth duration pduration duration_new pduration_new
0 2013-04-01 1 1 30 10 30.0 10.0
1 2013-04-01 1 1 15 20 15.0 20.0
2 2013-04-01 3 2 20 30 NaN NaN
3 2013-04-02 3 5 15 40 NaN NaN
4 2013-04-02 5 5 30 50 NaN NaN
You can merge the second df into the first df and then divide.
Consider the first df as df1 and second df as df2
df1 = df1.merge(df2, on=['date', 'month'], how='left').fillna(1)
df1
date month pmonth duration pduration value
0 2013-04-01 1 1 30 10 1
1 2013-04-01 1 1 15 20 1
2 2013-04-01 3 2 20 30 1
3 2013-04-02 3 5 15 40 1
4 2013-04-02 5 5 30 50 1
df1['duration'] = df1['duration'] / df1['value']
df1['pduration'] = df1['pduration'] / df1['value']
df1.drop('value', axis=1, inplace=True)
you can merge the two dataframes, where the date and month match the value column will be added to the first data frame. If there is no match it will represented by NaN. You can then do division operation. see code below
Assuming your second dataframe is df2, then
df3 = df2.merge(df, how = 'right')
for col in ['duration','pduration']:
df3['new_'+col] = df3[col].astype(float)/df3['value'].astype(float)
df3
results in
date month value pmonth duration pduration newduration newpduration
0 2013-04-01 1 1 1 30 10 30.0 10.0
1 2013-04-01 1 1 1 15 20 15.0 20.0
2 2013-04-01 3 NaN 2 20 30 NaN NaN
3 2013-04-02 3 NaN 5 15 40 NaN NaN
4 2013-04-02 5 NaN 5 30 50 NaN NaN

How to find rate of change across successive rows using time and data columns after grouping by a different column using pandas?

I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64

Dataframe has Everyother column timestamp, how to get it in one column?

I have a dataframe I import from excel that is of 'n x n' length, that looks like the following (sorry, i do not know how to easily duplicate this with code)
How do I get the timestamps into one column? Like the following (I've tried pivot)
You may need to extract the data by 3 columns group. Then rename the columns and add the "A,B,C" flag column and concatenate them together. See the test as below:
abc_list = [["2017-10-01",0,"2017-10-02",1,"2017-10-03",8],["2017-11-01",3,"2017-11-01",5,"2017-11-05",10],["2017-12-01",0,"2017-12-07",7,"2017-12-07",12]]
df = pd.DataFrame(abc_list,columns=["Time1","A","Time2","B","Time3","C"])
The output:
Time1 A Time2 B Time3 C
0 2017-10-01 0 2017-10-02 1 2017-10-03 8
1 2017-11-01 3 2017-11-01 5 2017-11-05 10
2 2017-12-01 0 2017-12-07 7 2017-12-07 12
Then:
df_a=df.iloc[:,0:2].rename(columns={'Time1':'time','A':'value'})
df_a['flag']="A"
df_b=df.iloc[:,2:4].rename(columns={'Time2':'time','B':'value'})
df_b['flag']="B"
df_c=df.iloc[:,4:].rename(columns={'Time3':'time','C':'value'})
df_c['flag']="C"
df_final=pd.concat([df_a,df_b,df_c])
df_final.reset_index(drop=True)
output:
time value flag
0 2017-10-01 0 A
1 2017-11-01 3 A
2 2017-12-01 0 A
3 2017-10-02 1 B
4 2017-11-01 5 B
5 2017-12-07 7 B
6 2017-10-03 8 C
7 2017-11-05 10 C
8 2017-12-07 12 C
This is a quit bit not a pythonic way to do it.
Here is another way:
columns = pd.MultiIndex.from_tuples([('A','Time'),('A','Value'),('B','Time'),('B','Value'),('C','Time'),('C','Value')],names=['Group','Sub_value'])
df.columns=columns
Output:
Group A B C
Sub_value Time Value Time Value Time Value
0 2017-10-01 0 2017-10-02 1 2017-10-03 8
1 2017-11-01 3 2017-11-01 5 2017-11-05 10
2 2017-12-01 0 2017-12-07 7 2017-12-07 12
Run:
df.stack(level='Group')
Output:
Sub_value Time Value
Group
0 A 2017-10-01 0
B 2017-10-02 1
C 2017-10-03 8
1 A 2017-11-01 3
B 2017-11-01 5
C 2017-11-05 10
2 A 2017-12-01 0
B 2017-12-07 7
C 2017-12-07 12
This is one method. It is fairly easy to extend to any number of columns.
import pandas as pd
dfs = {}
# read in pairs of columns and assign 'Category' column
dfs[i] = {i: pd.read_excel('file.xlsx', usecols=[2*i, 2*i+1], skiprows=[0],
header=None, columns=['Date', 'Value']).assign(Category=j) \
for i, j in enumerate(['A', 'B', 'C'])}
# concatenate dataframes
df = pd.concat(list(dfs.values()), ignore_index=True)

Categories