Efficient way to merge two panda groupby object - python

I have 2 groupby object as follows:
df1.last() => return a panda dataframe with stock_id and date as index:
close
stock_id, date
a1 2005-12-31 1.1
2006-12-31 1.2
...
2017-12-31 1.3
2018-12-31 1.3
2019-12-31 1.4
a2 2008-12-31 2.1
2009-12-31 2.4
....
2018-12-31 3.4
2019-12-31 3.4
df2 => return a groupby object with id as index:
stock_id, date, eps, dps,
id
1 a1 2017-12-01 0.2 0.03
2 a1 2018-12-01 0.3 0.02
3 a1 2019-06-01 0.4 0.01
4 a2 2018-12-01 0.5 0.03
5 a2 2019-06-01 0.3 0.04
df2 is supposed to be used as reference to merge with df1 based on stock_id and year matching with df2 as df2 has lesser year value. The expected result as follows:
df2 merge with df1:
stock_id, date, eps, dps close ratio_eps, ratio_dps
id a1 2017 0.2 0.03 1.3 0.2/1.3 0.03/1.3
a1 2018 0.3 0.02 1.3 0.3/1.3 ...
a1 2019 0.4 0.01 1.4 0.4/1.4 ...
a2 2018 0.5 0.03 3.4 ...
a2 2019 0.3 0.04 3.4 ...
The above can be done with for loop but it would be inefficient. Is there any pythonic way to achieve it ?
How do i remove the day and month from both dataframe and use it as a key to match and join both table together efficiently ?

Related

Pandas Is it possible to add new time values with empty values in columns in a csv with a time sequence?

I have a csv file that looks something like this
Time
OI
V
10:00:23
5.4
27
10:00:24
-0.7
1
10:00:28
-0.5
4
10:00:29
0.2
12
Can I somehow add new time values using Pandas while filling the columns with zeros or Nan? For the entire csv file.
What would have turned out something like that ?
Time
OI
V
10:00:23
5.4
27
10:00:24
-0.7
1
10:00:25
0
Nan
10:00:26
0
Nan
10:00:27
0
Nan
10:00:28
-0.5
4
10:00:29
0.2
12
Convert column to datetimes, create DatetimeIndex and add missing values by DataFrame.asfreq, last replace NaNs in OI column:
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index('Time').asfreq('S').fillna({'OI':0})
df.index = df.index.time
print (df)
OI V
10:00:23 5.4 27.0
10:00:24 -0.7 1.0
10:00:25 0.0 NaN
10:00:26 0.0 NaN
10:00:27 0.0 NaN
10:00:28 -0.5 4.0
10:00:29 0.2 12.0
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index('Time').asfreq('S').fillna({'OI':0}).reset_index()
df['Time'] = df['Time'].dt.time
print (df)
Time OI V
0 10:00:23 5.4 27.0
1 10:00:24 -0.7 1.0
2 10:00:25 0.0 NaN
3 10:00:26 0.0 NaN
4 10:00:27 0.0 NaN
5 10:00:28 -0.5 4.0
6 10:00:29 0.2 12.0

Pandas dataframe intersection with varying groups

I have a large pandas dataframe with varying rows and columns but looks more or less like:
time id angle ...
0.0 a1 33.67 ...
0.0 b2 35.90 ...
0.0 c3 42.01 ...
0.0 d4 45.00 ...
0.1 a1 12.15 ...
0.1 b2 15.35 ...
0.1 c3 33.12 ...
0.2 a1 65.28 ...
0.2 c3 87.43 ...
0.3 a1 98.85 ...
0.3 c3 100.12 ...
0.4 a1 11.11 ...
0.4 c3 83.22 ...
...
I am trying to aggregate the id's together and then find id's that have in common time-intervals. I have tried using pandas groupby and can easily group them by id and get their respective groups with information. How can I then take it a step further to find id's that also have the same time stamps?
Ideally I'd like to return intersection of certain fixed time intervals (2-3 seconds) for similar id's with the fixed time interval overlap:
time id angle ...
0.0 a1 33.67 ...
0.1 a1 12.15 ...
0.2 a1 65.28 ...
0.3 a1 98.85 ...
0.0 c3 42.01 ...
0.1 c3 33.12 ...
0.2 c3 87.43 ...
0.3 c3 100.12 ...
Code tried so far:
#create pandas grouped by id
df1 = df.groupby(['id'], as_index=False)
Which outputs:
time id angle ...
(0.0 a1 33.67
...
0.4 a1 11.11)
(0.0 b2 35.90
0.1 b2 15.35)
(0.0 c3 42.01
...
0.4 c3 83.22)
(0.0 d4 45.00)
But I'd like to return only a dataframe where id and time are the same for a fixed interval, in the above example .4 seconds.
Any ideas on a fairly simple way to achieve this with pandas dataframes?
If need filter rows by some intervals - e.g. here between 0 and 0.4 and get all id which overlap use boolean indexing with Series.between first, then DataFrame.pivot:
df1 = df[df['time'].between(0, 0.4)].pivot('time','id','angle')
print (df1)
id a1 b2 c3 d4
time
0.0 33.67 35.90 42.01 45.0
0.1 12.15 15.35 33.12 NaN
0.2 65.28 NaN 87.43 NaN
0.3 98.85 NaN 100.12 NaN
0.4 11.11 NaN 83.22 NaN
There are missing values for non overlap id, so remove columns with any NaNs by DataFrame.any and reshape to 3 columns by DataFrame.unstack and Series.reset_index:
print (df1.dropna(axis=1))
id a1 c3
time
0.0 33.67 42.01
0.1 12.15 33.12
0.2 65.28 87.43
0.3 98.85 100.12
0.4 11.11 83.22
df2 = df1.dropna(axis=1).unstack().reset_index(name='angle')
print (df2)
id time angle
0 a1 0.0 33.67
1 a1 0.1 12.15
2 a1 0.2 65.28
3 a1 0.3 98.85
4 a1 0.4 11.11
5 c3 0.0 42.01
6 c3 0.1 33.12
7 c3 0.2 87.43
8 c3 0.3 100.12
9 c3 0.4 83.22
There are many ways to define the filter you're asking for:
df.groupby('id').filter(lambda x: len(x) > 4)
# OR
df.groupby('id').filter(lambda x: x['time'].eq(0.4).any())
# OR
df.groupby('id').filter(lambda x: x['time'].max() == 0.4)
Output:
time id angle
0 0.0 a1 33.67
2 0.0 c3 42.01
4 0.1 a1 12.15
6 0.1 c3 33.12
7 0.2 a1 65.28
8 0.2 c3 87.43
9 0.3 a1 98.85
10 0.3 c3 100.12
11 0.4 a1 11.11
12 0.4 c3 83.22

Creating a new pandas dataframe column by looking up values in other rows

I would like to find a faster way to calculate the sales 52 weeks ago column for each product below without using iterrows or itertuples. Any suggestions? Input will be the table without "sales 52 weeks ago column" and output will be the entire table below.
date sales city product sales 52 weeks ago
0 2020-01-01 1.5 c1 p1 0.6
1 2020-01-01 1.2 c1 p2 0.3
2 2019-05-02 0.5 c1 p1 nan
3 2019-01-02 0.3 c1 p2 nan
4 2019-01-02 0.6 c1 p1 nan
5 2019-01-01 1.2 c1 p2 nan
Example itertuples code but really slow:
for row in df.itertuples(index=True, name='Pandas'):
try:
df.at[row.Index, 'sales 52 weeks ago']=df[(df['date']==row.date-timedelta(weeks=52))&(df['product']==row.product),'sales']
except:
continue
You need a merge after subtracting the date with Timedelta:
m=df['date'].sub(pd.Timedelta('52W')).to_frame().assign(product=df['product'])
final = df.assign(sales_52_W_ago=m.merge(df,
on=['date','product'],how='left').loc[:,'sales'])
date sales city product sales_52_W_ago
0 2020-01-01 1.5 c1 p1 0.6
1 2020-01-01 1.2 c1 p2 0.3
2 2019-05-02 0.5 c1 p1 NaN
3 2019-01-02 0.3 c1 p2 NaN
4 2019-01-02 0.6 c1 p1 NaN
5 2019-01-01 1.2 c1 p2 NaN

Pandas - Rebasing values based on a specific column

I have the following dataframe:
Name 2018-02-28 2018-01-31 2018-12-31 2017-11-30 2017-10-31 2017-09-30
ID
11 ABC 110 109 108 100 95 90
22 DEF 120 119 118 100 85 80
33 GHI 130 129 128 100 75 70
I would like to obtain the below table where the resulting data reflects the % chg of the row's values relative to a particular row, in this case 2017-11-30's values.
Then, create a row at the bottom of the dataframe that provides the average.
Name 2018-02-28 2018-01-31 2018-12-31 2017-11-30 2017-10-31 2017-09-30
ID
11 ABC 10.0% 9.0% 8.0% 0.0% -5.0% -10.0%
22 DEF 20.0% 19.0% 18.0% 0.0% -15.0% -20.0%
33 GHI 30.0% 29.0% 28.0% 0.0% -25.0% -30.0%
Average 20.0% 19.0% 18.0% 0.0% -15.0% -20.0%
My actual dataframe has about 50 columns and 50 rows, and the actual column as the "base" value when we calculate the % chg is 1 year ago (ie column 14). A solution as generic as possible would be really appreciated!
I couldn't resist to post a continuation of jpps solution but cleaning it using multiindex. First we recreate the data set with pd.compat.
import pandas as pd
import numpy as np
data = '''\
ID Name 2018-02-28 2018-01-31 2018-12-31 2017-11-30 2017-10-31 2017-09-30
11 ABC 110 109 108 100 95 90
22 DEF 120 119 118 100 85 80
33 GHI 130 129 128 100 75 70'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+').set_index('ID')
Alternative single-index:
# Pop away the column names and add Average
names = df.pop('Name').tolist() + ['Average']
# Recreate dataframe with percent of column index 4
df.loc[:] = (df.values.T - df.iloc[:,3].values).T / 100
# Get the mean and append
s = df.mean()
s.name = '99' # name is required when you use append (this will be the id)
df = df.append(s)
# Insert back
df.insert(0,'Name', names)
print(df)
Returns
Name 2018-02-28 2018-01-31 2018-12-31 2017-11-30 2017-10-31 \
ID
11 ABC 0.1 0.09 0.08 0.0 -0.05
22 DEF 0.2 0.19 0.18 0.0 -0.15
33 GHI 0.3 0.29 0.28 0.0 -0.25
99 Average 0.2 0.19 0.18 0.0 -0.15
2017-09-30
ID
11 -0.1
22 -0.2
33 -0.3
99 -0.2
Alternative with multi-index
# Set dual index
df = df.set_index([df.index,'Name'])
# Recreate dataframe with percent of column index 3 (4th)
df.loc[:] = (df.values.T - df.iloc[:,3].values).T / 100
# Get the mean and append
s = df.mean()
s.name = 'Average'
df = df.append(s)
print(df)
df output:
2018-02-28 2018-01-31 2018-12-31 2017-11-30 2017-10-31 2017-09-30
(11, ABC) 0.1 0.09 0.08 0.0 -0.05 -0.1
(22, DEF) 0.2 0.19 0.18 0.0 -0.15 -0.2
(33, GHI) 0.3 0.29 0.28 0.0 -0.25 -0.3
Average 0.2 0.19 0.18 0.0 -0.15 -0.2
You can use numpy for this. Below output is in decimals, you can multiply by 100 if necessary.
df.iloc[:, 1:] = (df.iloc[:, 1:].values / df.iloc[:, 4].values[:, None]) - 1
df.loc[len(df)+1] = ['Average'] + np.mean(df.iloc[:, 1:].values, axis=0).tolist()
Result
Name 2018-02-28 2018-01-31 2018-12-31 2017-11-30 2017-10-31 \
ID
11 ABC 0.1 0.09 0.08 0.0 -0.05
22 DEF 0.2 0.19 0.18 0.0 -0.15
33 GHI 0.3 0.29 0.28 0.0 -0.25
4 Average 0.2 0.19 0.18 0.0 -0.15
2017-09-30
ID
11 -0.1
22 -0.2
33 -0.3
4 -0.2
Explanation
df.iloc[:, 1:] extracts the 2nd column onwards; .values retrieves the numpy array representation from the dataframe.
[:, None] changes the axis of the array so that the division is oriented correctly.

What is the method to convert the data into NaN, if the flag in a designated column (matches top 6 letters) is "1"?

What is the method to convert the data into NaN, if the flag in a designated column (matches top 6 letters) is "1"?
There are Dataframes which indicate the data and flags.
Order of columns are different between two dataframe.
Those frames have hundreds of columns and a half a million number of records.
df
123456.A 123456.B ... 456789.A 456789.B
2016-01-01 00:00 5.6 0.3 ... 6.7 1.1
2016-01-01 00:01 5.4 0.4 ... 6.7 1.3
2016-01-01 00:02 5.1 0.2 ... 6.7 1.5
....
2016-12-31 23:57 5.7 0.4 ... 6.7 1.2
2016-12-31 23:58 5.6 0.3 ... 6.7 1.4
2016-12-31 23:59 5.4 0.4 ... 6.7 1.5
flag_t
456789 123456 ... 342546 821453
2016-01-01 00:00 1 0 ... 0 0
2016-01-01 00:01 0 0 ... 0 0
2016-01-01 00:02 1 1 ... 0 0
....
2016-12-31 23:57 0 1 ... 1 1
2016-12-31 23:58 0 0 ... 0 1
2016-12-31 23:59 0 0 ... 0 1
This is a table that I would like to get:
df
123456.A 123456.B ... 456789.A 456789.B
2016-01-01 00:00 5.6 0.3 ... NaN NaN
2016-01-01 00:01 5.4 0.4 ... 6.7 1.3
2016-01-01 00:02 NaN NaN ... NaN NaN
....
2016-12-31 23:57 NaN NaN ... 6.7 1.2
2016-12-31 23:58 5.6 0.3 ... 6.7 1.4
2016-12-31 23:59 5.4 0.4 ... 6.7 1.5
split columns by '.'
add df2.where(df2 == 0)
will be zero where zero and np.nan else where.
I do this because I can add at a specific level, broadcasting over the rest.
df.columns = df.columns.str.split('.', expand=True)
df = df.add(df2.where(df2==0), level=0)
df.columns = df.columns.map('.'.join)
print(df)
123456.A 123456.B 456789.A 456789.B
2016-01-01 00:00:00 5.6 0.3 NaN NaN
2016-01-01 00:01:00 5.4 0.4 6.7 1.3
2016-01-01 00:02:00 NaN NaN NaN NaN
2016-12-31 23:57:00 NaN NaN 6.7 1.2
2016-12-31 23:58:00 5.6 0.3 6.7 1.4
2016-12-31 23:59:00 5.4 0.4 6.7 1.5
You can use mask which create NaN where True values with reindex:
#convert columns to MultiIndex
df.columns = df.columns.str.split('.', expand=True)
print (df)
123456 456789
A B A B
2016-01-01 00:00 5.6 0.3 6.7 1.1
2016-01-01 00:01 5.4 0.4 6.7 1.3
2016-01-01 00:02 5.1 0.2 6.7 1.5
2016-12-31 23:57 5.7 0.4 6.7 1.2
2016-12-31 23:58 5.6 0.3 6.7 1.4
2016-12-31 23:59 5.4 0.4 6.7 1.5
#create new MultiIndex with flag_t columns and possible letters
mux = pd.MultiIndex.from_product([flag_t.columns, ['A','B']])
print (mux)
MultiIndex(levels=[['123456', '342546', '456789', '821453'], ['A', 'B']],
labels=[[2, 2, 0, 0, 1, 1, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])
#reindex flag_t by new MultiIndex mux
flag_t = flag_t.reindex(columns=mux, level=0)
print (flag_t)
456789 123456 342546 821453
A B A B A B A B
2016-01-01 00:00 1 1 0 0 0 0 0 0
2016-01-01 00:01 0 0 0 0 0 0 0 0
2016-01-01 00:02 1 1 1 1 0 0 0 0
2016-12-31 23:57 0 0 1 1 1 1 1 1
2016-12-31 23:58 0 0 0 0 0 0 1 1
2016-12-31 23:59 0 0 0 0 0 0 1 1
#create mask by reindex, cast to bool
mask = flag_t.reindex(columns=df.columns).astype(bool)
print (mask)
123456 456789
A B A B
2016-01-01 00:00 False False True True
2016-01-01 00:01 False False False False
2016-01-01 00:02 True True True True
2016-12-31 23:57 True True False False
2016-12-31 23:58 False False False False
2016-12-31 23:59 False False False False
df1 = df.mask(mask)
#convert MultiIndex to columns
df1.columns = df1.columns.map('.'.join)
print (df1)
123456.A 123456.B 456789.A 456789.B
2016-01-01 00:00 5.6 0.3 NaN NaN
2016-01-01 00:01 5.4 0.4 6.7 1.3
2016-01-01 00:02 NaN NaN NaN NaN
2016-12-31 23:57 NaN NaN 6.7 1.2
2016-12-31 23:58 5.6 0.3 6.7 1.4
2016-12-31 23:59 5.4 0.4 6.7 1.5
Assuming that your second array, flag_t is a valid mask for your first array, to get the output you desire, you can use pandas.DataFrame.where. Here's a small demonstrative example:
>>> df = pd.DataFrame({'a': [1,2], 'b':[3,4]})
>>> mask = pd.DataFrame({'a': [0, 1], 'b': [1,0]})
>>> df.where(mask)
<<< a b
0 NaN 3.0
1 2.0 NaN
In this case the rub is that there are two columns you're masking, named 'A' and 'B', so they're not exactly synonymous. Here's one way to handle this:
df_1 = df[[c for c in df.columns if ".A" in c]] # Get the .A columns...
.rename(columns={c: c[:-2] for c in df.columns}) # ...remove the .B...
.where(mask) # And apply the mask.
df_2 = df[[c for c in df.columns if ".B" in c]] # Ditto.
.rename(columns={c: c[:-2] for c in df.columns})
.where(mask)
# Rejoin to get the final result.
masked_df = df_1.join(df_2, lsuffix='.A', rsuffix='.B')

Categories