I have a large pandas dataframe with varying rows and columns but looks more or less like:
time id angle ...
0.0 a1 33.67 ...
0.0 b2 35.90 ...
0.0 c3 42.01 ...
0.0 d4 45.00 ...
0.1 a1 12.15 ...
0.1 b2 15.35 ...
0.1 c3 33.12 ...
0.2 a1 65.28 ...
0.2 c3 87.43 ...
0.3 a1 98.85 ...
0.3 c3 100.12 ...
0.4 a1 11.11 ...
0.4 c3 83.22 ...
...
I am trying to aggregate the id's together and then find id's that have in common time-intervals. I have tried using pandas groupby and can easily group them by id and get their respective groups with information. How can I then take it a step further to find id's that also have the same time stamps?
Ideally I'd like to return intersection of certain fixed time intervals (2-3 seconds) for similar id's with the fixed time interval overlap:
time id angle ...
0.0 a1 33.67 ...
0.1 a1 12.15 ...
0.2 a1 65.28 ...
0.3 a1 98.85 ...
0.0 c3 42.01 ...
0.1 c3 33.12 ...
0.2 c3 87.43 ...
0.3 c3 100.12 ...
Code tried so far:
#create pandas grouped by id
df1 = df.groupby(['id'], as_index=False)
Which outputs:
time id angle ...
(0.0 a1 33.67
...
0.4 a1 11.11)
(0.0 b2 35.90
0.1 b2 15.35)
(0.0 c3 42.01
...
0.4 c3 83.22)
(0.0 d4 45.00)
But I'd like to return only a dataframe where id and time are the same for a fixed interval, in the above example .4 seconds.
Any ideas on a fairly simple way to achieve this with pandas dataframes?
If need filter rows by some intervals - e.g. here between 0 and 0.4 and get all id which overlap use boolean indexing with Series.between first, then DataFrame.pivot:
df1 = df[df['time'].between(0, 0.4)].pivot('time','id','angle')
print (df1)
id a1 b2 c3 d4
time
0.0 33.67 35.90 42.01 45.0
0.1 12.15 15.35 33.12 NaN
0.2 65.28 NaN 87.43 NaN
0.3 98.85 NaN 100.12 NaN
0.4 11.11 NaN 83.22 NaN
There are missing values for non overlap id, so remove columns with any NaNs by DataFrame.any and reshape to 3 columns by DataFrame.unstack and Series.reset_index:
print (df1.dropna(axis=1))
id a1 c3
time
0.0 33.67 42.01
0.1 12.15 33.12
0.2 65.28 87.43
0.3 98.85 100.12
0.4 11.11 83.22
df2 = df1.dropna(axis=1).unstack().reset_index(name='angle')
print (df2)
id time angle
0 a1 0.0 33.67
1 a1 0.1 12.15
2 a1 0.2 65.28
3 a1 0.3 98.85
4 a1 0.4 11.11
5 c3 0.0 42.01
6 c3 0.1 33.12
7 c3 0.2 87.43
8 c3 0.3 100.12
9 c3 0.4 83.22
There are many ways to define the filter you're asking for:
df.groupby('id').filter(lambda x: len(x) > 4)
# OR
df.groupby('id').filter(lambda x: x['time'].eq(0.4).any())
# OR
df.groupby('id').filter(lambda x: x['time'].max() == 0.4)
Output:
time id angle
0 0.0 a1 33.67
2 0.0 c3 42.01
4 0.1 a1 12.15
6 0.1 c3 33.12
7 0.2 a1 65.28
8 0.2 c3 87.43
9 0.3 a1 98.85
10 0.3 c3 100.12
11 0.4 a1 11.11
12 0.4 c3 83.22
Related
I have a pandas DataFrame which is of the form :
A B C D
A1 6 7.5 NaN
A1 4 23.8 <D1 0.0 6.5 12 4, D2 1.0 4 3.5 1>
A2 7 11.9 <D1 2.0 7.5 10 2, D3 7.5 4.2 13.5 4>
A3 11 0.8 <D2 2.0 7.5 10 2, D3 7.5 4.2 13.5 4, D4 2.0 7.5 10 2, D5 7.5 4.2 13.5 4>
The column D is a raw-string column with multiple categories in each entry. The value of entry is calculated by dividing the last two values for each category. For example, in 2nd row :
D1 = 12/4 = 3
D2 = 3.5/1 = 3.5
I need to split column D based on it's categories and join them to my DataFrame. The problem is the column is dynamic and can have nearly 35-40 categories within a single entry. For now, all I'm doing is a Brute Force Approach by iterating all rows, which is very slow for large datasets. Can someone please help me?
EXPECTED OUTCOME
A B C D1 D2 D3 D4 D5
A1 6 7.5 NaN NaN NaN NaN NaN
A1 4 23.8 3.0 3.5 NaN NaN NaN
A2 7 11.9 5.0 NaN 3.4 NaN NaN
A3 11 0.8 NaN 5.0 3.4 5.0 3.4
Use:
d = df['D'].str.extractall(r'(D\d+).*?([\d.]+)\s([\d.]+)(?:,|\>)')
d = d.droplevel(1).set_index(0, append=True).astype(float)
d = df.join(d[1].div(d[2]).round(1).unstack()).drop('D', 1)
Details:
Use Series.str.extractall to extract all the capture groups from the column D as specified by the regex pattern. You can test the regex pattern here.
print(d)
0 1 2 # --> capture groups
match
1 0 D1 12 4
1 D2 3.5 1
2 0 D1 10 2
1 D3 13.5 4
3 0 D2 10 2
1 D3 13.5 4
2 D4 10 2
3 D5 13.5 4
Use DataFrame.droplevel + set_index with optional parameter append=True to drop the unused level and append a new index to datafarme.
print(d)
1 2
0
1 D1 12.0 4.0
D2 3.5 1.0
2 D1 10.0 2.0
D3 13.5 4.0
3 D2 10.0 2.0
D3 13.5 4.0
D4 10.0 2.0
D5 13.5 4.0
Use Series.div to divide column 1 by 2 and use Series.round to round the values then use Series.unstack to reshape the dataframe, then using DataFrame.join join the new dataframe with df
print(d)
A B C D1 D2 D3 D4 D5
0 A1 6 7.5 NaN NaN NaN NaN NaN
1 A1 4 23.8 3.0 3.5 NaN NaN NaN
2 A2 7 11.9 5.0 NaN 3.4 NaN NaN
3 A3 11 0.8 NaN 5.0 3.4 5.0 3.4
I would like to find a faster way to calculate the sales 52 weeks ago column for each product below without using iterrows or itertuples. Any suggestions? Input will be the table without "sales 52 weeks ago column" and output will be the entire table below.
date sales city product sales 52 weeks ago
0 2020-01-01 1.5 c1 p1 0.6
1 2020-01-01 1.2 c1 p2 0.3
2 2019-05-02 0.5 c1 p1 nan
3 2019-01-02 0.3 c1 p2 nan
4 2019-01-02 0.6 c1 p1 nan
5 2019-01-01 1.2 c1 p2 nan
Example itertuples code but really slow:
for row in df.itertuples(index=True, name='Pandas'):
try:
df.at[row.Index, 'sales 52 weeks ago']=df[(df['date']==row.date-timedelta(weeks=52))&(df['product']==row.product),'sales']
except:
continue
You need a merge after subtracting the date with Timedelta:
m=df['date'].sub(pd.Timedelta('52W')).to_frame().assign(product=df['product'])
final = df.assign(sales_52_W_ago=m.merge(df,
on=['date','product'],how='left').loc[:,'sales'])
date sales city product sales_52_W_ago
0 2020-01-01 1.5 c1 p1 0.6
1 2020-01-01 1.2 c1 p2 0.3
2 2019-05-02 0.5 c1 p1 NaN
3 2019-01-02 0.3 c1 p2 NaN
4 2019-01-02 0.6 c1 p1 NaN
5 2019-01-01 1.2 c1 p2 NaN
I have 2 groupby object as follows:
df1.last() => return a panda dataframe with stock_id and date as index:
close
stock_id, date
a1 2005-12-31 1.1
2006-12-31 1.2
...
2017-12-31 1.3
2018-12-31 1.3
2019-12-31 1.4
a2 2008-12-31 2.1
2009-12-31 2.4
....
2018-12-31 3.4
2019-12-31 3.4
df2 => return a groupby object with id as index:
stock_id, date, eps, dps,
id
1 a1 2017-12-01 0.2 0.03
2 a1 2018-12-01 0.3 0.02
3 a1 2019-06-01 0.4 0.01
4 a2 2018-12-01 0.5 0.03
5 a2 2019-06-01 0.3 0.04
df2 is supposed to be used as reference to merge with df1 based on stock_id and year matching with df2 as df2 has lesser year value. The expected result as follows:
df2 merge with df1:
stock_id, date, eps, dps close ratio_eps, ratio_dps
id a1 2017 0.2 0.03 1.3 0.2/1.3 0.03/1.3
a1 2018 0.3 0.02 1.3 0.3/1.3 ...
a1 2019 0.4 0.01 1.4 0.4/1.4 ...
a2 2018 0.5 0.03 3.4 ...
a2 2019 0.3 0.04 3.4 ...
The above can be done with for loop but it would be inefficient. Is there any pythonic way to achieve it ?
How do i remove the day and month from both dataframe and use it as a key to match and join both table together efficiently ?
I have the following dataframes:
df1
C1 C2 C3
0 0 0 0
1 0 0 0
df2
C1 C4 C5
0 1 1 1
1 1 1 1
The result I am looking for is:
df3
C1 C2 C3 C4 C5
0 0.5 0 0 1 1
1 0.5 0 0 1 1
Is there an easy way to accomplish this ?
Thanks in advance!
You can using concat and groupby axis =1
s=pd.concat([df1,df2],axis=1)
s.groupby(s.columns.values,axis=1).mean()
Out[116]:
C1 C2 C3 C4 C5
0 0.5 0.0 0.0 1.0 1.0
1 0.5 0.0 0.0 1.0 1.0
A nice alternative from #cᴏʟᴅsᴘᴇᴇᴅ
s.groupby(level=0,axis=1).mean()
Out[117]:
C1 C2 C3 C4 C5
0 0.5 0.0 0.0 1.0 1.0
1 0.5 0.0 0.0 1.0 1.0
DataFrame.add
df3 = df2.add(df1, fill_value=0)
df3[df1.columns.intersection(df2.columns)] /= 2
C1 C2 C3 C4 C5
0 0.5 0.0 0.0 1.0 1.0
1 0.5 0.0 0.0 1.0 1.0
concat + groupby
pd.concat([df1, df2], axis=1).groupby(axis=1, level=0).mean()
C1 C2 C3 C4 C5
0 0.5 0.0 0.0 1.0 1.0
1 0.5 0.0 0.0 1.0 1.0
I have a dataframe of the following structure which is simplified for this question.
A B C D E
0 2014/01/01 nan nan 0.2 nan
1 2014/01/01 0.1 nan nan nan
2 2014/01/01 nan 0.3 nan 0.7
3 2014/01/02 nan 0.4 nan nan
4 2014/01/02 0.5 nan 0.6 0.8
What I have here is a series of readings across several timestamps on single days. The columns B,C,D and E represent different locations. The data I am reading in is set up such that at a specified timestamp it takes data from certain locations and fills in nan values for the other locations.
What I wish to do is group the data by timestamp which I can easily do with a .GroupBy()command. From there I wish to have the nan values in the grouped data be overwritten with the valid values taken in later rows such that this is the following result is obtained.
A B C D E
0 2014/01/01 0.1 0.3 0.2 0.7
1 2014/01/02 0.5 0.4 0.6 0.8
How do I go about achieving this?
Try df.groupby with DataFrameGroupBy.agg:
In [528]: df.groupby('A', as_index=False, sort=False).agg(np.nansum)
Out[528]:
A B C D E
0 2014/01/01 0.1 0.3 0.2 0.7
1 2014/01/02 0.5 0.4 0.6 0.8
A shorter version with DataFrameGroupBy.sum (thanks MaxU!):
In [537]: df.groupby('A', as_index=False, sort=False).sum()
Out[537]:
A B C D E
0 2014/01/01 0.1 0.3 0.2 0.7
1 2014/01/02 0.5 0.4 0.6 0.8
you can try this by using pandas first
df.groupby('A', as_index=False).first()
A B C D E
0 1/1/2014 0.1 0.3 0.2 0.7
1 1/2/2014 0.5 0.4 0.6 0.8