Having a dataframe as
dasz_id sector counts
0 0 dasz_id 2011.0
1 NaN wah11 0.0
2 NaN wah21 0.0
3 0 dasz_id 2012.0
4 NaN wah11 0.0
5 NaN wah21 0.0
I'm trying to get the daz_id value and apply it to all the rows until a new dasz value is appear, so the desired output would look as:
dasz_id sector counts
0 2011 dasz_id 2011.0
1 2011 wah11 0.0
2 2011 wah21 0.0
3 2012 dasz_id 2012.0
4 2012 wah11 0.0
5 2012 wah21 0.0
I've created a function using the apply method, which works to get the value over, but i don't know how to apply the values for the rest of the rows.
What am i doing wrong?
def dasz(row):
if row.sector == "dasz_id":
return int(row.counts)
else:
#get previous dasz_id value
e["dasz_id"] = e.apply(dasz, axis = 1)
I do not know why you have duplicated index, but here is one way
df['dasz_id'] = df['counts']
df['dasz_id'] = df['dasz_id'].replace({0:np.nan}).ffill()
df
Out[84]:
dasz_id sector counts
0 2011.0 dasz_id 2011.0
1 2011.0 wah11 0.0
2 2011.0 wah21 0.0
0 2012.0 dasz_id 2012.0
1 2012.0 wah11 0.0
2 2012.0 wah21 0.0
Using the dasz function that you created, and ffill function used by Wen, you could also do:
def dasz(row):
if row.sector == "dasz_id":
return row.counts
e["dasz_id"] = e.apply(dasz, axis = 1)
e.ffill(inplace=True)
Related
I have a data frame of football league stats stored in dataframe matches:
matches.iloc[: , 20:30]
sot dist fk pk pkatt season team venue_code opp_code hour
1 4.0 16.9 1.0 0.0 0.0 2022 Manchester City 0 16 16
2 4.0 17.3 1.0 0.0 0.0 2022 Manchester City 1 14 15
3 10.0 14.3 0.0 0.0 0.0 2022 Manchester City 1 0 12
4 8.0 14.0 0.0 0.0 0.0 2022 Manchester City 0 9 15
6 1.0 15.7 1.0 0.0 0.0 2022 Manchester City 1 15 15
... ... ... ... ... ... ... ... ... ... ...
38 3.0 20.7 0.0 0.0 0.0 2021 Norwich City 0 1 15
39 2.0 21.5 1.0 0.0 0.0 2021 Norwich City 1 18 14
40 5.0 16.2 0.0 0.0 0.0 2021 Norwich City 0 9 19
41 2.0 13.4 0.0 0.0 0.0 2021 Norwich City 0 19 14
42 0.0 17.1 0.0 0.0 0.0 2021 Norwich City 1 16 16
I'm trying to group by the team column using the below code:
grouped_matches = matches.groupby(by=["team"], dropna=False)
grouped_matches
Out: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff400036c70>
Is that what I should expect the output of the group_by function to be? Looking at the documentation, I was expecting another dataframe. Am I using this incorrectly?
This leads to my next line of code which returns the error:
group = grouped_matches.get_group("Manchester City")
KeyError: 'Manchester City'
Manchester City is in the original dataframe. I also tried other teams but get the same error. So why would this return a KeyError?
It seems to me that you don't want to groupby, but simply filter your dataframe.
>>> df[df['team'] == 'Manchester City']
Use groupby when you want to apply some logic to each group (i.e. aggregate them, filter them, modify/transform in some way etc). If you just want to list the groups, then all you need is slicing and .loc.
You may also sort your dataframe by team to see it in a "grouped" fashion.
>>> df.sort_values(by='team')
Now, as to why you get a KeyError: your data is likely not clean. The group name must match excatly the values in your DataFrame. For example, if you have in your df the value " Manchester City", with an extra whitespace in the beginning, calling get_group('Manchester City') will yield an error. The same is true for \n, \t and other invisible characters.
Make sure you have clean team names first. Usually, doing
>>> df['team'] = df['team'].str.strip()
is a good start.
I am computing a DataFrame with weekly amounts and now I need to fill it with missing weeks from a provided date range.
This is how I'm generating the dataframe with the weekly amounts:
df['date'] = pd.to_datetime(df['date']) - timedelta(days=6)
weekly_data: pd.DataFrame = (df
.groupby([pd.Grouper(key='date', freq='W-SUN')])[data_type]
.sum()
.reset_index()
)
Which outputs:
date sum
0 2020-10-11 78
1 2020-10-18 673
If a date range is given as start='2020-08-30' and end='2020-10-30', then I would expect the following dataframe:
date sum
0 2020-08-30 0.0
1 2020-09-06 0.0
2 2020-09-13 0.0
3 2020-09-20 0.0
4 2020-09-27 0.0
5 2020-10-04 0.0
6 2020-10-11 78
7 2020-10-18 673
8 2020-10-25 0.0
So far, I have managed to just add the missing weeks and set the sum to 0, but it also replaces the existing values:
weekly_data = weekly_data.reindex(pd.date_range('2020-08-30', '2020-10-30', freq='W-SUN')).fillna(0)
Which outputs:
date sum
0 2020-08-30 0.0
1 2020-09-06 0.0
2 2020-09-13 0.0
3 2020-09-20 0.0
4 2020-09-27 0.0
5 2020-10-04 0.0
6 2020-10-11 0.0 # should be 78
7 2020-10-18 0.0 # should be 673
8 2020-10-25 0.0
Remove reset_index for DatetimeIndex, because reindex working with index and if RangeIndex get 0 values, because no match:
weekly_data = (df.groupby([pd.Grouper(key='date', freq='W-SUN')])[data_type]
.sum()
)
Then is possible use fill_value=0 parameter and last add reset_index:
r = pd.date_range('2020-08-30', '2020-10-30', freq='W-SUN', name='date')
weekly_data = weekly_data.reindex(r, fill_value=0).reset_index()
print (weekly_data)
date sum
0 2020-08-30 0
1 2020-09-06 0
2 2020-09-13 0
3 2020-09-20 0
4 2020-09-27 0
5 2020-10-04 0
6 2020-10-11 78
7 2020-10-18 673
8 2020-10-25 0
Sorry for any possible confusion with the title. I will describe my question better with the following code and pictures.
Now I have a dataframe with multiple columns. The first two columns, by which they are sorted, 'Route' and 'ID' (Sorry about the formatting, all the rows here have 'Route' value of '100' and 'ID' from 1 to 3.
df1.head(9)
Route ID Year Vol Truck_Vol Truck_%
0 100 1 2017.0 7016 635.0 9.1
1 100 1 2014.0 6835 NaN NaN
2 100 1 2011.0 5959 352.0 5.9
3 100 2 2018.0 15828 NaN NaN
4 100 2 2015.0 13114 2964.0 22.6
5 100 2 2009.0 11844 1280.0 10.8
6 100 3 2016.0 15434 NaN NaN
7 100 3 2013.0 18699 2015.0 10.8
8 100 3 2010.0 15903 NaN NaN
What I want to have is
Route ID Year Vol1 Truck_Vol1 Truck_%1 Year2 Vol2 Truck_Vol2 Truck_%2 Year3 Vol3 Truck_Vol3 Truck_%3
0 100 1 2017 7016 635.0 9.1 2014 6835 NaN NaN 2011 5959 352.0 5.9
1 100 2 2018 15828 NaN NaN 2015 13114 2964.0 22.6 2009 11844 1280.0 10.8
2 100 3 2016 15434 NaN NaN 2013 18699 2015.0 10.8 2010 15903 NaN NaN
Again, sorry for the messy formatting. Let me try a simplified version.
Input:
Route ID Year Vol T_%
0 100 1 2017 100 1.0
1 100 1 2014 200 NaN
2 100 1 2011 300 2.0
3 100 2 2018 400 NaN
4 100 2 2015 500 3.0
5 100 2 2009 600 4.0
Desired Output:
Route ID Year Vol T_% Year.1 Vol.1 T_%.1 Year.2 Vol.2 T_%.2
0 100 1 2017 100 1.0 2014 200 NaN 2011 300 2
1 100 2 2018 400 NaN 2015 500 3.0 2009 600 4
So basically just move the cells shown in the picture
I am stumped here. The names for the newly generated columns don't matter.
For this current dataframe, I have three rows per 'group' like shown in the code. It will be great if the answer can accommodate any number of rows each group.
Thanks for your time.
with groupby + cumcount + set_index + unstack
df1 = df.assign(cid = df.groupby(['Route', 'ID']).cumcount()).set_index(['Route', 'ID', 'cid']).unstack(-1).sort_index(1,1)
df1.columns = [f'{x}{y}' for x,y in df1.columns]
df1 = df1.reset_index()
Output df1:
Route ID T_%0 Vol0 Year0 T_%1 Vol1 Year1 T_%2 Vol2 Year2
0 100 1 1.0 100 2017 NaN 200 2014 2.0 300 2011
1 100 2 NaN 400 2018 3.0 500 2015 4.0 600 2009
melt + pivot_table
v = df.melt(id_vars=['Route', 'ID'])
v['variable'] += v.groupby(['Route', 'ID', 'variable']).cumcount().astype(str)
res = v.pivot_table(index=['Route', 'ID'], columns='variable', values='value')
variable T_% 0 T_% 1 T_% 2 Vol 0 Vol 1 Vol 2 Year 0 Year 1 Year 2
Route ID
100 1 1.0 NaN 2.0 100.0 200.0 300.0 2017.0 2014.0 2011.0
2 NaN 3.0 4.0 400.0 500.0 600.0 2018.0 2015.0 2009.0
If you want to sort these:
c = res.columns.str.extract(r'(\d+)')[0].values.astype(int)
res.iloc[:,np.argsort(c)]
variable T_%0 Vol0 Year0 T_%1 Vol1 Year1 T_%2 Vol2 Year2
Route ID
100 1 1.0 100.0 2017.0 NaN 200.0 2014.0 2.0 300.0 2011.0
2 NaN 400.0 2018.0 3.0 500.0 2015.0 4.0 600.0 2009.0
You asked about why I used cumcount. To explain, here is what v looks like from above:
Route ID variable value
0 100 1 Year 2017.0
1 100 1 Year 2014.0
2 100 1 Year 2011.0
3 100 2 Year 2018.0
4 100 2 Year 2015.0
5 100 2 Year 2009.0
6 100 1 Vol 100.0
7 100 1 Vol 200.0
8 100 1 Vol 300.0
9 100 2 Vol 400.0
10 100 2 Vol 500.0
11 100 2 Vol 600.0
12 100 1 T_% 1.0
13 100 1 T_% NaN
14 100 1 T_% 2.0
15 100 2 T_% NaN
16 100 2 T_% 3.0
17 100 2 T_% 4.0
If I used pivot_table on this DataFrame, you would end up with something like this:
variable T_% Vol Year
Route ID
100 1 1.5 200.0 2014.0
2 3.5 500.0 2014.0
Obviously you are losing data here. cumcount is the solution, as it turns the variable series into this:
Route ID variable value
0 100 1 Year0 2017.0
1 100 1 Year1 2014.0
2 100 1 Year2 2011.0
3 100 2 Year0 2018.0
4 100 2 Year1 2015.0
5 100 2 Year2 2009.0
6 100 1 Vol0 100.0
7 100 1 Vol1 200.0
8 100 1 Vol2 300.0
9 100 2 Vol0 400.0
10 100 2 Vol1 500.0
11 100 2 Vol2 600.0
12 100 1 T_%0 1.0
13 100 1 T_%1 NaN
14 100 1 T_%2 2.0
15 100 2 T_%0 NaN
16 100 2 T_%1 3.0
17 100 2 T_%2 4.0
Where you have a count of repeated elements per unique Route and ID.
I'm trying to restructure a large DataFrame of the following form as a MultiIndex:
date store_nbr item_nbr units snowfall preciptotal event
0 2012-01-01 1 1 0 0.0 0.0 0.0
1 2012-01-01 1 2 0 0.0 0.0 0.0
2 2012-01-01 1 3 0 0.0 0.0 0.0
3 2012-01-01 1 4 0 0.0 0.0 0.0
4 2012-01-01 1 5 0 0.0 0.0 0.0
I want to group by store_nbr (1-45), within each store_nbr group by item_nbr (1-111) and then for the corresponding index pair (e.g., store_nbr=12, item_nbr=109), display the rows in chronological order, so that ordered rows will look like, for example:
store_nbr=12, item_nbr=109: date=2014-02-06, units=0, snowfall=...
date=2014-02-07, units=0, snowfall=...
date=2014-02-08, units=0, snowfall=...
... ...
store_nbr=12, item_nbr=110: date=2014-02-06, units=0, snowfall=...
date=2014-02-07, units=1, snowfall=...
date=2014-02-08, units=1, snowfall=...
...
It looks like some combination of groupby and set_index might be useful here, but I'm getting stuck after the following line:
grouped = stores.set_index(['store_nbr', 'item_nbr'])
This produces the following MultiIndex:
date units snowfall preciptotal event
store_nbr item_nbr
1 1 2012-01-01 0 0.0 0.0 0.0
2 2012-01-01 0 0.0 0.0 0.0
3 2012-01-01 0 0.0 0.0 0.0
4 2012-01-01 0 0.0 0.0 0.0
5 2012-01-01 0 0.0 0.0 0.0
Does anyone have any suggestions from here? Is there an easy way to do this by manipulating groupby objects?
You can sort your rows with:
df.sort_values(by='date')
Consider the warehouse stocks on different days
day action quantity symbol
0 1 40 a
1 1 53 b
2 -1 21 a
3 1 21 b
4 -1 2 a
5 1 42 b
Here, day represents time series, action represents buy/sell for specific product (symbol) and of quantity.
For this dataframe, How do I calculate the cumulative sum daily, for each product.
Basically, a resultant dataframe as below:
days a b
0 40 0
1 40 53
2 19 53
3 19 64
4 17 64
5 17 106
I have tried cumsum() with groupby and was unsuccessful with it
Using pivot_table
In [920]: dff = df.pivot_table(
index=['day', 'action'], columns='symbol',
values='quantity').reset_index()
In [921]: dff
Out[921]:
symbol day action a b
0 0 1 40.0 NaN
1 1 1 NaN 53.0
2 2 -1 21.0 NaN
3 3 1 NaN 21.0
4 4 -1 2.0 NaN
5 5 1 NaN 42.0
Then, mul the action, take cumsum, forward fill missing values, and finally replace NaNs with 0
In [922]: dff[['a', 'b']].mul(df.action, 0).cumsum().ffill().fillna(0)
Out[922]:
symbol a b
0 40.0 0.0
1 40.0 53.0
2 19.0 53.0
3 19.0 74.0
4 17.0 74.0
5 17.0 116.0
Final result
In [926]: dff[['a', 'b']].mul(df.action, 0).cumsum().ffill().fillna(0).join(df.day)
Out[926]:
a b day
0 40.0 0.0 0
1 40.0 53.0 1
2 19.0 53.0 2
3 19.0 74.0 3
4 17.0 74.0 4
5 17.0 116.0 5
Nevermind, didn't see pandas tag. This is just plain Python.
Try this:
sums = []
currentsums = {'a': 0, 'b': 0}
for i in data:
currentsums[i['symbol']] += i['action'] * i['quantity']
sums.append({'a': currentsums['a'], 'b': currentsums['b']})
Try it online!
Note that it gives a different result than you posted because you calculated wrong.