I have a dataframe in which the index is a datetime and column A and B are objects. I need to see the unique values of A and B per week.
I managed to get the unique value count per week (I am using the pd.grouper function for that) but I am struggling to get the unique values per week.
This code gives me the unique value counts per week
df_unique = pd.DataFrame(df.groupby(pd.Grouper(freq="W"))['A', 'B'].nunique())
However, the code below does not give me the unique values itself per week
df_unique_list = pd.DataFrame(df.groupby(pd.Grouper(freq="W"))['A', 'B'].unique())
This code gives me te following error message
AttributeError: 'DataFrameGroupBy' object has no attribute 'unique'
Use lambda function with Series.unique and converting to list:
np.random.seed(123)
rng = pd.date_range('2017-04-03', periods=20)
df = pd.DataFrame({'A': np.random.choice([1,2,3,4,5,6], size=20),
'B': np.random.choice([1,2,3,4,5,6,7,8], size=20)}, index=rng)
print (df)
A B
2017-04-03 6 1
2017-04-04 3 5
2017-04-05 5 2
2017-04-06 3 8
2017-04-07 2 4
2017-04-08 4 3
2017-04-09 3 5
2017-04-10 4 8
2017-04-11 2 3
2017-04-12 2 5
2017-04-13 1 8
2017-04-14 2 1
2017-04-15 2 6
2017-04-16 1 1
2017-04-17 1 8
2017-04-18 2 2
2017-04-19 4 4
2017-04-20 6 5
2017-04-21 5 5
2017-04-22 1 5
df_unique_list = df.groupby(pd.Grouper(freq="W"))['A', 'B'].agg(lambda x: list(x.unique()))
print (df_unique_list)
A B
2017-04-09 [6, 3, 5, 2, 4] [1, 5, 2, 8, 4, 3]
2017-04-16 [4, 2, 1] [8, 3, 5, 1, 6]
2017-04-23 [1, 2, 4, 6, 5] [8, 2, 4, 5]
Related
I am trying to create a new column in which the value in the first row is 0 and from the second row, it should do a calculation as mentioned below which is
ColumnA[This row] = (ColumnA[Last row] * 13 + ColumnB[This row])/14
I am using the python pandas shift function but it doesn't seem to be producing the intended result.
test = np.array([ 1, 5, 3, 20, 2, 6, 9, 8, 7])
test = pd.DataFrame(test, columns = ['ABC'])
test.loc[test['ABC'] == 1, 'a'] = 0
test['a'] = (test['a'].shift()*13 + test['ABC'])/14
I am trying to create a column that looks like this
ABC
a
1
0
5
0.3571
3
0.5459
20
1.9355
2
1.9401
6
2.2301
9
2.7137
8
3.0913
7
3.3705
But actually what I am getting by running the above code is this
ABC
a
1
nan
2
0
3
nan
4
nan
5
nan
6
nan
7
nan
8
nan
9
nan
test = np.array([ 1, 2, 3, 4, 5, 6, 7, 8, 9])
test = pd.DataFrame(test, columns = ['ABC'])
test["res"] = test["ABC"]
test.iloc[0]['res'] = 0 # Initialize the first row as 0
test["res"] = test.res + test.res.shift()
test["res"] = test.res.fillna(0).astype(int) # test.res.shift() introduces a nan and we replace it with a 0 and convert the column data type to int
Try:
test["a"] = (test["ABC"].shift().cumsum() + test["ABC"].shift()).fillna(0)
print(test)
Prints:
ABC a
0 1 0.0
1 2 2.0
2 3 5.0
3 4 9.0
4 5 14.0
5 6 20.0
6 7 27.0
7 8 35.0
8 9 44.0
Let's try a for loop
import pandas as pd
df = pd.DataFrame({'ABC': [1, 5, 3, 20, 2, 6, 9, 8, 7]})
lst = [0]
res = 0
for i, row in df.iloc[1:].iterrows():
res = ((res * 13) + row['ABC']) / 14
lst.append(res)
df['a'] = pd.Series(lst)
print(df)
Output:
ABC a
0 1 0.000000
1 5 0.357143
2 3 0.545918
3 20 1.935496
4 2 1.940103
5 6 2.230096
6 9 2.713660
7 8 3.091256
8 7 3.370452
I've a DataFrame df:
A B C date
O 4 5 5 2019-06-2
1 3 5 2 2019-06-2
2 3 2 1 2019-06-2
3 4 4 3 2019-06-3
4 5 4 6 2019-06-3
5 2 3 7 2019-06-3
Now I can groupby one column by using the following code:
df.groupby('date')['A'].apply(list)
A date
O [4,3,3] 2019-06-2
1 [4,5,2] 2019-06-3
but what if want to group by multiple columns? I've tried something like this but it doesn't seems to be working:
df.groupby('date')[['A','B','C']].apply(list)
The final DataFrame should look like this:
A B C date
O [4,3,3] [5,5,2] [5,2,1] 2019-06-2
1 [4,5,2] [4,4,3] [3,6,7] 2019-06-3
Use GroupBy.agg instead of GroupBy.apply:
df1 = df.groupby('date')[['A','B','C']].agg(list).reset_index()
print (df1)
date A B C
0 2019-06-2 [4, 3, 3] [5, 5, 2] [5, 2, 1]
1 2019-06-3 [4, 5, 2] [4, 4, 3] [3, 6, 7]
EDIT: If wanting to do more aggregations pass it in list:
df2 = df.groupby('date')[['A','B','C']].agg(['mean','min','max', list])
print (df2)
A B C \
mean min max list mean min max list mean
date
2019-06-2 3.333333 3 4 [4, 3, 3] 4.000000 2 5 [5, 5, 2] 2.666667
2019-06-3 3.666667 2 5 [4, 5, 2] 3.666667 3 4 [4, 4, 3] 5.333333
min max list
date
2019-06-2 1 5 [5, 2, 1]
2019-06-3 3 7 [3, 6, 7]
Then the MultiIndex columns can be flatten:
df2 = df.groupby('date')[['A','B','C']].agg(['mean','min','max', list])
df2.columns = df2.columns.map(lambda x: f'{x[0]}_{x[1]}')
df2 = df2.reset_index()
print (df2)
date A_mean A_min A_max A_list B_mean B_min B_max \
0 2019-06-2 3.333333 3 4 [4, 3, 3] 4.000000 2 5
1 2019-06-3 3.666667 2 5 [4, 5, 2] 3.666667 3 4
B_list C_mean C_min C_max C_list
0 [5, 5, 2] 2.666667 1 5 [5, 2, 1]
1 [4, 4, 3] 5.333333 3 7 [3, 6, 7]
First time posting, newbie to python.
I have a data frame consisting of 3 columns: ['ID', 'date', 'profit_forecast']
'ID': is product ID
'date': start date
'profit_forecast': a list containing 367 items, each item is a profit forecast for date+n
I am looking to create a new data frame that maps each item in profit_forecast to the ID and corresponding date+n for its position in the list.
Not sure how to start.
Thanks in advance!
If I understand you correctly, the following example data captures the essence of your question:
df = pd.DataFrame({'ID': [1, 2, 3],
'date': pd.date_range('2019-01-01', freq='YS', periods=3),
'profit_forecast': [[1, 2, 3], [4, 5], [6, 7, 8, 9]]})
df
ID date profit_forecast
0 1 2019-01-01 [1, 2, 3]
1 2 2020-01-01 [4, 5]
2 3 2021-01-01 [6, 7, 8, 9]
One solution is to make sure you've upgraded to pandas 0.25, and then to explode the profit_forecast column:
res = df.explode('profit_forecast')
res
ID date profit_forecast
0 1 2019-01-01 1
0 1 2019-01-01 2
0 1 2019-01-01 3
1 2 2020-01-01 4
1 2 2020-01-01 5
2 3 2021-01-01 6
2 3 2021-01-01 7
2 3 2021-01-01 8
2 3 2021-01-01 9
At this point, your question is not clear enough on how you need to increment the dates of each ID. If by "date + n" you mean to add one day to each consecutive date within each ID, then something like this should work:
res['date'] = res['date'] + pd.to_timedelta(res.groupby('ID').cumcount(), 'D')
res
ID date profit_forecast
0 1 2019-01-01 1
0 1 2019-01-02 2
0 1 2019-01-03 3
1 2 2020-01-01 4
1 2 2020-01-02 5
2 3 2021-01-01 6
2 3 2021-01-02 7
2 3 2021-01-03 8
2 3 2021-01-04 9
I have two DF:
df1 = pd.DataFrame({'A':[3, 2, 5, 1, 6], 'B': [4, 6, 5, 8, 2], 'C': [4, 8, 3, 8, 0], 'D':[1, 4, 2, 8, 7], 'zebra': [5, 7, 2, 4, 8]})
df2 = pd.DataFrame({'B': [7, 3, 5, 1, 8], 'D':[4, 5, 8, 2, 3] })
print(df1)
print(df2)
A B C D zebra
0 3 4 4 1 5
1 2 8 8 5 7
2 5 5 3 2 2
3 1 6 8 5 4
4 6 2 0 7 8
B D
0 7 4
1 3 5
2 5 8
3 8 5
4 8 3
This is a simple example, in real df1 is with 1000k+ rows and 10+ columns, df2 is with only 24 rows and fewer columns as well. I would like to loop all rows in df2 and to compare those specific rows (for example column 'B' and 'D') from df2 with same column names in df1 and if row values match (if value in column B and column D in df2 match same values in same columns but in df1) to assign corresponding zebra value in that row to the same row in df2 creating new column zebra and assigning that value. If no matching found to assign 0s or NaN's.
B D zebra
0 7 4 nan
1 3 5 nan
2 5 8 nan
3 8 5 7
4 8 3 nan
From example, only row index 3 in df2 matched values 'B': 8 and 'D':5 with a row with index 2 from df1 (NOTE: row index should not be important in comparisons) and assign corresponding row value 7 from column 'zebra' to df2.
A merge would do
df2.merge(df1[['B', 'D', 'zebra']], on = ['B', 'D'], how = 'left')
B D zebra
0 7 4 NaN
1 3 5 NaN
2 5 8 NaN
3 8 5 7.0
4 8 3 NaN
I'd like to generate a series that's the incremental mean of a timeseries. Meaning that, starting from the first date (index 0), the mean stored in row x is the average of values [0:x]
data
index value mean formula
0 4
1 5
2 6
3 7 5.5 average(0-3)
4 4 5.2 average(0-4)
5 5 5.166666667 average(0-5)
6 6 5.285714286 average(0-6)
7 7 5.5 average(0-7)
I'm hoping there's a way to do this without looping to take advantage of pandas.
Here's an update for newer versions of Pandas (starting with 0.18.0)
df['value'].expanding().mean()
or
s.expanding().mean()
As #TomAugspurger points out, you can use expanding_mean:
In [11]: s = pd.Series([4, 5, 6, 7, 4, 5, 6, 7])
In [12]: pd.expanding_mean(s, 4)
Out[12]:
0 NaN
1 NaN
2 NaN
3 5.500000
4 5.200000
5 5.166667
6 5.285714
7 5.500000
dtype: float64
Another approach is to use cumsum(), and divide by the cumulative number of items, for example:
In [1]:
s = pd.Series([4, 5, 6, 7, 4, 5, 6, 7])
s.cumsum() / pd.Series(np.arange(1, len(s)+1), s.index)
Out[1]:
0 4.000000
1 4.500000
2 5.000000
3 5.500000
4 5.200000
5 5.166667
6 5.285714
7 5.500000
dtype: float64