I'm selecting some data in spark like this:
base = spark.sql("""
SELECT
...
...
""")
print(base.count())
base.cache()
base=base.toPandas()
base['yyyy_mm_dd'] = pd.to_datetime(base['yyyy_mm_dd'])
base.set_index("yyyy_mm_dd", inplace=True)
This gives me a dataframe which looks like this:
id aggregated_field aggregated_field2
yyyy_mm_dd
I want to group by yyyy_mm_dd and id, but sum the aggregated fields. This way I can see per day, the total sum of the aggregated fields for every provider. I'll then want to aggregate this to be monthly. This is what I've done:
agg = base.groupby(['yyyy_mm_dd', 'id'])[['aggregated_field','aggregated_field2']].sum()
My dataframe now looks like this:
aggregated_field aggregated_field2
yyyy_mm_dd id
Finally, I try to resample() to monthly:
agg = agg.resample('M').sum()
Then I get this error:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'
I'm not sure why since I convert my yyyy_mm_dd to a date index earlier.
Edit: The output I'm looking for is this:
yyyy_mm_dd id aggregated_metric aggregated_metric2
2019-01-01 1 ... ...
2
3
2019-01-02 1
2
3
Maybe you will find this useful:
Solution 1 (employing pd.Period and its "rightful" displaying of monthly data format)
>>> import pandas as pd
>>> base = \
pd.DataFrame(
{
'yyyy_mm_dd': ['2012-01-01','2012-01-01','2012-01-02','2012-01-02','2012-02-01','2012-02-01','2012-02-02','2012-02-02'],
'id': [1,2,1,2,1,2,1,2],
'aggregated_field': [0,1,2,3,4,5,6,7],
'aggregated_field2': [100,101,102,103,104,105,106,107]
}
)
>>> base
yyyy_mm_dd id aggregated_field aggregated_field2
0 2012-01-01 1 0 100
1 2012-01-01 2 1 101
2 2012-01-02 1 2 102
3 2012-01-02 2 3 103
4 2012-02-01 1 4 104
5 2012-02-01 2 5 105
6 2012-02-02 1 6 106
7 2012-02-02 2 7 107
>>> base['yyyy_mm_dd'] = pd.to_datetime(base['yyyy_mm_dd'])
>>> base['yyyy_mm'] = base['yyyy_mm_dd'].dt.to_period('M')
>>> agg = base.groupby(['yyyy_mm', 'id'])[['aggregated_field','aggregated_field2']].sum()
>>> agg
aggregated_field aggregated_field2
yyyy_mm id
2012-01 1 2 202
2 4 204
2012-02 1 10 210
2 12 212
Solution 2 (stick to datetime64)
>>> import pandas as pd
>>> base = \
pd.DataFrame(
{
'yyyy_mm_dd': ['2012-01-01','2012-01-01','2012-01-02','2012-01-02','2012-02-01','2012-02-01','2012-02-02','2012-02-02'],
'id': [1,2,1,2,1,2,1,2],
'aggregated_field': [0,1,2,3,4,5,6,7],
'aggregated_field2': [100,101,102,103,104,105,106,107]
}
)
>>> base
yyyy_mm_dd id aggregated_field aggregated_field2
0 2012-01-01 1 0 100
1 2012-01-01 2 1 101
2 2012-01-02 1 2 102
3 2012-01-02 2 3 103
4 2012-02-01 1 4 104
5 2012-02-01 2 5 105
6 2012-02-02 1 6 106
7 2012-02-02 2 7 107
>>> base['yyyy_mm_dd'] = pd.to_datetime(base['yyyy_mm_dd'])
>>> base['yyyy_mm_dd_month_start'] = base['yyyy_mm_dd'].values.astype('datetime64[M]')
>>> agg = base.groupby(['yyyy_mm_dd_month_start', 'id'])[['aggregated_field','aggregated_field2']].sum()
>>> agg
aggregated_field aggregated_field2
yyyy_mm_dd_month_start id
2012-01-01 1 2 202
2 4 204
2012-02-01 1 10 210
2 12 212
Related
I have a dataframe that I have already done a groupby().agg() using
df = df.groupby(['Date','Time', 'ID_Code']).agg({'ID_Code':['count','nunique']}).reset_index()
now it looks like this
Date Time ID_Code count nunique
0 2021-01-04 10:50:00 CA_031 2 1
1 2021-01-04 12:40:00 CA_021 8 1
2 2021-01-04 13:20:00 CA_044 4 1
3 2021-01-04 13:30:00 CA_045 4 1
4 2021-01-04 13:36:00 CA_040 13 1
.. ... ... ... ... ...
433 2021-12-28 12:12:00 CA_805 3 1
434 2021-12-28 12:40:00 CA_802 3 1
435 2021-12-28 15:35:00 CA_003 22 1
436 2021-12-28 8:29:00 CA_806 3 1
What I now need is a sum of the count and how many times each ID_Code occurs.
# the below line removed the multi index header.
df.columns = ['Date', 'Time', 'ID_Code', 'count', 'nunique']
# the below line does not appear to work. Why and how would I fix it?
df = df.groupby('ID_Code').agg(total_count = ('count','sum'), frequency = ('nunique','sum'),)
What I want is:
ID_Code total_count frequency
0 CA_031 242 12
1 CA_021 89 9
2 CA_044 148 11
3 CA_045 76 7
I'm new to python and very new to Pandas. I've looked through the Pandas documentation and tried multiple ways to solve this problem unsuccessfully.
I have a DateFrame with timestamps in one column and prices in another, such as:
d = {'TimeStamp': [1603822620000, 1603822680000,1603822740000, 1603823040000,1603823100000,1603823160000,1603823220000], 'Price': [101,105,102,108,105,101,106], 'OtherData1': [1,2,3,4,5,6,7], 'OtherData2': [7,6,5,4,3,2,1]}
df= pd.DataFrame(d)
df
TimeStamp Price OtherData1 OtherData2
0 1603822620000 101 1 7
1 1603822680000 105 2 6
2 1603822740000 102 3 5
3 1603823040000 108 4 4
4 1603823100000 105 5 3
5 1603823160000 101 6 2
6 1603823220000 106 7 1
In addition to the two columns of interest, this DataFrame also has additional columns with data not particularly relevant to the question (represented with OtherData Cols).
I want to create a new column 'Fut2Min' (Price Two Minutes into the Future). There may be missing data, so this problem can't be solved by simply getting the data from 2 rows below.
I'm trying to find a way to make the value for Fut2Min Col in each row == the Price at the row with the timestamp + 120000 (2 minutes into the future) or null (or NAN or w/e) if the corresponding timestamp doesn't exist.
For the example data, the DF should be updated to:
(Code used to mimic desired result)
d = {'TimeStamp': [1603822620000, 1603822680000, 1603822740000, 1603822800000, 1603823040000,1603823100000,1603823160000,1603823220000],
'Price': [101,105,102,108,105,101,106,111],
'OtherData1': [1,2,3,4,5,6,7,8],
'OtherData2': [8,7,6,5,4,3,2,1],
'Fut2Min':[102,108,'NaN','NaN',106,111,'NaN','NaN']}
df= pd.DataFrame(d)
df
TimeStamp Price OtherData1 OtherData2 Fut2Min
0 1603822620000 101 1 8 102
1 1603822680000 105 2 7 108
2 1603822740000 102 3 6 NaN
3 1603822800000 108 4 5 NaN
4 1603823040000 105 5 4 106
5 1603823100000 101 6 3 111
6 1603823160000 106 7 2 NaN
7 1603823220000 111 8 1 NaN
Assuming that the DataFrame is:
TimeStamp Price OtherData1 OtherData2 Fut2Min
0 1603822620000 101 1 8 0
1 1603822680000 105 2 7 0
2 1603822740000 102 3 6 0
3 1603822800000 108 4 5 0
4 1603823040000 105 5 4 0
5 1603823100000 101 6 3 0
6 1603823160000 106 7 2 0
7 1603823220000 111 8 1 0
Then, if you use pandas.DataFrame.apply, along the column axis:
import pandas as pd
def Fut2MinFunc(row):
futTimeStamp = row.TimeStamp + 120000
if (futTimeStamp in df.TimeStamp.values):
return df.loc[df['TimeStamp'] == futTimeStamp, 'Price'].iloc[0]
else:
return None
df['Fut2Min'] = df.apply(Fut2MinFunc, axis = 1)
You will get exactly what you describe as:
TimeStamp Price OtherData1 OtherData2 Fut2Min
0 1603822620000 101 1 8 102.0
1 1603822680000 105 2 7 108.0
2 1603822740000 102 3 6 NaN
3 1603822800000 108 4 5 NaN
4 1603823040000 105 5 4 106.0
5 1603823100000 101 6 3 111.0
6 1603823160000 106 7 2 NaN
7 1603823220000 111 8 1 NaN
EDIT 2: I have updated the solution since it had some sloppy parts (exchanged the list for index determination with a dictionary and restricted the search for timestamps).
This (with import numpy as np)
indices = {ts - 120000: i for i, ts in enumerate(df['TimeStamp'])}
df['Fut2Min'] = [
np.nan
if (ts + 120000) not in df['TimeStamp'].values[i:] else
df['Price'].iloc[indices[ts]]
for i, ts in enumerate(df['TimeStamp'])
]
gives you
TimeStamp Price Fut2Min
0 1603822620000 101 102.0
1 1603822680000 105 108.0
2 1603822740000 102 NaN
3 1603822800000 108 NaN
4 1603823040000 105 106.0
5 1603823100000 101 111.0
6 1603823160000 106 NaN
7 1603823220000 111 NaN
But I'm not sure if that is an optimal solution.
EDIT: Inspired by the discussion in the comments I did some timing:
With the sample frame
from itertools import accumulate
import numpy as np
rng = np.random.default_rng()
n = 10000
timestamps = [1603822620000 + t
for t in accumulate(rng.integers(1, 4) * 60000
for _ in range(n))]
df = pd.DataFrame({'TimeStamp': timestamps, 'Price': n * [100]})
TimeStamp Price
0 1603822680000 100
... ... ...
9999 1605030840000 100
[10000 rows x 2 columns]
and the two test functions
# (1) Other solution
def Fut2MinFunc(row):
futTimeStamp = row.TimeStamp + 120000
if (futTimeStamp in df.TimeStamp.values):
return df.loc[df['TimeStamp'] == futTimeStamp, 'Price'].iloc[0]
else:
return None
def test_1():
df['Fut2Min'] = df.apply(Fut2MinFunc, axis = 1)
# (2) Solution here
def test_2():
indices = list(df['TimeStamp'] - 120000)
df['Fut2Min'] = [
np.nan
if (timestamp + 120000) not in df['TimeStamp'].values else
df['Price'].iloc[indices.index(timestamp)]
for timestamp in df['TimeStamp']
]
I conducted the experiment
from timeit import timeit
t1 = timeit('test_1()', number=100, globals=globals())
t2 = timeit('test_2()', number=100, globals=globals())
print(t1, t2)
with the result
135.98962861 40.306039344
which seems to imply that the version here is faster? (I also measured directly with time() and without the wrapping in functions and the results are virtually identical.)
With my updated version the result looks like
139.33713767799998 14.178187169000012
I finally did one try with a frame with 1,000,000 rows (number=1) and the result was
763.737430931 175.73120002400003
I have two dataframes. say for example, frame 1 is the student info:
student_id course
1 a
2 b
3 c
4 a
5 f
6 f
frame 2 is each interaction the student has with a program
student_id day number_of_clicks
1 4 60
1 5 34
1 7 87
2 3 33
2 4 29
2 8 213
2 9 46
3 2 103
I am trying to add the information from frame 2 to frame 1, ie. for each student I would like to know the number of different days they accessed the database on, and the sum of all the clicks on those days. eg:
student_id course no_days total_clicks
1 a 3 181
2 b 4 321
3 c 1 103
4 a 0 0
5 f 0 0
6 f 0 0
I've tried to do this with groupby, but I couldn't add the information back into frame 1, or figure out how to sum the number of clicks. any ideas?
First we aggregate your df2 to the desired information using GroupBy.agg. Then we merge that information into df1:
agg = df2.groupby('student_id').agg(
no_days=('day', 'size'),
total_clicks=('number_of_clicks', 'sum')
)
df1 = df1.merge(agg, on='student_id', how='left').fillna(0)
student_id course no_days total_clicks
0 1 a 3.0 181.0
1 2 b 4.0 321.0
2 3 c 1.0 103.0
3 4 a 0.0 0.0
4 5 f 0.0 0.0
5 6 f 0.0 0.0
Or if you like one-liners, here's the same method as above, but in one line of code and more in SQL kind of style:
df1.merge(
df2.groupby('student_id').agg(
no_days=('day', 'size'),
total_clicks=('number_of_clicks', 'sum')
),
on='student_id',
how='left'
).fillna(0)
Use merge and fillna the null values then aggregate using groupby.agg as:
df = df1.merge(df2, how='left').fillna(0, downcast='infer')\
.groupby(['student_id', 'course'], as_index=False)\
.agg({'day':np.count_nonzero, 'number_of_clicks':np.sum}).reset_index()
print(df)
student_id course day number_of_clicks
0 1 a 3 181
1 2 b 4 321
2 3 c 1 103
3 4 a 0 0
4 5 f 0 0
5 6 f 0 0
I want to write code to cut a dataframe that contains weekly predictions data to return a 'n' week prediction length from today's date.
a toy example of my dataframe looks like this:
data4 = pd.DataFrame({'Id' : ['001','002','003'],
'2020-01-01' : [4,5,6],
'2020-01-08':[3,5,6],
'2020-01-15': [2,6,7],
'2020-01-22': [2,6,7],
'2020-01-29': [2,6,7],
'2020-02-5': [2,6,7],
'2020-02-12': [4,4,4]})
Id 2020-01-01 2020-01-08 2020-01-15 2020-01-22 2020-01-29 2020-02-5 \
0 001 4 3 2 2 2 2
1 002 5 5 6 6 6 6
2 003 6 6 7 7 7 7
2020-02-12
0 4
1 4
2 4
I am trying to get:
dataset_for_analysis = pd.DataFrame({'Id' : ['001','002','003'],
'2020-01-15': [2,6,7],
'2020-01-22': [2,6,7],
'2020-01-29': [2,6,7],
'2020-02-5': [2,6,7]})
Id 2020-01-15 2020-01-22 2020-01-29 2020-02-5
0 001 2 2 2 2
1 002 6 6 6 6
2 003 7 7 7 7
I have done this,from what I understood from datetime documentations.
dataset_for_analysis = data4.datetime.datetime.today+ pd.Timedelta('3 weeks')
and gives me the error:
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'datetime'
I am a bit confused about how to use the datetime today and timedelta, especially because i am working with weekly data. is there a way to get the current week of the year i am in, rather than the day? Would anyone has help with this? Thank you!
You can do the following:
today = '2020-01-15'
n_weeks = 10
# get dates by n weeks
cols = [str((pd.to_datetime(today) + pd.Timedelta(weeks=x)).date()) for x in range(n_weeks)]
# pick the columns which exist in cols
use_cols = ['Id'] + [x for x in data4.columns if x in cols]
# select the columns
data4 = data4[use_cols]
Id 2020-01-15 2020-01-22 2020-01-29 2020-02-12
0 001 2 2 2 4
1 002 6 6 6 4
2 003 7 7 7 4
I have a dataframe like bellow
ID Date
111 1.1.2018
222 5.1.2018
333 7.1.2018
444 8.1.2018
555 9.1.2018
666 13.1.2018
and I would like to bin them into 5 days intervals.
The output should be
ID Date Bin
111 1.1.2018 1
222 5.1.2018 1
333 7.1.2018 2
444 8.1.2018 2
555 9.1.2018 2
666 13.1.2018 3
How can I do this in python, please?
Looks like groupby + ngroup does it:
df['Date'] = pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
df['Bin'] = df.groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3
If you don't want to mutate the Date column, then you may first call assign for a copy based assignment, and then do the groupby:
df['Bin'] = df.assign(
Date=pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
).groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 1.1.2018 1
1 222 5.1.2018 1
2 333 7.1.2018 2
3 444 8.1.2018 2
4 555 9.1.2018 2
5 666 13.1.2018 3
One way is to create an array of your date range and use numpy.digitize.
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
date_ranges = pd.date_range(df['Date'].min(), df['Date'].max(), freq='5D')\
.astype(np.int64).values
df['Bin'] = np.digitize(df['Date'].astype(np.int64).values, date_ranges)
Result:
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3