Calculating the range of column value datewise through Python - python

I want to calculate the maximum difference of product_mrp according to the dates.
For that I was trying to group by date but not able to get after that.
INPUT:
+-------------+--------------------+
| product_mrp | order_date |
+-------------+--------------------+
| 142 | 01-12-2019 |
| 20 | 01-12-2019 |
| 20 | 01-12-2019 |
| 120 | 01-12-2019 |
| 30 | 03-12-2019 |
| 20 | 03-12-2019 |
| 45 | 03-12-2019 |
| 215 | 03-12-2019 |
| 15 | 03-12-2019 |
| 25 | 07-12-2019 |
| 5 | 07-12-2019 |
+-------------+--------------------+
EXPECTED OUTPUT:
+-------------+--------------------+
| product_mrp | order_date |
+-------------+--------------------+
| 122 | 01-12-2019 |
| 200 | 03-12-2019 |
| 20 | 07-12-2019 |
+-------------+--------------------+

you can use groupby as you said and max, min and reset_index like:
gr = df.groupby('order_date')['product_mrp']
df_ = (gr.max()-gr.min()).reset_index()
print (df_)
order_date product_mrp
0 01-12-2019 122
1 03-12-2019 200
2 07-12-2019 20

Use pandas to load the data, then use groupby to group by the shared index:
import pandas as pd
dates = ['01-12-2019']*4 + ['03-12-2019']*5 + ['07-12-2019']*2
data = [142,20,20,120,30,20,45,215,15,25,5]
df = pd.DataFrame(data,)
df.index = pd.DatetimeIndex(dates)
grouped = df.groupby(df.index).apply(lambda x: x.max()-x.min())
Output:
product mrp
2019-01-12 122
2019-03-12 200
2019-07-12 20

Related

Extract utc format for datetime object in a new Python column

Be the following pandas DataFrame:
| ID | date |
|--------------|---------------------------------------|
| 0 | 2022-03-02 18:00:20+01:00 |
| 0 | 2022-03-12 17:08:30+01:00 |
| 1 | 2022-04-23 12:11:50+01:00 |
| 1 | 2022-04-04 10:15:11+01:00 |
| 2 | 2022-04-07 08:24:19+02:00 |
| 3 | 2022-04-11 02:33:22+02:00 |
I want to separate the date column into two columns, one for the date in the format "yyyy-mm-dd" and one for the time in the format "hh:mm:ss+tmz".
That is, I want to get the following resulting DataFrame:
| ID | date_only | time_only |
|--------------|-------------------------|----------------|
| 0 | 2022-03-02 | 18:00:20+01:00 |
| 0 | 2022-03-12 | 17:08:30+01:00 |
| 1 | 2022-04-23 | 12:11:50+01:00 |
| 1 | 2022-04-04 | 10:15:11+01:00 |
| 2 | 2022-04-07 | 08:24:19+02:00 |
| 3 | 2022-04-11 | 02:33:22+02:00 |
Right now I am using the following code, but it does not return the time with utc +hh:mm.
df['date_only'] = df['date'].apply(lambda a: a.date())
df['time_only'] = df['date'].apply(lambda a: a.time())
| ID | date_only |time_only |
|--------------|-------------------------|----------|
| 0 | 2022-03-02 | 18:00:20 |
| 0 | 2022-03-12 | 17:08:30 |
| ... | ... | ... |
| 3 | 2022-04-11 | 02:33:22 |
I hope you can help me, thank you in advance.
Convert column to datetimes and then extract Series.dt.date and times with timezones by Series.dt.strftime:
df['date'] = pd.to_datetime(df['date'])
df['date_only'] = df['date'].dt.date
df['time_only'] = df['date'].dt.strftime('%H:%M:%S%z')
Or split converted values to strings by space and select second lists:
df['date'] = pd.to_datetime(df['date'])
df['date_only'] = df['date'].dt.date
df['time_only'] = df['date'].astype(str).str.split().str[1]

Calculating quarterly growth

I have some daily data in a df, which goes back as far as 1st January 2020. It looks similar to the below but with many id1s on each day.
| yyyy_mm_dd | id1 | id2 | cost |
|------------|-----|------|-------|
| 2020-01-01 | 23 | 7253 | 5003 |
| 2020-01-01 | 23 | 7743 | 30340 |
| 2020-01-02 | 23 | 7253 | 450 |
| 2020-01-02 | 23 | 7743 | 4500 |
| ... | ... | ... | ... |
| 2021-01-01 | 23 | 7253 | 5675 |
| 2021-01-01 | 23 | 134 | 1030 |
| 2021-01-01 | 23 | 3445 | 564 |
| 2021-01-01 | 23 | 4534 | 345 |
| ... | ... | ... | ... |
I want like to calculate (1) the summed cost grouped by quarter and id1, (2) the growth % compared to the same quarter in the previous year.
I have grouped and calculated the summed cost like so:
grouped_quarterly = (
df
.withColumn('year_quarter', (F.year(sf.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd'))
.groupby('id1', 'year_quarter')
.agg(
F.sum('cost').alias('cost')
)
)
But I am unsure how to get the growth compared to the previous year. Expected output based on the above sample:
| year_quarter | id1 | cost | cost_growth |
|--------------|-----|------|-------------|
| 202101 | 23 | 7614 | -81 |
It would also be nice to set cost_growth to 0 if the id1 has no rows in the previous years quarter.
Edit: Below is an attempt to make the comparison but I get an error that there is no attribute prev_value:
grouped_quarterly = (
df
.withColumn('year_quarter', (F.year(sf.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd'))
.groupby('id1', 'year_quarter')
.agg(
F.sum('cost').alias('cost')
)
)
w = Window.partitionBy('id1').orderBy('year_quarter')
growth = (
grouped_quarterly
.withColumn('prev_value', sf.lag(grouped_quarterly.cost).over(w))
.withColumn('diff', sf.when(sf.isnull(grouped_quarterly.cost - grouped_quarterly.prev_value), 0).otherwise(grouped_quarterly.cost - grouped_quarterly.cost))
)
Edit #2: The window function seems to take the previous quarter, regardless of year. This means my prev_value column is the previous quarter rather than the same quarter from the previous year:
grouped_quarterly.where(sf.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost |
|-----|--------------|------|
| 222 | 202001 | 73 |
| 222 | 202002 | 246 |
| 222 | 202003 | 525 |
| 222 | 202004 | -27 |
| 222 | 202101 | 380 |
w = Window.partitionBy('id1').orderBy('year_quarter')
growth = (
grouped_quarterly
.withColumn('prev_value', sf.lag(sf.col('cost')).over(w))
.withColumn('diff', sf.when(sf.isnull(sf.col('cost') - sf.col('prev_value')), 0).otherwise(sf.col('cost') - sf.col('prev_value')))
)
growth.where(sf.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost | prev_value | diff |
|-----|--------------|------|------------|------|
| 222 | 202001 | 73 | null | 0 |
| 222 | 202002 | 246 | 73 | 173 |
| 222 | 202003 | 525 | 246 | 279 |
| 222 | 202004 | -27 | 525 | -522 |
| 222 | 202101 | 380 | -27 | 407 |
Edit #3: Using the quarter in the partitioning results in a null prev_value for all rows:
grouped_quarterly.where(sf.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost |
|-----|--------------|------|
| 222 | 202001 | 73 |
| 222 | 202002 | 246 |
| 222 | 202003 | 525 |
| 222 | 202004 | -27 |
| 222 | 202101 | 380 |
w = Window.partitionBy(sf.col('id1'), sf.expr('substring(string(year_quarter), 2)')).orderBy('year_quarter')
growth = (
grouped_quarterly
.withColumn('prev_value', sf.lag(sf.col('cost')).over(w))
.withColumn('diff', sf.when(sf.isnull(sf.col('cost') - sf.col('prev_value')), 0).otherwise(sf.col('cost') - sf.col('prev_value')))
)
growth.where(sf.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost | prev_value | diff |
|-----|--------------|------|------------|-------|
| 222 | 202001 | 73 | null | 0 |
| 222 | 202002 | 246 | null | 0 |
| 222 | 202003 | 525 | null | 0 |
| 222 | 202004 | -27 | null | 0 |
| 222 | 202101 | 380 | null | 0 |
Try using the quarter in the partitioning as well, so that lag will give you the value in the same quarter last year:
w = Window.partitionBy(sf.col('id1'), sf.expr('substring(string(year_quarter), -2)')).orderBy('year_quarter')

How to group a Pandas DataFrame by url without the query string?

I have a Pandas DataFrame that is structured like this:
+-------+------------+------------------------------------+----------+
| index | Date | path | Count |
+-------+------------+------------------------------------+----------+
| 0 | 2020-06-10 | about/v1/ | 10865 |
| 1 | 2020-06-10 | about/v1/?status=active | 2893 |
| 2 | 2020-06-10 | about/v1/?status=active?name=craig | 264 |
| 3 | 2020-06-09 | about/v1/?status=active?name=craig | 182 |
+-------+------------+------------------------------------+----------+
How do I group by the path, and the date without the query string so that the table looks like this?
+-------+------------+-------------------------+----------+
| index | Date | path | Count |
+-------+------------+-------------------------+----------+
| 0 | 2020-06-10 | about/v1/ | 10865 |
| 1 | 2020-06-10 | about/v1/?status=active | 3157 |
| 3 | 2020-06-09 | about/v1/?status=active | 182 |
+-------+------------+-------------------------+----------+
Replace the name=craig section, and groupby on the Date and path columns :
result = (df.assign(path=df.path.str.replace(r"\?name=.*",""))
.drop("index",axis=1)
.groupby(["Date","path"],sort=False)
.sum()
)
result
Count
Date path
2020-06-10 about/v1/ 10865
about/v1/?status=active 3157
2020-06-09 about/v1/?status=active 182

Filter all rows from groupby object

I have a dataframe like below
+-----------+------------+---------------+------+-----+-------+
| InvoiceNo | CategoryNo | Invoice Value | Item | Qty | Price |
+-----------+------------+---------------+------+-----+-------+
| 1 | 1 | 77 | 128 | 1 | 10 |
| 1 | 1 | 77 | 101 | 1 | 11 |
| 1 | 2 | 77 | 105 | 3 | 12 |
| 1 | 3 | 77 | 129 | 2 | 10 |
| 2 | 1 | 21 | 145 | 1 | 9 |
| 2 | 2 | 21 | 130 | 1 | 12 |
+-----------+------------+---------------+------+-----+-------+
I want to filter the entire group, if any of the items in the list item_list = [128,129,130] is present in that group, after grouping by 'InvoiceNo' &'CategoryNo'.
My desired out put is as below
+-----------+------------+---------------+------+-----+-------+
| InvoiceNo | CategoryNo | Invoice Value | Item | Qty | Price |
+-----------+------------+---------------+------+-----+-------+
| 1 | 1 | 77 | 128 | 1 | 10 |
| 1 | 1 | 77 | 101 | 1 | 11 |
| 1 | 3 | 77 | 129 | 2 | 10 |
| 2 | 2 | 21 | 130 | 1 | 12 |
+-----------+------------+---------------+------+-----+-------+
I know how to filter a dataframe using isin(). But, not sure how to do it with groupby()
so far i have tried below
import pandas as pd
df = pd.read_csv('data.csv')
item_list = [128,129,130]
df.groupby(['InvoiceNo','CategoryNo'])['Item'].isin(item_list)
but nothing happens. please guide me how to solve this issue.
You can do something like this:
s = (df['Item'].isin(item_list)
.groupby([df['InvoiceNo'], df['CategoryNo']])
.transform('any')
)
df[s]

Resampling Pandas time series data with keeping only valid numbers for each row

I have a dataframe which has a list of web pages with summed hourly trraffic by unix hour.
Pivoted, it looks like this:
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| unix hour | 394533 | 394534 | 394535 | 394536 | 394537 | 394538 | 394539 | 394540 | 394541 | 394542 | 394543 |
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| page | | | | | | | | | | | |
| 3530765 | 5791 | 6017 | 5302 | | | | | | | | |
| 3563667 | | | | 3481 | 2840 | 2421 | | | | | |
| 3579922 | | | | | | | 1816 | 1947 | 1878 | 2013 | 1718 |
+-----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
Instead of having the time be actually over time, I would like to centralize so that it looks like this:
+---------+------+------+------+------+------+
| hour | 1 | 2 | 3 | 4 | 5 |
+---------+------+------+------+------+------+
| page | | | | | |
| 3530765 | 5791 | 6017 | 5302 | | |
| 3563667 | 3481 | 2840 | 2421 | | |
| 3579922 | 1816 | 1947 | 1878 | 2013 | 1718 |
+---------+------+------+------+------+------+
Would would be the best way to do this in pandas?
*Note - I realize the hours as columns isn't ideal, but for my full data set i have 7k pages and only over a span of 72 hours, so to me, pages as the index and hours as the columns makes the most sense.
Assuming the data is stored as float:
In [191]:
print df.dtypes
394533 float64
394534 float64
394535 float64
394536 float64
394537 float64
394538 float64
394539 float64
394540 float64
394541 float64
394542 float64
394543 float64
dtype: object
We will just do:
In [192]:
print df.apply(lambda x: pd.Series(data=x[np.isfinite(x)].values), 1)
0 1 2 3 4
page
3530765 5791 6017 5302 NaN NaN
3563667 3481 2840 2421 NaN NaN
3579922 1816 1947 1878 2013 1718
The idea is to get the valid numbers of each rows, put those into Series, but without the original UNIXtime as index. The index, therefore will become 0,1,2...., if you must you can easily make it into 1,2,3... by df2.columns = df2.columns+1, assuming the result is assigned df2.

Categories