PySpark - Applying custom function on GroupedData and output histograms - python

I have a large PySpark DataFrame in a similar structure to this:
city
store
machine_id
numeric_value
time
London
A
1
x
01/01/2021 14:15:00
London
A
2
y
01/01/2021 14:17:00
NY
B
9
z
01/01/2021 16:12:00
London
A
1
w
01/01/2021 14:20:00
London
A
2
q
01/01/2021 14:24:00
.
.
.
.
.
.
.
.
.
.
I would like to split the data into time windows (of 10 minutes for example) and calculate some statistics (mean, variance, number of distinct values and other custom functions) per machine_id and output a histogram of this statistic per combination of city and store. For example, for each city_store combination a historgram of the variance of "numeric_value" in a time window of 10 minutes.
So far I used groupby to get the data grouped by the columns I need -
interval_window = pyspark.sql.functions.window("time", '10 minutes')
grouped_df = df.groupBy('city', 'store', 'machine_id', interval_window)
From here I applied some pyspark.sql.functions (like var,mean..) using agg but I would like to know how to apply a custom function on a GroupedData object and how can I output histogram of the results per city and store. I dont think I can convert it into pandas DF as this DataFrame is very large and won't fit into the master.
I'm a beginner in spark so if I'm not using the correct objects/functions please let me know.
Thanks!

Related

Is there a way to create 10 millions row of random dataset in python?

I would like to create random dataset consists of 10 million rows. Unfortunately, I could not find a way to create date column with specific range (example from 01.01.2021-31.12.2021).
I tried with oracle sql, but could not find a way to do that. There is way that I can do in excel, but excel can not handle 10 millions row of data. Therefore, I though Python can be the best way to do that, but I could not figure it out.
I would like to create random dataset consists of 10 million rows. Unfortunately, I could not find a way to create date column with specific range (example from 01.01.2021-31.12.2021).
I tried with oracle sql, but could not find a way to do that.
You can use the DBMS_RANDOM package with a hierarchical query:
SELECT DATE '2021-01-01'
+ DBMS_RANDOM.VALUE(0, DATE '2022-01-01' - DATE '2021-01-01')
AS random_date
FROM DUAL
CONNECT BY LEVEL <= 10000000;
Which outputs:
RANDOM_DATE
2021-11-25 00:53:13
2021-08-28 22:33:35
2021-02-11 23:28:50
2021-12-10 05:39:00
2021-01-10 22:02:47
...
2021-01-01 16:39:13
2021-10-30 20:58:21
2021-03-14 06:27:34
2021-10-11 00:24:03
2021-04-20 03:53:54
fiddle
Use pandas.date_range combined with numpy.random.choice:
df = pd.DataFrame(
{
'date': np.random.choice(
pd.date_range('2021-01-01', '2021-12-31', freq='D'), size=10_000_000
)
}
)
Example:
date
0 2021-04-05
1 2021-02-01
2 2021-09-22
3 2021-10-17
4 2021-04-28
... ...
9999995 2021-07-24
9999996 2021-03-15
9999997 2021-07-28
9999998 2021-11-01
9999999 2021-03-20
[10000000 rows x 1 columns]
Most python IDE's will come with a random module which you need because no random function is built in with python.
To get 10000000 rows of data a loop like the one below will probably work.
#Imports the random module
import random
#Creates a loop that will run 10 million times
for i in range(0,10000000):
#Prints a random number between one and ten on each new row
print(random.randint(0,10)
It will take a while but will work if this is what you are after?

How to access components of seasonal_decompose from statsmodels

I have two time series stored in data frames london and scotland of the same length and same columns. One column in date which spans from 2009 to 2019 and has a daily frequency for data of column yearly_cost. They look as such:
Date Yearly_cost
0 2009-01-01 230
1 2009-01-02 460
2 2009-01-03 260
3 2009-01-04 250
4 2009-01-05 320
5 2009-01-06 430
I wish to compare the euclidean distance of only the seasonality components of yearly_cost in the time series. I have decomposed them using seasonal_decompose() from statsmodels, however I wish to take only the seasonality component from the object:
result = <statsmodels.tsa.seasonal.DecomposeResult at 0x2b5d7d2add8>
Is this possible to take and create into a time series in a new_df?
Any help would be appreciated. Thanks
I have worked this out. To obtain the seasonal component, you simply use
new_df = result.seasonal
This gives you only the seasonal result.

Multiple linear regression using binary, non-binary variables

I'm hoping to obtain some feedback on the most appropriate method in undertaking this approach. I have a df that contains revenue data and various related variables. I'm hoping to determine which variables predict revenue. These variables are both binary and non-binary though
I'll display an example df below and talk through my thinking:
import pandas as pd
d = ({
'Date' : ['01/01/18','01/01/18','01/01/18','01/01/18','02/01/18','02/01/18','02/01/18','02/01/18'],
'Country' : ['US','US','US','MX','US','US','MX','MX'],
'State' : ['CA','AZ','FL','BC','CA','CA','BC','BC'],
'Town' : ['LA','PO','MI','TJ','LA','SF','EN','TJ'],
'Occurences' : [1,5,3,4,2,5,10,2],
'Time Started' : ['12:03:00 PM','02:17:00 AM','13:20:00 PM','01:25:00 AM','08:30:00 AM','12:31:00 AM','08:35:00 AM','02:45:00 AM'],
'Medium' : [1,2,1,2,1,1,1,2],
'Revenue' : [100000,40000,500000,8000,10000,300000,80000,1000],
})
df = pd.DataFrame(data=d)
Out:
Date Country State Town Occurences Time Medium Revenue
0 01/01/18 US CA LA 1 12:03:00 PM 1 100000
1 01/01/18 US AZ PO 10 02:17:00 AM 2 40000
2 01/01/18 US FL MI 3 13:20:00 PM 1 500000
3 01/01/18 MX BC TJ 4 01:25:00 AM 2 8000
4 02/01/18 US CA LA 2 08:30:00 AM 1 10000
5 02/01/18 US CA SF 5 12:31:00 AM 1 300000
6 02/01/18 MX BC EN 10 08:35:00 AM 1 80000
7 02/01/18 MX BC TJ 2 02:45:00 AM 2 1000
So the specific variables that influence revenue are Medium, Time Started, and Occurrences. I also have location groups that can be used, such as, Country, State, and Town.
Would a multiple linear regression be appropriate here? Should I standardise the independent variables somehow? Medium will always be either 1 or 2. But should I group Time Started and Occurrences? Times will fall between a 20hr period (8AM - 4AM), while occurrences will fall between 1-10. Should these variable be assigned to dummy variables.
Some ideas: you could apply a logit transform of Medium, subtract the earliest starting time from all Time values, and convert it to hours. Then standardize all three variable in some way, and follow-up with multiple linear regression.
Before going into that kind of complex model, you could try plotting each variable against revenue and against each other, and see if there's any interesting patterns.

Dividing rows for specific columns by date+n in Pandas

I want to divide rows in my dataframe via specific columns.
That is, I have a column named 'ticker' which has a attributes 'date' and 'price'.
I want to divide date[i+2] by date[i] where i and i+2 just mean the DAY and the DAY +2 for the price of that ticker. The date is also in proper datetime format for operations using Pandas.
The data looks like:
date | ticker | price |
2002-01-30 A 20
2002-01-31 A 21
2002-02-01 A 21.4
2002-02-02 A 21.3
.
.
That means I want to select the price based off the ticker and the DAY and the DAY + 2 specifically for each ticker to calculate the ratio date[i+2]/date[i].
I've considered using iloc but I'm not sure how to select for specific tickers only to do the math on.
use groupby:
df.groupby('ticker')['price'].transform(lambda x: x / x.shift(2))
0 NaN
1 NaN
2 1.070000
3 1.014286
Name: price, dtype: float64

Panda groupby on many columns with agg()

I've been asked to analize the DB from a medical record app. So a bunch of record would look like:
So i have to resume more than 3 million records from 2011 to 2014 by PX, i know they repeat since thats the ID for each patient, so a patient should had many visitis to the doctor. How could i group them or resume them by patient.
I don't know what you mean by "resume", but it looks like all you want to do is only to sort and display data in a nicer way. You can visually group (=order) the records "px- and fecha-wise" like this:
df.set_index(['px', 'fecha'], inplace=True)
EDIT:
When you perform a grouping of the data based on some common property, you have to decide, what kind of aggregation are you going to use on the data in other columns. Simply speaking, once you perform a groupby, you only have one empty field for in each remaining column for each "pacient_id" left, so you must use some aggregation function (e.g. sum, mean, min, avg, count,...) that will return desired representable value of the grouped data.
It is hard to work on your data since they are locked in an image, and it is impossible to tell what you mean by "Age", since this column is not visible, but I hope you can achieve what you want by looking at the following example with dummy data:
import pandas as pd
import numpy as np
from datetime import datetime
import random
from datetime import timedelta
def random_datetime_list_generator(start_date, end_date,n):
return ((start_date + timedelta(seconds=random.randint(0, int((end_date - start_date).total_seconds())))) for i in xrange(n))
#create random dataframe with 4 sample columns and 50000 rows
rows = 50000
pacient_id = np.random.randint(100,200,rows)
dates = random_datetime_list_generator(pd.to_datetime("2011-01-01"),pd.to_datetime("2014-12-31"),rows)
age = np.random.randint(10,80,rows)
bill = np.random.randint(1,1000,rows)
df = pd.DataFrame(columns=["pacient_id","visited","age","bill"],data=zip(pacient_id,dates,age,bill))
print df.head()
# 1.Only perform statictis of the last visit of each pacient only
stats = df.groupby("pacient_id",as_index=False)["visited"].max()
stats.columns = ["pacient_id","last_visited"]
print stats
# 2. Perform a bit more complex statistics on pacient by specifying desired aggregate function for each column
custom_aggregation = {'visited':{"first visit": 'min',"last visit": "max"}, 'bill':{"average bill" : "mean"}, 'age': 'mean'}
#perform a group by with custom aggregation and renaming of functions
stats = df.groupby("pacient_id").agg(custom_aggregation)
#round floats
stats = stats.round(1)
print stats
Original dummy dataframe looks like so:
pacient_id visited age bill
0 150 2012-12-24 21:34:17 20 188
1 155 2012-10-26 00:34:45 17 672
2 116 2011-11-28 13:15:18 33 360
3 126 2011-06-03 17:36:10 58 167
4 165 2013-07-15 15:39:31 68 815
First aggregate would look like this:
pacient_id last_visited
0 100 2014-12-29 00:01:11
1 101 2014-12-22 06:00:48
2 102 2014-12-26 11:51:41
3 103 2014-12-29 15:01:32
4 104 2014-12-18 15:29:28
5 105 2014-12-30 11:08:29
Second, complex aggregation would look like this:
visited age bill
first visit last visit mean average bill
pacient_id
100 2011-01-06 06:11:33 2014-12-29 00:01:11 45.2 507.9
101 2011-01-01 20:44:55 2014-12-22 06:00:48 44.0 503.8
102 2011-01-02 17:42:59 2014-12-26 11:51:41 43.2 498.0
103 2011-01-01 03:07:41 2014-12-29 15:01:32 43.5 495.1
104 2011-01-07 18:58:11 2014-12-18 15:29:28 45.9 501.7
105 2011-01-01 03:43:12 2014-12-30 11:08:29 44.3 513.0
This example should get you going. Additionaly, there is a nice SO question about pandas groupby aggregation which may teach you a lot on this topics.

Categories