Apply Function to Groups of Dask DataFrame - python

I have a huge CSV-File which I initially converted into a Parquet-File. This File contains Information from different sensors.
| | Unnamed: 0 | sensor_id | timestamp | P1 | P2 |
|---:|-------------:|------------:|:--------------------|------:|-----:|
| 0 | 0 | 4224 | 2020-05-01T00:00:00 | 0.5 | 0.5 |
| 1 | 1 | 3016 | 2020-05-01T00:00:00 | 0.77 | 0.7 |
| 2 | 2 | 29570 | 2020-05-01T00:00:00 | 0.82 | 0.52 |
In order to process the data I want to create several smaller (using resampling etc.) DataFrames containing the timeseries of each sensor. These timeseries should then be inserted into a HDF5-File.
Is there any faster other possibility besides looping over every group:
import dask.dataframe as dd
import numpy as np
def parse(d):
# ... parsing
return d
# load data
data = dd.read_parquet(fp)
sensor_ids = np.unique(test['sensor_id'].values).compute() # get array of all ids/groups
groups = test.groupby('sensor_id')
res = []
for idx in sensor_ids:
d = parse(groups.get_group(idx).compute())
res.append(d)
# ... loop over res ... store ...
I was thinking about using data.groupby('sensor_id').apply(....) but this results in a single DataFrame. While the solution above calls the compute()-method in every iteration leading to a too high computation time. The data contains a total of approx. 200_000_000 rows. There is a total of approx 11_000 sensors/groups.
Can I implemented writing the timeseries to a HDF5-File for every sensor into a function and call apply?
The desired result for one group/sensor looks like this:
parse(data.groupby('sensor_id').get_group(4224).compute()).to_markdown()
| timestamp | sensor_id | P1 | P2 |
|:--------------------|------------:|--------:|--------:|
| 2020-05-01 00:00:00 | 4224 | 2.75623 | 1.08645 |
| 2020-05-02 00:00:00 | 4224 | 5.69782 | 3.21847 |

Here looping is not the best way if you are happy to save the small datasets as parquet you could just use the option partition_on.
import dask.dataframe as dd
data = dd.read_parquet(fp)
data.to_parquet("data_partitioned", partition_on="sensor_id")

Related

How to efficiently zero pad datasets with different lengths

My aim is to zero pad my data to have an equal length for all the subset datasets. I have data as follows:
|server| users | power | Throughput range | time |
|:----:|:--------------:|:--------------:|:--------------------:|:-----:|
| 0 | [5, 3,4,1] | -4.2974843 | [5.23243, 5.2974843]| 0 |
| 1 | [8, 6,2,7] | -6.4528433 | [6.2343, 7.0974845] | 1 |
| 2 | [9,12,10,11] | -3.5322451 | [4.31240, 4.9073840]| 2 |
| 3 | [14,13,16,17]| -5.9752843 | [5.2243, 5.2974843] | 3 |
| 0 | [22,18,19,21]| -1.2974652 | [3.12843, 4.2474643]| 4 |
| 1 | [22,23,24,25]| -9.884843 | [8.00843, 8.0974843]| 5 |
| 2 | [27,26,28,29]| -2.3984843 | [7.23843, 8.2094845]| 6 |
| 3 | [30,32,31,33]| -4.5654566 | [3.1233, 4.2474643] | 7 |
| 1 | [36,34,37,35]| -1.2974652 | [3.12843, 4.2474643]| 8 |
| 2 | [40,41,38,39]| -3.5322451 | [4.31240, 4.9073840]| 9 |
| 1 | [42,43,45,44]| -5.9752843 | [6.31240, 6.9073840]| 10 |
The aim is to analyze individual servers by their respective data which was done using the code below:
c0 = grp['server'].values == 0
c0_new = grp[c0]
server0 = pd.DataFrame(c0_new)
c1 = grp['server'].values == 1
c1_new = grp[c1]
server1 = pd.DataFrame(c1_new)
c2 = grp['server'].values == 2
c2_new = grp[c2]
server2 = pd.DataFrame(c2_new)
c3 = grp['server'].values == 3
c3_new = grp[c3]
server3 = pd.DataFrame(c3_new)
The results of this code provide the different servers and their respective data features. For example, the server0 output becomes:
| server | users | power | Throughput range | time |
|:------:|:--------------:|:--------------:|:--------------------:|:-----:|
| 0 | [5, 3,4,1] | -4.2974843 | [5.23243, 5.2974843]| 0 |
| 0 | [22,18,19,21]| -1.2974652 | [3.12843, 4.2474643]| 1 |
The results obtained for individual servers have different lengths so I tried padding using the code below:
from Keras.preprocessing.sequence import pad_sequences
man = [server0, server1, server2, server3]
new = pad_sequences(man)
The results obtained in this case show the padding has been done with all the servers having equal length but the problem is that the output does not contain the column names anymore, I want the final data to contain the columns. Please any suggestions?
The aim is to apply machine learning on the data and would like to have them concatenated. This is what I later did and it worked for the application I wanted it for.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
man = [server0, server1, server2, server3]
for cel in man:
cel.set_index('time', inplace=True)
cel.drop(['users'], axis=1, inplace=True)
scl = MinMaxScaler()
vals = [cel.values.reshape(cel.shape[0], 1) for cel in man]
I then applied the the pad sequence and it worked as follows:
from keras.preprocessing.sequence import pad_sequences
new = pad_sequences(vals)

Where am I going wrong when analyzing this data?

Trying to find a trend in attendance. I filtered my existing df to this so I can look at 1 activity at a time.
+---+-----------+-------+----------+-------+---------+
| | Date | Org | Activity | Hours | Weekday |
+---+-----------+-------+----------+-------+---------+
| 0 | 8/3/2020 | Org 1 | Gen Ab | 10.5 | Monday |
| 1 | 8/25/2020 | Org 1 | Gen Ab | 2 | Tuesday |
| 3 | 8/31/2020 | Org 1 | Gen Ab | 8.5 | Monday |
| 7 | 8/10/2020 | Org 2 | Gen Ab | 1 | Monday |
| 8 | 8/14/2020 | Org 3 | Gen Ab | 3.5 | Friday |
+---+-----------+-------+----------+-------+---------+
This code:
gen_ab = att_df.loc[att_df['Activity'] == "Gen Ab"]
sum_gen_ab = gen_ab.groupby(['Date', 'Activity']).sum()
sum_gen_ab.head()
Returns this:
+------------+----------+------------+
| | | Hours |
+------------+----------+------------+
| Date | Activity | |
| 06/01/2020 | Gen Ab | 347.250000 |
| 06/02/2020 | Gen Ab | 286.266667 |
| 06/03/2020 | Gen Ab | 169.583333 |
| 06/04/2020 | Gen Ab | 312.633333 |
| 06/05/2020 | Gen Ab | 317.566667 |
+------------+----------+------------+
How do I make the summed column name 'Hours'? I still get the same result when I do this:
sum_gen_ab['Hours'] = gen_ab.groupby(['Date', 'Activity']).sum()
What I eventually want to do is have a line graph that shows the sum of hours for the activity over time. The time of course would be the dates in my df.
plt.plot(sum_gen_ab['Date'], sum_gen_ab['Hours'])
plt.show()
returns KeyError: Date
Once you've used groupby(['Date', 'Activity']) Date and Activity have been transformed to indices and can't be referenced with sum_gen_ab['Date'].
To avoid transforming them to indices you can use groupby(['Date', 'Activity'], as_index=False) instead.
I will typically use the pandasql library to manipulate my data frames into different datasets. This allows you to manipulate your pandas data frame with SQL code. Pandasql can be used alongside pandas.
EXAMPLE:
import pandas as pd
import pandasql as psql
df = "will be your dataset"
new_dataset = psql.sqldf('''
SELECT DATE, ACTIVITY, SUM(HOURS) as SUM_OF_HOURS
FROM df
GROUP BY DATE, ACTIVITY''')
new_dataset.head() #Shows the first 5 rows of your dataset

Transforming data frame (row to column and count)

Sorry for the dumb question, but I got stuck. I have the dataframe with the next structure:
|.....| ID | Cause | Date |
| 1 | AR | SGNLss| 10-05-2019 05:01:00|
| 2 | TD | PTRXX | 12-05-2019 12:15:00|
| 3 | GZ | FAIL | 10-05-2019 05:01:00|
| 4 | AR | PTRXX | 12-05-2019 12:15:00|
| 5 | GZ | SGNLss| 10-05-2019 05:01:00|
| 6 | AR | FAIL | 10-05-2019 05:01:00|
What I want is convert DATE column value to columns rounded to day so that the expected DF will have ID, 10-05-2019, 11-05-2019, 12-05-2019... columns and the values - the number of events (Causes) happened on this Id.
It's not a problem to round day and count values separately, but I can't get how to do both these operations.
You can use pd.crosstab:
pd.crosstab(df['ID'], df['Date'].dt.date)
Output:
Date 2019-10-05 2019-12-05
ID
AR 2 1
GZ 2 0
TD 0 1

How to join/merge on unequal pandas dataframes

I would like to convert the following sql statement to the equivalent pandas expression.
select
a1.country,
a1.platform,
a1.url_page as a1_url_page,
a2.url_page as a2_url_page,
a1.userid, a1.a1_min_time,
min(a2.dvce_created_tstamp) as a2_min_time
from(
select country, platform, url_page, userid,
min(dvce_created_tstamp) as a1_min_time
from pageviews
group by 1,2,3,4) as a1
left outer join pageviews as a2 on a1.userid=a2.userid
and a1.a1_min_time < a2.dvce_created_tstamp
and a2.url_page <> a1.url_page
group by 1,2,3,4,5,6
I am aware of the merging command of pandas however in our case we have a composite join clause that includes also inequality. I haven't found some documentation on how to handle this case.
Of course I can think as a last resort to iterate through dataframes but I do not think that this is the most efficient way to do it.
For example we can add some sample input data
----------------------------------------------------------------
| country | platform | url_page | userid | dvce_created_tstamp |
|----------------------------------------------------------------
| gr | win | a | bar | 2019-01-01 00:00:00 |
| gr | win | b | bar | 2019-01-01 00:01:00 |
| gr | win | a | bar | 2019-01-01 00:02:00 |
| gr | win | a | foo | 2019-01-01 00:00:00 |
| gr | win | a | foo | 2019-01-01 01:00:00 |
The response from sql
When I use dataframe left merge command I get following output
(edit: Add sample data)
It is obvious that we miss the rows with null a2_url_page

How do you convert two columns of vectors into one PySpark data frame?

After having run PolynomialExpansion on a Pyspark dataframe, I have a data frame (polyDF) that looks like this:
+--------------------+--------------------+
| features| polyFeatures|
+--------------------+--------------------+
|(81,[2,9,26],[13....|(3402,[5,8,54,57,...|
|(81,[4,16,20,27,3...|(3402,[14,19,152,...|
|(81,[4,27],[1.0,1...|(3402,[14,19,405,...|
|(81,[4,27],[1.0,1...|(3402,[14,19,405,...|
The "features" column includes the features included in the original data. Each row represents a different user. There are 81 total possible features for each user in the original data. The "polyFeatures" column includes the features after the polynomial expansion has been run. There are 3402 possible polyFeatures after running PolynomialExpansion. So what each row of both columns contain are:
An integer representing the number of possible features (each user may or may not have had a value in each of the features).
A list of integers that contains the feature indexes for which that user had a value.
A list of numbers that contains the values of each of the features mentioned in #2 above.
My question is, how can I take these two columns, create two sparse matrices, and subsequently join them together to get one, full, sparse Pyspark matrix? Ideally it would look like this:
+---+----+----+----+------+----+----+----+----+---+---
| 1 | 2 | 3 | 4 | ... |405 |406 |407 |408 |409|...
+---+----+----+----+------+----+----+----+----+---+---
| 0 | 13 | 0 | 0 | ... | 0 | 0 | 0 | 6 | 0 |...
| 0 | 0 | 0 | 9 | ... | 0 | 0 | 0 | 0 | 0 |...
| 0 | 0 | 0 | 1.0| ... | 3 | 0 | 0 | 0 | 0 |...
| 0 | 0 | 0 | 1.0| ... | 3 | 0 | 0 | 0 | 0 |...
I have reviewed the Spark documentation for PolynomialExpansion located here but it doesn't cover this particular issue. I have also tried to apply the SparseVector class which is documented here, but this seems to be useful for only one vector rather than a data frame of vectors.
Is there an effective way to accomplish this?

Categories