How to efficiently zero pad datasets with different lengths

How to efficiently zero pad datasets with different lengths - python

My aim is to zero pad my data to have an equal length for all the subset datasets. I have data as follows:
|server| users | power | Throughput range | time |
|:----:|:--------------:|:--------------:|:--------------------:|:-----:|
| 0 | [5, 3,4,1] | -4.2974843 | [5.23243, 5.2974843]| 0 |
| 1 | [8, 6,2,7] | -6.4528433 | [6.2343, 7.0974845] | 1 |
| 2 | [9,12,10,11] | -3.5322451 | [4.31240, 4.9073840]| 2 |
| 3 | [14,13,16,17]| -5.9752843 | [5.2243, 5.2974843] | 3 |
| 0 | [22,18,19,21]| -1.2974652 | [3.12843, 4.2474643]| 4 |
| 1 | [22,23,24,25]| -9.884843 | [8.00843, 8.0974843]| 5 |
| 2 | [27,26,28,29]| -2.3984843 | [7.23843, 8.2094845]| 6 |
| 3 | [30,32,31,33]| -4.5654566 | [3.1233, 4.2474643] | 7 |
| 1 | [36,34,37,35]| -1.2974652 | [3.12843, 4.2474643]| 8 |
| 2 | [40,41,38,39]| -3.5322451 | [4.31240, 4.9073840]| 9 |
| 1 | [42,43,45,44]| -5.9752843 | [6.31240, 6.9073840]| 10 |
The aim is to analyze individual servers by their respective data which was done using the code below:
c0 = grp['server'].values == 0
c0_new = grp[c0]
server0 = pd.DataFrame(c0_new)
c1 = grp['server'].values == 1
c1_new = grp[c1]
server1 = pd.DataFrame(c1_new)
c2 = grp['server'].values == 2
c2_new = grp[c2]
server2 = pd.DataFrame(c2_new)
c3 = grp['server'].values == 3
c3_new = grp[c3]
server3 = pd.DataFrame(c3_new)
The results of this code provide the different servers and their respective data features. For example, the server0 output becomes:
| server | users | power | Throughput range | time |
|:------:|:--------------:|:--------------:|:--------------------:|:-----:|
| 0 | [5, 3,4,1] | -4.2974843 | [5.23243, 5.2974843]| 0 |
| 0 | [22,18,19,21]| -1.2974652 | [3.12843, 4.2474643]| 1 |
The results obtained for individual servers have different lengths so I tried padding using the code below:
from Keras.preprocessing.sequence import pad_sequences
man = [server0, server1, server2, server3]
new = pad_sequences(man)
The results obtained in this case show the padding has been done with all the servers having equal length but the problem is that the output does not contain the column names anymore, I want the final data to contain the columns. Please any suggestions?

The aim is to apply machine learning on the data and would like to have them concatenated. This is what I later did and it worked for the application I wanted it for.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
man = [server0, server1, server2, server3]
for cel in man:
cel.set_index('time', inplace=True)
cel.drop(['users'], axis=1, inplace=True)
scl = MinMaxScaler()
vals = [cel.values.reshape(cel.shape[0], 1) for cel in man]
I then applied the the pad sequence and it worked as follows:
from keras.preprocessing.sequence import pad_sequences
new = pad_sequences(vals)

Related

Where am I going wrong when analyzing this data?

Trying to find a trend in attendance. I filtered my existing df to this so I can look at 1 activity at a time.
+---+-----------+-------+----------+-------+---------+
| | Date | Org | Activity | Hours | Weekday |
+---+-----------+-------+----------+-------+---------+
| 0 | 8/3/2020 | Org 1 | Gen Ab | 10.5 | Monday |
| 1 | 8/25/2020 | Org 1 | Gen Ab | 2 | Tuesday |
| 3 | 8/31/2020 | Org 1 | Gen Ab | 8.5 | Monday |
| 7 | 8/10/2020 | Org 2 | Gen Ab | 1 | Monday |
| 8 | 8/14/2020 | Org 3 | Gen Ab | 3.5 | Friday |
+---+-----------+-------+----------+-------+---------+
This code:
gen_ab = att_df.loc[att_df['Activity'] == "Gen Ab"]
sum_gen_ab = gen_ab.groupby(['Date', 'Activity']).sum()
sum_gen_ab.head()
Returns this:
+------------+----------+------------+
| | | Hours |
+------------+----------+------------+
| Date | Activity | |
| 06/01/2020 | Gen Ab | 347.250000 |
| 06/02/2020 | Gen Ab | 286.266667 |
| 06/03/2020 | Gen Ab | 169.583333 |
| 06/04/2020 | Gen Ab | 312.633333 |
| 06/05/2020 | Gen Ab | 317.566667 |
+------------+----------+------------+
How do I make the summed column name 'Hours'? I still get the same result when I do this:
sum_gen_ab['Hours'] = gen_ab.groupby(['Date', 'Activity']).sum()
What I eventually want to do is have a line graph that shows the sum of hours for the activity over time. The time of course would be the dates in my df.
plt.plot(sum_gen_ab['Date'], sum_gen_ab['Hours'])
plt.show()
returns KeyError: Date

Once you've used groupby(['Date', 'Activity']) Date and Activity have been transformed to indices and can't be referenced with sum_gen_ab['Date'].
To avoid transforming them to indices you can use groupby(['Date', 'Activity'], as_index=False) instead.

I will typically use the pandasql library to manipulate my data frames into different datasets. This allows you to manipulate your pandas data frame with SQL code. Pandasql can be used alongside pandas.
EXAMPLE:
import pandas as pd
import pandasql as psql
df = "will be your dataset"
new_dataset = psql.sqldf('''
SELECT DATE, ACTIVITY, SUM(HOURS) as SUM_OF_HOURS
FROM df
GROUP BY DATE, ACTIVITY''')
new_dataset.head() #Shows the first 5 rows of your dataset

Pandas keeping certain rows based on strings in other rows

I have the following dataframe
+-------+------------+--+
| index | keep | |
+-------+------------+--+
| 0 | not useful | |
| 1 | start_1 | |
| 2 | useful | |
| 3 | end_1 | |
| 4 | not useful | |
| 5 | start_2 | |
| 6 | useful | |
| 7 | useful | |
| 8 | end_2 | |
+-------+------------+--+
There are two pairs of strings (start_1, end_1, start_2, end_2) that indicate that the rows between those strings are the only ones relevant in the data. Hence, in the dataframe below, the output dataframe would be only composed of the rows at index 2, 6, 7 (since 2 is between start_1 and end_1; and 6 and 7 is between start_2 and end_2)
d = {'keep': ["not useful", "start_1", "useful", "end_1", "not useful", "start_2", "useful", "useful", "end_2"]}
df = pd.DataFrame(data=d)
What is the most Pythonic/Pandas approach to this problem?
Thanks

Here's one way to do that (in a couple of steps, for clarity). There might be others:
df["sections"] = 0
df.loc[df.keep.str.startswith("start"), "sections"] = 1
df.loc[df.keep.str.startswith("end"), "sections"] = -1
df["in_section"] = df.sections.cumsum()
res = df[(df.in_section == 1) & ~df.keep.str.startswith("start")]
Output:
index keep sections in_section
2 2 useful 0 1
6 6 useful 0 1
7 7 useful 0 1

Asking user to input certain information

I have this code that takes a csv files, filters from by a column and then plots the values of another column.
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
from matplotlib import pyplot as plt
import numpy as np
df = pd.read_csv(r'C:/Desktop/Plot/dataframe.csv', delimiter=";", encoding='unicode_escape')
df['num_1'] = df['num_2'].str.split(',').str[0]
df['num_1'] = df['num_2'].astype('int64', copy=False)
X=df[df['Describe']=='The Start of Journey']['num_2'].values
dev_x= X
# Set figure size
plt.figure(figsize=(10, 5))
plt.hist(dev_x, bins=5)
plt.title('Data')
This is the dataset
+-----+-------+-----------------------+--------+--------+
| | name | Describe | num_1 | num_2 |
+-----+-------+-----------------------+--------+--------+
| 0 | er | The Start of Journey | 17 | 249,5 |
| 1 | NaN | NaN | 58 | 51,0 |
| 2 | NaN | NaN | 14 | 66,5 |
| 3 | NaN | NaN | 526 | 84,0 |
| 4 | be | The end of journey | 3 | 13,0 |
| 5 | tg | Levels | 342 | 34,0 |
| 6 | NaN | NaN | 231 | 55,6 |
| 7 | NaN | NaN | 23 | 75,0 |
| 8 | tf | counts | 54 | 34,6 |
| 9 | sf | The Start of Journey | 52 | 4324,0 |
| 10 | gd | The Start of Journey | 352 | 54.0 |
+-----+-------+-----------------------+--------+--------+
I want to modify the code so it does the following
Props to user to add the csv file
Prop the user to add the name of the column we want to filter ( this case Describe column)
Prop the user to add the string ( this case The Start of Journey)
Prop the user to add the name of the column we want to plot the data ( this case num_2)
I have checked other sources but due to the structure of the code I am having trouble regarding this.

Use the input() function. You can have a variable like x, and do x = input("Enter the CSV path>>> ") (or something similar), and x will be a string with whatever the user input. Then you can use x later. For example, instead of 'Describe' you could just put x.
x = input("Enter the csv path>>>") # returns answer in string form

How to aggregate and restructure dataframe data in pyspark (column wise)

I am trying to aggregate data in pyspark dataframe on a particular criteria. I am trying to align the acct based on switchOUT amount to switchIN amount. So that accounts with money switching out of becomes from account and other accounts become to_accounts.
Data I am getting in the dataframe to begin with
+--------+------+-----------+----------+----------+-----------+
| person | acct | close_amt | open_amt | switchIN | switchOUT |
+--------+------+-----------+----------+----------+-----------+
| A | 1 | 125 | 50 | 75 | 0 |
+--------+------+-----------+----------+----------+-----------+
| A | 2 | 100 | 75 | 25 | 0 |
+--------+------+-----------+----------+----------+-----------+
| A | 3 | 200 | 300 | 0 | 100 |
+--------+------+-----------+----------+----------+-----------+
To this table
+--------+--------+-----------+----------+----------+
| person | from_acct| to_acct | switchIN | switchOUT|
+--------+----------+--------+----------+-----------+
| A | 3 | 1 | 75 | 100 |
+--------+----------+--------+----------+-----------+
| A | 3 | 2 | 25 | 100 |
+--------+----------+--------+----------+-----------+
And also how can I do it so that it works for N number of rows (not just 3 accounts)
So far I have used this code
# define udf
def sorter(l):
res = sorted(l, key=operator.itemgetter(1))
return [item[0] for item in res]
def list_to_string(l):
res = 'from_fund_' +str(l[0]) + '_to_fund_'+str(l[1])
return res
def listfirstAcc(l):
res = str(l[0])
return res
def listSecAcc(l):
res = str(l[1])
return res
sort_udf = F.udf(sorter)
list_str = F.udf(list_to_string)
extractFirstFund = F.udf(listfirstAcc)
extractSecondFund = F.udf(listSecAcc)
# Add additional columns
df= df.withColumn("move", sort_udf("list_col").alias("sorted_list"))
df= df.withColumn("move_string", list_str("move"))
df= df.withColumn("From_Acct",extractFirstFund("move"))
df= df.withColumn("To_Acct",extractSecondFund("move"))
Current outcome I am getting:
+--------+--------+-----------+----------+----------+
| person | from_acct| to_acct | switchIN | switchOUT|
+--------+----------+--------+----------+-----------+
| A | 3 | 1,2 | 75 | 100 |
+--------+----------+--------+----------+-----------+

Python: decrease running time of for loops

I want to calculate APRU for several countries.
country_list = ['us','gb','ca','id']
count = {}
for i in country_list:
count[i] = df_day_country[df_day_country.isin([i])]
count[i+'_reverse'] = count[i].iloc[::-1]
for j in range(1,len(count[i+'_reverse'])):
count[i+'_reverse']['count'].iloc[j] = count[i+'_reverse']['count'][j-1:j+1].sum()
for k in range(1,len(count[i])):
count[i][revenue_sum].iloc[k] = count[i][revenue_sum][k-1:k+1].sum()
count[i]['APRU'] = count[i][revenue_sum] / count[i]['count'][0]/100
After that, I will create 4 dataframes: df_us, df_gb, df_ca, df_id that show each country's APRU.
But the size of dataset is large. The running time is extremely slow after the country list become larger. So is there a way to decrease the running time?

Consider using numba
Your code thus becomes
from numba import njit
country_list = ['us','gb','ca','id']
#njit
def count(country_list):
count = {}
for i in country_list:
count[i] = df_day_country[df_day_country.isin([i])]
count[i+'_reverse'] = count[i].iloc[::-1]
for j in range(1,len(count[i+'_reverse'])):
count[i+'_reverse']['count'].iloc[j] = count[i+'_reverse']['count'][j-1:j+1].sum()
for k in range(1,len(count[i])):
count[i][revenue_sum].iloc[k] = count[i][revenue_sum][k-1:k+1].sum()
count[i]['APRU'] = count[i][revenue_sum] / count[i]['count'][0]/100
return count
Numba makes python loops a lot faster and is in the process of being integrated into the more heavy duty python libraries like scipy. Deffinetly give this a look.

IIUC, from your code and variable names, it looks like you are trying to compute average:
# toy data set:
country_list = ['us','gb']
np.random.seed(1)
datalen=10
df_day_country = pd.DataFrame({'country': np.random.choice(country_list, datalen),
'count': np.random.randint(0,100, datalen),
'revenue_sum': np.random.uniform(0,100,datalen)})
df_day_country['APRU'] = (df_day_country.groupby('country',group_keys=False)
.apply(lambda x: x['revenue_sum']/x['count'].sum())
)
Output:
+----------+--------+--------------+------------+----------+
| country | count | revenue_sum | APRU | |
+----------+--------+--------------+------------+----------+
| 0 | gb | 16 | 20.445225 | 0.150333 |
| 1 | gb | 1 | 87.811744 | 0.645675 |
| 2 | us | 76 | 2.738759 | 0.011856 |
| 3 | us | 71 | 67.046751 | 0.290246 |
| 4 | gb | 6 | 41.730480 | 0.306842 |
| 5 | gb | 25 | 55.868983 | 0.410801 |
| 6 | gb | 50 | 14.038694 | 0.103226 |
| 7 | gb | 20 | 19.810149 | 0.145663 |
| 8 | gb | 18 | 80.074457 | 0.588783 |
| 9 | us | 84 | 96.826158 | 0.419161 |
+----------+--------+--------------+------------+----------+

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to efficiently zero pad datasets with different lengths - python

Related

Where am I going wrong when analyzing this data?

Pandas keeping certain rows based on strings in other rows

Asking user to input certain information

How to aggregate and restructure dataframe data in pyspark (column wise)

Python: decrease running time of for loops

Categories

Resources