I am currently training a neural network. I would like to split up my training and validation data with a 80:20 ratio. I would like to have the full purchases.
Unfortunately, I get an IndexError: column index (12) out of range. How can I fix this? At this position the error occurs mat[purchaseid, itemid] = 1.0. So I always split after each purchase (a complete purchase = if I have all rows with the same purchaseid).
Dataframe:
d = {'purchaseid': [0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 9],
'itemid': [ 3, 8, 2, 10, 3, 10, 4, 12, 3, 12, 3, 4, 8, 6, 3, 0, 5, 12, 9, 9, 13, 1, 7, 11, 11]}
df = pd.DataFrame(data=d)
print(df.head(20))
Methods:
PERCENTAGE_SPLIT = 20
def splitter(df):
df_ = pd.DataFrame()
sum_purchase = df['purchaseid'].nunique()
amount = round((sum_purchase / 100) * PERCENTAGE_SPLIT)
random_list = random.sample(df['purchaseid'].unique().tolist(), amount)
df_ = df.loc[df['purchaseid'].isin(random_list)]
df_reduced = df.loc[~df['purchaseid'].isin(random_list)]
return [df_reduced, df_]
def generate_matrix(dataframe, name):
mat = sp.dok_matrix((dataframe.shape[0], len(dataframe['itemid'].unique())), dtype=np.float32)
for purchaseid, itemid in zip(dataframe['purchaseid'], dataframe['itemid']):
mat[purchaseid, itemid] = 1.0 # At this position is the error
return mat
Call:
dfs = splitter(df)
df_tr = dfs[0].copy(deep=True)
df_val = dfs[1].copy(deep=True)
train_mat_ = generate_matrix(df_tr, 'train')
val_mat_ = generate_matrix(df_val, 'val')
Error:
IndexError: column index (12) out of range
Dataframe:
#df
purchaseid itemid
0 0 3
1 0 8
2 0 2
3 1 10
4 2 3
5 2 10
6 3 4
7 3 12
8 3 3
9 4 12
10 4 3
11 4 4
12 5 8
13 5 6
14 5 3
15 6 0
16 6 5
17 6 12
18 7 9
19 7 9
# df_tr
purchaseid itemid
0 0 3
1 0 8
2 0 2
3 1 10
4 2 3
5 2 10
6 3 4
7 3 12
8 3 3
9 4 12
10 4 3
11 4 4
12 5 8
13 5 6
14 5 3
18 7 9
19 7 9
20 8 13
# df_val
purchaseid itemid
15 6 0
16 6 5
17 6 12
21 9 1
22 9 7
23 9 11
24 9 11
Try this instead. sp.dok_matrix requires dimensions of target matrix. I have assumed range of purchaseid to be within [ 0, max(purchaseid) ] and range of itemid to be within [ 0, max(itemid) ] looking at your data.
def generate_matrix(dataframe, name):
mat = sp.dok_matrix((dataframe['purchaseid'].max() + 1, dataframe['itemid'].max() + 1), dtype=np.float32)
for purchaseid, itemid in zip(dataframe['purchaseid'], dataframe['itemid']):
mat[purchaseid, itemid] = 1.0 # At this position is the error
return mat
Related
I want to reshape the data so that the values in the index column become the columns
My Data frame:
Gender_Male Gender_Female Location_london Location_North Location_South
Cat
V 5 4 4 2 3
W 15 12 12 7 8
X 11 15 16 4 6
Y 22 18 21 9 9
Z 8 7 7 4 4
Desired Data frame:
Is there an easy way to do this? I also have 9 other categorical variables in my data set in addition to the Gender and Location variables. I have only included two variables to keep the example simple.
Code to create the example dataframe:
df1 = pd.DataFrame({
'Cat' : ['V','W', 'X', 'Y', 'Z'],
'Gender_Male' : [5, 15, 11, 22, 8],
'Gender_Female' : [4, 12, 15, 18, 7],
'Location_london': [4,12, 16, 21, 7],
'Location_North' : [2, 7, 4, 9, 4],
'Location_South' : [3, 8, 6, 9, 4]
}).set_index('Cat')
df1
You can transpose the dataframe and then split and set the new index:
Transpose
dft = df1.T
print(dft)
Cat V W X Y Z
Gender_Male 5 15 11 22 8
Gender_Female 4 12 15 18 7
Location_london 4 12 16 21 7
Location_North 2 7 4 9 4
Location_South 3 8 6 9 4
Split and set the new index
dft.index = dft.index.str.split('_', expand=True)
dft.columns.name = None
print(dft)
V W X Y Z
Gender Male 5 15 11 22 8
Female 4 12 15 18 7
Location london 4 12 16 21 7
North 2 7 4 9 4
South 3 8 6 9 4
I am trying to create a new column in which the value in the first row is 0 and from the second row, it should do a calculation as mentioned below which is
ColumnA[This row] = (ColumnA[Last row] * 13 + ColumnB[This row])/14
I am using the python pandas shift function but it doesn't seem to be producing the intended result.
test = np.array([ 1, 5, 3, 20, 2, 6, 9, 8, 7])
test = pd.DataFrame(test, columns = ['ABC'])
test.loc[test['ABC'] == 1, 'a'] = 0
test['a'] = (test['a'].shift()*13 + test['ABC'])/14
I am trying to create a column that looks like this
ABC
a
1
0
5
0.3571
3
0.5459
20
1.9355
2
1.9401
6
2.2301
9
2.7137
8
3.0913
7
3.3705
But actually what I am getting by running the above code is this
ABC
a
1
nan
2
0
3
nan
4
nan
5
nan
6
nan
7
nan
8
nan
9
nan
test = np.array([ 1, 2, 3, 4, 5, 6, 7, 8, 9])
test = pd.DataFrame(test, columns = ['ABC'])
test["res"] = test["ABC"]
test.iloc[0]['res'] = 0 # Initialize the first row as 0
test["res"] = test.res + test.res.shift()
test["res"] = test.res.fillna(0).astype(int) # test.res.shift() introduces a nan and we replace it with a 0 and convert the column data type to int
Try:
test["a"] = (test["ABC"].shift().cumsum() + test["ABC"].shift()).fillna(0)
print(test)
Prints:
ABC a
0 1 0.0
1 2 2.0
2 3 5.0
3 4 9.0
4 5 14.0
5 6 20.0
6 7 27.0
7 8 35.0
8 9 44.0
Let's try a for loop
import pandas as pd
df = pd.DataFrame({'ABC': [1, 5, 3, 20, 2, 6, 9, 8, 7]})
lst = [0]
res = 0
for i, row in df.iloc[1:].iterrows():
res = ((res * 13) + row['ABC']) / 14
lst.append(res)
df['a'] = pd.Series(lst)
print(df)
Output:
ABC a
0 1 0.000000
1 5 0.357143
2 3 0.545918
3 20 1.935496
4 2 1.940103
5 6 2.230096
6 9 2.713660
7 8 3.091256
8 7 3.370452
I have this pandas series:
ts = pd.Series([1, 2, 3, 4, 5, 6, 7, 8])
What I would like to get is a dataframe which contains another column with the sum of rows 0, 2, 4, 6 and for 1, 3, 5 and 7 (that means, one row is left out when creating the sum).
In this case, this means a new dataframe should look like this one:
index ts sum
0 1 16
1 2 20
2 3 16
3 4 20
4 5 16
5 6 20
6 7 16
7 8 20
How could I do this?
Use modulo by k for each kth rows:
k = 2
df = ts.to_frame('ts')
df['sum'] = df.groupby(ts.index % k).transform('sum')
#if not default RangeIndex
#df['sum'] = df.groupby(np.arange(len(ts)) % k).transform('sum')
print (df)
ts sum
0 1 16
1 2 20
2 3 16
3 4 20
4 5 16
5 6 20
6 7 16
7 8 20
Good day!
There is the following time series dataset:
Time Value
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 4
11 4
12 5
I need to split and group data by value like this:
Value Time start, Time end
1 1 3
2 4 7
3 8 9
4 10 11
5 12 12
How to do it fast and in the most functional programming style on python? Various libraries can be used for example pandas, numpy.
Try with pandas:
df.groupby('Time')['Value'].agg(['min','max'])
We can use pandas for this:
Solution:
data = {'Time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'Value': [1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 5]
}
df = pd.DataFrame(data, columns= ['Time', 'Value'])
res = df.groupby('Value').agg(['min', 'max'])
f_res = res.rename(columns = {'min': 'Start Time', 'max': 'End Time'}, inplace = False)
print(f_res)
Output:
Time
Start Time End Time
Value
1 1 3
2 4 7
3 8 9
4 10 11
5 12 12
first get the count of Values
result = df.groupby('Value').agg(['count'])
result.columns = result.columns.get_level_values(1) #drop multi-index
result
count
Value
1 3
2 4
3 2
4 2
5 1
then cumcount to get time start
s = df.groupby('Value').cumcount()
result["time start"] = s[s == 0].index.tolist()
result
count time start
Value
1 3 0
2 4 3
3 2 7
4 2 9
5 1 11
finally,
result["time start"] += 1
result["time end"] = result["time start"] + result['count'] - 1
result
count time start time end
Value
1 3 1 3
2 4 4 7
3 2 8 9
4 2 10 11
5 1 12 12
I have a data frame like the following:
values = random.sample(range(1, 101), 15)
df = pd.DataFrame({'x': [3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4], 'n': [100, 100, 100, 'reference', 'reference', 'reference', 500, 500, 500, 100, 100, 100, 'reference', 'reference', 'reference'], 'value': values})
The values labeled as 'reference' in the n column are reference values, which I will eventually plot against. To help with this, I need to make a data frame that has the reference values in a different column, so columns = ['x', 'n', 'value', 'value_reference']
Value reference is the reference value for all values of n as long as x is the same. Therefore, I want to make a data frame like the following:
desired_df = pd.DataFrame({'x': [3, 3, 3, 3, 3, 3, 4, 4, 4], 'n': [100, 100, 100, 500, 500, 500, 100, 100, 100], 'value': [values[i] for i in [0, 1, 2, 6, 7, 8, 9, 10, 11]], 'value_reference':[values[i] for i in [3, 4, 5, 3, 4, 5, 12, 13, 14]]})
I got the result here by hard coding exactly what I want to make a reproducible example. However, I am looking for the correct way of doing this operation.
How can this be done?
Thanks,
Jack
One way might be this:
df["tick"] = df.groupby(["x", "n"]).cumcount()
numbers = df.loc[df["n"] != "reference"]
ref = df.loc[df["n"] == "reference"]
ref = ref.drop("n", axis=1).rename(columns={"value": "reference"})
out = numbers.merge(ref).drop("tick", axis=1)
out = out.sort_values(["x", "n"])
which gives me
In [282]: out
Out[282]:
x n value reference
0 3 100 6 67
2 3 100 9 29
4 3 100 34 51
1 3 500 42 67
3 3 500 36 29
5 3 500 12 51
6 4 100 74 5
7 4 100 48 37
8 4 100 7 70
Step by step, first we add a tick column so we know which row of value matches with which row of reference:
In [290]: df
Out[290]:
x n value tick
0 3 100 6 0
1 3 100 9 1
2 3 100 34 2
3 3 reference 67 0
4 3 reference 29 1
5 3 reference 51 2
6 3 500 42 0
7 3 500 36 1
8 3 500 12 2
9 4 100 74 0
10 4 100 48 1
11 4 100 7 2
12 4 reference 5 0
13 4 reference 37 1
14 4 reference 70 2
Then we separate out the value and reference parts of the table:
In [291]: numbers = df.loc[df["n"] != "reference"]
...: ref = df.loc[df["n"] == "reference"]
...: ref = ref.drop("n", axis=1).rename(columns={"value": "reference"})
...:
...:
In [292]: numbers
Out[292]:
x n value tick
0 3 100 6 0
1 3 100 9 1
2 3 100 34 2
6 3 500 42 0
7 3 500 36 1
8 3 500 12 2
9 4 100 74 0
10 4 100 48 1
11 4 100 7 2
In [293]: ref
Out[293]:
x reference tick
3 3 67 0
4 3 29 1
5 3 51 2
12 4 5 0
13 4 37 1
14 4 70 2
and then we merge, where the merge will align on the shared columns, which are "x" and "tick". A sort to clean things up and we're done.