I have a data frame like the following:
values = random.sample(range(1, 101), 15)
df = pd.DataFrame({'x': [3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4], 'n': [100, 100, 100, 'reference', 'reference', 'reference', 500, 500, 500, 100, 100, 100, 'reference', 'reference', 'reference'], 'value': values})
The values labeled as 'reference' in the n column are reference values, which I will eventually plot against. To help with this, I need to make a data frame that has the reference values in a different column, so columns = ['x', 'n', 'value', 'value_reference']
Value reference is the reference value for all values of n as long as x is the same. Therefore, I want to make a data frame like the following:
desired_df = pd.DataFrame({'x': [3, 3, 3, 3, 3, 3, 4, 4, 4], 'n': [100, 100, 100, 500, 500, 500, 100, 100, 100], 'value': [values[i] for i in [0, 1, 2, 6, 7, 8, 9, 10, 11]], 'value_reference':[values[i] for i in [3, 4, 5, 3, 4, 5, 12, 13, 14]]})
I got the result here by hard coding exactly what I want to make a reproducible example. However, I am looking for the correct way of doing this operation.
How can this be done?
Thanks,
Jack
One way might be this:
df["tick"] = df.groupby(["x", "n"]).cumcount()
numbers = df.loc[df["n"] != "reference"]
ref = df.loc[df["n"] == "reference"]
ref = ref.drop("n", axis=1).rename(columns={"value": "reference"})
out = numbers.merge(ref).drop("tick", axis=1)
out = out.sort_values(["x", "n"])
which gives me
In [282]: out
Out[282]:
x n value reference
0 3 100 6 67
2 3 100 9 29
4 3 100 34 51
1 3 500 42 67
3 3 500 36 29
5 3 500 12 51
6 4 100 74 5
7 4 100 48 37
8 4 100 7 70
Step by step, first we add a tick column so we know which row of value matches with which row of reference:
In [290]: df
Out[290]:
x n value tick
0 3 100 6 0
1 3 100 9 1
2 3 100 34 2
3 3 reference 67 0
4 3 reference 29 1
5 3 reference 51 2
6 3 500 42 0
7 3 500 36 1
8 3 500 12 2
9 4 100 74 0
10 4 100 48 1
11 4 100 7 2
12 4 reference 5 0
13 4 reference 37 1
14 4 reference 70 2
Then we separate out the value and reference parts of the table:
In [291]: numbers = df.loc[df["n"] != "reference"]
...: ref = df.loc[df["n"] == "reference"]
...: ref = ref.drop("n", axis=1).rename(columns={"value": "reference"})
...:
...:
In [292]: numbers
Out[292]:
x n value tick
0 3 100 6 0
1 3 100 9 1
2 3 100 34 2
6 3 500 42 0
7 3 500 36 1
8 3 500 12 2
9 4 100 74 0
10 4 100 48 1
11 4 100 7 2
In [293]: ref
Out[293]:
x reference tick
3 3 67 0
4 3 29 1
5 3 51 2
12 4 5 0
13 4 37 1
14 4 70 2
and then we merge, where the merge will align on the shared columns, which are "x" and "tick". A sort to clean things up and we're done.
Related
I have a dataframe with three columns. The first column specifies a group into which each row is classified. Each group normally consists of 3 data points (rows), but it is possible for the last group to be "cut off," and contain fewer than three data points. In the real world, this could be due to the experiment or data collection process being cut off prematurely. In the below example, group 3 is cut off and only contains one data point.
import pandas as pd
data = {
"group_id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 3],
"valueA": [420, 380, 390, 500, 270, 220, 150, 400, 330, 170],
"valueB": [50, 40, 45, 22, 20, 50, 10, 60, 90, 10]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
I also have two lists with additional values.
x_list = [1, 3, 5]
y_list = [2, 4, 6]
I want to add these lists to my dataframe as new columns, and have the values repeat for each group. In other words, I want my output to look like this.
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
Notice that even though the length of a column is not divisible by the length of the shorter lists, the number of rows in the dataframe does not change.
How do I achieve this without losing dataframe rows or adding new rows with NaN values?
You can use GroupBy.cumcount to generate a indexer, then use this to duplicate the values in order of the groups:
new = pd.DataFrame({'x': x_list, 'y': y_list})
idx = df.groupby('group_id').cumcount()
df[['x', 'y']] = new.reindex(idx).to_numpy()
Output:
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
As your lists have the same length, you can use:
df[['x', 'y']] = (pd.DataFrame({'x': x_list, 'y': y_list})
.reindex(df.groupby('group_id').cumcount().mod(3)).values)
print(df)
# Output
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
Let's use np.resize:
import pandas as pd
import numpy as np
data = {
"group_id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 3],
"valueA": [420, 380, 390, 500, 270, 220, 150, 400, 330, 170],
"valueB": [50, 40, 45, 22, 20, 50, 10, 60, 90, 10]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
df['x'] = np.resize(x_list, len(df))
df['y'] = np.resize(y_list, len(df))
df
Output:
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
An alternative in case of having lists of different sizes:
lambda_duplicator = lambda lista, lenn, shape : (lista*int(1 + shape/lenn))[:shape]
df['x'] = lambda_duplicator(x_list, len(x_list), df.shape[0])
df['y'] = lambda_duplicator(y_list, len(y_list), df.shape[0])
I want to reshape the data so that the values in the index column become the columns
My Data frame:
Gender_Male Gender_Female Location_london Location_North Location_South
Cat
V 5 4 4 2 3
W 15 12 12 7 8
X 11 15 16 4 6
Y 22 18 21 9 9
Z 8 7 7 4 4
Desired Data frame:
Is there an easy way to do this? I also have 9 other categorical variables in my data set in addition to the Gender and Location variables. I have only included two variables to keep the example simple.
Code to create the example dataframe:
df1 = pd.DataFrame({
'Cat' : ['V','W', 'X', 'Y', 'Z'],
'Gender_Male' : [5, 15, 11, 22, 8],
'Gender_Female' : [4, 12, 15, 18, 7],
'Location_london': [4,12, 16, 21, 7],
'Location_North' : [2, 7, 4, 9, 4],
'Location_South' : [3, 8, 6, 9, 4]
}).set_index('Cat')
df1
You can transpose the dataframe and then split and set the new index:
Transpose
dft = df1.T
print(dft)
Cat V W X Y Z
Gender_Male 5 15 11 22 8
Gender_Female 4 12 15 18 7
Location_london 4 12 16 21 7
Location_North 2 7 4 9 4
Location_South 3 8 6 9 4
Split and set the new index
dft.index = dft.index.str.split('_', expand=True)
dft.columns.name = None
print(dft)
V W X Y Z
Gender Male 5 15 11 22 8
Female 4 12 15 18 7
Location london 4 12 16 21 7
North 2 7 4 9 4
South 3 8 6 9 4
I have this pandas series:
ts = pd.Series([1, 2, 3, 4, 5, 6, 7, 8])
What I would like to get is a dataframe which contains another column with the sum of rows 0, 2, 4, 6 and for 1, 3, 5 and 7 (that means, one row is left out when creating the sum).
In this case, this means a new dataframe should look like this one:
index ts sum
0 1 16
1 2 20
2 3 16
3 4 20
4 5 16
5 6 20
6 7 16
7 8 20
How could I do this?
Use modulo by k for each kth rows:
k = 2
df = ts.to_frame('ts')
df['sum'] = df.groupby(ts.index % k).transform('sum')
#if not default RangeIndex
#df['sum'] = df.groupby(np.arange(len(ts)) % k).transform('sum')
print (df)
ts sum
0 1 16
1 2 20
2 3 16
3 4 20
4 5 16
5 6 20
6 7 16
7 8 20
I am currently training a neural network. I would like to split up my training and validation data with a 80:20 ratio. I would like to have the full purchases.
Unfortunately, I get an IndexError: column index (12) out of range. How can I fix this? At this position the error occurs mat[purchaseid, itemid] = 1.0. So I always split after each purchase (a complete purchase = if I have all rows with the same purchaseid).
Dataframe:
d = {'purchaseid': [0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 9],
'itemid': [ 3, 8, 2, 10, 3, 10, 4, 12, 3, 12, 3, 4, 8, 6, 3, 0, 5, 12, 9, 9, 13, 1, 7, 11, 11]}
df = pd.DataFrame(data=d)
print(df.head(20))
Methods:
PERCENTAGE_SPLIT = 20
def splitter(df):
df_ = pd.DataFrame()
sum_purchase = df['purchaseid'].nunique()
amount = round((sum_purchase / 100) * PERCENTAGE_SPLIT)
random_list = random.sample(df['purchaseid'].unique().tolist(), amount)
df_ = df.loc[df['purchaseid'].isin(random_list)]
df_reduced = df.loc[~df['purchaseid'].isin(random_list)]
return [df_reduced, df_]
def generate_matrix(dataframe, name):
mat = sp.dok_matrix((dataframe.shape[0], len(dataframe['itemid'].unique())), dtype=np.float32)
for purchaseid, itemid in zip(dataframe['purchaseid'], dataframe['itemid']):
mat[purchaseid, itemid] = 1.0 # At this position is the error
return mat
Call:
dfs = splitter(df)
df_tr = dfs[0].copy(deep=True)
df_val = dfs[1].copy(deep=True)
train_mat_ = generate_matrix(df_tr, 'train')
val_mat_ = generate_matrix(df_val, 'val')
Error:
IndexError: column index (12) out of range
Dataframe:
#df
purchaseid itemid
0 0 3
1 0 8
2 0 2
3 1 10
4 2 3
5 2 10
6 3 4
7 3 12
8 3 3
9 4 12
10 4 3
11 4 4
12 5 8
13 5 6
14 5 3
15 6 0
16 6 5
17 6 12
18 7 9
19 7 9
# df_tr
purchaseid itemid
0 0 3
1 0 8
2 0 2
3 1 10
4 2 3
5 2 10
6 3 4
7 3 12
8 3 3
9 4 12
10 4 3
11 4 4
12 5 8
13 5 6
14 5 3
18 7 9
19 7 9
20 8 13
# df_val
purchaseid itemid
15 6 0
16 6 5
17 6 12
21 9 1
22 9 7
23 9 11
24 9 11
Try this instead. sp.dok_matrix requires dimensions of target matrix. I have assumed range of purchaseid to be within [ 0, max(purchaseid) ] and range of itemid to be within [ 0, max(itemid) ] looking at your data.
def generate_matrix(dataframe, name):
mat = sp.dok_matrix((dataframe['purchaseid'].max() + 1, dataframe['itemid'].max() + 1), dtype=np.float32)
for purchaseid, itemid in zip(dataframe['purchaseid'], dataframe['itemid']):
mat[purchaseid, itemid] = 1.0 # At this position is the error
return mat
Suppose one has a dataframe created as such:
tdata = {('A', 50): [1, 2, 3, 4],
('A', 55): [5, 6, 7, 8],
('B', 10): [10, 20, 30, 40],
('B', 20): [50, 60, 70, 80],
('B', 50): [2, 4, 6, 8],
('B', 55): [10, 12, 14, 16]}
tdf = pd.DataFrame(tdata, index=range(0,4))
A B
50 55 10 20 50 55
0 1 5 10 50 2 10
1 2 6 20 60 4 12
2 3 7 30 70 6 14
3 4 8 40 80 8 16
How would one drop specific columns, say ('B', 10) and ('B', 20) from the dataframe?
Is there a way to drop the columns in one command such as tdf.drop(['B', [10,20]])? Note, I know that my example of the command is by no means close to what it should be, but I hope that it gets the gist across.
Is there a way to drop the columns through some logical expression? For example, say I want to drop columns having the sublevel indices less than 50 (again, the 10, 20 columns). Can I do some general command that would encompass column 'A', even though the 10,20 sublevel indices don't exist or must I specifically reference column 'B'?
You can use drop by list of tuples:
print (tdf.drop([('B',10), ('B',20)], axis=1))
A B
50 55 50 55
0 1 5 2 10
1 2 6 4 12
2 3 7 6 14
3 4 8 8 16
For remove columns by level:
mask = tdf.columns.get_level_values(1) >= 50
print (mask)
[ True True False False True True]
print (tdf.loc[:, mask])
A B
50 55 50 55
0 1 5 2 10
1 2 6 4 12
2 3 7 6 14
3 4 8 8 16
If need remove by level is possible specify only one level:
print (tdf.drop([50,55], axis=1, level=1))
B
10 20
0 10 50
1 20 60
2 30 70
3 40 80