Pandas: How to structure data in more than two dimensions? - python

I have a dataframe of quantities for each of a number products (below labeled 'A','B','C',etc) for each month...
import pandas as pd
import numpy as np
np.random.seed(0)
range = pd.date_range('2020-01-31', periods=12, freq='M')
column_names = list('ABCDEFGH')
quantities = pd.DataFrame(np.random.randint(0,100,size=(12, 8)), index=range, columns=column_names)
quantities
# Output
# A B C D E F G H
# 2020-01-31 44 47 64 67 67 9 83 21
# 2020-02-29 36 87 70 88 88 12 58 65
# 2020-03-31 39 87 46 88 81 37 25 77
# 2020-04-30 72 9 20 80 69 79 47 64
# 2020-05-31 82 99 88 49 29 19 19 14
# 2020-06-30 39 32 65 9 57 32 31 74
# 2020-07-31 23 35 75 55 28 34 0 0
# 2020-08-31 36 53 5 38 17 79 4 42
# 2020-09-30 58 31 1 65 41 57 35 11
# 2020-10-31 46 82 91 0 14 99 53 12
# 2020-11-30 42 84 75 68 6 68 47 3
# 2020-12-31 76 52 78 15 20 99 58 23
I also have a dataframe of the unit costs for each product for each month. And from these, I have calculated a third dataframe of the costs (quantity x unit cost) for each product for each month.
unit_costs = pd.DataFrame(np.random.rand(12, 8), index=range, columns=column_names)
costs = quantities*unit_costs
The code below produces a dataframe of the bill for the first month (bill0)...
bill0 = pd.DataFrame({'quantity': quantities.iloc[0],'unit_cost': unit_costs.iloc[0],'cost': costs.iloc[0]})
bill0
# Output
# quantity unit_cost cost
# A 44 0.338008 14.872335
# B 47 0.674752 31.713359
# C 64 0.317202 20.300911
# D 67 0.778345 52.149147
# E 67 0.949571 63.621261
# F 9 0.662527 5.962742
# G 83 0.013572 1.126446
# H 21 0.622846 13.079768
I would like to efficiently produce a dataframe of the bill for any specific month. It seems that a 3D data structure is required, and I'm too new to python to know how to approach it.
Perhaps an array of bill dataframes - one for each month? (If so, how?)
Or perhaps the quantities, unit_costs, and amounts dataframes should first be combined into a multi-indexed dataframe and then that could be filtered (or otherwise manipulated) to produce the bill dataframe for whichever month I'm after? (If so, how?)
Or is there a more elegant way of doing this?
Thanks so much for your time!

IIUC, you can use MultiIndex column headers:
pd.concat(
[quantities, unit_costs, costs], keys=["Quantity", "Unit Cost", "Cost"], axis=1
).swaplevel(0, 1, axis=1).sort_index(level=0, axis=1)
Output (just printed A and B but dataframe has all products):
A B
Cost Quantity Unit Cost Cost Quantity Unit Cost
2020-01-31 14.872335 44 0.338008 31.713359 47 0.674752
2020-02-29 24.251747 36 0.673660 84.559215 87 0.971945
2020-03-31 38.203882 39 0.979587 31.271668 87 0.359444
2020-04-30 62.287384 72 0.865103 4.580721 9 0.508969
2020-05-31 53.068279 82 0.647174 83.297226 99 0.841386
2020-06-30 22.215118 39 0.569618 22.519593 32 0.703737
2020-07-31 20.505752 23 0.891554 23.801945 35 0.680056
2020-08-31 8.992666 36 0.249796 16.600572 53 0.313218
2020-09-30 35.890897 58 0.618809 14.720893 31 0.474868
2020-10-31 4.731714 46 0.102863 7.574659 82 0.092374
2020-11-30 5.933084 42 0.141264 8.169834 84 0.097260
2020-12-31 35.662937 76 0.469249 43.739287 52 0.841140

Related

How to create a column that contains the penultimate value of each row?

I have a DataFrame and I need to create a new column which contains the second largest value of each row in the original Dataframe.
Sample:
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
Desired output:
0 1 2 3 4 5 6 7 8 9 penultimate
0 52 69 62 7 20 69 38 10 57 17 62
1 52 94 49 63 1 90 14 76 20 84 90
2 78 37 58 7 27 41 27 26 48 51 58
3 6 39 99 36 62 90 47 25 60 84 90
4 37 36 91 93 76 69 86 95 69 6 93
5 5 54 73 61 22 29 99 27 46 24 73
6 71 65 45 9 63 46 4 93 36 18 71
7 85 7 76 46 65 97 64 52 28 80 85
How can this be done in as little code as possible?
You could use NumPy for this:
import numpy as np
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
df['penultimate'] = np.sort(df.values, 1)[:, -2]
print(df)
Using NumPy is faster.
Here is a simple lambda function!
# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
# Output
out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1)
df['penultimate'] = out
print(df)
Cheers!

Shuffle rows in pandas dataframe, keeping duplicates together

I have a data like this:
A B C D E F
35 1 2 35 25 65
40 5 7 47 57 67
20 1 8 74 58 63
35 1 2 37 28 69
40 5 7 49 58 69
20 1 8 74 58 63
35 1 2 47 29 79
40 5 7 55 77 87
20 1 8 74 58 63
Here we can see that Columns A,B and C have replicas that are repeated in various rows. I want to shuffle all the rows and have the replicas in consecutive rows, without deleting any of them. The output should look like this:
A B C D E F
35 1 2 35 25 65
35 1 2 37 28 69
35 1 2 47 29 79
40 5 7 47 57 67
40 5 7 49 58 69
40 5 7 55 77 87
20 1 8 74 58 63
20 1 8 74 58 63
20 1 8 74 58 63
When I use pandas.DataFrame.duplicated, it can give me duplicated rows. How can I keep all the identical rows using groupby?
Here is code that achieves the result you asked for (which doesn't require either explicit shuffling or sorting, but merely grouping your existing df by columns A,B,C):
df_shuf = pd.concat( group[1] for group in df.groupby(['A','B','C'], sort=False) )
print(df_shuf.to_string(index=False))
A B C D E F
35 1 2 35 25 65
35 1 2 37 28 69
35 1 2 47 29 79
40 5 7 47 57 67
40 5 7 49 58 69
40 5 7 55 77 87
20 1 8 74 58 63
20 1 8 74 58 63
20 1 8 74 58 63
Notes:
I couldn't figure out how to do df.reindex in-place on the grouped object. But we can get by without it.
You don't need pandas.DataFrame.duplicated, since df.groupby(['A','B','C'] puts all duplicates in the same group already.
df.groupby(... sort=False) is faster, use it whenever you don't need the groups sorted by default.

how to multiply multiple columns by another column pandas

I have a Dataframe of 100 Columns and I want to multiply one column ('Count') value with the columns position ranging from 6 to 74. Please tell me how to do that. I have been trying
df = df.ix[0, 6:74].multiply(df["Count"], axis="index")
df = df[df.columns[6:74]]*df["Count"]
None of them is working
The result Dataframe should be of 100 columns with all original columns where columns number 6 to 74 have the multiplied values in all the rows.
Assuming the same dataframe provided by #MaxU
Not easier, but a perspective on how to use other api elements.
pd.DataFrame.update and pd.DataFrame.mul
df.update(df.iloc[:, 3:7].mul(df.Count, 0))
df
0 1 2 3 4 5 6 7 8 9 Count
0 89 38 89 15.366436 1.355862 7.231264 4.971494 12 70 69 0.225977
1 49 1 38 1.004190 1.095480 2.829990 0.273870 57 93 64 0.030430
2 2 53 49 49.749460 50.379200 54.157640 16.373240 22 31 41 0.629740
3 38 44 23 28.437516 73.545300 41.185368 73.545300 19 99 57 0.980604
4 45 2 60 10.093230 4.773825 10.502415 6.274170 43 63 55 0.136395
5 65 97 15 10.375760 57.066680 38.260615 14.915155 68 5 21 0.648485
6 95 90 45 52.776000 16.888320 22.517760 50.664960 76 32 75 0.703680
7 60 31 65 63.242210 2.976104 26.784936 38.689352 72 73 94 0.744026
8 64 96 96 7.505370 37.526850 11.007876 10.007160 68 56 39 0.500358
9 78 54 74 8.409275 25.227825 16.528575 9.569175 97 63 37 0.289975
Demo:
Sample DF:
In [6]: df = pd.DataFrame(np.random.randint(100,size=(10,10))) \
.assign(Count=np.random.rand(10))
In [7]: df
Out[7]:
0 1 2 3 4 5 6 7 8 9 Count
0 89 38 89 68 6 32 22 12 70 69 0.225977
1 49 1 38 33 36 93 9 57 93 64 0.030430
2 2 53 49 79 80 86 26 22 31 41 0.629740
3 38 44 23 29 75 42 75 19 99 57 0.980604
4 45 2 60 74 35 77 46 43 63 55 0.136395
5 65 97 15 16 88 59 23 68 5 21 0.648485
6 95 90 45 75 24 32 72 76 32 75 0.703680
7 60 31 65 85 4 36 52 72 73 94 0.744026
8 64 96 96 15 75 22 20 68 56 39 0.500358
9 78 54 74 29 87 57 33 97 63 37 0.289975
Let's multiply columns 3-6 by df['Count']:
In [8]: df.iloc[:, 3:6+1]
Out[8]:
3 4 5 6
0 68 6 32 22
1 33 36 93 9
2 79 80 86 26
3 29 75 42 75
4 74 35 77 46
5 16 88 59 23
6 75 24 32 72
7 85 4 36 52
8 15 75 22 20
9 29 87 57 33
In [9]: df.iloc[:, 3:6+1] *= df['Count']
In [10]: df
Out[10]:
0 1 2 3 4 5 6 7 8 9 Count
0 89 38 89 66.681065 0.818372 20.751519 15.480964 12 70 69 0.225977
1 49 1 38 32.359929 4.910233 60.309102 6.333122 57 93 64 0.030430
2 2 53 49 77.467708 10.911630 55.769707 18.295685 22 31 41 0.629740
3 38 44 23 28.437513 10.229653 27.236368 52.776014 19 99 57 0.980604
4 45 2 60 72.564688 4.773838 49.933342 32.369289 43 63 55 0.136395
5 65 97 15 15.689662 12.002793 38.260613 16.184644 68 5 21 0.648485
6 95 90 45 73.545292 3.273489 20.751519 50.664974 76 32 75 0.703680
7 60 31 65 83.351331 0.545581 23.345459 36.591370 72 73 94 0.744026
8 64 96 96 14.709058 10.229653 14.266669 14.073604 68 56 39 0.500358
9 78 54 74 28.437513 11.866397 36.963643 23.221446 97 63 37 0.289975
The easiest thing to do here would be to extract the values, multiply, and then assign.
u = df.iloc[0, 6:74].values
v = df[['count']]
df = pd.DataFrame(u * v)
By using combine_first
df.iloc[:, 3:6+1].mul(df['Count'],axis=0).combine_first(df)
You need to concatenate the data frame resulting from multiplication with the remaining columns:
df=pd.concat( [df.iloc[0:6],df.iloc[75:],df.iloc[:,6:74+1].multiply(df['Count'],axis=0)] , axis=1)

How to delete first value in the last column of a pandas dataframe and then delete the remaining last row?

Below I am using pandas to read my csv file in the following format:
dataframe = pandas.read_csv("test.csv", header=None, usecols=range(2,62), skiprows=1)
dataset = dataframe.values
How can I delete the first value in the very last column in the dataframe and then delete the last row in the dataframe?
Any ideas?
You can shift the last column up to get rid of the first value, then drop the last line.
df.assign(E=df.E.shift(-1)).drop(df.index[-1])
MVCE:
pd.np.random.seed = 123
df = pd.DataFrame(pd.np.random.randint(0,100,(10,5)),columns=list('ABCDE'))
Output:
A B C D E
0 91 83 40 17 94
1 61 5 43 87 48
2 3 69 73 15 85
3 99 53 18 95 45
4 67 30 69 91 28
5 25 89 14 39 64
6 54 99 49 44 73
7 70 41 96 51 68
8 36 3 15 94 61
9 51 4 31 39 0
df.assign(E=df.E.shift(-1)).drop(df.index[-1]).astype(int)
Output:
A B C D E
0 91 83 40 17 48
1 61 5 43 87 85
2 3 69 73 15 45
3 99 53 18 95 28
4 67 30 69 91 64
5 25 89 14 39 73
6 54 99 49 44 68
7 70 41 96 51 61
8 36 3 15 94 0
or in two steps:
df[df.columns[-1]] = df[df.columns[-1]].shift(-1)
df = df[:-1]

Object Similarity Pandas and Scikit Learn

Is there a way to find to find and rank rows in a Pandas Dataframe by their similarity to a row from another Dataframe?
My understanding of your question: you have two data frames, hopfully of the same column count. You want to rate first data frame's, the subject data frame, members by how close, i.e. similar, they are to any of the members of the target data frame.
I am not aware of a built in method.
It is probably not the most efficient way but here is how I'd approach:
#! /usr/bin/python3
import pandas as pd
import numpy as np
import pprint
pp = pprint.PrettyPrinter(indent=4)
# Simulate data
df_subject = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')) # This is the one we're iterating to check similarity to target.
df_target = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')) # This is the one we're checking distance to
# This will hold the min dstances.
distances=[]
# Loop to iterate over subject DF
for ix1,subject in df_subject.iterrows():
distances_cur=[]
# Loop to iterate over target DF
for ix2,target in df_target.iterrows():
distances_cur.append(np.linalg.norm(target-subject))
# Get the minimum distance for the subject set member.
distances.append(min(distances_cur))
# Distances to df
distances=pd.DataFrame(distances)
# Normalize.
distances=0.5-(distances-distances.mean(axis=0))/distances.max(axis=0)
# Column index joining, ordering and beautification.
Proximity_Ratings_name='Proximity Ratings'
distances=distances.rename(columns={0: Proximity_Ratings_name})
df_subject=df_subject.join(distances)
pp.pprint(df_subject.sort_values(Proximity_Ratings_name,ascending=False))
It should yeild something like the table below. Higher rating means there's a similar member in the target data frame:
A B C D Proximity Ratings
55 86 21 91 78 0.941537
38 91 31 35 95 0.901638
43 49 89 49 6 0.878030
98 28 98 98 36 0.813685
77 67 23 78 84 0.809324
35 52 16 36 58 0.802223
54 2 25 61 44 0.788591
95 76 3 60 46 0.766896
5 55 39 88 37 0.756049
52 79 71 90 70 0.752520
66 52 27 82 82 0.751353
41 45 67 55 33 0.739919
76 12 93 50 62 0.720323
94 99 84 39 63 0.716123
26 62 6 97 60 0.715081
40 64 50 37 27 0.714042
68 70 21 8 82 0.698824
47 90 54 60 65 0.676680
7 85 95 45 71 0.672036
2 14 68 50 6 0.661113
34 62 63 83 29 0.659322
8 87 90 28 74 0.647873
75 14 61 27 68 0.633370
60 9 91 42 40 0.630030
4 46 46 52 35 0.621792
81 94 19 82 44 0.614510
73 67 27 34 92 0.608137
30 92 64 93 51 0.608137
11 52 25 93 50 0.605770
51 17 48 57 52 0.604984
.. .. .. .. .. ...
64 28 56 0 9 0.397054
18 52 84 36 79 0.396518
99 41 5 32 34 0.388519
27 19 54 43 94 0.382714
92 69 56 73 93 0.382714
59 1 29 46 16 0.374878
58 2 36 8 96 0.362525
69 58 92 16 48 0.361505
31 27 57 80 35 0.349887
10 59 23 47 24 0.345891
96 41 77 76 33 0.345891
78 42 71 87 65 0.344398
93 12 31 6 27 0.329152
23 6 5 10 42 0.320445
14 44 6 43 29 0.319964
6 81 51 44 15 0.311840
3 17 60 13 22 0.293066
70 28 40 22 82 0.251549
36 95 72 35 5 0.249354
49 78 10 30 18 0.242370
17 79 69 57 96 0.225168
46 42 95 86 81 0.224742
84 58 81 59 86 0.221346
9 9 62 8 30 0.211659
72 11 51 74 8 0.159265
90 74 26 80 1 0.138993
20 90 4 6 5 0.117652
50 3 12 5 53 0.077088
42 90 76 42 1 0.075284
45 94 46 88 14 0.054244
Hope I understand correctly. Don't use if performance matters, I'm sure there's an algebraic way to approach this (Multiply matrices) that would run way faster.

Categories