Convert (origin, destination, distance) to a distance matrix

Convert (origin, destination, distance) to a distance matrix - python

I have a matrix with three columns (origin, destination, distance) and I want to convert this to a origin/destination matrix with Pandas, is there a fast way (lambda, map) to do this without for loops?
For example (what I have):
a b 10
a c 20
b c 30
What I need:
a b c
a 0 10 20
b 10 0 30
c 20 30 0

Here is an option, firstly duplicate the data frame information by concatenating the original data frame values and the origin-destination swapped values and then do a pivot:
pd.DataFrame(pd.np.concatenate([df.values, df.values[:, [1,0,2]]])).pivot(0,1,2).fillna(0)

Related

How to determine which row in dataframe has most even and highest distribution

I would like to sort a pandas dataframe by the rows which have the most even distribution but also high values. For example:
Row Attribute1 Attribute2 Attribute3
a 1 1 108
b 10 2 145
c 50 60 55
d 100 90 120
e 20 25 23
f 1000 30 0
Rows d and c should rank the highest, ideally d followed by c.
I considered using standard deviation to identify the most even distribution and then mean to get the highest average values but I'm unsure as to how I can combine these together.

As the perception of "even distribution" you mention seems to be quite subjective, here is an instuction to implement the coefficient of variation mentionned by #ALollz.
df.std(axis=1) / df.mean(axis=1)
Row 0
a 1.6848130582715446
b 1.535375387727906
c 0.09090909090909091
d 0.14782502241793033
e 0.11102697698927574
f 1.6569547684031352
This metrics is the percentage of the mean represented by the standard deviation. If you have a row mean of 10 and a standard deviation of 1, the ratio will be 10% or 0.1
In this example, the row that could be considered most 'evenly distributed' is the row c: its mean is 55 and its standard deviation is 5. Therefore the ratio is about 9%.
This way, you can have a decent overview of the homogeneity of the distribution.
If you want the ranking, you can apply .sort_values:
(df.std(axis=1) / df.mean(axis=1)).sort_values()
Row 0
c 0.09090909090909091
e 0.11102697698927574
d 0.14782502241793033
b 1.535375387727906
f 1.6569547684031352
a 1.6848130582715446
My last words would be to not be fooled by our brain's perception: it can be easily tricked by statistics.
Now if you want to improve results of higher values, you can divide this coefficient by the mean: the higher the mean, the lower the coefficient.
(df.std(axis=1) / df.mean(axis=1)**2).sort_values()
Row 0
d 0.0014305647330767452
c 0.001652892561983471
f 0.004826081849717869
e 0.004898248984820989
b 0.029338383204991835
a 0.045949447043769395
And now we obtain the desired ranking : d first, then c, f, e, b and a

Difference between values based on condition

I have two dataframes and I want to find the difference between dataframe1 and dataframe2 based on the condition. What I mean is the following:
df.ref_well:
zone depth
a 34
b 23
c 11
d 35
e -9999
df_well
zone depth
a 17
c 15
d 25
f 11
what I want is to generate the df3 with the zone name and the difference between depth of the same zones in df1 and df3:
df3 = well- ref well (the same zones)
zone depth
a 17
b -9999
c -4
d 10
e -9999
I have tried to iterate through dfs separately and identify the same zones, and if they are equal to find the difference:
ref_well_zone_count=len(df_ref_well.iloc[:,0])
well_zone_count=len(df_well.iloc[:,0])
delta_depth=[]
for ref_zone in range(ref_well_zone_count):
for well_zone in range(well_zone_count):
if df_ref_well.iloc[ref_zone,0]==df_well.iloc[well_zone,0]:
delta_tvdss.append(df_well.iloc[well_zone, 1] - df_ref_well.iloc[ref_zone, 1])
The problem is I can't fill the results into the new column I am not able to insert them, so when I try adding the delta_depth as a column it says that:
ValueError: Length of values does not match length of index
But if I print out the results it calculates perfectly

You didn't specify what you want to do if there is no match. So I will assume no match means depth = 0
Link 2 df together using merge, then fill those that doesn't have a match will have 0 by default:
df3 = pd.merge(ref_well,df_well, on=['zone'], how='outer').fillna(0)
Calculate the difference and put it back
df3['diff'] = df3.depth_x - df3.depth_y

Convert columns of number into percentages

I have 4 columns. The 4th column is sum of values of 3 columns.
It is like this.
A B C D
4 3 3 10
I want to convert above expression into this.
E F G H
40% 30% 30% 100%
How can I do this in python ?

You can use numpy arrays for this operation, let’s consider that all the columns are represented by vectors then first 3 columns are represented by 3 vectors.
A = numpy.array([4])
B = numpy.array([3])
C = numpy.array([3])
Then you can add them as normal vectors (in your case columns)
D = A + B + C
All the numerical values will be work as you expected but according to my knowledge problem is letters those can’t be added like you mentioned. Because if we consider
A = 1
B = 2
C = 3
Then the answer would be 10 or J not D, same goes for the second set.

Stacked Histograms of Grouped Data In pandas

Say that I have a dataframe (df)with lots of values, including two columns, X and Y. I want to create a stacked histogram where each bin is a categorical value in X (say A and B), and inside each bin are stacks by values in Y (say a,b,c,...).
I can run df.groupby(["X","Y"]).size() to get output like below, but how can I make the stacked histogram from this?
A a 14
b 41
c 4
d 2
e 2
f 15
g 1
h 3
B a 18
b 37
c 1
d 3
e 1
f 17
g 2

So, I think I figured this out. First one needs to stack the data using; .unstack(level=-1)
This will turn it into an n by m array-like structure where n is number of X entries and m is the number of Y entries. From this form you can follow the outline given here:
http://pandas.pydata.org/pandas-docs/stable/visualization.html
So in total the command will be:
df.groupby(["X","Y"]).size().unstack(level=-1).plot(kind='bar',stacked=True)
Kinda unwieldy looking though!

Re-shaping pandas data frame using shape or pivot_table (stack each row)

I have an almost embarrassingly simple question, which I cannot figure out for myself.
Here's a toy example to demonstrate what I want to do, suppose I have this simple data frame:
df = pd.DataFrame([[1,2,3,4,5,6],[7,8,9,10,11,12]],index=range(2),columns=list('abcdef'))
a b c d e f
0 1 2 3 4 5 6
1 7 8 9 10 11 12
What I want is to stack it so that it takes the following form, where the columns identifiers have been changed (to X and Y) so that they are the same for all re-stacked values:
X Y
0 1 2
3 4
5 6
1 7 8
9 10
11 12
I am pretty sure you can do it with pd.stack() or pd.pivot_table() but I have read the documentation, but cannot figure out how to do it. But instead of appending all columns to the end of the next, I just want to append a pairs (or triplets of values actually) of values from each row.
Just to add some more flesh to the bones of what I want to do;
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
a b c d e f
0 -0.168636 -1.878447 -0.985152 -0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890 -1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250 -1.718324 0.145479 -0.099530
I want this to re-stacked into this form (where column labels have been changed again, to the same for all values):
X Y Z
0 -0.168636 -1.878447 -0.985152
-0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890
-1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250
-1.718324 0.145479 -0.099530
Yes, one could just make a for-loop with the following logic operating on each row:
df.values.reshape(df.shape[1]/3,2)
But then you would have to compute each row individually and my actual data has tens of thousands of rows.
So I want to stack each individual row selectively (e.g. by pairs of values or triplets), and then stack that row-stack, for the entire data frame, basically. Preferably done on the entire data frame at once (if possible).
Apologies for such a trivial question.

Use numpy.reshape to reshape the underlying data in the DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
print(df)
# a b c d e f
# 0 -0.889810 1.348811 -1.071198 0.091841 -0.781704 -1.672864
# 1 0.398858 0.004976 1.280942 1.185749 1.260551 0.858973
# 2 1.279742 0.946470 -1.122450 -0.355737 1.457966 0.034319
result = pd.DataFrame(df.values.reshape(-1,3),
index=df.index.repeat(2), columns=list('XYZ'))
print(result)
yields
X Y Z
0 -0.889810 1.348811 -1.071198
0 0.091841 -0.781704 -1.672864
1 0.398858 0.004976 1.280942
1 1.185749 1.260551 0.858973
2 1.279742 0.946470 -1.122450
2 -0.355737 1.457966 0.034319

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert (origin, destination, distance) to a distance matrix - python

I have a matrix with three columns (origin, destination, distance) and I want to convert this to a origin/destination matrix with Pandas, is there a fast way (lambda, map) to do this without for loops? For example (what I have): a b 10 a c 20 b c 30 What I need: a b c a 0 10 20 b 10 0 30 c 20 30 0

Here is an option, firstly duplicate the data frame information by concatenating the original data frame values and the origin-destination swapped values and then do a pivot: pd.DataFrame(pd.np.concatenate([df.values, df.values[:, [1,0,2]]])).pivot(0,1,2).fillna(0)

Related

How to determine which row in dataframe has most even and highest distribution

Difference between values based on condition

Convert columns of number into percentages

Stacked Histograms of Grouped Data In pandas

Re-shaping pandas data frame using shape or pivot_table (stack each row)

Categories

Resources