I have a matrix with three columns (origin, destination, distance) and I want to convert this to a origin/destination matrix with Pandas, is there a fast way (lambda, map) to do this without for loops?
For example (what I have):
a b 10
a c 20
b c 30
What I need:
a b c
a 0 10 20
b 10 0 30
c 20 30 0
Here is an option, firstly duplicate the data frame information by concatenating the original data frame values and the origin-destination swapped values and then do a pivot:
pd.DataFrame(pd.np.concatenate([df.values, df.values[:, [1,0,2]]])).pivot(0,1,2).fillna(0)
Related
I would like to sort a pandas dataframe by the rows which have the most even distribution but also high values. For example:
Row Attribute1 Attribute2 Attribute3
a 1 1 108
b 10 2 145
c 50 60 55
d 100 90 120
e 20 25 23
f 1000 30 0
Rows d and c should rank the highest, ideally d followed by c.
I considered using standard deviation to identify the most even distribution and then mean to get the highest average values but I'm unsure as to how I can combine these together.
As the perception of "even distribution" you mention seems to be quite subjective, here is an instuction to implement the coefficient of variation mentionned by #ALollz.
df.std(axis=1) / df.mean(axis=1)
Row 0
a 1.6848130582715446
b 1.535375387727906
c 0.09090909090909091
d 0.14782502241793033
e 0.11102697698927574
f 1.6569547684031352
This metrics is the percentage of the mean represented by the standard deviation. If you have a row mean of 10 and a standard deviation of 1, the ratio will be 10% or 0.1
In this example, the row that could be considered most 'evenly distributed' is the row c: its mean is 55 and its standard deviation is 5. Therefore the ratio is about 9%.
This way, you can have a decent overview of the homogeneity of the distribution.
If you want the ranking, you can apply .sort_values:
(df.std(axis=1) / df.mean(axis=1)).sort_values()
Row 0
c 0.09090909090909091
e 0.11102697698927574
d 0.14782502241793033
b 1.535375387727906
f 1.6569547684031352
a 1.6848130582715446
My last words would be to not be fooled by our brain's perception: it can be easily tricked by statistics.
Now if you want to improve results of higher values, you can divide this coefficient by the mean: the higher the mean, the lower the coefficient.
(df.std(axis=1) / df.mean(axis=1)**2).sort_values()
Row 0
d 0.0014305647330767452
c 0.001652892561983471
f 0.004826081849717869
e 0.004898248984820989
b 0.029338383204991835
a 0.045949447043769395
And now we obtain the desired ranking : d first, then c, f, e, b and a
I have two dataframes and I want to find the difference between dataframe1 and dataframe2 based on the condition. What I mean is the following:
df.ref_well:
zone depth
a 34
b 23
c 11
d 35
e -9999
df_well
zone depth
a 17
c 15
d 25
f 11
what I want is to generate the df3 with the zone name and the difference between depth of the same zones in df1 and df3:
df3 = well- ref well (the same zones)
zone depth
a 17
b -9999
c -4
d 10
e -9999
I have tried to iterate through dfs separately and identify the same zones, and if they are equal to find the difference:
ref_well_zone_count=len(df_ref_well.iloc[:,0])
well_zone_count=len(df_well.iloc[:,0])
delta_depth=[]
for ref_zone in range(ref_well_zone_count):
for well_zone in range(well_zone_count):
if df_ref_well.iloc[ref_zone,0]==df_well.iloc[well_zone,0]:
delta_tvdss.append(df_well.iloc[well_zone, 1] - df_ref_well.iloc[ref_zone, 1])
The problem is I can't fill the results into the new column I am not able to insert them, so when I try adding the delta_depth as a column it says that:
ValueError: Length of values does not match length of index
But if I print out the results it calculates perfectly
You didn't specify what you want to do if there is no match. So I will assume no match means depth = 0
Link 2 df together using merge, then fill those that doesn't have a match will have 0 by default:
df3 = pd.merge(ref_well,df_well, on=['zone'], how='outer').fillna(0)
Calculate the difference and put it back
df3['diff'] = df3.depth_x - df3.depth_y
I have 4 columns. The 4th column is sum of values of 3 columns.
It is like this.
A B C D
4 3 3 10
I want to convert above expression into this.
E F G H
40% 30% 30% 100%
How can I do this in python ?
You can use numpy arrays for this operation, let’s consider that all the columns are represented by vectors then first 3 columns are represented by 3 vectors.
A = numpy.array([4])
B = numpy.array([3])
C = numpy.array([3])
Then you can add them as normal vectors (in your case columns)
D = A + B + C
All the numerical values will be work as you expected but according to my knowledge problem is letters those can’t be added like you mentioned. Because if we consider
A = 1
B = 2
C = 3
Then the answer would be 10 or J not D, same goes for the second set.
Say that I have a dataframe (df)with lots of values, including two columns, X and Y. I want to create a stacked histogram where each bin is a categorical value in X (say A and B), and inside each bin are stacks by values in Y (say a,b,c,...).
I can run df.groupby(["X","Y"]).size() to get output like below, but how can I make the stacked histogram from this?
A a 14
b 41
c 4
d 2
e 2
f 15
g 1
h 3
B a 18
b 37
c 1
d 3
e 1
f 17
g 2
So, I think I figured this out. First one needs to stack the data using; .unstack(level=-1)
This will turn it into an n by m array-like structure where n is number of X entries and m is the number of Y entries. From this form you can follow the outline given here:
http://pandas.pydata.org/pandas-docs/stable/visualization.html
So in total the command will be:
df.groupby(["X","Y"]).size().unstack(level=-1).plot(kind='bar',stacked=True)
Kinda unwieldy looking though!
I have an almost embarrassingly simple question, which I cannot figure out for myself.
Here's a toy example to demonstrate what I want to do, suppose I have this simple data frame:
df = pd.DataFrame([[1,2,3,4,5,6],[7,8,9,10,11,12]],index=range(2),columns=list('abcdef'))
a b c d e f
0 1 2 3 4 5 6
1 7 8 9 10 11 12
What I want is to stack it so that it takes the following form, where the columns identifiers have been changed (to X and Y) so that they are the same for all re-stacked values:
X Y
0 1 2
3 4
5 6
1 7 8
9 10
11 12
I am pretty sure you can do it with pd.stack() or pd.pivot_table() but I have read the documentation, but cannot figure out how to do it. But instead of appending all columns to the end of the next, I just want to append a pairs (or triplets of values actually) of values from each row.
Just to add some more flesh to the bones of what I want to do;
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
a b c d e f
0 -0.168636 -1.878447 -0.985152 -0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890 -1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250 -1.718324 0.145479 -0.099530
I want this to re-stacked into this form (where column labels have been changed again, to the same for all values):
X Y Z
0 -0.168636 -1.878447 -0.985152
-0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890
-1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250
-1.718324 0.145479 -0.099530
Yes, one could just make a for-loop with the following logic operating on each row:
df.values.reshape(df.shape[1]/3,2)
But then you would have to compute each row individually and my actual data has tens of thousands of rows.
So I want to stack each individual row selectively (e.g. by pairs of values or triplets), and then stack that row-stack, for the entire data frame, basically. Preferably done on the entire data frame at once (if possible).
Apologies for such a trivial question.
Use numpy.reshape to reshape the underlying data in the DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
print(df)
# a b c d e f
# 0 -0.889810 1.348811 -1.071198 0.091841 -0.781704 -1.672864
# 1 0.398858 0.004976 1.280942 1.185749 1.260551 0.858973
# 2 1.279742 0.946470 -1.122450 -0.355737 1.457966 0.034319
result = pd.DataFrame(df.values.reshape(-1,3),
index=df.index.repeat(2), columns=list('XYZ'))
print(result)
yields
X Y Z
0 -0.889810 1.348811 -1.071198
0 0.091841 -0.781704 -1.672864
1 0.398858 0.004976 1.280942
1 1.185749 1.260551 0.858973
2 1.279742 0.946470 -1.122450
2 -0.355737 1.457966 0.034319