compute the difference of all possible rows - python

Based on a selection ds of a dataframe d with:
{ 'x': d.x, 'y': d.y, 'a':d.a, 'b':d.b, 'c':d.c 'row:d.n'})
Having n rows, x ranges from 0 to n-1. The column n is needed since it's a selection and indices need to be kept for a later query.
How do you efficiently compute the difference between each row (e.g.a_0, a_1, etc) of each column (a, b, c) without losing the rows information (e.g. new column with the indices of the rows that were used) ?
MWE
Sample selection ds:
x y a b c n
554.607085 400.971878 9789 4151 6837 146
512.231450 405.469524 8796 3811 6596 225
570.427284 694.369140 1608 2019 2097 291
Desired output:
dist euclidean distance math.hypot(x2 - x1, y2 - y1)
da, db, dc for da: np.abs(a1-a2)
ns a string with both ns of the employed rows
the result would look like:
dist da db dc ns
42.61365102824963 993 340 241 146-225
293.82347069813255 8181 2132 4740 146-291
.. .. .. .. 225-291

You can use itertools.combinations() to generate the pairs:
Read data first:
import pandas as pd
from io import StringIO
import numpy as np
text = """ x y a b c n
554.607085 400.971878 9789 4151 6837 146
512.231450 405.469524 8796 3811 6596 225
570.427284 694.369140 1608 2019 2097 291"""
df = pd.read_csv(StringIO(text), delim_whitespace=True)
Create the index and calculate the results:
from itertools import combinations
index = np.array(list(combinations(range(df.shape[0]), 2)))
df1, df2 = [df.iloc[idx].reset_index(drop=True) for idx in index.T]
res = pd.concat([
np.hypot(df1.x - df2.x, df1.y - df2.y),
df1[["a", "b", "c"]] - df2[["a", "b", "c"]],
df1.n.astype(str) + "-" + df2.n.astype(str)
], axis=1)
res.columns = ["dist", "da", "db", "dc", "ns"]
res
the output:
dist da db dc ns
0 42.613651 993 340 241 146-225
1 293.823471 8181 2132 4740 146-291
2 294.702805 7188 1792 4499 225-291

This approach makes good use of Pandas and the underlying numpy capabilities, but the matrix manipulations are a little hard to keep track of:
import pandas as pd, numpy as np
ds = pd.DataFrame(
[
[554.607085, 400.971878, 9789, 4151, 6837, 146],
[512.231450, 405.469524, 8796, 3811, 6596, 225],
[570.427284, 694.369140, 1608, 2019, 2097, 291]
],
columns = ['x', 'y', 'a', 'b', 'c', 'n']
)
def concat_str(*arrays):
result = arrays[0]
for arr in arrays[1:]:
result = np.core.defchararray.add(result, arr)
return result
# Make a panel with one item for each column, with a square data frame for
# each item, showing the differences between all row pairs.
# This creates perpendicular matrices of values based on the underlying numpy arrays;
# then numpy broadcasts them along the missing axis when calculating the differences
p = pd.Panel(
(ds.values[np.newaxis,:,:] - ds.values[:,np.newaxis,:]).transpose(),
items=['d'+c for c in ds.columns], major_axis=ds.index, minor_axis=ds.index
)
# calculate euclidian distance
p['dist'] = np.hypot(p['dx'], p['dy'])
# create strings showing row relationships
p['ns'] = concat_str(ds['n'].values.astype(str)[:,np.newaxis], '-', ds['n'].values.astype(str)[np.newaxis,:])
# remove unneeded items
del p['dx'], p['dy'], p['dn']
# convert to frame
diffs = p.to_frame().reindex_axis(['dist', 'da', 'db', 'dc', 'ns'], axis=1)
diffs
This gives:
dist da db dc ns
major minor
0 0 0.000000 0 0 0 146-146
1 42.613651 993 340 241 146-225
2 293.823471 8181 2132 4740 146-291
1 0 42.613651 -993 -340 -241 225-146
1 0.000000 0 0 0 225-225
2 294.702805 7188 1792 4499 225-291
2 0 293.823471 -8181 -2132 -4740 291-146
1 294.702805 -7188 -1792 -4499 291-225
2 0.000000 0 0 0 291-291

Related

Can I combine this tables via panas.crosstab?

I have three data frames (as result from .mean()) like this:
A 533.9
B 691.9
C 611.5
D 557.8
I want to concatenate them to three columns like this
all X Y
A 533.9 558.0 509.8
B 691.9 613.2 770.6
C 611.5 618.4 604.6
D 557.8 591.0 524.6
My MWE below does work. But I wonder if I can use .crosstab() or another fancy and more easy pandas function for that.
The initial data frame:
group A B C D
0 X 844 908 310 477
1 X 757 504 729 865
2 X 420 281 898 260
3 X 258 755 683 805
4 X 511 618 472 548
5 Y 404 250 100 14
6 Y 783 909 434 719
7 Y 303 982 610 398
8 Y 476 810 913 824
9 Y 583 902 966 668
And this is the MWE using dict and pandas.concat() to solve the problem.
#!/usr/bin/env python3
import random as rd
import pandas as pd
import statistics
rd.seed(0)
df = pd.DataFrame({
'group': ['X'] * 5 + ['Y'] * 5,
'A': rd.choices(range(1000), k=10),
'B': rd.choices(range(1000), k=10),
'C': rd.choices(range(1000), k=10),
'D': rd.choices(range(1000), k=10),
})
cols = list('ABCD')
result = {
'all': df.loc[:, cols].mean(),
'X': df.loc[df.group.eq('X'), cols].mean(),
'Y': df.loc[df.group.eq('Y'), cols].mean()
}
tab = pd.concat(result, axis=1)
print(tab)
You can do with melt then pivot_table
out = df.melt('group').pivot_table(
index = 'variable',
columns = 'group',
values = 'value',
aggfunc = 'mean',
margins = True).drop(['All'])
Out[207]:
group X Y All
variable
A 558.0 509.8 533.9
B 613.2 770.6 691.9
C 618.4 604.6 611.5
D 591.0 524.6 557.8
Solution :
res = df.groupby('group').mean().T
res['all'] = (res.X + res.Y) / 2
print(res)
Output
group X Y all
A 558.0 509.8 533.9
B 613.2 770.6 691.9
C 618.4 604.6 611.5
D 591.0 524.6 557.8

Calculating Distances and Filtering and Summing Values Pandas

I currently have data which contains a location name, latitude, longitude and then a number value associated locations. The final goal for me would to get a dataframe that has the sum of the values of each location within specific distance ranges. A sample dataframe is below:
IDVALUE,Latitude,Longitude,NumberValue
ID1,44.968046,-94.420307,1
ID2,44.933208,-94.421310,10
ID3,33.755787,-116.359998,15
ID4,33.844843,-116.54911,207
ID5,44.92057,-93.44786,133
ID6,44.240309,-91.493619,52
ID7,44.968041,-94.419696,39
ID8,44.333304,-89.132027,694
ID9,33.755783,-116.360066,245
ID10,33.844847,-116.549069,188
ID11,44.920474,-93.447851,3856
ID12,44.240304,-91.493768,189
Firstly, I managed to get the distances between each of them using the haversine function. Using the code below I turned the latlongs into radians and then created a matrix where the diagonals are infinite values.
df_latlongs['LATITUDE'] = np.radians(df_latlongs['LATITUDE'])
df_latlongs['LONGITUDE'] = np.radians(df_latlongs['LONGITUDE'])
dist = DistanceMetric.get_metric('haversine')
latlong_df = pd.DataFrame(dist.pairwise(df_latlongs[['LATITUDE','LONGITUDE']].to_numpy())*6373, columns=df_latlongs.IDVALUE.unique(), index=df_latlongs.IDVALUE.unique())
np.fill_diagonal(latlong_df.values, math.inf)
This distance matrix is then in kilometres. What I'm struggling with next is to be able to filter the distances of each of the locations and get the total number of values within a range and link this to the original dataframe.
Below is the code I have used to filter the distance matrix to get all of the locations within 500 meters:
latlong_df_rows = latlong_df[latlong_df < 0.5]
latlong_df_rows = latlong_df_rows.dropna(how='all', axis=0)
latlong_df_rows = latlong_df_rows.dropna(how='all', axis=1)
My attempt was to them get a list for each location of the locations that were in this value using the code below:
within_range_df = latlong_df_rows.apply(lambda row: row[row < 0.05].index.tolist(), axis=1)
within_range_df = within_range_df.to_frame()
within_range_df = within_range_df.dropna(how='all', axis=0)
within_range_df = within_range_df.dropna(how='all', axis=1)
From here I was going to try and get the NumberValue from the original dataframe by looping through the list of values to obtain another column for the number for that location. Then sum all of them. The final dataframe would ideally look like the following:
ID VALUE,<500m,500-1000m,>100m
ID1,x1,y1,z1
ID2,x2,y2,z2
ID3,x3,y3,z3
ID4,x4,y4,z4
ID5,x5,y5,z5
ID6,x6,y6,z6
ID7,x7,y7,z7
ID8,x8,y8,z8
ID9,x9,y9,z9
ID10,x10,y10,z10
ID11,x11,y11,z11
ID12,x12,y12,z12
Where x y and z are the total number values for the nearest locations for different distances. I know this is probably really weird and overcomplicated so any tips to change the question or anything else that is needed I'll be happy to provide. Cheers
I would define a helper function, making use of BallTree, e.g.
from sklearn.neighbors import BallTree
import pandas as pd
import numpy as np
df = pd.read_csv('input.csv')
We use query_radius() to get the IDs and use list comprehension to get the values and sum them;
locations_radians = np.radians(df[["Latitude","Longitude"]].values)
tree = BallTree(locations_radians, leaf_size=12, metric='haversine')
def summed_numbervalue_for_radius( radius_in_m=100):
distance_in_meters = radius_in_m
earth_radius = 6371000
radius = distance_in_meters / earth_radius
ids_within_radius = tree.query_radius(locations_radians, r=radius, count_only=False)
values_as_array = np.array(df.NumberValue)
summed_values = [values_as_array[ix].sum() for ix in ids_within_radius]
return np.array(summed_values)
With the helper function you can do for instance;
df = df.assign( sum_100=summed_numbervalue_for_radius(100))
df = df.assign( sum_500=summed_numbervalue_for_radius(500))
df = df.assign( sum_1000=summed_numbervalue_for_radius(1000))
df = df.assign( sum_1000_to_5000=summed_numbervalue_for_radius(5000)-summed_numbervalue_for_radius(1000))
Will give you
IDVALUE Latitude Longitude NumberValue sum_100 sum_500 sum_1000 \
0 ID1 44.968046 -94.420307 1 40 40 40
1 ID2 44.933208 -94.421310 10 10 10 10
2 ID3 33.755787 -116.359998 15 260 260 260
3 ID4 33.844843 -116.549110 207 395 395 395
4 ID5 44.920570 -93.447860 133 3989 3989 3989
5 ID6 44.240309 -91.493619 52 241 241 241
6 ID7 44.968041 -94.419696 39 40 40 40
7 ID8 44.333304 -89.132027 694 694 694 694
8 ID9 33.755783 -116.360066 245 260 260 260
9 ID10 33.844847 -116.549069 188 395 395 395
10 ID11 44.920474 -93.447851 3856 3989 3989 3989
11 ID12 44.240304 -91.493768 189 241 241 241
sum_1000_to_5000
0 10
1 40
2 0
3 0
4 0
5 0
6 10
7 0
8 0
9 0
10 0
11 0

How to calculate and plot accuracy between two columns

I wanted to create an accuracy rate of each letter in a bar graph using matplotlib.
Example Dataset
data = {'Actual Letter': ['U', 'A', 'X', 'P', 'C', 'R', 'C', 'U', 'J', 'D'], 'Predicted Letter': ['U', 'A', 'X', 'P', 'C', 'R', 'C', 'U', 'J', 'D']}
df = pd.DataFrame(data, index=[10113, 19164, 12798, 12034, 17719, 17886, 4624, 6047, 15608, 11815])
Actual Letter Predicted Letter
10113 U U
19164 A A
12798 X X
12034 P P
17719 C C
17886 R R
4624 C C
6047 U U
15608 J J
11815 D D
df.plot(kind='bar')
Error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-a5f21be4f14b> in <module>
3 df = pd.DataFrame(data, index=[10113, 19164, 12798, 12034, 17719, 17886, 4624, 6047, 15608, 11815])
4
----> 5 df.plot(kind='bar')
e:\Anaconda3\lib\site-packages\pandas\plotting\_core.py in __call__(self, *args, **kwargs)
970 data.columns = label_name
971
--> 972 return plot_backend.plot(data, kind=kind, **kwargs)
973
974 __call__.__doc__ = __doc__
e:\Anaconda3\lib\site-packages\pandas\plotting\_matplotlib\__init__.py in plot(data, kind, **kwargs)
69 kwargs["ax"] = getattr(ax, "left_ax", ax)
70 plot_obj = PLOT_CLASSES[kind](data, **kwargs)
---> 71 plot_obj.generate()
72 plot_obj.draw()
73 return plot_obj.result
e:\Anaconda3\lib\site-packages\pandas\plotting\_matplotlib\core.py in generate(self)
284 def generate(self):
285 self._args_adjust()
--> 286 self._compute_plot_data()
287 self._setup_subplots()
288 self._make_plot()
e:\Anaconda3\lib\site-packages\pandas\plotting\_matplotlib\core.py in _compute_plot_data(self)
451 # no non-numeric frames or series allowed
452 if is_empty:
--> 453 raise TypeError("no numeric data to plot")
454
455 self.data = numeric_data.apply(self._convert_to_ndarray)
TypeError: no numeric data to plot
I wanted a bar graph that would be like this. However I don't know how to do it.
Imports and Sample DataFrame
import pandas as pd
import numpy as np # for sample data only
import string # for sample data only
# create sample dataframe for testing
np.random.seed(365)
rows = 1100
data = {'Actual': np.random.choice(list(string.ascii_uppercase), size=rows),
'Predicted': np.random.choice(list(string.ascii_uppercase), size=rows)}
df = pd.DataFrame(data)
Calculations and Plotting
Updated
The following implementation is more succinct; unnecessary steps have be removed.
Create a Boolean 'Match' column depending on if there is a match between 'Predicted' and 'Actual'
.groupby on 'Actual', aggregate .mean(), multiply by 100, and round, to get the percent.
The group for each letter will sum the Booleans and divide by the count. For 'A', the sum is 1, because there is 1 True, which is divided by the total count of the group, 33. Therefore, 1/33 = 0.030303030303030304
Plot the bar for the selected data with pandas.DataFrame.plot
Note that step (1) and (2) can be reduced and combined to the following:
dfa = df.Predicted.eq(df.Actual).groupby(df.Actual).mean().mul(100).round(2)
# determine where Predicted equals Actual
df['Match'] = df.Predicted.eq(df.Actual)
# display(df.head())
Actual Predicted Match
0 S Z False
1 U J False
2 B L False
3 M V False
4 F C False
# groupby and get percent
dfa = df.groupby('Actual').Match.mean().mul(100).round(2)
# display(dfa.head())
Actual
A 3.03
B 2.63
C 4.44
D 6.82
E 5.77
Name: Match, dtype: float64
# plot
ax = dfa.plot(kind='bar', x='Actual', y='%', rot=0, legend=False, grid=True, figsize=(8, 5),
ylabel='Percent %', xlabel='Letter', title='Accuracy Rate % per letter')
Original Code
This works as well
# determine where Predicted equals Actual and convert to an int; True = 1 and False = 0
df['Match'] = df.Predicted.eq(df.Actual).astype(int)
# get the normalized value counts
dfg = df.groupby('Actual').Match.value_counts(normalize=True).mul(100).round(2).reset_index(name='%')
# get the accuracy scores where there is a Match
df_accuracy = dfg[dfg.Match.eq(1)]
# display(df_accuracy.head())
Actual Match %
1 A 1 3.03
3 B 1 2.63
5 C 1 4.44
7 D 1 6.82
9 E 1 5.77
# plot
ax = df_accuracy.plot(kind='bar', x='Actual', y='%', rot=0, legend=False, grid=True, figsize=(8, 5),
ylabel='Percent %', xlabel='Letter', title='Accuracy Rate % per letter')
have simulated data that you note
graph is exceptionally simple if you calc the percentages first
import numpy as np
import pandas as pd
# simulate some data...
df = pd.DataFrame(
{"Actual Letter": np.random.choice(list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"), 200)}
).assign(
**{
"Predicted Letter": lambda d: d["Actual Letter"].apply(
lambda l: np.random.choice(
[l] + list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"), 1, p=tuple([0.95]+ [0.05/26]*26)
)[0]
)
}
)
# now just calc percentage of where actual and predicted are the same
# graph it...
df.groupby("Actual Letter").apply(lambda d: (d["Actual Letter"]==d["Predicted Letter"]).sum()/len(d)).plot(kind="bar")

How to split several column's data in pandas?

I have a dataframe which looks like this:
df = pd.DataFrame({'hard': [['525', '21']], 'soft': [['1525', '221']], 'set': [['5245', '271']], 'purch': [['925', '201']], \
'mont': [['555', '621']], 'gest': [['536', '251']], 'memo': [['825', '241']], 'raw': [['532', '210']]})
df
Out:
gest hard memo mont purch raw set soft
0 [536, 251] [525, 21] [825, 241] [555, 621] [925, 201] [532, 210] [5245, 271] [1525, 221]
I should split all of the columns like this:
df1 = pd.DataFrame()
df1['gest_pos'] = df.gest.str[0].astype(int)
df1['gest_size'] = df.gest.str[1].astype(int)
df1['hard_pos'] = df.hard.str[0].astype(int)
df1['hard_size'] = df.hard.str[1].astype(int)
df1
gest_pos gest_size hard_pos hard_size
0 536 251 525 21
I have more than 70 columns and my method takes lot of place and time. Is there an easier way to do this job?
Thanks!
Different approach:
df2 = pd.DataFrame()
for column in df:
df2['{}_pos'.format(column)] = df[column].str[0].astype(int)
df2['{}_size'.format(column)] = df[column].str[1].astype(int)
print(df2)
You can use nested list comprehension with flattening and then create new DataFrame by constructor:
L = [[y for x in z for y in x] for z in df.values.tolist()]
#if want filter first 2 values per each list
#L = [[y for x in z for y in x[:2]] for z in df.values.tolist()]
#https://stackoverflow.com/a/45122198/2901002
def mygen(lst):
for item in lst:
yield item + '_pos'
yield item + '_size'
df = pd.DataFrame(L, columns = list(mygen(df.columns))).astype(int)
print (df)
hard_pos hard_size soft_pos soft_size set_pos set_size purch_pos purch_size \
0 525 21 1525 221 5245 271 925 201
mont_pos mont_size gest_pos gest_size memo_pos memo_size raw_pos raw_size
0 555 621 536 251 825 241 532 210
You can use NumPy operations to construct your list of columns and flatten out your series of lists:
import numpy as np
from itertools import chain
# create column label array
cols = np.repeat(df.columns, 2).values
cols[::2] += '_pos'
cols[1::2] += '_size'
# create data array
arr = np.array([list(chain.from_iterable(i)) for i in df.values]).astype(int)
# combine with pd.DataFrame constructor
res = pd.DataFrame(arr, columns=cols)
Result:
print(res)
gest_pos gest_size hard_pos hard_size memo_pos memo_size mont_pos \
0 536 251 525 21 825 241 555
mont_size purch_pos purch_size raw_pos raw_size set_pos set_size \
0 621 925 201 532 210 5245 271
soft_pos soft_size
0 1525 221

Apply column operations to get a new column in pandas

i have the data like this
ID 8-Jan 15-Jan 22-Jan 29-Jan 5-Feb 12-Feb LowerBound UpperBound
001 618 720 645 573 503 447 - -
002 62 80 67 94 81 65 - -
003 32 10 23 26 26 31 - -
004 22 13 1 28 19 25 - -
005 9 7 9 6 8 4 - -
I want to create two columns with lower bounds and upper bounds for each product using 95% confidence intervals. I know manual way of writing a function which loops through each product ID
import numpy as np
import scipy as sp
import scipy.stats
# Method copied from http://stackoverflow.com/questions/15033511/compute-a-confidence-interval-from-sample-data
def mean_confidence_interval(data, confidence=0.95):
a = 1.0*np.array(data)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * sp.stats.t._ppf((1+confidence)/2., n-1)
return m-h, m+h
Is there an efficient way in Pandas or (one liner kind of thing) ?
Of course, you want df.apply. Note you need to modify mean_confidence_interval to return pd.Series([m-h, m+h]).
df[['LowerBound','UpperBound']] = df.apply(mean_confidence_interval, axis=1)
Standard error of the mean is pretty straightforward to calculate so you can easily vectorize this:
import scipy.stats as ss
df.mean(axis=1) + ss.t.ppf(0.975, df.shape[1]-1) * df.std(axis=1)/np.sqrt(df.shape[1])
will give you the upper bound. Use - ss.t.ppf for the lower bound.
Also, pandas seems to have a sem method. If you have a large dataset, I don't suggest using apply over rows. It is pretty slow. Here are some timings:
df = pd.DataFrame(np.random.randn(100, 10))
%timeit df.apply(mean_confidence_interval, axis=1)
100 loops, best of 3: 18.2 ms per loop
%%timeit
dist = ss.t.ppf(0.975, df.shape[1]-1) * df.sem(axis=1)
mean = df.mean(axis=1)
mean - dist, mean + dist
1000 loops, best of 3: 598 µs per loop
Since you already created a function for calculating the confidence interval, simply apply it to each row of your data:
def mean_confidence_interval(data):
confidence = 0.95
m = data.mean()
se = scipy.stats.sem(data)
h = se * sp.stats.t._ppf((1 + confidence) / 2, data.shape[0] - 1)
return pd.Series((m - h, m + h))
interval = df.apply(mean_confidence_interval, axis=1)
interval.columns = ("LowerBound", "UpperBound")
pd.concat([df, interval],axis=1)

Categories