Match two numpy arrays to find the same elements - python

I have a task kind of like SQL search. I have a "table" which contains the following 1D arrays (about 1 million elements) identified by ID1:
ID1, z, e, PA, n
Another "table" which contains the following 1D arrays (about 1.5 million elements) identified by ID2:
ID2, RA, DEC
I want to match ID1 and ID2 to find the common ones to form another "table" which contains ID, z, e, PA, n, RA, DEC. Most elements in ID1 can be found in ID2 but not all, otherwise I can use numpy.in1d(ID1,ID2) to accomplish it. Anyone has fast way to accomplish this task?
For example:
ID1, z, e, PA, n
101, 1.0, 1.2, 1.5, 1.8
104, 1.5, 1.8, 2.2, 3.1
105, 1.4, 2.0, 3.3, 2.8
ID2, RA, DEC
101, 4.5, 10.5
107, 90.1, 55.5
102, 30.5, 3.3
103, 60.1, 40.6
104, 10.8, 5.6
The output should be
ID, z, e, PA, n, RA, DEC
101, 1.0, 1.2, 1.5, 1.8, 4.5, 10.5
104, 1.5, 1.8, 2.2, 3.1, 10.8, 5.6

Well you can use np.in1d with swapped places for the first columns of the two arrays/tables, such that we would have two masks to index into the arrays for selection. Then, simply stack the results -
mask1 = np.in1d(a[:,0], b[:,0])
mask2 = np.in1d(b[:,0], a[:,0])
out = np.column_stack(( a[mask1], b[mask2,1:] ))
Sample run -
In [44]: a
Out[44]:
array([[ 101. , 1. , 1.2, 1.5, 1.8],
[ 104. , 1.5, 1.8, 2.2, 3.1],
[ 105. , 1.4, 2. , 3.3, 2.8]])
In [45]: b
Out[45]:
array([[ 101. , 4.5, 10.5],
[ 102. , 30.5, 3.3],
[ 103. , 60.1, 40.6],
[ 104. , 10.8, 5.6],
[ 107. , 90.1, 55.5]])
In [46]: mask1 = np.in1d(a[:,0], b[:,0])
In [47]: mask2 = np.in1d(b[:,0], a[:,0])
In [48]: np.column_stack(( a[mask1], b[mask2,1:] ))
Out[48]:
array([[ 101. , 1. , 1.2, 1.5, 1.8, 4.5, 10.5],
[ 104. , 1.5, 1.8, 2.2, 3.1, 10.8, 5.6]])

Assuming your second table, table B, is sorted, you can do a sorted lookup, then check if the indexed element is actually found:
idx = np.searchsorted(B[:-1, 0], A[:, 0])
found = A[:, 0] == B[idx, 0]
np.hstack((A[found, :], B[idx[found], 1:]))
Result:
array([[ 101. , 1. , 1.2, 1.5, 1.8, 4.5, 10.5],
[ 104. , 1.5, 1.8, 2.2, 3.1, 10.8, 5.6]])
The last element of the B indices is excluded to simplify the case where the item in A is beyond the final element in B. Without it, it is possible that the returned index would be greater than the length of B and cause indexing errors.

Use pandas:
import pandas as pd
id1 = pd.read_csv('id1.txt')
id2 = pd.read_csv('id2.txt')
df = id1.merge(id2.sort_values(by='ID2').drop_duplicates('ID2').rename(columns={'ID2':'ID1'}))
print(df)
Produces:
ID1 z e PA n RA DEC
0 101 1.0 1.2 1.5 1.8 4.5 10.5
1 104 1.5 1.8 2.2 3.1 10.8 5.6
With large datasets you may need to do things in place:
# [Optional] sort locations and drop duplicates
id2.sort_values(by='ID2', inplace=True)
id2.drop_duplicates('ID2', inplace=True)
# columns that you are merging must have the same name
id2.rename(columns={'ID2':'ID1'}, inplace=True)
# perform the merge
df = id1.merge(id2)
Without drop_duplicates you get one row for each item:
df = id1.merge(id2.rename(columns={'ID2':'ID1'}))
print(id2)
print(df)
Giving:
ID2 RA DEC
0 101 4.5 10.5
1 107 90.1 55.5
2 102 30.5 3.3
3 103 60.1 40.6
4 104 10.8 5.6
5 103 60.1 40.6
6 104 10.9 5.6
ID1 z e PA n RA DEC
0 101 1.0 1.2 1.5 1.8 4.5 10.5
1 104 1.5 1.8 2.2 3.1 10.8 5.6
2 104 1.5 1.8 2.2 3.1 10.9 5.6
Note that this solution preserves the different types for the columns:
>>> id1.ID1.dtype
dtype('int64')
>>> id1[' z'].dtype
dtype('float64')
Since you have spaces after the comma in the header row those spaces became part of the column name, hence need to refer to the second column using id1[' z']. By modifying the read statement, this is no longer necessary:
>>> id1 = pd.read_csv('id1.txt', skipinitialspace=True)
>>> id1.z.dtype
dtype('float64')

Related

fillna with max value of each group in python

Dataframe
df=pd.DataFrame({"sym":["a","a","aa","aa","aa","a","ab","ab","ab"],
"id_h":[2.1, 2.2 , 2.5 , 3.1 , 2.5, 3.8 , 2.5, 5,6],
"pm_h":[np.nan, 2.3, np.nan , 2.8, 2.7, 3.7, 2.4, 4.9,np.nan]})
want to fill pm_h nan values with max id_h value of each "sys" group i.e. (a, aa, ab)
Required output:
df1=pd.DataFrame({"sym":["a","a","aa","aa","aa","a","ab","ab","ab"],
"id_h":[2.1, 2.2 , 2.5 , 3.1 , 2.5, 3.8 , 2.5, 5,6],
"pm_h":[3.8, 2.3, 3.1 , 2.8, 2.7, 3.7, 2.4, 4.9, 6})
Use Series.fillna with GroupBy.transform by maximal values for new Series with same index like original:
df['pm_h'] = df['pm_h'].fillna(df.groupby('sym')['id_h'].transform('max'))
print (df)
sym id_h pm_h
0 a 2.1 3.8
1 a 2.2 2.3
2 aa 2.5 3.1
3 aa 3.1 2.8
4 aa 2.5 2.7
5 a 3.8 3.7
6 ab 2.5 2.4
7 ab 5.0 4.9
8 ab 6.0 6.0

Python increase the number of elements in a list with same distribution

I want to increase the number of elements in a list by following same distribution.
My code:
# Presently I have 5 elements
x_now = [4,4.5,4.6,5.4,6]
# I want to produce 13 elements. My expected output
x_exp = [4,4.5,4.6,5.4,6,4,4.5,4.6,5.4,6,4,4.5] % I got it by copy and pasting existing list elements
# Is it possible to randomly sample between min and max here and produce n elements here:
x_exp1 =[4 4.2 4.6 4.9 5.5 5.9 4.3 4.7 4.8 5.6 6 4.1 4.6]
Option 1
(x * 2)+(x[:13%len(x)])
Option 2
[x[i%len(x)] for i in range(13)]
Something like this:
In [1431]: l = x_now * 3
In [1432]: l[:len(l)-(13 // len(x_now))]
Out[1432]: [4, 4.5, 4.6, 5.4, 6, 4, 4.5, 4.6, 5.4, 6, 4, 4.5, 4.6]

Is there an function that gives practical information about a dataframe?

In Python, there is a function data.info(). This function gives you all the information about a dataset such as datatypes, memory, number of entries, etc.
Here you can look up for more information about the .info() function in Python.
Is there also a function in R that gives me this kind of information?
So here we have a few options
Base R
Within Base R there are a few options for getting these kind of data regarding your data:
str
You can use str to see the structure of a data frame
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary
Additionally, there is the summary function which completes a five number summary for each column and then counts for factors:
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
dplyr
dplyr provides something similar to str which shows some of the data types
library(dplyr)
glimpse(iris)
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5...
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, 3.9, 3.5, 3.8, 3...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1...
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.4, 0.4, 0.3, 0.3, 0...
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, ...
skimr
Finally, the skimr package provides an enhanced summary including little histograms
library(skimr)
skim(iris)
-- Data Summary ------------------------
Values
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None
-- Variable type: factor -------------------------------------------------------
skim_variable n_missing complete_rate ordered n_unique top_counts
1 Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50
-- Variable type: numeric ------------------------------------------------------
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 Sepal.Length 0 1 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
2 Sepal.Width 0 1 3.06 0.436 2 2.8 3 3.3 4.4 ▁▆▇▂▁
3 Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂
4 Petal.Width 0 1 1.20 0.762 0.1 0.3 1.3 1.8 2.5 ▇▁▇▅▃
Between those functions you can get a pretty good look at your data!
It's not a single function, but the first three things I always do are
library(tidyverse)
# Shows top 6 rows
iris %>% head()
# Gives dimensions of data.frame
iris %>% dim()
# Gives the classes of the data in each column (e.g. numeric, character etc)
iris %>% sapply(class)
The best package I use, that I haven't seen above, is inspectdf (mentioned by Niels in a comment above). inspectdf does much of the summary you see in skimr in #MDEWITT via specific function calls; for instance, inspect_cat and inspect_num for categorical and numerical variable summaries, respectively.
The contribution of my comment is that inspectdf has two additional functions inspect_imb and inspect_cor which, respectively, look at the most common value per column and the correlation between numerical cols. I find these tremendously useful for data cleaning/pre-processing.

Standard error of values in array corresponding to values in another array

I have an array that contains numbers that are distances, and another that represents certain values at that distance. How do I calculate the standard error of all the data at a fixed value of the distance?
The standard error is the standard deviation/ the square-root of the number of observations.
e.g distances(d):
[1 1 14 6 1 12 14 6 6 7 4 3 7 9 1 3 3 6 5 8]
e.g data corresponding to the entry of the distances:
therefore value=3.3 at d=1; value=2,1 at d=1; value=3.5 at d=14; etc..
[3.3 2.1 3.5 2.5 4.6 7.4 2.6 7.8 9.2 10.11 14.3 2.5 6.7 3.4 7.5 8.5 9.7 4.3 2.8 4.1]
For example, at distance d=6 I should calculate the standard error of 2.5, 7.8, 9.2 and 4.3 which would be the standard deviation of these values divided by the square root of the total number of values (4 in this case).
I've used the following code that works, but I don't know how to divide the result be the square-root of the total number of values at each distance:
import numpy as np
result = []
for d in set(key):
result.append(np.std[dist[i] for i in range(len(key)) if key[i] == d])
Any help would be greatly appreciated. Thanks!
Does this help?
for d in set(key):
result.append(np.std[dist[i] for i in range(len(key)) if key[i] == d] / np.sqrt(dist.count(d)))
I'm having a bit of a hard time telling exactly how you want things structured, but I would recommend a dictionary, so that you can know which result is associated with which key value. If your data is like this:
>>> key
array([ 1, 1, 14, 6, 1, 12, 14, 6, 6, 7, 4, 3, 7, 9, 1, 3, 3,
6, 5, 8])
>>> values
array([ 3.3 , 2.1 , 3.5 , 2.5 , 4.6 , 7.4 , 2.6 , 7.8 , 9.2 ,
10.11, 14.3 , 2.5 , 6.7 , 3.4 , 7.5 , 8.5 , 9.7 , 4.3 ,
2.8 , 4.1 ])
You can set up a dictionary along these lines with a dict comprehension:
result = {f'distance_{i}':np.std(values[key==i]) / np.sqrt(sum(key==i)) for i in set(key)}
>>> result
{'distance_1': 1.0045988005169029, 'distance_3': 1.818424226264781, 'distance_4': 0.0, 'distance_5': 0.0, 'distance_6': 1.3372079120316331, 'distance_7': 1.2056170619230633, 'distance_8': 0.0, 'distance_9': 0.0, 'distance_12': 0.0, 'distance_14': 0.3181980515339463}

Meshgrid of z values that match x and y meshgrid values

Edit: Original question was flawed but I am leaving it here for reasons of transparency.
Original:
I have some x, y, z data where x and y are coordinates of a 2D grid and z is a scalar value corresponding to (x, y).
>>> import numpy as np
>>> # Dummy example data
>>> x = np.arange(0.0, 5.0, 0.5)
>>> y = np.arange(1.0, 2.0, 0.1)
>>> z = np.sin(x)**2 + np.cos(y)**2
>>> print "x = ", x, "\n", "y = ", y, "\n", "z = ", z
x = [ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
y = [ 1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9]
z = [ 0.29192658 0.43559829 0.83937656 1.06655187 0.85571064 0.36317266
0.02076747 0.13964978 0.62437081 1.06008127]
Using xx, yy = np.meshgrid(x, y) I can get two grids containing x and y values corresponding to each grid position.
>>> xx, yy = np.meshgrid(x, y)
>>> print xx
[[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]]
>>> print yy
[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. ]
[ 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1]
[ 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2]
[ 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3]
[ 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4]
[ 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5]
[ 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6]
[ 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7]
[ 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8]
[ 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9]]
Now I want an array of the same shape for z, where the grid values correspond to the matching x and y values in the original data! But I cannot find an elegant, built-in solution where I do not need to re-grid the data, and I think I am missing some understanding of how I should approach it.
I have tried following this solution (with my real data, not this simple example data, but it should have the same result) but my final grid was not fully populated.
Please help!
Corrected question:
As was pointed out by commenters, my original dummy data was unsuitable for the question I am asking. Here is an improved version of the question:
I have some x, y, z data where x and y are coordinates of a 2D grid and z is a scalar value corresponding to (x, y). The data is read from a text file "data.txt":
#x y z
1.4 0.2 1.93164166734
1.4 0.3 1.88377897779
1.4 0.4 1.81946452501
1.6 0.2 1.9596778849
1.6 0.3 1.91181519535
1.6 0.4 1.84750074257
1.8 0.2 1.90890970517
1.8 0.3 1.86104701562
1.8 0.4 1.79673256284
2.0 0.2 1.78735230743
2.0 0.3 1.73948961789
2.0 0.4 1.67517516511
Loading the text:
>>> import numpy as np
>>> inFile = 'C:\data.txt'
>>> x, y, z = np.loadtxt(inFile, unpack=True, usecols=(0, 1, 2), comments='#', dtype=float)
>>> print x
[ 1.4 1.4 1.4 1.6 1.6 1.6 1.8 1.8 1.8 2. 2. 2. ]
>>> print y
[ 0.2 0.3 0.4 0.2 0.3 0.4 0.2 0.3 0.4 0.2 0.3 0.4]
>>> print z
[ 1.93164167 1.88377898 1.81946453 1.95967788 1.9118152 1.84750074
1.90890971 1.86104702 1.79673256 1.78735231 1.73948962 1.67517517]
Using xx, yy= np.meshgrid(np.unique(x), np.unique(y)) I can get two grids containing x and y values corresponding to each grid position.
>>> xx, yy= np.meshgrid(np.unique(x), np.unique(y))
>>> print xx
[[ 1.4 1.6 1.8 2. ]
[ 1.4 1.6 1.8 2. ]
[ 1.4 1.6 1.8 2. ]]
>>> print yy
[[ 0.2 0.2 0.2 0.2]
[ 0.3 0.3 0.3 0.3]
[ 0.4 0.4 0.4 0.4]]
Now each corresponding cell position in both xx and yy correspond to one of the original grid point locations.
I simply need an equivalent array where the grid values correspond to the matching z values in the original data!
"""e.g.
[[ 1.93164166734 1.9596778849 1.90890970517 1.78735230743]
[ 1.88377897779 1.91181519535 1.86104701562 1.73948961789]
[ 1.81946452501 1.84750074257 1.79673256284 1.67517516511]]"""
But I cannot find an elegant, built-in solution where I do not need to re-grid the data, and I think I am missing some understanding of how I should approach it. For example, using xx, yy, zz = np.meshgrid(x, y, z) returns three 3D arrays that I don't think I can use.
Please help!
Edit:
I managed to make this example work thanks to the solution from Jaime: Fill 2D numpy array from three 1D numpy arrays
>>> x_vals, x_idx = np.unique(x, return_inverse=True)
>>> y_vals, y_idx = np.unique(y, return_inverse=True)
>>> vals_array = np.empty(x_vals.shape + y_vals.shape)
>>> vals_array.fill(np.nan) # or whatever your desired missing data flag is
>>> vals_array[x_idx, y_idx] = z
>>> zz = vals_array.T
>>> print zz
But the code (with real input data) that led me on this path was still failing. I found the problem now. I have been using scipy.ndimage.zoom to resample my gridded data to a higher resolution before generating zz.
>>> import scipy.ndimage
>>> zoom = 2
>>> x = scipy.ndimage.zoom(x, zoom)
>>> y = scipy.ndimage.zoom(y, zoom)
>>> z = scipy.ndimage.zoom(z, zoom)
This produced an array containing many nan entries:
array([[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
...,
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan]])
When I skip the zoom stage, the correct array is produced:
array([[-22365.93400183, -22092.31794674, -22074.21420168, ...,
-14513.89091599, -12311.97437017, -12088.07062786],
[-29264.34039242, -28775.79743097, -29021.31886353, ...,
-21354.6799064 , -21150.76555669, -21046.41225097],
[-39792.93758344, -39253.50249278, -38859.2562673 , ...,
-24253.36838785, -25714.71895023, -29237.74277727],
...,
[ 44829.24733543, 44779.37084337, 44770.32987311, ...,
21041.42652441, 20777.00408692, 20512.58162671],
[ 44067.26616067, 44054.5398901 , 44007.62587598, ...,
21415.90416488, 21151.48168444, 20887.05918082],
[ 43265.35371973, 43332.5983711 , 43332.21743471, ...,
21780.32283309, 21529.39770759, 21278.47255848]])

Categories