Compare every cell in a dataframe with its surrounding cells - python

I have a dataframe similar to this yet, but a lot bigger (3000x3000):
A
B
C
D
E
W
3
1
8
3
4
X
2
2
9
1
1
Y
5
7
1
3
7
Z
6
8
5
8
9
where the [A,B,C,D,E] are the column names and [W,X,Y,Z] are the rows indexs.
I want to compare every cell with its surrounding cells. If the cell has a greater value than its neighbor cell value, create a directed edge (using networkX package) from that cell to its smaller value neighbor cell. For example:
examining cell (X,B), we should add the following:
G.add_edge((X,B), (W,B)) and G.add_edge((X,B), (Y,C)) and so on for every cell in the dataframe.
Currently I am doing it using two nested loops. However this takes hours to finish and a lot of resources (RAM).
Is there any more efficient way to do it?

If you want to have edges in a networkx graph, then you will not be able to avoid the nested for loop.
The comparison is actually easy to optimize. You could make four copies of your matrix and shift each one step into each direction. You are then able to vectorize the comparison by a simple df > df_copy for every direction.
Nevertheless, when it comes to creating the edges in your graph, it is necessary for you to iterate over both axes.
My recommendation is to write the data preparation part in Cython. Also have a look at graph-tools which at its core is written in C++. With that much edges you will probably also get performance issues in networkx itself.

Related

Get attributes from non-overlapping points and polygons

I have two geodatasets -- one is in points (centroids from a different polygon, let's name it point_data) and the other one is a polygon of a whole country (let's name it polygon_data). What I'm trying to do now is to get attributes from polygon_data and put them in point_data. But the problem is that they are not overlapping with each other.
To better understand the context, the country is archipelagic by nature, and the points are outside the country (that's why they're not overlapping).
Some solutions that I've tried are:
1.) Buffer up polygon_data so that it would touch point_data. Unfortunately this caused problems because the shapes that are not in the shoreline also buffered up.
2.) Used the original polygon of point_data and did a spatial join (intersects), but the problem is that there are some points that still returned with null values and duplicate rows also occured.
I want to make the process as seamless and easy as possible. Any ideas?
I'm both proficient with geopandas and qgis, but I would prefer it in geopandas as much as possible.
Thank you to whoever will be able to help. :)
I guess you can try to join your data depending on the distance between the points and the polygon(s). By doing so, you can fetch the index of the nearest polygon feature for each of your points, then use this index to do the jointure.
To replicate your problem, I generated a layer of points and a layer of polygons (they have an attribute name that I want to put on the point layer).
One (naive) way to do so could be the following:
# read the polygon layer and the point layer
poly_data = gpd.read_file('poly_data.geojson')
pt_data = gpd.read_file('pt_data.geojson')
# Create the field to store the index
# of the nearest polygon feature
pt_data['join_field'] = 0
for idx, geom in pt_data['geometry'].iteritems():
# Compute the distance between this point and each polygon
distances = [
(idx_to_join, geom.distance(geom_poly))
for idx_to_join, geom_poly in poly_data['geometry'].iteritems()]
# Sort the distances...
distances.sort(key=lambda d: d[1])
# ... and store the index of the nearest polygon feature
pt_data.loc[(idx, 'join_field')] = distances[0][0]
# make the join between pt_data and poly_data (except its geometry column)
# based on the value of 'join_field'
result = pt_data.join(
poly_data[poly_data.columns.difference(['geometry'])],
on='join_field')
# remove the `join_field` if needed
result.drop('join_field', axis=1, inplace=True)
Result: (the value in the name column is coming from the polygons)
id geometry name
0 1 POINT (-0.07109 0.40284) A
1 2 POINT (0.04739 0.49763) A
2 3 POINT (0.05450 0.29858) A
3 4 POINT (0.06635 0.11848) A
4 5 POINT (0.63744 0.73934) B
5 6 POINT (0.61611 0.53555) B
6 7 POINT (0.76540 0.44787) B
7 8 POINT (0.84597 0.36256) B
8 9 POINT (0.67062 -0.36493) C
9 10 POINT (0.54028 -0.37204) C
10 11 POINT (0.69194 -0.60900) C
11 12 POINT (0.62085 -0.65166) C
12 13 POINT (0.31043 -0.48578) C
13 14 POINT (0.36967 -0.81280) C
Depending on the size of your dataset you may want to consider more efficient methods (e.g. defining a maximum search radius around each point to avoid having to iterate across all polygons).

Pysparkling 2 reset monotonically_increasing_id from 1

I want to split a spark dataframe into two pieces and defined the row number for each of the sub-dataframe. But I found that the function monotonically_increasing_id will still define the row number from the original dataframe.
Here is what I did in python:
# df is the original sparkframe
splits = df.randomSplit([7.0,3.0],400)
# add column rowid for the two subframes
set1 = splits[0].withColumn("rowid", monotonically_increasing_id())
set2 = splits[1].withColumn("rowid", monotonically_increasing_id())
# check the results
set1.select("rowid").show()
set2.select("rowid").show()
I would expect the first five elements of rowid for the two frames are both 1 to 5 (or 0 to 4, can't remember clearly):
set1: 1 2 3 4 5
set2: 1 2 3 4 5
But what I actually got is:
set1: 1 3 4 7 9
set2: 2 5 6 8 10
The two subframes' row id are actually their row id in the original sparkframe df not the new ones.
As a newbee of spark, I am seeking helps on why this happened and how to fix it.
First of all, what version of Spark are you using? The monotonically_increasing_id method implementation has been changed a few times. I can reproduce your problem in Spark 2.0, but it seems the behavior is different in spark 2.2. So it could be a bug that was fixed in newer spark release.
That being said, you should NOT expect the value generated by monotonically_increasing_id to be increase consecutively. In your code, I believe there is only one partition for the dataframe. According to http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
The generated ID is guaranteed to be monotonically increasing and
unique, but not consecutive. The current implementation puts the
partition ID in the upper 31 bits, and the record number within each
partition in the lower 33 bits. The assumption is that the data frame
has less than 1 billion partitions, and each partition has less than 8
billion records.
As an example, consider a DataFrame with two partitions, each with 3
records. This expression would return the following IDs: 0, 1, 2,
8589934592 (1L << 33), 8589934593, 8589934594.
So if you code should not expect the rowid to be increased consecutively.
Besides, you should also consider caching scenario and failure scenarios. Even if monotonically_increase_id works as what you expect -- increase the value consecutively, it still will not work. What if your node fails? The partitions on the failed node will be regenerated from the source or last cache/checkpoint, which might have a different order thus different rowid. Eviction out of cache also causes issue. Assuming that after generating a dataframe and you cache it to memory. What if it is evicted out of memory. Future action will try to regenerate the dataframe again, thus gives out different rowid.

What is the Python syntax for accessing specific data in a function call?

I am not generally a python user. I do most things in R and Stata. However, I cannot find a good semantic similarity package/API in either of those.
I have two data frames in the environment. One is questions, which consists of 2 columns and 3 rows. The other is results, which has 3 columns and 3 rows.
I am trying to compare each question (individually) in the first column of the questions dataframe to all of the questions in the second column. Then I want the output to populate the results dataframe. The function takes two strings as arguments So far my code looks like this:
for i in range(1, 3):
results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
I assume that I am pointing to the data incorrectly, but I really haven't been able to find good documentation about this seemingly straightforward issue.
By comparison, here is my working R code, which uses a slightly different function and only one dataframe.
for (i in 1:3){
df[,i+2] <- levenshteinSim(df$yr1questions[i], df$yr2questions)
}
Any help would be greatly appreciated! I am trying to come up with proof-of-concept code to compare similar survey questions between years based on semantic meaning.
Bob
Let's try to compare(multiply) every question A to every question B
import pandas as pd
questions = pd.DataFrame(np.arange(6).reshape(3,2), columns=["question_A", "question_B"])
This gives:
question_A question_B
0 0 1
1 2 3
2 4 5
Then let's define a compare fonction:
def compare(row):
return pd.Series(row[0]*questions['question_B'])
results = questions.apply(compare, axis=1)
That gives us:
0 1 2
0 0 0 0
1 2 6 10
2 4 12 20
As you pointed in the comments, here is a version comparing only two strings at a time:
def compare(row):
question_a = row[0]
return pd.Series([liteClient.compare(question_a, question_b) for question_b in questions['question_B']])
Based on what you've put so far here are some issues with what you've written which are understandable from your R programming background:
for i in range(1, 3):
In python 3.x, what this does is create a range object which you can think of as a special type of function (though is really an object that contains iteration properties) that allows you to generate a sequence of numbers with a certain step size (default is 1) exclusively. Additionally you need to know that most programming languages index starting at zero, not 1, and this includes python.
What this range object does here is generate the sequence 1, 2 and that is it.
Your arrays you are using i to index over are not going to index over all indicies. What I believe you want is something like:
for i in range(3):
Notice how there is only one number here, this defaults to the exclusive maximum of the range, and 0 being the inclusive minimum, so this will generate the sequence of 0,1,2. If you have an array of size 3, this will represent all possible indices for that array.
This next line is a bit confusing to me, since I'm not familiar with R, but I sort of understand what you were trying to do. If I understand correctly you are trying to compare two columns of 3 questions each, and compare each question in column 1 to the questions in column 2, resulting in a 3 x 3 matrix of comparison results which you are trying to store in results?. Assuming the size are already correct (as in results is 3x3) I'd like to explain some peculiarities I see in this code.
results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
with results.iloc[i-1,i] you are going to be indexing by row an column, as in i-1 is the row, and i is the column here. So with out changing range(1,3) this results in the following indexes being accessed, 0,1, 1,2 and that is it. I believe liteClient.compare(...) is supposed to return either a dataframe of 1x3, or a list of size 3 based on what you were trying to do inside of it, this may not be the case however, I'm not sure what object you are using to call that member function, so I don't know where the documentation for that function exists. Assuming it does return a list of size 3 or the dataframe, what you'll need to change the way you are trying to assign the data, via this:
results.iloc[i,:] = ...
What is happening here is that iloc(...) is taking a row positional argument and a slice positional argument, here you are assigning all the columns in the result matrix at that row to the values returned by compare. With the for statement changes this will iterate over all indices in the dataframe.
liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
This line as it currently stands you will be iterating over each column in the first row of questions.iloc, and then comparing them to the second column and all rows of the second questions.iloc.
I believe what you will want to do is change this to:
liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])
What this does is for each i, 0, 1, 2, at column 0, compare this to every row in column 1. If your questions dataframe is actually organized as 2 columns and 3 rows this should work, otherwise you will need to change how you create questions as well.
in all I believe the fixed program should look something like:
for i in range(3):
results.iloc[i,:] = liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])

pandas - how to combine selected rows in a DataFrame

I've been reading a huge (5 GB) gzip file in the form:
User1 User2 W
0 11 12 1
1 12 11 2
2 13 14 1
3 14 13 2
which is basically a directed graph representation of connections among users with a certain weight W. Since the file is so big, I tried to read it through networkx, building a Directed Graph and then set it to Undirected. But it took too much time. So I was thinking in doing the same thing analysing a pandas dataframe. I would like to return the previous dataframe in the form:
User1 User2 W
0 11 12 3
1 13 14 3
where the common links in the two directions have been merged into one having as W the sum of the single weights. Any help would be appreciated.
There is probably a more concise way, but this works. The main trick is just to normalize the data such that User1 is always the lower number ID. Then you can use groupby since 11,12 and 12,11 are now recognized as representing the same thing.
In [330]: df = pd.DataFrame({"User1":[11,12,13,14],"User2":[12,11,14,13],"W":[1,2,1,2]})
In [331]: df['U1'] = df[['User1','User2']].min(axis=1)
In [332]: df['U2'] = df[['User1','User2']].max(axis=1)
In [333]: df = df.drop(['User1','User2'],axis=1)
In [334]: df.groupby(['U1','U2'])['W'].sum()
Out[334]:
U1 U2
11 12 3
13 14 3
Name: W, dtype: int64
For more concise code that avoids creating new variables, you could replace the middle 3 steps with:
In [400]: df.ix[df.User1>df.User2,['User1','User2']] = df.ix[df.User1>df.User2,['User2','User1']].values
Note that column switching can be trickier than you'd think, see here: What is correct syntax to swap column values for selected rows in a pandas data frame using just one line?
As far as making this code fast in general, it will depend on your data. I don't think the above code will be as important as other things you might do. For example, your problem should be amenable to a chunking approach where you iterate over sections of the code, gradually shrinking it on each pass. In that case, the main thing you need to think about is sorting the data before chunking, so as to minimize how many passes you need to make. But doing it that way should allow you to do all the work in memory.

Counting non zeros in only 1 column of a numpy array

I have a Numpy array that is created as follows
data=np.zeros(500,dtype='float32, (50000,2)float32')
This array is filled with values that I acquire from some measurements, and is supposed to reflect that during each time point (room for 500 time points) we can acquire 50.000 x- and y- coords.
Later in my code is use a bisect-like search for which I need to know howmany X-coords (measurement points) are actually in my array which I originally did with np.count_nonzero(data), this yielded the following problem:
Fake data:
1 1
2 2
3 0
4 4
5 0
6 6
7 7
8 8
9 9
10 10
the non zero count returns 18 values here, the code then goes into the bisect-like search using data[time][1][0][0] as min X-coord and data[time][1][(np.count_nonzero(data)][0] as max x-coord which results in the array stopping at 9 instead of 10.
I could use a while loop to manually count non-zero values (in the X-coord column) in the array but that would be silly, I assume that there is some builtin numpy functionality for this. My question is then what builtin functionality or modification of my np.count_nonzero(data) I need since the documentation doesn't offer much information in that regards (link to numpy doc).
-- Simplified question --
Can I use Numpy functionality to count the non-zero values for a singular column only? (i.e. between data[time][1][0][0] and data[time][1][max][0] )
Maybe a better approach would be to filter the array using nonzero and iterate over the result:
nonZeroData = data[np.nonzero(data[time][1])]
To count zeros only from the second column:
nonZeroYCount = np.count_nonzero(data[time][1][:, 1])
If I understand you correctly, to select elements from data[time][1][0][0] to data[time][1][max][0]:
data[time][1][:max+1,0]
EDIT:
To count all non-zero for every time:
(data["f1"][:,:,0] != 0).sum(1)
Why not consider using data != 0 to get the bool matrix?
You can use:
stat = sum(data != 0) to count the non-zero entries.
I am not sure what shape your data array has but hope you can see what I mean. :)

Categories