Intersection of multiple rows in single DataFrame

Intersection of multiple rows in single DataFrame - python

I have a DataFrame of Temperature 1000s of rows(Time series data) and 40 columns(40 points in a catchment ). Entries in this DataFrame are zeros and one (1 means active part of catchment and zero means non-active part). I want to place number of intersected values in a separate column(named inter) in the same DataFrame .
I expect the output in this way [attached image]
value in the first row of inter should be zero as all entries are zero
and no part is active on day first
value in the 2nd row of inter should be 4 as four parts are active
on day 2.
value in the 3rd row of inter should be 3 (number of intersected values
of all above rows including 3rd row)[enter image description here][1].
Green boxes in image show the value for 3rd row
value in 4th row of inter should be number of intersected values of
all above rows (yellow shaded area in the image).
similarly blue boxes show the value for 5th row and red boxes show
the value for sixth row and so on
Note: for every row I will count the intersection of all above rows

I deserve a reward for this :)
Here is you answer:
import pandas as pd
import numpy as np
# setup test data
data = {'0': [0, 0, 0, 1, 0], '1': [0, 0, 1, 0, 1], '2': [0, 0, 0, 1, 0], '3': [0, 0, 1, 1, 1], '4': [0, 1, 1, 1, 0]
, '5': [0, 0, 0, 0, 1], '6': [0, 1, 1, 1, 0], '7': [0, 0, 1, 0, 1], '8': [0, 1, 0, 1, 0], '9': [0, 1, 1, 0, 0],
'10': [0, 0, 1, 0, 0], '11': [0, 0, 0, 1, 1], '12': [0, 0, 0, 1, 1]}
data = pd.DataFrame(data=data)
# collect inter data
inter_data = []
for main_index, main_row in data.iterrows():
# select data for calculations
selected_data = data.loc[0:main_index,:]
# handle firs row with 0 values
if not 1 in main_row.values:
inter_data.append(0)
else:
# handle second row
if selected_data.shape[0] == 2:
inter_data.append(selected_data[1:2].values[0].sum())
# handle rest of data
else:
# drop last row from selected data
selected_data = selected_data[:-1]
# sum selected data
summed_data = 0
for index, row in selected_data.iterrows():
summed_data += row.values
# get position of 1
positions = np.where(main_row.values == 1)
# get summed data based on position
positions_data = summed_data[positions[0]]
# sum occurance in data
inter_data.append((positions_data >= 1).sum())
# add inter data to raw data
data['inter'] = pd.DataFrame(inter_data)
Output:
0 1 2 3 4 5 6 7 8 9 10 11 12 inder
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 1 0 1 0 1 1 0 0 0 4
2 0 1 0 1 1 0 1 1 0 1 1 0 0 3
3 1 0 1 1 1 0 1 0 1 0 0 1 1 4
4 0 1 0 1 0 1 0 1 0 0 0 1 1 5

Related

How to convert a dictionary into a tensor in tensorflow

This is the dictionary I have:
docs = {'computer': {'1': 1, '3': 5, '8': 2},
'politics': {'0': 2, '1': 2, '3': 1}}
I want to create a 9 * 2 tensor like this:
[
[0, 1, 0, 5, 0, 0, 0, 0, 2],
[2, 2, 0, 1, 0, 0, 0, 0, 0, 0]
]
Here, because the max item is 8 so we have 9 rows. But, the number of rows and columns can increase based on the dictionary.
I have tried to implement this using for-loop though as the dictionary is big it's not efficient at all and also it implemented using the list I need that to be a tensor.
maxr = 0
for i, val in docs.items():
for j in val.keys():
if int(j) > int(maxr):
maxr = int(j)
final_lst = []
for val in docs.values():
lst = [0] * (maxr+1)
for j, val2 in sorted(val.items()):
lst[int(j)] = val2
final_lst.append(lst)
print(final_lst)

If you are ok with using pandas and numpy, here's how you can do it.
import pandas as pd
import numpy as np
# Creates a dataframe with keys as index and values as cell values.
df = pd.DataFrame(docs)
# Create a new set of index from min and max of the dictionary keys.
new_index = np.arange( int(df.index.min()),
int(df.index.max())).astype(str)
# Add the new index to the existing index and fill the nan values with 0, take a transpose of dataframe.
new_df = df.reindex(new_index).fillna(0).T.astype(int)
# 0 1 2 3 4 5 6 7
#computer 0 1 0 5 0 0 0 0
#politics 2 2 0 1 0 0 0 0
If you just want the array, you can call array = new_df.values.
#[[0 1 0 5 0 0 0 0]
# [2 2 0 1 0 0 0 0]]
If you want tensor, then you can use tf.convert_to_tensor(new_df.values)

Splitting survey data with OneHotEncoder

I have a dataframe with results from a survey where there were options A-E and it was possible to select more than one option - a selection could be 'A' or 'A;C;D', etc.
I will be using the data for some machine learning and am looking to run it through OneHotEncoder to end up with the 5 columns with 1's and 0's.
An example of my initial survey data is :
survey_data = pd.DataFrame({'Q1': ['A','B','C','A;D', 'D;E', 'F']})
I initially tried LabelEncoder but obviously ended up with a lot of features (rather than just the A-E).

You can also use MultilabelBinarizer for this:
inputX = [element.split(';') for element in survey_data['Q1']]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
transformedX = mlb.fit_transform(inputX)
#Out: transformedX
#array([[1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 0, 1]])

Here's one approach, using get_dummies:
import pandas as pd
# example data provided by OP
survey_data = pd.DataFrame({'Q1': ['A','B','C','A;D', 'D;E', 'F']})
# split out rows with multiple chosen options into columns
tmp = survey_data.Q1.str.split(';').apply(pd.Series)
# one-hot encode columns with get_dummies, then overlay into one df
df = (pd.get_dummies(tmp[0])
.add(pd.get_dummies(tmp[1]), fill_value=0)
.astype(int))
print(df)
A B C D E F
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 0 1 0 0 0
3 1 0 0 1 0 0
4 0 0 0 1 1 0
5 0 0 0 0 0 1

How to get an object bounding box given pixel label in python?

Say I have a scene parsing map for an image, each pixel in this scene parsing map indicates which object this pixel belongs to. Now I want to get bounding box of each object, how can I implement this in python?
For a detail example, say I have a scene parsing map like this:
0 0 0 0 0 0 0
0 1 1 0 0 0 0
1 1 1 1 0 0 0
0 0 1 1 1 0 0
0 0 1 1 1 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
So the bounding box is:
0 0 0 0 0 0 0
1 1 1 1 1 0 0
1 0 0 0 1 0 0
1 0 0 0 1 0 0
1 1 1 1 1 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Actually, in my task, just know the width and height of this object is enough.
A basic idea is to search four edges in the scene parsing map, from top, bottom, left and right direction. But there might be a lot of small objects in the image, this way is not time efficient.
A second way is to calculate the coordinates of all non-zero elements and find the max/min x/y. Then calculate weight and height using these x and y.
Is there any other more efficient way to do this? Thx.

If you are processing images, you can use scipy's ndimage library.
If there is only one object in the image, you can get the measurements with scipy.ndimage.measurements.find_objects (http://docs.scipy.org/doc/scipy-0.16.1/reference/generated/scipy.ndimage.measurements.find_objects.html):
import numpy as np
from scipy import ndimage
a = np.array([[0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
# Find the location of all objects
objs = ndimage.find_objects(a)
# Get the height and width
height = int(objs[0][0].stop - objs[0][0].start)
width = int(objs[0][1].stop - objs[0][1].start)
If there are many objects in the image, you first have to label each object and then get the measurements:
import numpy as np
from scipy import ndimage
a = np.array([[0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0]]) # Second object here
# Label objects
labeled_image, num_features = ndimage.label(a)
# Find the location of all objects
objs = ndimage.find_objects(labeled_image)
# Get the height and width
measurements = []
for ob in objs:
measurements.append((int(ob[0].stop - ob[0].start), int(ob[1].stop - ob[1].start)))
If you check ndimage.measurements, you can get more measurements: center of mass, area...

using numpy:
import numpy as np
ind = np.nonzero(arr.any(axis=0))[0] # indices of non empty columns
width = ind[-1] - ind[0] + 1
ind = np.nonzero(arr.any(axis=1))[0] # indices of non empty rows
height = ind[-1] - ind[0] + 1
a bit more explanation:
arr.any(axis=0) gives a boolean array telling you if the columns are empty (False) or not (True). np.nonzero(arr.any(axis=0))[0] then extract the non zero (i.e. True) indices from that array. ind[0] is the first element of that array, hence the left most column non empty column and ind[-1] is the last element, hence the right most non empty column. The difference then gives the width, give or take 1 depending on whether you include the borders or not.
Similar stuff for the height but on the other axis.

how to skip the first line of a file in python

I need to write a code to read a .txt file, which is a matrix displayed as below, and turn it into an new integer list matrix. However, I want to skip first line of this .txt file without manually deleting the file. I do not know how to do that.
I have written some code. It is able to display the matrix, but I am unable to get rid of the first line:
def display_matrix(a_matrix):
for row in a_matrix:
print(row)
return a_matrix
def numerical_form_of(a_list):
return [int(a_list[i]) for i in range(len(a_list))]
def get_scoring_matrix():
scoring_file = open("Scoring Matrix")
row_num = 0
while row_num <= NUMBER_OF_FRAGMENTS:
content_of_line = scoring_file.readline()
content_list = content_of_line.split(' ')
numerical_form = numerical_form_of(content_list[1:])
scoring_matrix = []
scoring_matrix.append(numerical_form)
row_num += 1
#print(scoring_matrix)
display_matrix(scoring_matrix)
# (Complement): row_num = NUMBER_OF_FRAGMENTS
return scoring_matrix
get_scoring_matrix()
Scoring Matrix is a .txt file:
1 2 3 4 5 6 7
1 0 1 1 1 1 1 1
2 0 0 1 1 1 1 1
3 0 0 0 1 1 1 1
4 0 0 0 0 1 1 1
5 0 0 0 0 0 1 1
6 0 0 0 0 0 0 1
7 0 0 0 0 0 0 0
The result of my code:
[1, 2, 3, 4, 5, 6, 7]
[0, 1, 1, 1, 1, 1, 1]
[0, 0, 1, 1, 1, 1, 1]
[0, 0, 0, 1, 1, 1, 1]
[0, 0, 0, 0, 1, 1, 1]
[0, 0, 0, 0, 0, 1, 1]
[0, 0, 0, 0, 0, 0, 1]
[0, 0, 0, 0, 0, 0, 0]

just put a scoring_file.readline() before the while loop.

I suggest using an automated tool:
import pandas
df = pandas.read_table("Scoring Matrix", delim_whitespace = True)
If you insist doing it yourself, change the while loop;
while row_num <= NUMBER_OF_FRAGMENTS:
content_of_line = scoring_file.readline()
if row_num == 0:
content_of_line = scoring_file.readline()

Filling in a concave polygon represented as a binary matrix

In my task, I represent a concave polygon as a matrix of ones and zeros, where one means that the given point belongs to the polygon. For instance, the following are a simple square and a u-shaped polygon:
0 0 0 0 0 0 0 0 0 0 0
0 1 1 0 0 1 1 0 0 1 1
0 1 1 0 0 1 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1 1
However, sometimes I get an incomplete representation, in which: (1) all boundary points are included, and (2) some internal points are missing. For example, in the following enlarged version of the u-shaped polygon, the elements at positions (1,1), (1,6), (3,1), ..., (3,6)* are "unfilled". The goal is to fill them (i.e., change their value to 1).
1 1 1 0 0 1 1 1
1 0 1 0 0 1 0 1
1 1 1 1 1 1 0 1
1 0 0 0 0 0 0 1
1 1 1 1 1 1 1 1
Do you know if there's an easy way to do this in Python/NumPy?
*(row, column), starting counting from the top left corner

This is a very well known problem in image processing that can be solved using morphological operators.
With that, you can use scipy's binary_fill_holes to fill the holes in your mask:
>>> import numpy as np
>>> from scipy.ndimage import binary_fill_holes
>>> data = np.array([[1, 1, 1, 0, 0, 1, 1, 1],
[1, 0, 1, 0, 0, 1, 0, 1],
[1, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 1]])
>>> filled = binary_fill_holes(data).astype(int)
>>> filled
array([[1, 1, 1, 0, 0, 1, 1, 1],
[1, 1, 1, 0, 0, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1]])

I do not believe there would exist some generic purpose solution in Python or whatever. This is classic breadth-first graph search. For each 0 either exists a path of adjacent zeros, so that at least one of those zeros is at position (y,x) so that (x = 0 or y = 0 or x = maxx or y = maxy) or this 0 should be changed to 1.
Maybe an answer here will be helpful to you: How to trace the path in a Breadth-First Search?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Intersection of multiple rows in single DataFrame - python

Related

How to convert a dictionary into a tensor in tensorflow

Splitting survey data with OneHotEncoder

How to get an object bounding box given pixel label in python?

how to skip the first line of a file in python

Filling in a concave polygon represented as a binary matrix

Categories

Resources