I have a dataframe with points on a 2-dimensional plane:
index x y
0 0 -0.032836 49.268820
1 0 4.160005 49.268820
2 0 4.105928 68.330440
3 0 -0.062953 68.342125
4 1 4.166139 49.269398
5 1 8.497650 49.278310
6 1 8.592334 68.336560
7 1 4.041361 68.336560
8 2 8.426349 49.278890
9 2 13.480260 49.278890
10 2 13.446286 68.336560
11 2 8.467557 68.336560
12 3 13.438516 49.278374
13 3 17.356792 49.287285
14 3 17.378400 68.338240
15 3 13.382163 68.333786
16 4 17.295988 49.289800
17 4 21.418156 49.289800
18 4 21.336264 67.359630
19 4 17.313816 67.359630
and I've been trying to find a way to draw lines between the (x,y) coordinates for each index. The resulting plot should be closed rectangles.
Now, I've tried to approach this by defining series:
x = df['x']
y = df['y']
and then
index_l = df.index.tolist()
for i in index_l:
plt.plot([df.x[i],df.y[i]])
This doesn't work at all. Any idea on how to proceed. A note: ideally, I would like to have a rectangle, but if doing this by even connecting diagonally is easier, I can live with it.
Thankful for any hints or solutions.
You can group by the index and then for x, y values of each group, append the first row to the end so that plt.plot plots a closed rectangle:
for idx, points in df.groupby("index")[["x", "y"]]:
points_to_plot = points.append(points.iloc[0])
plt.plot(points_to_plot.x, points_to_plot.y)
to get this plot
Related
I have an 120x70 matrix of which I want to graph diagonal lines.
for ease of typing here, I will explain my problem with a smaller 4x4 matrix.
index
2020
2021
2022
2023
0
1
2
5
7
1
3
5
8
10
0
1
2
5
3
1
3
5
8
4
I now want to graph for example starting at 2021 index 0
so that I get the following diagonal numbers in a graphs: 2, 8, 10
or if I started at 2020 I would get 1, 5, 5, 4.
Kind regards!
You can do this with a simple for-loop. e.g.:
matrix = np.array((120, 70))
graph_points = []
column_index = 0 # Change this to whatever column you want to start at
for i in range(matrix.shape[0]):
graph_points.append(matrix[i, column_index])
column_index += 1
if column_index >= matrix.shape[1]:
break
## Plot graph_points here
I have the following dataframe (it is actually several hundred MB long):
X Y Size
0 10 20 5
1 11 21 2
2 9 35 1
3 8 7 7
4 9 19 2
I want discard any X, Y point that has an euclidean distance from any other X, Y point in the dataframe of less than delta=3. In those cases I want to keep only the row with the bigger size.
In this example the intended result would be:
X Y Size
0 10 20 5
2 9 35 1
3 8 7 7
As the question is stated, the behavior of the desired algorithm is not clear about how to deal with the chaining of distances.
If chaining is allowed, one solution is to cluster the dataset using a density-based clustering algorithm such as DBSCAN.
You just need to set the neighboorhood radius epsto delta and the min_sample parameter to 1 to allow isolated points as clusters. Then, you can find in each group which point has the maximum size.
from sklearn.cluster import DBSCAN
X = df[['X', 'Y']]
db = DBSCAN(eps=3, min_samples=1).fit(X)
df['grp'] = db.labels_
df_new = df.loc[df.groupby('grp').idxmax()['Size']]
print(df_new)
>>>
X Y Size grp
0 10 20 5 0
2 9 35 1 1
3 8 7 7 2
You can use below script and also try improving it.
#get all euclidean distances using sklearn;
#it will create an array of euc distances;
#then get index from df whose euclidean distance is less than 3
from sklearn.metrics.pairwise import euclidean_distances
Z = df[['X', 'Y']]
euc = euclidean_distances(Z, Z)
idx = [(i, j) for i in range(len(euc)-1) for j in range(i+1, len(euc)) if euc[i, j] < 3]
# collect all index of df that has euc dist < 3 and get the max value
# then collect all index in df NOT in euc and add the row with max size
# create a new called df_new by combining the rest in df and row with max size
from itertools import chain
df_idx = list(set(chain(*idx)))
df2 = df.iloc[df_idx]
idx_max = df2[df2['Size'] == df2['Size'].max()].index.tolist()
df_new = pd.concat([df.iloc[~df.index.isin(df_idx)], df2.iloc[idx_max]])
df_new
Result:
X Y Size
2 9 35 1
3 8 7 7
0 10 20 5
I have a matrix as shown below (taken from a txt file with an argument), and every cell has neighbors. Once you pick a cell, that cell and all neighboring cells that containing the same number will disappear.
1 0 4 7 6 8
0 5 4 4 5 5
2 1 4 4 4 6
4 1 3 7 4 4
I've tried to do this with using recursion. I separated function four parts which are up(), down() , left() and right(). But I got an error message: RecursionError: maximum recursion depth exceeded in comparison
cmd=input("Row,column:")
cmdlist=command.split(",")
row,column=int(cmdlist[0]),int(cmdlist[1])
num=lines[row-1][column-1]
def up(x,y):
if lines[x-2][y-1]==num and x>1:
left(x,y)
right(x,y)
lines[x-2][y-1]=None
def left(x,y):
if lines[x-1][y-2]==num and y>1:
up(x,y)
down(x,y)
lines[x-1][y-2]=None
def right(x,y):
if lines[x-1][y]==num and y<len(lines[row-1]):
up(x,y)
down(x,y)
lines[x-1][y]=None
def down(x,y):
if lines[x][y-1]==num and x<len(lines):
left(x,y)
right(x,y)
lines[x][y-1]=None
up(row,column)
down(row,column)
for i in lines:
print(str(i).strip("[]").replace(",","").replace("None"," "))
When I give the input (3,3) which represents the number of "4", the output must be like this:
1 0 7 6 8
0 5 5 5
2 1 6
4 1 3 7
I don't need fixed code, just the main idea will be enough. Thanks a lot.
Recursion error happens when your recursion does not terminate.
You can solve this without recursing using set's of indexes:
search all indexes that contain the looked for number into all_num_idx
add the index you are currently at (your input) to a set tbd (to be deleted)
loop over the tbd and add all indexed from all_num_idx that differ only in -1/+1 in row or col to any index thats already in the set
do until tbd does no longer grow
delete all indexes from tbd:
t = """4 0 4 7 6 8
0 5 4 4 5 5
2 1 4 4 4 6
4 1 3 7 4 4"""
data = [k.strip().split() for k in t.splitlines()]
row,column=map(int,input("Row,column:").strip().split(";"))
num = data[row][column]
len_r =len(data)
len_c = len(data[0])
all_num_idx = set((r,c) for r in range(len_r) for c in range(len_c) if data[r][c]==num)
tbd = set( [ (row,column)] ) # inital field
tbd_size = 0 # different size to enter while
done = set() # we processed those already
while len(tbd) != tbd_size: # loop while growing
tbd_size=len(tbd)
for t in tbd:
if t in done:
continue
# only 4-piece neighbourhood +1 or -1 in one direction
poss_neighbours = set( [(t[0]+1,t[1]), (t[0],t[1]+1),
(t[0]-1,t[1]), (t[0],t[1]-1)] )
# 8-way neighbourhood with diagonals
# poss_neighbours = set((t[0]+a,t[1]+b) for a in range(-1,2) for b in range(-1,2))
tbd = tbd.union( poss_neighbours & all_num_idx)
# reduce all_num_idx by all those that we already addded
all_num_idx -= tbd
done.add(t)
# delete the indexes we collected
for r,c in tbd:
data[r][c]=None
# output
for line in data:
print(*(c or " " for c in line) , sep=" ")
Output:
Row,column: 3,4
4 0 7 6 8
0 5 5 5
2 1 6
4 1 3 7
This is a variant of a "flood-fill-algorythm" flooding only cells of a certain value. See https://en.wikipedia.org/wiki/Flood_fill
Maybe you should replace
def right(x,y):
if lines[x-1][y]==num and y<len(lines[row-1]):
up(x,y)
down(x,y)
lines[x-1][y]=None
by
def right(x,y):
if lines[x-1][y]==num and y<len(lines[row-1]):
lines[x-1][y]=None
up(x - 1,y)
down(x - 1,y)
right(x - 1, y)
and do the same for all the other functions.
Putting lines[x-1][y]=None ensure that your algorithm stops and changing the indices ensure that the next step of your algorithm will start from the neighbouring cell.
With lengthy column names, DataFrames will display in a very messy form seemingly no matter what options are set.
Info: I'm in Jupyter QtConsole, pandas 0.20.1, with the following relevant options specified at startup:
pd.set_option('display.max_colwidth', 20)
pd.set_option('expand_frame_repr', False)
pd.set_option('display.max_rows', 25)
Question: how can I truncate the DataFrame if necessary rather than wrapping the columns to the next line, while keeping expand_frame_repr=False?
Here's an example. Again, the issue doesn't depend on the number of columns but length of the columns.
This will not cause an issue:
df = pd.DataFrame(np.random.randn(1000, 1000),
columns=['col' + str(i) for i in range(1000)])
As the output is perfectly readable and looks like:
The same DataFrame with long column names causes the issue I'm talking about:
df = pd.DataFrame(np.random.randn(1000, 1000),
columns=['very_long_col_name_'
+ str(i) for i in range(1000)])
Is there any way to conform the second output to be like the first that I'm missing? (Through specifying an option, not through using .iloc every time I want to view.)
Use max_columns
from string import ascii_letters
df = pd.DataFrame(np.random.randint(10, size=(5, 52)), columns=list(ascii_letters))
with pd.option_context(
'display.max_colwidth', 20,
'expand_frame_repr', False,
'display.max_rows', 25,
'display.max_columns', 5,
):
print(df.add_prefix('really_long_column_name_'))
really_long_column_name_a really_long_column_name_b ... really_long_column_name_Y really_long_column_name_Z
0 8 1 ... 1 9
1 8 5 ... 2 1
2 5 0 ... 9 9
3 6 8 ... 0 9
4 1 2 ... 7 1
[5 rows x 52 columns]
Another idea... Obviously not exactly what you want, but maybe you can twist it to your needs.
d1 = df.add_suffix('_really_long_column_name')
with pd.option_context('display.max_colwidth', 4, 'expand_frame_repr', False):
mw = pd.get_option('display.max_colwidth')
print(d1.rename(columns=lambda x: x[:mw-3] + '...' if len(x) > mw else x))
a... b... c... d... e... f... g... h... i... j... ... Q... R... S... T... U... V... W... X... Y... Z...
0 6 5 5 5 8 3 5 0 7 6 ... 9 0 6 9 6 8 4 0 6 7
1 0 5 4 7 2 5 4 3 8 7 ... 8 1 5 3 5 9 4 5 5 3
2 7 2 1 6 5 1 0 1 3 1 ... 6 7 0 9 9 5 2 8 2 2
3 1 8 7 1 4 5 5 8 8 3 ... 3 6 5 7 1 0 8 1 4 0
4 7 5 6 2 4 9 7 9 0 5 ... 6 8 1 6 3 5 4 2 3 2
Looks like it will need an enhancement. The relevant code in the repr function appears to be here:
max_rows = get_option("display.max_rows")
max_cols = get_option("display.max_columns")
show_dimensions = get_option("display.show_dimensions")
if get_option("display.expand_frame_repr"):
width, _ = console.get_console_size()
else:
width = None
self.to_string(buf=buf, max_rows=max_rows, max_cols=max_cols,
line_width=width, show_dimensions=show_dimensions)
So either you pass expand_frame_repr=True and it wraps on the line width, or you pass expand_frame_repr=False and it shouldn't. But it looks like there is a bug in the code (this should be pandas 0.20.3 iirc):
in pd.io.formats.format.DataFrameFormatter:
def _chk_truncate(self):
"""
Checks whether the frame should be truncated. If so, slices
the frame up.
"""
from pandas.core.reshape.concat import concat
# Column of which first element is used to determine width of a dot col
self.tr_size_col = -1
# Cut the data to the information actually printed
max_cols = self.max_cols
max_rows = self.max_rows
if max_cols == 0 or max_rows == 0: # assume we are in the terminal
# (why else = 0)
(w, h) = get_terminal_size()
self.w = w
self.h = h
if self.max_rows == 0:
dot_row = 1
prompt_row = 1
if self.show_dimensions:
show_dimension_rows = 3
n_add_rows = (self.header + dot_row + show_dimension_rows +
prompt_row)
# rows available to fill with actual data
max_rows_adj = self.h - n_add_rows
self.max_rows_adj = max_rows_adj
# Format only rows and columns that could potentially fit the
# screen
if max_cols == 0 and len(self.frame.columns) > w:
max_cols = w
if max_rows == 0 and len(self.frame) > h:
max_rows = h
Looks like it intended to do what you wanted, but was unfinished. It's checking max_cols against the number of columns, not the total width of the columns.
So you could either create a show_df function that would calculate the correct number of columns and show it in an option_context like pi2Squared's answer, or fix it here (and maybe submit a patch if you need it distributed).
As others have pointed out, Pandas itself seems to be bugged or badly designed here, so a workaround is required.
Most of the time this problem occurs with numerical columns, since numbers are relatively short. Pandas will split the column heading onto multiple lines if there are spaces in it, so you can "hack in" the correct behavior by inserting spaces into column headings for numerical columns when you display the dataframe. I have a one-liner to do this:
def colfix(df, L=5): return df.rename(columns=lambda x: ' '.join(x.replace('_', ' ')[i:i+L] for i in range(0,len(x),L)) if df[x].dtype in ['float64','int64'] else x )
do display your dataframe, simply type
colfix(your_df)
note that the renaming is not going to permanently change the dataframe, it will only add spaces to the names for the purposes of displaying it that one time.
Results (in a Jupyter Notebook):
With colfix:
Without:
I have the following output containing two columns (line# and ID):
1 Q50331
2 P75247
3 P75544
4 P22446
5 P78027
6 P75271
7 P75176
8 P0ABB4
9 P63284
10 P0A6M8
11 P0AES4
12 P39452
13 P0A8T7
14 P0A698
How can I make the ID values of second column align at the top of each other, like the following:
1 Q50331
2 P75247
3 P75544
4 P22446
5 P78027
6 P75271
7 P75176
8 P0ABB4
9 P63284
10 P0A6M8
11 P0AES4
12 P39452
13 P0A8T7
14 P0A698
The problem I am facing is how to incorporate the solution to my code. I tried to use python tabulate, but found this is not working properly since what I am printing: row[0] is a unicode from the tuple row (See the following code).
count = 0
for row in c:
count += 1
print count, row[0]
Any idea how can I incorporate tabulate or other methods to align the unicode-type values in the column?
Use alignment specifiers:
data = {
1:'Q50331',
2:'P75247',
3:'P75544',
4:'P22446',
5:'P78027',
6:'P75271',
7:'P75176',
8:'P0ABB4',
9:'P63284',
10:'P0A6M8',
11:'P0AES4',
12:'P39452',
13:'P0A8T7',
14:'P0A698',
333:'P00bar'
}
length = len(str(max(data.keys())))+1
for k,v in data.items():
print "{:<{}}{}".format(k, length, v)
Output:
1 Q50331
2 P75247
3 P75544
4 P22446
5 P78027
6 P75271
7 P75176
8 P0ABB4
9 P63284
10 P0A6M8
11 P0AES4
12 P39452
13 P0A8T7
14 P0A698
333 P00bar
I've created length which will contain the length of the max value from data keys, +1. Then I pass that length value to my alignment specifier, which in this case is 4:
{:<4}{}