Good Morning,
I have 2 array with the same length over 21 rows. What I would like to find a match in df and df1 and skip to the next 3 to find a match etc.... Then I would like the return the Id's and matching values and then save to excel sheet.
Here's a example below
df= np.array([[[100,1,2,4,5,6,8],
[87,1,6,20,22,23,34],
[99,1,12,13,34,45,46]],
[[64,1,10,14,29,32,33],
[55,1,22,13,23,33,35],
[66,1,6,7,8,9,10]],
[[77,1,2,3,5,6,8],
[811,1,2,5,6,8,10],
[118,1,7,8,22,44,56]]])
df1= np.array([[[89,2,4,6,8,10,12],
[81,2,7,25,28,33,54],
[91,2,13,17,33,45,48]],
[[66,2,20,24,25,36,38],
[56,1,24,33,43,53,65],
[78,3,14,15,18,19,20]],
[[120,4,10,23,25,26,38],
[59,5,22,35,36,38,40],
[125,1,17,18,32,44,56]]])
I would like to find a match in.
100 to 89,81,91
87 to 89,81,91
99 to 89,81,91
64 to 66,56,78
55 to 66,56,78
66 to 66,56,78
77 to 120,59,125
811 to 120,59,125
118 to 120,59,125
Output/Save to excel sheet.
I have this dataframe:
x y z parameter
0 26 24 25 Age
1 35 37 36 Age
2 57 52 54.5 Age
3 160 164 162 Hgt
4 182 163 172.5 Hgt
5 175 167 171 Hgt
6 95 71 83 Wgt
7 110 68 89 Wgt
8 89 65 77 Wgt
I'm using pandas to get this final result:
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt
I'm using groupby() to extract and isolate rows based on same parameter Hgt from the original dataframe
First, I added a column to set it as an index:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
And the dataframe came out like this:
index x y z parameter
0 0 26 24 25 Age
1 1 35 37 36 Age
2 2 57 52 54.5 Age
3 3 160 164 162 Hgt
4 4 182 163 172.5 Hgt
5 5 175 167 171 Hgt
6 6 95 71 83 Wgt
7 7 110 68 89 Wgt
8 8 89 65 77 Wgt
Then, I used the following code to group based on index and extract the columns I need:
df1 = df.groupby('index')[['x', 'y','parameter']]
And the output was:
x y parameter
0 26 24 Age
1 35 37 Age
2 57 52 Age
3 160 164 Hgt
4 182 163 Hgt
5 175 167 Hgt
6 95 71 Wgt
7 110 68 Wgt
8 89 65 Wgt
After that, I used the following code to isolate only Hgt values:
df2 = df1[df1['parameter'] == 'Hgt']
When I ran df2, I got an error saying:
IndexError: Column(s) ['x', 'y', 'parameter'] already selected
Am I missing something here? What to do to get the final result?
Because you asked what you did wrong, let me point to useless/bad code.
Without any judgement (this is just to help you improve future code), almost everything is incorrect. It feels like a succession of complicated ways to do useless things. Let me give some details:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
This seems a very convoluted way to do df.reset_index(). Even [count for count in range(df.shape[0])] could be have been simplified by using range(df.shape[0]) directly.
But this step is not even needed for a groupby as you can group by index level:
df.groupby(level=0)
But... the groupby is useless anyways as you only have single membered groups.
Also, when you do:
df1 = df.groupby('index')[['x', 'y','parameter']]
df1 is not a dataframe but a DataFrameGroupBy object. Very useful to store in a variable when you know what you're doing, this is however causing the error in your case as you thought this was a DataFrame. You need to apply an aggregation or transformation method of the DataFrameGroupBy object to get back a DataFrame, which you didn't (likely because, as seen above, there isn't much interesting to do on single-membered groups).
So when you run:
df1[df1['parameter'] == 'Hgt']
again, all is wrong as df1['parameter'] is equivalent to df.groupby('index')[['x', 'y','parameter']]['parameter'] (the cause of the error as you select twice 'parameter'). Even if you removed this error, the equality comparison would give a single True/False as you still have your DataFrameGroupBy and not a DataFrame, and this would incorrectly try to subselect an inexistent column of the DataFrameGroupBy.
I hope it helped!
Do you really need groupby?
>>> df.loc[df['parameter'] == 'Hgt', ['x', 'y', 'parameter']].reset_index(drop=True)
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt
I have a lot of data points (in .CSV form) that I am trying to visualize, what I would like to do is to read the csv and read the "result" column, if the value in the corresponding column is positive(I was trying to use np.where condition), I would like to plot the A B C D E F G parameters corresponding to it in such a way that the y-axis is the value of the parameters and x-axis is the name of the parameter.(Something like a dot/scatter plot) I would like to plot all the values in the same graph, Furthermore, if the number of points are more than 20 I would like to use the first 20 points for the plotting.
An example of the type of dataset is below. (Mine contains around 12000 rows)
A B C D E F G result
23 -54 36 27 98 39 80 -0.86
14 44 -16 47 28 29 26 1.65
67 84 26 67 -88 29 10 0.5
-45 14 76 37 68 59 90 0
24 34 56 27 38 79 48 -1.65
Any help in guiding for this would be appreciated !
From your question I assume that your data is a pandas dataframe. In this case you can do the selection with pandas and use its built-in plotting function:
df.loc[df.result>0, df.columns[:-1]].T.plot(ls='', marker='o')
If you want to plot the first 20 rows only, just add [:20] (or better .iloc[:20]) to df.loc.
There's a postgres db with table 'nlp' having cols 'id'(Integer), 'cycle'(Integer), 'thresh'(Numeric)
using pd.read_sql for reading the database.
tb_data = pd.read_sql(rmdb.nlp, conn, columns = ['cycle','id','thresh'])
tb_data is read correctly and is as follows:
cycle id thresh
0 76 57 85.0
1 55 38 98.5
2 56 38 98.5
3 57 38 98.5
4 48 33 98.5
.. ... ... ...
65 159 125 98.5
66 160 126 98.5
67 161 127 98.5
68 162 128 98.5
69 163 129 99.0
[70 rows x 3 columns]
then i tried to find the 'cycle' value based on id using:
tb_data.loc[tb_data['id'] == self.id, 'cycle'].item().
but i'm getting empty array and .item() is throwing error can only convert an array of size 1 to a Python scalar.
Expected value : 163
I also tried
int(tb_data[tb_data['id'] == self.id]['cycle'].max()). But still no data was returned. It throws flaot NaN cannot be converted error.
Note: self.id is numeric and i printed it just before the above statements and it's value is 129 according to the current table data.
The thing is, when i try to run the same code on google colab after creating a dataframe from dictionary, it runs as expected. I am not able to find out the problem here. Any help would be appreciated.
You can select first value if always exist:
tb_data.loc[tb_data['id'] == self.id, 'cycle'].iat[0]
Or more general if possible returned empty Series is use next with iter for return first value, if exist else custom string, here 'no match':
next(iter(tb_data.loc[tb_data['id'] == self.id, 'cycle']), 'no match')
It's my mistake, i didn't checked the type of id which i was passing in the equality operataion. I was matching string with integer which was giving false and i was not able to get the cycle value.
I'm trying to write a piece of code in python to graph some data from a tab separated file with numerical data.
I'm very new to Python so I would appreciate it if any help could be dumbed down a little bit.
Basically, I have this file and I would like to take two columns from it, sort them each in ascending order, and then graph those sorted columns against each other.
First of all, you should not put code as images, since there is a functionality to insert and format here in the editor.
It's as simple as calling x.sort() and y.sort() since both of them are slices from data so that should work fine (assuming they are 1 dimensional arrays).
Here is an example:
import numpy as np
array = np.random.randint(0,100, size=50)
print(array)
Output:
[89 47 4 10 29 21 91 95 32 12 97 66 59 70 20 20 36 79 23 4]
So if we use the method mentioned before:
print(array.sort())
Output:
[ 4 4 10 12 20 20 21 23 29 32 36 47 59 66 70 79 89 91 95 97]
Easy as that :)