Extract information from an Excel (by updating arrays) with Excel / Python - python

I have an Excel file with thousands of columns on the following format:
Member No.
X
Y
Z
1000
25
60
-30
-69
38
68
45
2
43
1001
24
55
79
4
-7
89
78
51
-2
1002
45
-55
149
94
77
-985
-2
559
56
I need a way such that I shall get a new table with the absolute maximum value from each column. In this example, something like:
Member No.
X
Y
Z
1000
69
60
68
1001
78
55
89
1002
94
559
985
I have tried it in Excel (with using VLOOKUP for finding the "Member Number" in the first row and then using HLOOKUP for finding the values from the rows thereafter), but the problem is that the HLOOKUP command is not automatically updated with the new array (the array in which Member number 1001 is) (so my solution works for member 1000, but not for 1001 and 1002), and hence it always searches for the new value ONLY in the 1st Row (i.e. the row with the member number 1000).
I also tried reading the file with Python, but I am not well-versed enough to make much of a headway - once the dataset has been read, how do I tell excel to read the next 3 rows and get the (absolute) maximum in each column?
Can someone please help? Solution required in Python 3 or Excel (ideally, Excel 2014).

The below solution will get you your desired output using Python.
I first ffill to fill in the blanks in your Member No column (axis=0 means row-wise). Then convert your dataframe values to +ve using abs. Lastly, using pandas.DataFrame.agg, I get the max value for all the columns in your dataframe.
Assuming your dataframe is called data:
import pandas as pd
data['Member No.'] = data['Member No.'].ffill(axis=0).astype(int)
df = abs(df)
res = (data.groupby('Member No.').apply(lambda x: x.max())).drop('Member No.',axis=1).reset_index()
Which will print you:
Member No. X Y Z A B C
0 1000 69 60 68 60 74 69
1 1001 78 55 89 78 92 87
2 1002 94 559 985 985 971 976
Note that I added extra columns in your sample data to make sure that all the columns will return their max() value.

Related

How to match 2 different array with the same length

Good Morning,
I have 2 array with the same length over 21 rows. What I would like to find a match in df and df1 and skip to the next 3 to find a match etc.... Then I would like the return the Id's and matching values and then save to excel sheet.
Here's a example below
df= np.array([[[100,1,2,4,5,6,8],
[87,1,6,20,22,23,34],
[99,1,12,13,34,45,46]],
[[64,1,10,14,29,32,33],
[55,1,22,13,23,33,35],
[66,1,6,7,8,9,10]],
[[77,1,2,3,5,6,8],
[811,1,2,5,6,8,10],
[118,1,7,8,22,44,56]]])
df1= np.array([[[89,2,4,6,8,10,12],
[81,2,7,25,28,33,54],
[91,2,13,17,33,45,48]],
[[66,2,20,24,25,36,38],
[56,1,24,33,43,53,65],
[78,3,14,15,18,19,20]],
[[120,4,10,23,25,26,38],
[59,5,22,35,36,38,40],
[125,1,17,18,32,44,56]]])
I would like to find a match in.
100 to 89,81,91
87 to 89,81,91
99 to 89,81,91
64 to 66,56,78
55 to 66,56,78
66 to 66,56,78
77 to 120,59,125
811 to 120,59,125
118 to 120,59,125
Output/Save to excel sheet.

Using groupby() for a dataframe in pandas resulted Index Error

I have this dataframe:
x y z parameter
0 26 24 25 Age
1 35 37 36 Age
2 57 52 54.5 Age
3 160 164 162 Hgt
4 182 163 172.5 Hgt
5 175 167 171 Hgt
6 95 71 83 Wgt
7 110 68 89 Wgt
8 89 65 77 Wgt
I'm using pandas to get this final result:
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt
I'm using groupby() to extract and isolate rows based on same parameter Hgt from the original dataframe
First, I added a column to set it as an index:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
And the dataframe came out like this:
index x y z parameter
0 0 26 24 25 Age
1 1 35 37 36 Age
2 2 57 52 54.5 Age
3 3 160 164 162 Hgt
4 4 182 163 172.5 Hgt
5 5 175 167 171 Hgt
6 6 95 71 83 Wgt
7 7 110 68 89 Wgt
8 8 89 65 77 Wgt
Then, I used the following code to group based on index and extract the columns I need:
df1 = df.groupby('index')[['x', 'y','parameter']]
And the output was:
x y parameter
0 26 24 Age
1 35 37 Age
2 57 52 Age
3 160 164 Hgt
4 182 163 Hgt
5 175 167 Hgt
6 95 71 Wgt
7 110 68 Wgt
8 89 65 Wgt
After that, I used the following code to isolate only Hgt values:
df2 = df1[df1['parameter'] == 'Hgt']
When I ran df2, I got an error saying:
IndexError: Column(s) ['x', 'y', 'parameter'] already selected
Am I missing something here? What to do to get the final result?
Because you asked what you did wrong, let me point to useless/bad code.
Without any judgement (this is just to help you improve future code), almost everything is incorrect. It feels like a succession of complicated ways to do useless things. Let me give some details:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
This seems a very convoluted way to do df.reset_index(). Even [count for count in range(df.shape[0])] could be have been simplified by using range(df.shape[0]) directly.
But this step is not even needed for a groupby as you can group by index level:
df.groupby(level=0)
But... the groupby is useless anyways as you only have single membered groups.
Also, when you do:
df1 = df.groupby('index')[['x', 'y','parameter']]
df1 is not a dataframe but a DataFrameGroupBy object. Very useful to store in a variable when you know what you're doing, this is however causing the error in your case as you thought this was a DataFrame. You need to apply an aggregation or transformation method of the DataFrameGroupBy object to get back a DataFrame, which you didn't (likely because, as seen above, there isn't much interesting to do on single-membered groups).
So when you run:
df1[df1['parameter'] == 'Hgt']
again, all is wrong as df1['parameter'] is equivalent to df.groupby('index')[['x', 'y','parameter']]['parameter'] (the cause of the error as you select twice 'parameter'). Even if you removed this error, the equality comparison would give a single True/False as you still have your DataFrameGroupBy and not a DataFrame, and this would incorrectly try to subselect an inexistent column of the DataFrameGroupBy.
I hope it helped!
Do you really need groupby?
>>> df.loc[df['parameter'] == 'Hgt', ['x', 'y', 'parameter']].reset_index(drop=True)
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt

Plotting of dot points based on np.where condition

I have a lot of data points (in .CSV form) that I am trying to visualize, what I would like to do is to read the csv and read the "result" column, if the value in the corresponding column is positive(I was trying to use np.where condition), I would like to plot the A B C D E F G parameters corresponding to it in such a way that the y-axis is the value of the parameters and x-axis is the name of the parameter.(Something like a dot/scatter plot) I would like to plot all the values in the same graph, Furthermore, if the number of points are more than 20 I would like to use the first 20 points for the plotting.
An example of the type of dataset is below. (Mine contains around 12000 rows)
A B C D E F G result
23 -54 36 27 98 39 80 -0.86
14 44 -16 47 28 29 26 1.65
67 84 26 67 -88 29 10 0.5
-45 14 76 37 68 59 90 0
24 34 56 27 38 79 48 -1.65
Any help in guiding for this would be appreciated !
From your question I assume that your data is a pandas dataframe. In this case you can do the selection with pandas and use its built-in plotting function:
df.loc[df.result>0, df.columns[:-1]].T.plot(ls='', marker='o')
If you want to plot the first 20 rows only, just add [:20] (or better .iloc[:20]) to df.loc.

Empty array returned while fetching value from one column based on another column in pandas dataframe

There's a postgres db with table 'nlp' having cols 'id'(Integer), 'cycle'(Integer), 'thresh'(Numeric)
using pd.read_sql for reading the database.
tb_data = pd.read_sql(rmdb.nlp, conn, columns = ['cycle','id','thresh'])
tb_data is read correctly and is as follows:
cycle id thresh
0 76 57 85.0
1 55 38 98.5
2 56 38 98.5
3 57 38 98.5
4 48 33 98.5
.. ... ... ...
65 159 125 98.5
66 160 126 98.5
67 161 127 98.5
68 162 128 98.5
69 163 129 99.0
[70 rows x 3 columns]
then i tried to find the 'cycle' value based on id using:
tb_data.loc[tb_data['id'] == self.id, 'cycle'].item().
but i'm getting empty array and .item() is throwing error can only convert an array of size 1 to a Python scalar.
Expected value : 163
I also tried
int(tb_data[tb_data['id'] == self.id]['cycle'].max()). But still no data was returned. It throws flaot NaN cannot be converted error.
Note: self.id is numeric and i printed it just before the above statements and it's value is 129 according to the current table data.
The thing is, when i try to run the same code on google colab after creating a dataframe from dictionary, it runs as expected. I am not able to find out the problem here. Any help would be appreciated.
You can select first value if always exist:
tb_data.loc[tb_data['id'] == self.id, 'cycle'].iat[0]
Or more general if possible returned empty Series is use next with iter for return first value, if exist else custom string, here 'no match':
next(iter(tb_data.loc[tb_data['id'] == self.id, 'cycle']), 'no match')
It's my mistake, i didn't checked the type of id which i was passing in the equality operataion. I was matching string with integer which was giving false and i was not able to get the cycle value.

How do I sort columns of numerical file data in python

I'm trying to write a piece of code in python to graph some data from a tab separated file with numerical data.
I'm very new to Python so I would appreciate it if any help could be dumbed down a little bit.
Basically, I have this file and I would like to take two columns from it, sort them each in ascending order, and then graph those sorted columns against each other.
First of all, you should not put code as images, since there is a functionality to insert and format here in the editor.
It's as simple as calling x.sort() and y.sort() since both of them are slices from data so that should work fine (assuming they are 1 dimensional arrays).
Here is an example:
import numpy as np
array = np.random.randint(0,100, size=50)
print(array)
Output:
[89 47 4 10 29 21 91 95 32 12 97 66 59 70 20 20 36 79 23 4]
So if we use the method mentioned before:
print(array.sort())
Output:
[ 4 4 10 12 20 20 21 23 29 32 36 47 59 66 70 79 89 91 95 97]
Easy as that :)

Categories