Remove duplicate row in array based on specific column values in Python - python

I have an array like this:
array = [[0.91 0.33 0.09]
[0.52 0.63 0.05]
[0.91 0.33 0.11]
[0.52 0.63 0.07]
[0.62 0.41 0.01]
[0.36 0.37 0.01]]
I need it to remove the row with the larger value in the third column if the first two column values are duplicate. So this:
array2 = [0.91 0.33 0.11]
[0.52 0.63 0.07]
[0.62 0.41 0.01]
[0.36 0.37 0.01]]
I want a pythonic way to do this without for loops if possible.

Quick two common ways:
Python stdlib module itertools has a function filter()
Use a List comprehension
Since you have the condition, "if the first two column values are duplicate", you'll have to do some grouping first, and itertools also has groupby().

Related

Plot clusters of similar words from pandas dataframe

I have a big dataframe, here's a small subset:
key_words prove have read represent lead replace
be 0.58 0.49 0.48 0.17 0.23 0.89
represent 0.66 0.43 0 1 0 0.46
associate 0.88 0.23 0.12 0.43 0.11 0.67
induce 0.43 0.41 0.33 0.47 0 0.43
Which shows how close each word from the key_words is to the rest of the columns (based on their embeddings distance).
I want to find a way to visualize this dataframe so that I see the clusters that are being formed among the words that are closest to each other.
Is there a simple way to do this, considering that the key_word column has string values?
One option is to set the key_words column as index and to use seaborn.clustermap to plot the clusters:
# pip install seaborn
import seaborn as sns
sns.clustermap(df.set_index('key_words'), # data
vmin=0, vmax=1, # values of min/max here white/black
cmap='Greys', # color palette
figsize=(5,5) # plot size
)
output:

Making arrays of columns (or rows) of a (space-delimited) textfile in Python

I have seen similar questions but the answers always give strings of rows. I want to make arrays of the columns of a text file. I have a textfile like this (There is a text FILE that looks like this, but has 106 rows and 19 columns):
O2 CO2 NOx Ash Other
20.9 1.6 0.04 0.0002 0.0
22.0 2.3 0.31 0.0005 0.0
19.86 2.1 0.05 0.0002 0.0
17.06 3.01 0.28 0.006 0.001
I expect to have arrays of columns (either a 2D array of all columns or a 1D array of each column), and the first row is only for names so then a list for the first row. Since I would like to plot them later.
The desired result would be for example for a column:
array([0.04,
0.31 ,
0.05,
0.28 ], dtype=float32)
and for the first row:
species= ['O2','CO2','NOx','Ash',' Other']
I'd recommend not to manually loop over values in large data sets (in this case a sort of tab separated relational model). Just use the methods of a safe and well-known library like NumPy:
import numpy as np
data = np.transpose(np.loadtxt("/path/to/file.txt", skiprows=1, delimiter="\t"))
with the inner loadtxt you read your file and with skiprows=1 parameter skip the first row (the column names) to avoid incompatible data types and further conversions. If you need this row in the same structure just use insert a new row at index 0. then you need to transpose the matrix for which there's a safe method in NumPy as well. I just used the output of loadtxt (which is a list of lists for each row) as input of transpose to give a one-liner. But it's better to use them apart in order to avoid "train wrecks" and also be able to see what happens in between and eventually correct the unwanted results.
PS: the delimiter parameter must be adjusted to match the one in the original file. Check the loadtxt documentation for more info. I considered it to be a TAB. #KostasCharitidis - thanks for your note
UPDATE3
st = open('file.txt', 'r').read()
dct = []
species = []
for row in st.split('\n')[0].split(' '):
species.append(row)
for no, row in enumerate(st.split('\n')[1:]):
dct.append([])
for elem in row.split(' '):
dct[no].append([float(elem)])
print(species)
print(dct)
RESULT
['O2', 'CO2', 'NOx', 'Ash', 'Other']
[[[20.9], [1.6], [0.04], [0.0002], [0.0]], [[22.0], [2.3], [0.31], [0.0005], [0.0]], [[19.86], [2.1], [0.05], [0.0002], [0.0]], [[17.06], [3.01], [0.28], [0.006], [0.001]]]
file.txt
O2 CO2 NOx Ash Other
20.9 1.6 0.04 0.0002 0.0
22.0 2.3 0.31 0.0005 0.0
19.86 2.1 0.05 0.0002 0.0
17.06 3.01 0.28 0.006 0.001

Create a row in pandas dataframe

I am trying to create a row in my existing pandas dataframe and the value of a new row should be a computation
I have a dataframe that looks like the below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
I want to add a row called "Metric" which is the sum of "LE_St" variable for "Rating" >= 4 and <6 divided by "LE_St" for "All" i.e Metric = (0.05+1.77)/10.17
My output dataframe should look like below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
Metric 0.44
I believe your approach to the dataframe is wrong.
Usually rows hold values correlating with columns in a matter that makes sense and not hold random information. the power of pandas and python is for holding and manipulating data. You can easily compute a value from a column or even all columns and store them in a "summary" like dataframe or in separate values. That might help you with this as well.
for computation on a column (i.e. Series object) you can use the .sum() method (or any other of the computational tools) and slice your dataframe by values in the "rating" column.
for random computation of small statistics you will be rather off with excel :)
an example of a solution might look like this:
all = 10.17 # i dont know where this value comes from
df = df[df['rating'].between(4, 6, inclusive=True)]
metric = sliced_df['LE_ST'].sum()/all
print metric # or store it somewhere however you like

Value distance between two columns in a data frame with sorted, float index

We have a data frame with a sorted float index and two columns that should be the same. Their values are not always present, and in the worst case scenario, they do not have overlaps in the index values. The goal is to be able to check how far they are from each other.
I was thinking about interpolating the missing values and then calculating the distance. This would result in a large collection of index values for which this distance can be calculated.
Another approach would be to compare the actual values, and come up with an index error for which this comparison would make sense.
The question is which approach would make more sense and how to calculate the distance. The result should tell us how close they are to each other, with f.e. 0 meaning that they are the same.
Example
We have a data frame with two columns a1 and a2 and a sorted, float index.
df = pd.DataFrame({'a1':[6.1, np.nan, 6.8, 7.5, 7.9],
'a2':[6.2, 6.6, 6.8, np.nan, 7.7]},
index=[0.10, 0.11, 0.13, 0.16, 0.17])
a1 a2
0.10 6.1 6.2
0.11 NaN 6.6
0.13 6.8 6.8
0.16 7.5 NaN
0.17 7.9 7.7
If your objective is to get the absolute distance of the interpolated vectors you can proceed as follows:
r = pd.interpolate()
absolute_sum = (r["a1"] - r["a2"]).abs().sum()
With the given example the result is 0.7000000000000011.
Though if you are interested on how similar the two columns are you could take a look into the correlation coefficient.
r = pd.interpolate()
correlation = r["a1"].corr("a2")
With the given example the result is 0.9929580338258082.
Since you mention distance
from scipy.spatial import distance
df=df.interpolate(axis=0)
pd.DataFrame(distance.cdist(df.values, df.values, 'euclidean'),columns=df.index,index=df.index)
Out[468]:
0.10 0.11 0.13 0.16 0.17
0.10 0.000000 0.531507 0.921954 1.750000 2.343075
0.11 0.531507 0.000000 0.403113 1.234909 1.820027
0.13 0.921954 0.403113 0.000000 0.832166 1.421267
0.16 1.750000 1.234909 0.832166 0.000000 0.602080
0.17 2.343075 1.820027 1.421267 0.602080 0.000000

Loop through grouped data - Python/Pandas

I'm trying to perform an action on grouped data in Pandas. For each group, I want to loop through the rows and compare them to the first row in the group. If conditions are met, then I want to print out the row details. My data looks like this:
Orig Dest Route Vol Per VolPct
ORD ICN A 2,251 0.64 0.78
ORD ICN B 366 0.97 0.13
ORD ICN C 142 0.14 0.05
DCA FRA A 9,059 0.71 0.85
DCA FRA B 1,348 0.92 0.13
DCA FRA C 281 0.8 0.03
My groups are Orig, Dest pairs. If a row in the group other than the first row has a Per greater than the first row and a VolPct greater than .1, I want to output the grouped pair and the route. In this example, the output would be:
ORD ICN B
DCA FRA B
My attempted code is as follows:
for lane in otp.groupby(otp['Orig','Dest']):
X = lane.first(['Per'])
for row in lane:
if (row['Per'] > X and row['VolPct'] > .1):
print(row['Orig','Dest','Route'])
However, this isn't working so I'm obviously not doing something right. I'm also not sure how to tell Python to ignore the first row when in the "row in lane" loop. Any ideas? Thanks!
You are pretty close as it is.
First, you are calling groupby incorrectly. You should just pass a list of the column names instead of a DataFrame object. So, instead of otp.groupby(otp['Orig','Dest']) you should use otp.groupby(['Orig','Dest']).
Once you are looping through the groups you will hit more issues. A group in a groupby object is actually a tuple. The first item in that tuple is the grouping key and the second is the grouped data. For example your first group would be the following tuple:
(('DCA', 'FRA'), Orig Dest Route Vol Per VolPct
3 DCA FRA A 9,059 0.71 0.85
4 DCA FRA B 1,348 0.92 0.13
5 DCA FRA C 281 0.80 0.03)
You will need to change the way you set X to reflect this. For example, X = lane.first(['Per']) should become X = lane[1].iloc[0].Per. After that you only have a minor errors in the way you iterate through the rows and access multiple columns in a row. To wrap it all up your loop should be something like so:
for key, lane in otp.groupby(otp['Orig','Dest']):
X = lane.iloc[0].Per
for idx, row in lane.iterrows():
if (row['Per'] > X and row['VolPct'] > .1):
print(row[['Orig','Dest','Route']])
Note that I use iterrows to iterate through the rows, and I use double brackets when accessing multiple columns in a DataFrame.
You don't really need to tell pandas to ignore the first row in each group as it should never trigger your if statement, but if you did want to skip it you could use lane[1:].iterrows().

Categories