Collapsing rows by values within a given tolerance in pandas [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 days ago.
This post was edited and submitted for review yesterday.
Improve this question
ms_dict = {'Mz':[200.035234, 200.035523, 200.035125,200.042546],
'Type':['Blank', 'Sample','Sample','Sample'],
'Well': ['E17','E18','A04','H12'],'Intensity':[21655,56415,56456,546454]}
df = pd.DataFrame.from_dict(ms_dict)
df
Mz
Type
Well
Intensity
200.035234
Blank
E17
21655
200.035523
Sample
E18
56415
200.035125
Sample
A04
56456
200.042546
Sample
H12
546454
In our data, we often have row features that are within a given tolerance range of each other that do not collapse to a single bucket based on pre-determined limits, such as rounding (2 places behind decimal instead of 5). In the provided example, we want to group the first three rows into a single bucket with the values in columns Mz and Intensity combined in some "meaningful" way (summed or averaged), but not the fourth row. We also want to show the distribution of the new bucket across the Well positions in which it is found.
The tolerance is determined by the ppm difference (ppm_diff) of the values in each row to find others that are within that same tolerance window.
ppm_diff = ((value1 - value2)/ value1 )*1e6
Is there a way in python or pandas to group these rows based on the difference in the single column?
We have tried simply rounding, but that splits values inappropriately where the distribution matrix of individual features is inaccurate (edge effects of the rounding: 200.0355 and 200.0354 are well within a given tolerance of 2 ppm, but result in different buckets). Larger grouping based on rounding to a lower number of places behind the decimal can inappropriately group features that should remain distinct.

Related

Rank and color each cell based on a row wise comparison in a table python [duplicate]

This question already has answers here:
seaborn heatmap color scheme based on row values
(1 answer)
Coloring Cells in Pandas
(4 answers)
Closed 5 months ago.
I have a table (which is currently produced in Excel), where a row wise comparison is made and each cell in a row is ranked from lowest to highest. The best score is strong green, the second best score is less green and the worst score is red. For cells with an equal score, the color of the cells will also be similar and based on their shared rank.
For some rows, the ranking is based on a ascending score, while some rows have a descending ranking.
How can this be done using Python? Do you know any modules that are capable of doing something similar? I've used Seaborn for other heatmaps, but none of them were based on a row wise comparison.
Any ideas?
The colors are not important. I just want to know how to rank the cells of each row compared to each other.
Use background_gradient. The RdYlGn map sound like it matches your description. There won't be a 100% reproduction of Excel's color map.
df.style.background_gradient("RdYlGn", axis=1)
List of color maps: https://matplotlib.org/stable/tutorials/colors/colormaps.html

How to calculate the average without the points are very far from the others [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 months ago.
Improve this question
I have some data points (size of clusters) and I would like to calculate the average of these points, however, some instant peaks need to be removed. Usually, these peaks are twice or three times the normal value, but not always. Any suggestion would be appreciated. Thank you.
Some instant peaks because of coalescence:
Assuming you have your data in a dataframe with two columns: 'time' and 'size', and that there are around 500 observations in total (so the window size 10 is sensible):
Calculate the median of a moving window.
If for some value (the median centered at it * multiplier_thresh) is >= its 'size', then consider this value an outlier and remove it:
wind_size = 10
multiplier_thresh = 1.5
# Calculate rolling median
rolling_median = df['size'].rolling(window=wind_size).median().bfill()
# Drop outliers
to_stay = df['size'] < rolling_median * multiplier_thresh
df_no_outliers = df[to_stay]
Mean of the values without the outliers:
df_no_outliers['size'].mean()
A simpler approach:
Just remove the outliers of all your 'size' values.
You can use a variety of methods to detect and remove the outliers.
Here is a simple one:
q1 = df["size"].quantile(0.25)
q3 = df["size"].quantile(0.75)
iqr = q3 - q1 # Interquartile range
df_no_outliers = df[df["size"] < q3 + 1.5 * iqr]

Histogram only in a specific range [duplicate]

This question already has answers here:
How to select rows in a DataFrame between two values, in Python Pandas?
(7 answers)
Closed 1 year ago.
Example df
How do I create a histogram that in this case only uses the range of 2–5 points, instead of the entire points data range of 1–6?
I'm trying to only display the average data spread, not the extreme areas. Is there maybe a function to zoom in to the significant ranges? And is there a smart way to describe those?
For your specific data, you can first filter your DataFrame, then call .hist(). Note that Series.between(left, right) includes both the left and right values:
df[df['points'].between(2, 5)].hist()

Is there a way to compare and highlight differences between multiple csv files sequentially? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I am a newbie in the programming space and there is something that I need to do across multiple folders that I feel will be easier if I can code it out.
I have a folder containing 12 csv files which I need to run a comparison in python against a particular column in these files. The files contain common columns and data collected in the twelve months of the year (Jan-Dec). Is there a way I can compare the difference between January file and February file, then February file and March file, March file and April file....all along highlighting the differences and saving them in one dataframe, in python?
The data is numerical and I would like to run this comparison across this specific column.
If you happen to have an index column, then you can extract the insertions/deletions by comparing the indices of each dataframe (corresponding to each file). This, however, only works if you have an identifier that is unique across files; that is, a single observation or row will always have the same ID (value in the index column) no matter what file it is in.
import numpy as np
import pandas as pd
def series_diff(series1: pd.Series, series2: pd.Series) -> pd.DataFrame:
"""Compare two series via their indices, returning a table of differences.
Returns the additions and deletions made to ``series1`` to obtain ``series2``.
"""
added = series2.index.difference(series1.index)
deleted = series1.index.difference(series2.index)
return pd.concat(
[
series1.loc[deleted].to_frame(name="value").assign(action="deleted"),
series2.loc[added].to_frame(name="value").assign(action="added"),
]
)
For example, if you have the following files and want to compare the target column:
jan.csv:
id,filler1,target,filler2
0,spam,0.6059782788074047,eggs
1,spam,0.7333693611934982,eggs
2,spam,0.13894715672839875,eggs
3,spam,0.31267308385468695,eggs
4,spam,0.9972432813403187,eggs
5,spam,0.1281623754189607,eggs
6,spam,0.17899310595018803,eggs
7,spam,0.7529254287760938,eggs
8,spam,0.662160514309534,eggs
9,spam,0.7843101321411227,eggs
feb.csv:
id,filler1,target,filler2
0,spam,0.6059782788074047,eggs
1,spam,0.7333693611934982,eggs
2,spam,0.13894715672839875,eggs
4,spam,0.9972432813403187,eggs
5,spam,0.1281623754189607,eggs
6,spam,0.17899310595018803,eggs
8,spam,0.662160514309534,eggs
9,spam,0.7843101321411227,eggs
10,spam,0.09689439592486082,eggs
11,spam,0.058571285088035996,eggs
12,spam,0.9623959902103917,eggs
13,spam,0.6165574438945741,eggs
14,spam,0.08662996124854716,eggs
Here the index column is named id. Note that feb.csv contains additional rows with IDs 10 to 14, while rows 3 and 7 have been removed with respect to jan.csv.
Let's load the files:
jan = pd.read_csv("jan.csv", index_col="id")
feb = pd.read_csv("feb.csv", index_col="id")
And run the diff:
series_diff(jan["target"], feb["target"])
value action
id
3 0.312673 deleted
7 0.752925 deleted
10 0.096894 added
11 0.058571 added
12 0.962396 added
13 0.616557 added
14 0.086630 added
If you don't have an index column, it becomes difficult to accurately identify differences: for instance, two rows with the same value(s) could either be the same observation or they could be different observations that just happen to have the same value.
If we assume that the order of rows isn't shuffled around between files, and that any additions are made at the end of the previous table, one idea is to compare the rows line by line with a text diff, e.g. by using the difflib module, which will higlight additions and deletions.
import difflib
print(
*difflib.unified_diff(
[f"{x}\n" for x in jan["target"]],
[f"{x}\n" for x in feb["target"]],
fromfile="jan",
tofile="feb",
),
sep="",
)
--- jan
+++ feb
## -1,10 +1,13 ##
0.6059782788074047
0.7333693611934982
0.13894715672839875
-0.31267308385468695
0.9972432813403187
0.1281623754189607
0.17899310595018803
-0.7529254287760938
0.662160514309534
0.7843101321411227
+0.09689439592486082
+0.058571285088035996
+0.9623959902103917
+0.6165574438945741
+0.08662996124854716

Pandas:slicing the dataframe using index values [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
Pandas:I have a dataframe given below which contains the same set of banks twice..I need to slice the data from 0th index that contains a bank name, upto the index that contains the same bank name..here in the problem -DEUTSCH BANK AG..I need to apply same logic to any such kind of dataframes.ty..
I tried with logic:- df25.iloc[0,1]==df25[1].any().. but it returns nly true but not the index position.
DataFrame:-[1]:https://i.stack.imgur.com/iJ1hJ.png, https://i.stack.imgur.com/J2aDX.png
You need to get the index of all the rows that has the value you are looking for (in this case the bank name) and get the slice the data frame using the indices.
Example:
df = pd.DataFrame({'Col1':list('abcdeafgbfhi')})
search_str = 'b'
idx_list = list(df[(df['Col1']==search_str)].index.values)
print(df[idx_list[0]:idx_list[1]])
Output:
Col1
1 b
2 c
3 d
4 e
5 a
6 f
7 g
Note that the assumption is that there will be only 2 rows with the same value. If there are more than 2, you have to play with the index list values and get what you need. Hope this helps.
Keep in mind that posting a sample data set will always help you get more answers as people will move away to another question when they see images or screenshots, because it involves additional steps to reproduce the issue

Categories