Removing numbers from dataframe that lie within range - python

I have a pandas dataframe that contains values between -1000 to 1000. I want to eliminate all the numbers between the range of -0.00001 to 0.00001 i.e replace them with NaN. It is worth mentioning that my df contains numerous instances of very small positive and negative numbers that I want to include within this range as well e.g. 6.26478E-52.
How do I go about doing this?
P.S. I am attaching an image of the first few rows of my df for reference.

IIUC use if need less like -0.00001 and 0.00001:
df = df.mask(df.lt(-0.00001) | df.lt(0.00001))
is same like below 0.00001:
df = df.mask(df.lt(0.00001))
Or if need values between:
df = df.mask(df.gt(-0.00001) & df.lt(0.00001))

Related

How to removing irrelevant data in a dataframe?

I am having a large data(image:https://drive.google.com/file/d/1TFX1LQhQ-xwYp47fllL8PWwbwbv3wsb6/view?usp=drivesdk).The shape of data is 10000x100 .I want to remove irrelevant data by the condition(-0.5 =<angle<= 0.8).
the angle is a list of Angle_0,Angle_1,Angle_2......Angle_20.
ang = [f"Angle_{i}" for i in range(20)]
I want to have rows which follows the condition(-0.5 =<angle<= 0.8) for angle and delete other rows.
How to do this in python and pandas?
for example Angle_0 has value 0.1926715(row number 24) i want that row of entire dataset then in Angle_1 have data such as 0.1926715,0.192497 and i need those row(row number 7,9,14,16,19,21,23,26,28,29) similary for all other angles
I am a beginner in python.
Thank you very much in advance.
If you need to remove the rows that contain at least one angle that does not follow your conditions, this will solve your problem:
df.loc[(df>=-0.5 & df<=0.8).all(axis=1)]
new_df = df[(df['angle']>=-0.5 and df['angle']<=0.8]
Dirty way:
for n in range (20):
column_name = "Angle_" + str(n)
df = df[(df[column_name]>=-0.5 & df[column_name]<=0.8]
Since you need to adress several columns starting with the name Angle you could use regex like this to select them:
df = df.loc[(df.filter(regex='^Angle_') >=-0.5).all(axis=1) & (df.filter(regex='^Angle_')<=0.8).all(axis=1)]
The second solution is better in case you get a variable amount of angles or if it might change later

Number of quarters gap to int

I calculate number of quarters gap between two dates. Now, I want to test if the number of quarters gap is bigger than 2.
Thank you for your comments!
I'm actually running a code from WRDS (Wharton Research Data Services). Below, fst_vint is a DataFrame with two date variables, rdate and lag_rdate. First line seems to convert them to quarter variables (e.g., 9/8/2019 to 2019Q1), and then take differences between them, storing it in a new column qtr.
fst_vint.qtr >= 2 creates a problem, because the former is a QuarterEnd object, while the latter is an integer. How do I deal with this problem?
fst_vint['qtr'] = (fst_vint['rdate'].dt.to_period('Q')-\
fst_vint['lag_rdate'].dt.to_period('Q'))
# label first_report flag
fst_vint['first_report'] = ((fst_vint.qtr.isnull()) | (fst_vint.qtr>=2))
Using .diff() when column is converted to integer with .astype(int) results in the desired answer. So the code in your case would be:
fst_vint['qtr'] = fst_vint['rdate'].astype(int).diff()

How to filter out columns with binary classes, that are below a specific frequency in python?

I am pretty new to programming, and I am sure many solutions exist, but for now, mine seems not to be working. I have a dataset with over 200 predictor variables and the majority of them are binary 1= event, 0= no event. I want to filter out all variables that have an occurrence frequency below a certain threshold, e.g., 100 times.
I've tried something like this:
diag = luisa.T.reset_index().rename(columns = {'index': 'diagnosis'})
frequency = pd.concat([diag.iloc[:,:1],pd.DataFrame(diag.sum(1))], axis = 1).rename(columns = {0:'count'})
frequency.nlargest(150,'count)
Please help!
You can take the column-wise sum and filter out the columns whose sums are below a certain value, keeping in mind that the sum represents the total number of events:
threshold = 100
col_sum = df.sum()
filtered_df = df[col_sum[col_sum > threshold].index]
This will store in filtered_df a subset of the original DataFrame without those columns.
If not all your columns are binary, then you need to include the additional step of performing this operation only on the binary columns, and then reversing the condition to find the columns which do not fulfil your criteria:
binary_columns = df.isin([0, 1]).all(axis=0)
binary_df = df.loc[:, binary_columns]
col_sum = binary_df.sum()
filtered_df = df.drop(columns=col_sum[col_sum < threshold].index)

Unable to drop rows in pandas DataFrame which contain zeros

Editing a large dataframe in python. How do you drop entire rows in the dataframe if a specific column's row has the value 0.0?
When I drop the 0.0s in the overall satisfaction column the edits are not displayed in my scatterplot matrix of the large dataframe.
I have tried:
filtered_df = filtered_df.drop([('overall_satisfaction'==0)], axis=0)
also tried replacing 0.0 with nulls & dropping the nulls:
filtered_df = filtered_df.['overall_satisfaction'].replace(0.0, np.nan), axis=0)
filtered_df = filtered_df[filtered_NZ_df['overall_satisfaction'].notnull()]
What concept am I missing? Thanks :)
So it seems like your values are small enough to be represented as zeros, but are not actually zeros. This usually happens when calculations result in vanishing gradients (really small numbers that approach zero, but are not quite zero), so equality comparisons do not give you the result you're looking for.
In cases like this, numpy has a handy function called isclose that lets you test whether a number is close enough to another number within a certain tolerance.
In your case, doing
df = df[~np.isclose(df['overall_satisfaction'], 0)]
Seems to work.

Converting a single pandas index into a three level MultiIndex in python

I have some data in a pandas dataframe which looks like this:
gene VIM
time:2|treatment:TGFb|dose:0.1 -0.158406
time:2|treatment:TGFb|dose:1 0.039158
time:2|treatment:TGFb|dose:10 -0.052608
time:24|treatment:TGFb|dose:0.1 0.157153
time:24|treatment:TGFb|dose:1 0.206030
time:24|treatment:TGFb|dose:10 0.132580
time:48|treatment:TGFb|dose:0.1 -0.144209
time:48|treatment:TGFb|dose:1 -0.093910
time:48|treatment:TGFb|dose:10 -0.166819
time:6|treatment:TGFb|dose:0.1 0.097548
time:6|treatment:TGFb|dose:1 0.026664
time:6|treatment:TGFb|dose:10 -0.008032
where the left is an index. This is just a subsection of the data which is actually much larger. The index is composed of three components, time, treatment and dose. I want to reorganize this data such that I can access it easily by slicing. The way to do this is to use pandas MultiIndexing but I don't know how to convert my DataFrame with one index into another with three. Does anybody know how to do this?
To clarify, the desired output here is the same data with a three level index, the outer being treatment, middle is dose and the inner being time. This would be useful so then I could access the data with something like df['time']['dose'] or 'df[0]` (or something to that effect at least).
You can first replace unnecessary strings (index has to be converted to Series by to_series, because replace doesnt work with index yet) and then use split. Last set index names by rename_axis (new in pandas 0.18.0)
df.index = df.index.to_series().replace({'time:':'','treatment:': '','dose:':''}, regex=True)
df.index = df.index.str.split('|', expand=True)
df = df.rename_axis(('time','treatment','dose'))
print (df)
VIM
time treatment dose
2 TGFb 0.1 -0.158406
1 0.039158
10 -0.052608
24 TGFb 0.1 0.157153
1 0.206030
10 0.132580
48 TGFb 0.1 -0.144209
1 -0.093910
10 -0.166819
6 TGFb 0.1 0.097548
1 0.026664
10 -0.008032

Categories