How to divide two columns with different sizes (Pandas)? - python

I have two dataframes that are spectral measurements (both have two columns: Intensity and Wavelength) and I need to divide the intensity of one by the intensity of the other in a given Wavelength, as if I were dividing two functions (I1 (λ) / I2 (λ)). The difficulty is that both dataframes have different sizes and the Wavelength values ​​for one are not exactly the same as the other (although obviously they "go close").
One has approximately 200 lines (black line) and the other has 3648 (red line). In short, the red graph is much more "filled" than the black graph, but as I said before, the Wavelength values ​​of the respective dataframes are not exactly the same.
They have different Wavelength ranges as well:
Black starts from 300.2 to 795.5 nm
Red starts at 199.975 and goes up to 1027.43 nm
What I like to do is something like this:
Note that, I divided the Intensity of the black one by the red one, where the result with his corresponding Wavelength is added in a new df. Is it possible to generate a new dataframe with an equivalent Wavelength and make this division between intensities?

Here is working solution of your problem. My current assumption is that the sampling rate of instrument is the same. Since, you didn't provide any sample, I have generated some data. The answer is based on concatenating both dataframes on the Wavelength column.
import pandas as pd
import numpy as np
##generating the test data
black_lambda = np.arange(300.2,795.5,0.1)
red_lambda = np.arange(199.975,1027.43,0.1)
I_black = np.random.random((1,len(black_lambda))).ravel()
I_red = np.random.random((1,len(red_lambda))).ravel()
df = pd.DataFrame([black_lambda,I_black]).T
df1 = pd.DataFrame([red_lambda,I_red]).T
df.columns=['lambda','I_black']
df1.columns=['lambda','I_red']
Follow from here:
#setting lambda as index for both dataframes
df.set_index(['lambda'],inplace=True)
df1.set_index(['lambda'],inplace=True)
#concatenating/merging both dataframes into one
df3 = pd.concat([df,df1],axis=1)
#since both dataframes are not of same length, there will be some missing values. Taking care of them by filling previous values (optional).
df3.fillna(method='bfill',inplace=True)
df3.fillna(method='ffill',inplace=True)
#creating a new column 'division' to finish up the task
df3['division'] = df3['I_black'] / df3['I_red']
print(df3)
Output:
I_black I_red division
lambda
199.975 0.855777 0.683906 1.251308
200.075 0.855777 0.305783 2.798643
200.175 0.855777 0.497258 1.720993
200.275 0.855777 0.945699 0.904915
200.375 0.855777 0.910735 0.939655
... ... ... ...
1026.975 0.570973 0.637064 0.896258
1027.075 0.570973 0.457862 1.247042
1027.175 0.570973 0.429709 1.328743
1027.275 0.570973 0.564804 1.010924
1027.375 0.570973 0.246437 2.316917

Related

Generate missing values on the dataset based on ZIPF distribution

Currently, I want to observe the impact of missing values on my dataset. I replace data point (10, 20, 90 %) to missing values and observe the impact. This function below is to replace a certain per cent data point to missing.
def dropout(df, percent):
# create df copy
mat = df.copy()
# number of values to replace
prop = int(mat.size * percent)
# indices to mask
mask = random.sample(range(mat.size), prop)
# replace with NaN
np.put(mat, mask, [np.NaN]*len(mask))
return mat
My question is, I want to replace missing values based on zipf distirbution/power low/long tail. For instance, I have a dataset that contains of 10 columns (5 columns categorical data and 5 columns numerical data). I want to replace some data points on 5 columns categorical based on zipf law, columns in the left sides have more missing rather than in the right side.
I used Python to do this task.
I saw Scipy manual about zipf distirbution in this link: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.zipf.html but still it's not help me much.
Zipf distributions are a family of distributions on 0 to infinity, whereas you want to delete values from only 5 discrete columns, so you will have to make some arbitrary decisions to do this. Here is one way:
Pick a parameter for your Zipf distribution, say a = 2 as in the example given on the SciPy documentation page.
Looking at the plot given on that same page, you could decide to truncate at 10, i.e. if any sampled value of more than 10 comes up, you're just going to discard it.
Then you could just map the remaining domain of 0 to 10 linearly to your five categorical columns: Any value between 0 and 2 corresponds to the first column, and so on.
So you iteratively sample single values from your Zipf distribution using the SciPy function. For every sampled value, you delete one data point in the column the value corresponds to (see 3.), until you have reached the overall desired percentage of missing values.

pandas fill in 0 for non-existing categories in value_counts()

problem: I'm grouping results in my DataFrame, look at value_counts(normalize=True) and try to plot the result in a barplot.
The problem is that the barplot should contain frequencies. In some groups, some values don't occur. In that case, the corresponding value_count is not 0, it doesn't exist. For the barplot, this 0 value is not taken into account and the resulting bar is too big.
example: Here is a minimal example, which illustrates the problem: Let's say the DataFrame contains observations for experiments. When you perform such an experiment, a series of observations is collected. The result of the experiment are the relative frequencies of the observations collected for it.
df = pd.DataFrame()
df["id"] = [1]*3 + [2]*3 + [3]*3
df["experiment"] = ["a"]*6 + ["b"] * 3
df["observation"] = ["positive"]*3 + ["positive"]*2 + ["negative"]*1 + ["positive"]*2 + ["negative"]*1
there are two experiment types, "a" and "b"
observations that belong to the same evaluation of an experiment are given the same id.
So here, experiment a has been done 2 times, experiment b just once.
I need to group by id and experiment, then average the result.
plot_frame = pd.DataFrame(df.groupby(["id", "experiment"])["observation"].value_counts(normalize=True))
plot_frame = plot_frame.rename(columns={"observation":"percentage"})
In the picture above, you can already see the problem. The evaluation with id 1 has seen only positive observations. The relative frequency of "negative" should be 0. Instead, it doesn't exist. If I plot this, the corresponding bar is too high, the blue bars should add up to one:
sns.barplot(data=plot_frame.reset_index(),
x="observation",
hue="experiment",
y="percentage")
plt.show()
You can add rows filled with 0 by using unstack/stack method with argument fill_value=0. Try this:
df.groupby(["id", "experiment"])["observation"].value_counts(normalize=True).unstack(fill_value=0).stack()
I have found a hacky solution, by iterating over the index and manually filling in the missing values:
for a,b,_ in plot_frame.index:
if (a,b,"negative") not in plot_frame.index:
plot_frame.loc[(a,b,"negative"), "percentage"] = 0
Now this produces the desired plot:
I don't particularly like this solution, since it is very specific to my index and probably doesn't scale well if the categories become more complex

Paritition matrix into smaller matrices based on multiple values

So I have this giant matrix (~1.5 million rows x 7 columns) and am trying to figure out an efficient way to split it up. For simplicity of what I'm trying to do, I'll work with this much smaller matrix as an example for what I'm trying to do. The 7 columns consist of (in this order): item number, an x and y coordinate, 1st label (non-numeric), data #1, data #2, and 2nd label (non-numeric). So using pandas, I've imported from an excel sheet my matrix called A that looks like this:
What I need to do is partition this based on both labels (i.e. so I have one matrix that is all the 13G + Aa together, another matrix that is 14G + Aa, and another one that is 14G + Ab together -- this would have me wind up with 3 separate 2x7 matrices). The reason for this is because I need to run a bunch of statistics on the dataset of numbers of the "Marker" column for each individual matrix (e.g. in this example, break the 6 "marker" numbers into three sets of 2 "marker" numbers, and then run statistics on each set of two numbers). Since there are going to be hundreds of these smaller matrices on the real data set I have, I was trying to figure out some way to make the smaller matrices be labeled something like M1, M2, ..., M500 (or whatever number it ends up being) so that way later, I can use some loops to apply statistics to each individual matrix all at once without having to write it 500+ times.
What I've done so far is to use pandas to import my data set into python as a matrix with the command:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\path\cancerdata.csv")
A = df.as_matrix() #Convert excel sheet to matrix
A = np.delete(A, (0),axis=0) #Delete header row
Unfortunately, I haven't come across many resources for how to do what I want, which is why I wanted to ask here to see if anyone knows how to split up a matrix into smaller matrices based on multiple labels.
Your question has many implications, so instead of giving you a straight answer I'll try to give you some pointers on how to tackle this problem.
First off, don't transform your DataFrame into a Matrix. DataFrames are well-optimised for slicing and indexing operations (a Pandas Series object is in reality a fancy Numpy array anyway), so you only lose functionality by converting it to a Matrix.
You could probably convert your label columns into a MultiIndex. This way, you'll be able to access slices of your original DataFrame using df.loc, with a syntax similar to df.loc[label1].loc[label2].
A MultiIndex may sound confusing at first, but it really isn't. Try executing this code block and see for yourself how the resulting DataFrame looks like:
df = pd.read_csv("C:\path\cancerdata.csv")
labels01 = df["Label 1"].unique()
labels02 = df["Label 2"].unique()
index = pd.MultiIndex.from_product([labels01, labels02])
df.set_index(index, inplace=True)
print(df)
Here, we extracted all unique values in the columns "Label 1" and "Label 2", and created an MultiIndex based on all possible combinations of Label 1 vs. Label 2. In the df.set_index line, we extracted those columns from the DataFrame - now they act as indices for your other columns. For example, in order to access the DataFrame slice from your original DataFrame whose Label 1 = 13G and Label 2 = Aa, you can simply write:
sliced_df = df.loc["13G"].loc["Aa"]
And perform whatever calculations/statistics you need with it.
Lastly, instead of saving each sliced DataFrame into a list or dictionary, and then iterating over them to perform the calculations, consider rearranging your code so that, as soon as you create your sliced DataFrame, you peform the calculations, save them to a output/results file/DataFrame, and move on to the next slicing operation. Something like:
for L1 in labels01:
for L2 in labels02:
sliced_df = df.loc[L1].loc[L2]
results = perform_calculations(sub_df)
save_results(results)
This will both improve memory consumption and performance, which may be important considering your large dataset.

Removing points which deviate too much from adjacent point in Pandas

So I'm doing some time series analysis in Pandas and have a peculiar pattern of outliers which I'd like to remove. The bellow plot is based on a dataframe with the first column as a date and the second column the data
AS you can see those points of similar values interspersed and look like lines are likely instrument quirks and should be removed. Ive tried using both rolling_mean, median and removal based on standard deviation to no avail. For an idea of density, its daily measurements from 1984 to the present. Any ideas?
auge = pd.read_csv('GaugeData.csv', parse_dates=[0], header=None)
gauge.columns = ['Date', 'Gauge']
gauge = gauge.set_index(['Date'])
gauge['1990':'1995'].plot(style='*')
And the result of applying rolling median
gauge = pd.rolling_mean(gauge, 5, center=True)#gauge.diff()
gauge['1990':'1995'].plot(style='*')
After rolling median
You can demand that each data point has at least "N" "nearby" data points within a certain distance "D".
N can be 2 or more.
nearby for element gauge[i] can be a pair like: gauge[i-1] and gauge[i+1], but since some only have neighbors on one side you can ask for at least two elements with distance in indexes (dates) less than 2. So, let's say at least 2 of {gauge[i-2], gauge[i-1] gauge[i+1], gauge[i+2]} should satisfy: Distance(gauge[i], gauge[ix]) < D
D - you can decide this based on how close you expect those real data points to be.
It won't be perfect, but it should get most of the noise out of the dataset.

split pandas dataframe into multiple dataframes according to distribution of column

Changed question and picture (as I said before... its complicated :)
I have a pandas dataframe 'df' that has a column 'score' (floating point values) with a distribution (lets say a normal distribution). I additionally have an integer 'splits' (lets say 3) and a floating point number 'gap' (lets say 0.5).
I would like to have two dataframes 'gaps_df' and 'rest_df'. 'gaps_df' should consist of all entries from df that are marked orange in the picture (every two red lines have distance 'gap'). 'rest_df' consists of all entries which are marked green.
Here is the tricky part: The green areas have to be of equal size!
To be clear:
the GREEN areas have to be of equal amount of entries!
the ORANGE areas have to consist of entries within the gap-range (amount doesn't matter) between the green areas
So far I have the following:
df.sort('score')
df = df.reset_index(drop=True)
split_markers = []
for marker_index in range(1, classes):
split_markers.append(marker_index * df.size/classes)
But the last two lines are wrong since they split the WHOLE AREA into equal amount of entries. With a normal distribution, I could just move the markers 0.5*gap to the left and to the right. But in fact: I do NOT have a normal distribution (this was just to quickly create a picture with equal green areas).
It gets freaking me out. I really do appreciate every help you can give! Maybe there is a way easier solution...

Categories