Currently, I want to observe the impact of missing values on my dataset. I replace data point (10, 20, 90 %) to missing values and observe the impact. This function below is to replace a certain per cent data point to missing.
def dropout(df, percent):
# create df copy
mat = df.copy()
# number of values to replace
prop = int(mat.size * percent)
# indices to mask
mask = random.sample(range(mat.size), prop)
# replace with NaN
np.put(mat, mask, [np.NaN]*len(mask))
return mat
My question is, I want to replace missing values based on zipf distirbution/power low/long tail. For instance, I have a dataset that contains of 10 columns (5 columns categorical data and 5 columns numerical data). I want to replace some data points on 5 columns categorical based on zipf law, columns in the left sides have more missing rather than in the right side.
I used Python to do this task.
I saw Scipy manual about zipf distirbution in this link: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.zipf.html but still it's not help me much.
Zipf distributions are a family of distributions on 0 to infinity, whereas you want to delete values from only 5 discrete columns, so you will have to make some arbitrary decisions to do this. Here is one way:
Pick a parameter for your Zipf distribution, say a = 2 as in the example given on the SciPy documentation page.
Looking at the plot given on that same page, you could decide to truncate at 10, i.e. if any sampled value of more than 10 comes up, you're just going to discard it.
Then you could just map the remaining domain of 0 to 10 linearly to your five categorical columns: Any value between 0 and 2 corresponds to the first column, and so on.
So you iteratively sample single values from your Zipf distribution using the SciPy function. For every sampled value, you delete one data point in the column the value corresponds to (see 3.), until you have reached the overall desired percentage of missing values.
So I'm doing some time series analysis in Pandas and have a peculiar pattern of outliers which I'd like to remove. The bellow plot is based on a dataframe with the first column as a date and the second column the data
AS you can see those points of similar values interspersed and look like lines are likely instrument quirks and should be removed. Ive tried using both rolling_mean, median and removal based on standard deviation to no avail. For an idea of density, its daily measurements from 1984 to the present. Any ideas?
auge = pd.read_csv('GaugeData.csv', parse_dates=[0], header=None)
gauge.columns = ['Date', 'Gauge']
gauge = gauge.set_index(['Date'])
gauge['1990':'1995'].plot(style='*')
And the result of applying rolling median
gauge = pd.rolling_mean(gauge, 5, center=True)#gauge.diff()
gauge['1990':'1995'].plot(style='*')
After rolling median
You can demand that each data point has at least "N" "nearby" data points within a certain distance "D".
N can be 2 or more.
nearby for element gauge[i] can be a pair like: gauge[i-1] and gauge[i+1], but since some only have neighbors on one side you can ask for at least two elements with distance in indexes (dates) less than 2. So, let's say at least 2 of {gauge[i-2], gauge[i-1] gauge[i+1], gauge[i+2]} should satisfy: Distance(gauge[i], gauge[ix]) < D
D - you can decide this based on how close you expect those real data points to be.
It won't be perfect, but it should get most of the noise out of the dataset.
Changed question and picture (as I said before... its complicated :)
I have a pandas dataframe 'df' that has a column 'score' (floating point values) with a distribution (lets say a normal distribution). I additionally have an integer 'splits' (lets say 3) and a floating point number 'gap' (lets say 0.5).
I would like to have two dataframes 'gaps_df' and 'rest_df'. 'gaps_df' should consist of all entries from df that are marked orange in the picture (every two red lines have distance 'gap'). 'rest_df' consists of all entries which are marked green.
Here is the tricky part: The green areas have to be of equal size!
To be clear:
the GREEN areas have to be of equal amount of entries!
the ORANGE areas have to consist of entries within the gap-range (amount doesn't matter) between the green areas
So far I have the following:
df.sort('score')
df = df.reset_index(drop=True)
split_markers = []
for marker_index in range(1, classes):
split_markers.append(marker_index * df.size/classes)
But the last two lines are wrong since they split the WHOLE AREA into equal amount of entries. With a normal distribution, I could just move the markers 0.5*gap to the left and to the right. But in fact: I do NOT have a normal distribution (this was just to quickly create a picture with equal green areas).
It gets freaking me out. I really do appreciate every help you can give! Maybe there is a way easier solution...
I have csv files which are 1200 Rows x 3 Columns. Number of rows can differ from as low as 500 to as large as 5000 but columns remain same.
I want to create a feature vector from these files which will thus maintain consistent cells/vector length & thus help in finding out the distance between these vectors.
FILE_1
A, B, C
(267.09669678867186, 6.3664069175720197, 1257325.5809999991),
(368.24070923984374, 9.0808353424072301, 49603.662999999884),
(324.21470826328124, 11.489830970764199, 244391.04699999979),
(514.33452027500005, 7.5162401199340803, 56322.424999999988),
(386.19673340976561, 9.4927110671997106, 175958.77100000033),
(240.09965330898439, 10.3463039398193, 457819.8519411764),
(242.17559998691405, 8.4401674270629901, 144891.51100000029),
(314.23066895664061, 7.4405002593994096, 58433.818999999959),
(933.3073596304688, 7.1564397811889604, 41977.960000000014),
(274.04136473476564, 4.8482465744018599, 48782.314891525479),
(584.2639294320312, 7.90128517150879, 49730.705000000096),
(202.13173096835936, 10.559995651245099, 20847.805144088608),
(324.98563963710939, 2.2546300888061501, 43767.774800000007),
(464.35059935390626, 11.573680877685501, 1701597.3915132943),
(776.28339964687495, 8.7755222320556605, 106882.2469999999),
(310.11652952968751, 10.3175926208496, 710341.19162800116),
(331.19962889492189, 10.7578010559082, 224621.80632433048),
(452.31337752387947, 7.3100395202636701, 820707.26700000139),
(430.16615111171876, 10.134071350097701, 18197.691999999963),
(498.24687010585939, 11.0102319717407, 45423.269964585743),
.....,
.....,
500th row
FILE_2
(363.02781861484374, 8.8369808197021502, 72898.479666666608),
(644.20353882968755, 8.6263589859008807, 22776.78799999999),
(259.25105469882811, 9.8575859069824201, 499615.64068339905),
(410.19474608242189, 9.8795070648193395, 316146.18800000293),
(288.12153809726561, 4.7451887130737296, 58615.577999999943),
(376.25868409335936, 10.508985519409199, 196522.12200000012),
(261.11118895351564, 8.5228433609008807, 32721.110000000026),
(319.98896605312501, 3.2100667953491202, 60587.077000000027),
(286.94926268398439, 4.7687568664550799, 47842.133999999867),
(121.00206177890625, 7.9372291564941397, 239813.20531182736),
(308.19895750820314, 6.0029039382934597, 26354.519000000011),
(677.17011839687495, 9.0299625396728498, 10391.757655172449),
(182.1304913216797, 8.0010566711425799, 145583.55700000061),
(187.06341736972655, 9.9460496902465803, 77488.229000000007),
(144.07867615878905, 3.6044106483459499, 104651.56499999999),
(288.92317015468751, 4.3750333786010698, 151872.1949999998),
(228.2089825326172, 4.4475774765014604, 658120.07628214348),
(496.18831055820311, 11.422966003418001, 2371155.6659999997),
(467.30134398281251, 11.0771179199219, 109702.48440899582),
(163.08418089687501, 5.7271881103515598, 38107.106791666629),
.....,
.....,
3400th row
You can see that there is no correspondence between the two files, i.e. if someone asked you to calculate the distance between these two vectors its not possible.
The aim is to be able to interpolate the rows of both the files in such a manner so that there is a consistency across all such files. i.e. when I look up first row,
it should represent same feature across all the files. Now lest look at FILE_1
Range of values for three columns is (considering only 20 rows for time being)
A: 202.13173096835936,933.3073596304688
B: 2.2546300888061501, 11.573680877685501
C: 18197.691999999963,1701597.3915132943
I want to put these points on a 3d array, the grid size of which will be .1X.1X.1 (or lets say 10X10X10 or any arbitrary size grid cell)
But for that to work we need to normalize the data (mean normalize etc)
Now the data we have is a 3d data, which need to be normalized in order to interpolate them into this 3d array. Which neednt be 3d even if its a vector then that will also do.
Now when I said I need to average the points, by that I meant that if in a cell more than two points happen to fall (which will happen if the cell size is big eg 100X100X100) then we will take the average value of x,y,z coordinate as the value of that cell.
These interpolated vectors will have same length & correspondence, because corresponding point of a vector when compared to rest of such vectors will represent same point.
**NOTE : Min & Max range for all coordinates across all files is 100:1000,2:12, 10000:2000000
I am looking to find a center of mass for N-dimensional space in Python.I have a dataframe with K columns (some contain text and some contain numbers)
{X1...Xk}
...
{Z1..Zk}
k > 10000
I need to calculate center of mass for all numerical values in the dataframe.
What is the best way to do it?
The center of mass is simply the mean of the values on each dimension, and you just want to calculate it on non-object columns, so:
df.ix[:,df.dtypes != 'O'].mean()
EDIT: although the OP only mentioned "text" and "numbers", the following alternative is indeed more general (thanks MaxU):
df.select_dtypes(include=['number']).mean()