I have a data frame
Rfm Count %
0 111 88824 57.13
1 112 5462 3.51
2 121 32209 20.72
3 122 15155 9.75
4 211 5002 3.22
5 212 1002 0.64
6 221 3054 1.96
7 222 4778 3.07
How can I plot a graph like this?
Background - The numbers are the RFM scores.
R is Repeat (number of days since customer ordered)
F is frequency (number of jobs from customers)
M is monetary (how much customer is paying)
The R,F and M scores are either 1 (bad) or 2 (good).
I would like to segment them into 4 Quadrants.
I would also like the size of the blob to be proportional to the percentage.
I.e. blob 111 (57%) will be much larger than blob 212 (0.64%).
I really want to get better at data visualization, please help a beginner out. I'm familiar with seaborn and matplotlib.
Ps: Is it possible to add a third dimension to the plot? 3rd Dim would be the frequency.
Edit: The second image is a simple static way of achieveing my goal. Any input for doing it with matplotlib or seaborn? For a more interesting illustration.
[Second Image]
(https://i.stack.imgur.com/AuzEM.jpg)
Related
I am trying to implement a many-to-one RNN using time series data using tensorflow, similar to the example given https://www.tensorflow.org/tutorials/structured_data/time_series. The data looks similar to the one below
Time Latitude Longitude Speed Heading (deg)
0 20 20 5 180
1 19.9 20 5 180
2 19.8 20 5 180
3 19.7 20 5 180
Now my goal is to use the first 3 timesteps to predict the latitude of the next timestep. So my input would be
Latitude Longitude Speed Heading (deg)
20 20 5 180
19.9 20 5 180
19.8 20 5 180
and my output would be
19.7
My inputs may be "numbers", but they're all really categorical. Ex. heading 359 deg and 1 deg is nearly identical. I have tried one-hot encoding the data, then concatenating it to create a "four hot encoding" of the data but with little success.
How do you encode the features I have in a format that makes sense?
You can set some boundaries for each of the areas. For example, if Latitude is less than 10, assign it to class 0, if 10 < Latitude < 20 - to class 1, and more than 20-to class 2.
You can do it by simply adding columns to your dataframe.
I have the following toy df:
FilterSystemO2Concentration (Percentage) ProcessChamberHumidityAbsolute (g/m3) ProcessChamberPressure (mbar)
0 0.156 1 29.5 28.4 29.6 28.4
2 0.149 1.3 29.567 28.9
3 0.149 1 29.567 28.9
4 0.148 1.6 29.6 29.4
This is just a sample. The original have over 1200 rows. What's the best way to oversample it preserving its statistical propierties?
I have googled it for some time and i hve only come across resampling algorithms for imbalalnced classes. but that's not what i want, i'm not interested in balancing thr data anyhow, i just would like to produce more samples in a way that more or less preserves the original data distributions and statistical properties.
Thanks in advance
Using scipy.stats.rv_histogram(np.histogram(data)).isf(np.random.random(size=n)) will create n new samples randomly chosen from the distribution (histogram) of the data. You can do this for each column:
Example:
import pandas as pd
import scipy.stats as stats
df = pd.DataFrame({'x': np.random.random(100)*3, 'y': np.random.random(100) * 4 -2})
n = 5
new_values = pd.DataFrame({s: stats.rv_histogram(np.histogram(df[s])).isf(np.random.random(size=n)) for s in df.columns})
df = df.assign(data_type='original').append(new_values.assign(data_type='oversampled'))
df.tail(7)
>> x y data_type
98 1.176073 -0.207858 original
99 0.734781 -0.223110 original
0 2.014739 -0.369475 oversampled
1 2.825933 -1.122614 oversampled
2 0.155204 1.421869 oversampled
3 1.072144 -1.834163 oversampled
4 1.251650 1.353681 oversampled
I would like to group into two categories. I used panda get a scatter plot of my data. See the figure, I want all the data in the left upper corner to be a group and the remaining data into a secondary group. How can I sort my data based on the plotted figure?
Thanks for any help!
index x y
11 86.58 3.31
26 128.79 2.79
46 43.34 6.32
71 95.39 3.90
72 93.95 3.40
73 117.76 3.06
75 82.37 3.83
.............
g=sns.jointplot(x='x',y='y',data=GA_m_W,space = 0.1,ratio = 3)
figure
Today I want to know how to get indices for both lists whenever there is match between them. I came across use of enumerate and zip function. But they work if two list are of same length. Since my inputs are different I want to get indices of both of the lists.
# Simulated Time(msec) Simulated O/p Actual Time(msec) Actual o/p
0 12.57 0 12.55
50 12.58 100 12.56
100 12.55 200 12.60
150 12.59 300 12.45
200 12.53 400 12.59
250 12.87 500 12.78
300 12.50 600 12.57
350 12.75 700 12.66
400 12.80 800 12.78
...... ...... ..... ......
Also My simulated data is in different file and generating data at 50Hz rate different from my actual data. Hence Simulated data is of higher length than actual data. But actual data is present in simulated data. I want to get indices of both the the list. Example Simulated Time(msec) 100 (i=2) is matching with indice(j=1) of actual time. If I get indices of both i and j then I can compare corresponding simulated output and actual output at that particular instant.
Lastly I want to iterate till the end of simulated time.
Please suggest how can I solve this.
if simand act contains unique values, here is a way to do that, using the numpy set routine np.in1d:
sim=np.unique(np.random.randint(0,10,3))*10 #sample
act=np.unique(np.random.randint(0,10,5))*10 #sample
i=np.arange(len(sim))[np.in1d(sim,act)]
j=np.arange(len(act))[np.in1d(act,sim)]
print(sim,act,i,j)
#[40 50 70] [10 30 40 50] [0 1] [2 3]
I have a table of values which aren't logs but to find their relation I think I need to create a log-log plot. The values I have are:
R C
----------
0.2 103
2 13.9
20 2.72
200 0.800
2000 0.401
20000 0.433
How do I plot the logs of these values ?