I have a table of values which aren't logs but to find their relation I think I need to create a log-log plot. The values I have are:
R C
----------
0.2 103
2 13.9
20 2.72
200 0.800
2000 0.401
20000 0.433
How do I plot the logs of these values ?
Related
I have some data which is organized as follows:
x y score
1 1 0.951
1 1 0.956
1 1 0.976
1 1 0.875
1.5 1.5 0.94
1.5 1.5 0.76
1.5 1.5 0.88
...
10 10 0.51
10 10 0.66
So what I want to do is aggregate the data by (x, y) values and then show a box plot for the score at those (x, y) values.
I realize that I will need 2 y-axes and I think matplotlib allows one to do that. I see an example here: https://matplotlib.org/gallery/api/two_scales.html
However, I am not sure if it is even possible to arrange this so that the scales will correspond to the means of these datasets.
So, I am guessing my question is whether this can be done or if there is a recommendation on how to visualise this sort of data?
I have the following toy df:
FilterSystemO2Concentration (Percentage) ProcessChamberHumidityAbsolute (g/m3) ProcessChamberPressure (mbar)
0 0.156 1 29.5 28.4 29.6 28.4
2 0.149 1.3 29.567 28.9
3 0.149 1 29.567 28.9
4 0.148 1.6 29.6 29.4
This is just a sample. The original have over 1200 rows. What's the best way to oversample it preserving its statistical propierties?
I have googled it for some time and i hve only come across resampling algorithms for imbalalnced classes. but that's not what i want, i'm not interested in balancing thr data anyhow, i just would like to produce more samples in a way that more or less preserves the original data distributions and statistical properties.
Thanks in advance
Using scipy.stats.rv_histogram(np.histogram(data)).isf(np.random.random(size=n)) will create n new samples randomly chosen from the distribution (histogram) of the data. You can do this for each column:
Example:
import pandas as pd
import scipy.stats as stats
df = pd.DataFrame({'x': np.random.random(100)*3, 'y': np.random.random(100) * 4 -2})
n = 5
new_values = pd.DataFrame({s: stats.rv_histogram(np.histogram(df[s])).isf(np.random.random(size=n)) for s in df.columns})
df = df.assign(data_type='original').append(new_values.assign(data_type='oversampled'))
df.tail(7)
>> x y data_type
98 1.176073 -0.207858 original
99 0.734781 -0.223110 original
0 2.014739 -0.369475 oversampled
1 2.825933 -1.122614 oversampled
2 0.155204 1.421869 oversampled
3 1.072144 -1.834163 oversampled
4 1.251650 1.353681 oversampled
I have a data frame
Rfm Count %
0 111 88824 57.13
1 112 5462 3.51
2 121 32209 20.72
3 122 15155 9.75
4 211 5002 3.22
5 212 1002 0.64
6 221 3054 1.96
7 222 4778 3.07
How can I plot a graph like this?
Background - The numbers are the RFM scores.
R is Repeat (number of days since customer ordered)
F is frequency (number of jobs from customers)
M is monetary (how much customer is paying)
The R,F and M scores are either 1 (bad) or 2 (good).
I would like to segment them into 4 Quadrants.
I would also like the size of the blob to be proportional to the percentage.
I.e. blob 111 (57%) will be much larger than blob 212 (0.64%).
I really want to get better at data visualization, please help a beginner out. I'm familiar with seaborn and matplotlib.
Ps: Is it possible to add a third dimension to the plot? 3rd Dim would be the frequency.
Edit: The second image is a simple static way of achieveing my goal. Any input for doing it with matplotlib or seaborn? For a more interesting illustration.
[Second Image]
(https://i.stack.imgur.com/AuzEM.jpg)
I would like to group into two categories. I used panda get a scatter plot of my data. See the figure, I want all the data in the left upper corner to be a group and the remaining data into a secondary group. How can I sort my data based on the plotted figure?
Thanks for any help!
index x y
11 86.58 3.31
26 128.79 2.79
46 43.34 6.32
71 95.39 3.90
72 93.95 3.40
73 117.76 3.06
75 82.37 3.83
.............
g=sns.jointplot(x='x',y='y',data=GA_m_W,space = 0.1,ratio = 3)
figure
I've created a dataframe with stock information. When I go to create a scatter plot and annotate labels, not all of the labels are included. I'm only getting 3 labels out of 50 or so points. I can't figure out why it's not plotting all labels.
My Table:
Dividend ExpenseRatio Net_Assets PriceEarnings PriceSales
Ticker
FHLC 0.0128 0.08 6.056 22.95 1.78
ONEQ 0.0083 0.21 6.284 20.24 2.26
FTEC 0.0143 0.08 3.909 20.83 2.73
FDIS 0.0144 0.08 2.227 20.17 1.36
FENY 0.0262 0.08 4.386 25.97 1.34
My plotting code:
for ticker,row in df.iterrows():
plt.scatter(row['PriceSales'], row['PriceEarnings'], c = np.random.rand(3,1), s = 300)
for i, txt in enumerate(ticker):
plt.annotate(df.index[i-1], (df.PriceSales[i-1], df.PriceEarnings[i-1]))
plt.xlabel('PriceSales')
plt.ylabel('PriceEarnings')
plt.show()
My graph image:
ticker here is going to have the value of the ticker of the last row; e.g., "FENY". When you call enumerate(ticker), it will generate an item for each char, so it sounds like your last ticker has 3 entries.
I think you can annotate points in the same loop as the scatter plot:
for ticker,row in df.iterrows():
plt.scatter(row['PriceSales'], row['PriceEarnings'], c = np.random.rand(3,1), s = 300)
plt.annotate(ticker, (row['PriceSales'], row['PriceEarnings']))