I have multiple dataframes that look similar to this:
x y x y
1 2 0.5 2
2 4 1.5 6
3 6 3 12
Where the x columns are my indices. I want to plot the average line plot for these multiple datasets. My idea was to concatenate the two dataframes so that I have a scatterplot and can do a best fit line, but Pandas is throwing an error Reindexing only valid with uniquely valued Index objects. I've read other questions for this error message and have renamed my index names and column names to x_1 x_2 and y_1 and y_2 but it is still complaining, I believe because some of the x values are the same. What am I doing wrong here?
Not sure if I understand completely how your dataframes look like, but you can concatenate two (or more) dataframes df1,df2... by doing:
new_dataframe = pd.DataFrame(np.concatenate([df1,df2]),columns=['x','y'])
where my imports are
import pandas as pd
import numpy as np
Are you just looking for a best fit line for all the points? If so you can concat and use lmplot.
import pandas as pd
import seaborn as sns
df = pd.DataFrame({'x':[1,2,3],'y':[2,4,6]})
df2 = pd.DataFrame({'x':[.5,1.5,3], 'y':[2,6,12]})
out = pd.concat([df,df2])
sns.lmplot(data=out, x='x', y='y', ci=None);
Related
I have a data frame (loading from CSV) file that looks like below one
Data Mean sd time__1 time__2 time__3 time__4 time__5
0 Data_1 0.947667 0.025263 0.501517 0.874750 0.929426 0.953847 0.958375
1 Data_2 0.031960 0.017314 0.377588 0.069185 0.037523 0.024028 0.021532
Now, I wanted to plot 2 time series plots for (data_1, data_2) with (time__1, time__2, etc) as a timepoint. The x axis is (time__1, time__2, etc) and the y axis is their associated values.
The code I am trying
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("file.csv", delimiter=',', header=0)
data = data.drop(["Unnamed: 0"], axis=1)
# Set the date column as the index
data = data.set_index(["time__1", "time__2", "time__3", "time__4", "time__5"])
ax = data.plot(linewidth=2, fontsize=12)
ax.set_xlabel('Data')
ax.legend(fontsize=12)
plt.savefig("series.png")
plt.show()
The figure I am getting is not as expected.
I think I am doing some wrong with set_index() as my time points are in different columns.
How can I plot time-series when time points are in different columns?
Reproducible data as dictionary formate
{'Data': {(0.501517236232758, 0.874750375747681, 0.929425954818726, 0.953846752643585, 0.958374977111816): 'Data_1', (0.377588421106338, 0.069185301661491, 0.037522859871388, 0.0240284409374, 0.021532088518143): 'Data_2'}, 'Mean': {(0.501517236232758, 0.874750375747681, 0.929425954818726, 0.953846752643585, 0.958374977111816): 0.947667360305786, (0.377588421106338, 0.069185301661491, 0.037522859871388, 0.0240284409374, 0.021532088518143): 0.031959813088179}, 'sd': {(0.501517236232758, 0.874750375747681, 0.929425954818726, 0.953846752643585, 0.958374977111816): 0.025263005867601, (0.377588421106338, 0.069185301661491, 0.037522859871388, 0.0240284409374, 0.021532088518143): 0.017313838005066}}
IIUC you are getting the index wrong: If time__1, time__2 etc. is supposed to be your x-axis, that's what you want your index to be. The plot data series names are the columns. Therefore, you need to transpose your DataFrame. Using the csv data in your first table:
print(df)
# out:
Data Mean sd time__1 time__2 time__3 time__4 \
0 Data_1 0.947667 0.025263 0.501517 0.874750 0.929426 0.953847
1 Data_2 0.031960 0.017314 0.377588 0.069185 0.037523 0.024028
time__5
0 0.958375
1 0.021532
Changing column names and transposing:
df.drop(["Mean", "sd"], axis=1).set_index("Data").T
yields an appropriately formatted dataframe:
Data Data_1 Data_2
time__1 0.501517 0.377588
time__2 0.874750 0.069185
time__3 0.929426 0.037523
time__4 0.953847 0.024028
time__5 0.958375 0.021532
which can simply be plotted:
df.plot()
Have we numpy function or pandas function which make somthinfg like that:
For me, boundary values are the farthest values from the regression line.
That means for me:
the farthest from the line over the line and the farthest from the line under the line.
If I will have data:
l1 = [0,1,4,3,4,3]
df = pd.DataFrame(l1)
It looks like that:
0
0 1
1 4
2 3
3 4
4 3
How to find data from index 1 and index 4.
I need to recognize from python script and remove them. I know how to remove but i do not know how to find.
What I want to do:
First I am going to calculate linear regression, next I am going to remove outsider values and next i am going to recalculate linear regression one more time without the farthest values.
To remove outliers, you can use Series.quantile:
Suppose the following dataframe:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2022)
df = pd.DataFrame({'A': np.random.normal(5, 2, size=50)})
df.plot.hist(bins=25)
plt.xlim(0, 10)
plt.show()
Now filter out your dataframe:
df1 = df.loc[df['A'].between(*df['A'].quantile([0.25, 0.75]).values)]
df1.plot.hist(bins=10)
plt.xlim(0, 10)
plt.show()
I have a dataset of 6 parameters with 500 values each and I want to combine the two of the datasets to get the road curvature but I am getting an error. Since I am new to python, I am not sure that I am using the correct logic or not. Please guide.
from asammdf import MDF
import pandas as pd
mdf = MDF('./Data.mf4')
c=['Vhcl.Yaw','Vhcl.a','Car.Road.tx', 'Car.Road.ty', 'Vhcl.v', 'Car.Width']
m = mdf.to_dataframe(channels=c, raster=0.02)
for i in range(0,500):
mm = m.iloc[i].values
y = pd.concat([mm[2], mm[3]])
plt.plot(y)
plt.show()
print(y)
Error:
TypeError: cannot concatenate object of type '<class 'numpy.float64'>'; only Series and DataFrame objs are valid
Starting from your dataframe m
y = m.iloc[:, 1:3]
This will create another dataframe with all the entries in the first component and only the entries from the second and third channel.
CSV1only is a dataframe uploaded from a CSV
Let CSV1only as a dataframe be a column such that:
TRADINGITEMID:
1233
2455
3123
1235
5098
as a small example
How can I plot a scatterplot accordingly, specifically the y-axis?
I tried:
import pandas as pd
import matplotlib.pyplot as plt
CSV1only.plot(kind='scatter',x='TRADINGITEMID', y= [1,2], color='b')
plt.xlabel('TRADINGITEMID Numbers')
plt.ylabel('Range')
plt.title('Distribution of ItemIDNumbers')
and it doesn't work because of the y.
So, my main question is just how I can get a 0, 1, 2 y-axis for this scatter plot, as I want to make a distribution graph.
The following code doesn't work because it doesn't match the amount of rows included in the original TRADINGITEMID column, which has 5000 rows:
newcolumn_values = [1, 2]
CSV1only['un et deux'] = newcolumn_values
#and then I changed the y = [1,2] from before into y = ['un et deux']
Therefore the solution would need to work from any integer 1 to N, N being the # of rows. Yet, it would only have a range of [0, 2] or some [0, m], m being some arbitrary integer.
Don't need to worry about the actual pandas data frame CSV1only.
The 'TRADINGITEMIDNUMBERS' contains 5000 rows of unique numbers, so I just wanna plot those numbers on a line, with the y-axis being instances (which will never pass 1 since it is unique).
I think you are looking for the following: You need to generate y-values starting from 0 until n-1 where n is the total number of rows
y = np.arange(len(CSV1only['TRADINGITEMID']))
plt.scatter(CSV1only['TRADINGITEMID'], y, c='DarkBlue')
I have data of factories and their error codes during production
such as below;
PlantID A B C D
1 0 1 2 4
1 3 0 2 0
3 0 0 0 1
4 0 1 1 5
Each row represent production order.
I want to create a graph with x-axis=PlantID's and y-axis are A,B,C,D with different bars.
In this way I can see that which factory has the most D error, which has A in one graph
I usually use plotly and seaborn but I couldn't find any solution for that, y-axis is single column in every example
Thanks in advance,
Seaborn likes its data in long or wide-form.
As mentioned above, seaborn will be most powerful when your datasets have a particular organization. This format ia alternately called “long-form” or “tidy” data and is described in detail by Hadley Wickham in this academic paper. The rules can be simply stated:
Each variable is a column
Each observation is a row
The following code converts the original dataframe to a long form dataframe.
By stacking the columns on top of each other such that every row corresponds to a single record that specifies the column name and the value (the count).
import numpy as np
import pandas as pd
import seaborn as sns
# Generating some data
N = 20
PlantID = np.random.choice(np.arange(1, 4), size=N, replace=True)
data = dict((k, np.random.randint(0, 50, size=N)) for k in ['A', 'B', 'C', 'D'])
df = pd.DataFrame(data, index=PlantID)
df.index = df.index.set_names('PlantID')
# Stacking the columns and resetting the index to create a longformat. (And some renaming)
df = df.stack().reset_index().rename({'level_1' : 'column', 0: 'count'},axis=1)
sns.barplot(x='PlantID', y='count', hue='column', data=df)
Pandas has really clever built-in plotting functionality:
df.plot(kind='bar')
plt.show()