Plotting a nice graph with 3000 rows in dataset with matplotlib - python

I have a Dataframe (3440 rows x 2 columns) with two columns (int). I need to plot this data frame with y axis (strain-ylabel ) and x axis (time-xlabel) that is the same with the expecting plot (I will show this figure below as a link). There are several visual problems that I hope you guys can teach and show me with, because I am very week in visualization with Python.
Here is the datasource:
Here is the expecting plot:
Here is result:
Here is my code:
df=pd.read_csv('https://www.gw-openscience.org/GW150914data/P150914/fig2-unfiltered-waveform-H.txt')
df= df['index'].str.split(' ', expand=True)
df.coulumns=['time (s)','strain (h1)']
x=df['time'][:200]
y=df['strain'][:200]
plt.figure(figsize=(14,8))
plt.scatter(x,y,c='blue')
plt.show()
Note: I have tried with seaborn, but result was the same. I also tried to narrow down into 200 rows, but the result is different with the expecting plot.
I appreciate if you guys can help me with. Thank you very much!

The following works for me. I'm skipping the first row, because the column labels are not separated correctly. Furthermore, while loading the data I indicate that the columns are separated by a space.
I don't think that the file contains the data to plot the "reconstructed" line.
import pandas as pd
# read the csv file, skip the first row, columns are separated by ' '
df=pd.read_csv('fig2-unfiltered-waveform-H.txt', skiprows=1, sep=' ')
# add proper column names
df.columns = ['index', 'strain (h1)']
# extract the index & strain variables
index=df['index']
strain=df['strain (h1)']
# plot the figure
plt.figure(figsize=(14,8))
plt.plot(index, strain, c='red', label='numerical relativity')
# label the y axis and show the legend
plt.ylabel('strain (h1)')
plt.legend(loc="upper left")
plt.show()
This is the resulting plot:
The same with seaborn, once you've imported the data with pandas:
import seaborn as sns
sns.lineplot(data = df, x="index", y="strain (h1)", color='red')

Related

Seaborn tick labels not complete & not aligned to the graph

I'm trying to plot a line chart based on 2 columns using seaborn from a dataframe imported as a .csv with pandas.
The data consists of ~97000 records across 19 years of timeframe.
First part of the code: (I assume the code directly below shouldn't contribute to the issue, but will list it just in case)
# use pandas to read CSV files and prepare the timestamp column for recognition
temporal_fires = pd.read_csv("D:/GIS/undergraduateThesis/data/fires_csv/mongolia/modis_2001-2019_Mongolia.csv")
temporal_fires = temporal_fires.rename(columns={"acq_date": "datetime"})
# recognize the datetime column from the data
temporal_fires["datetime"] = pd.to_datetime(temporal_fires["datetime"])
# add a year column to the dataframe
temporal_fires["year"] = temporal_fires["datetime"].dt.year
temporal_fires['count'] = temporal_fires['year'].map(temporal_fires['year'].value_counts())
The plotting part of the code:
# plotting (seaborn)
plot1 = sns.lineplot(x="year",
y="count",
data=temporal_fires,
color='firebrick')
plt.gca().xaxis.set_major_formatter(FuncFormatter(lambda x, _: int(x)))
plt.xlabel("Шаталт бүртгэгдсэн он", fontsize=10)
plt.ylabel("Бүртгэгдсэн шаталтын тоо")
plt.title("2001-2019 он тус бүрт бүртгэгдсэн шаталтын график")
plt.xticks(fontsize=7.5, rotation=45)
plt.yticks(fontsize=7.5)
Python doesn't return any errors and does show the figure:
... but (1) the labels are not properly aligned with the graph vertices and (2) I want the X label ticks to show each year instead of skipping some. For the latter, I did find a stackoverflow post, but it was for a heatmap, so I'm not sure how I'll advance in this case.
How do I align them properly and show all ticks?
Thank you.
I found my answer, just in case anyone makes the same mistake.
The line
plt.gca().xaxis.set_major_formatter(FuncFormatter(lambda x, _: int(x)))
converted the X ticks on my plot to to its nearest number, but the original values stayed the same. The misalignment was because I had just renamed the "years" "2001.5" to "2001", not actually modifying the core data itself.
As for the label intervals, the addition of this line...
plt.xticks(np.arange(min(temporal_fires['year']), max(temporal_fires['year'])+1, 1.0))
...showed me all of the year values in the plot instead of skipping them.

Python Stacked barchart with dataframe

I'm trying to visualize a data frame I have with a stacked barchart, where the x is websites, the y is frequency and then the groups on the barchart are different groups using them.
This is the dataframe:
This is the plot created just by doing this:
web_data_roles.plot(kind='barh', stacked=True, figsize=(20,10))
As you can see its not what I want, vie tried changing the plot so the axes match up to the different columns of the dataframe but it just says no numerical data to plot, Not sure how to go about this anymore. so all help is appreciated
You need to organise your dataframe so that role is a column.
set_index() initial preparation
unstack() to move role out of index and make a column
droplevel() to clean up multi index columns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1, figsize=[10,5],
sharey=False, sharex=False, gridspec_kw={"hspace":0.3})
df = pd.read_csv(io.StringIO("""website,role,freq
www.bbc.co.uk,director,2000
www.bbc.co.uk,technical,500
www.twitter.com,director,4000
www.twitter.com,technical,1500
"""))
df.set_index(["website","role"]).unstack(1).droplevel(0,axis=1).plot(ax=ax, kind="barh", stacked=True)

Creating a plot like picture with categories and dates

I am wondering how I can make the following type of plot in Python (preferably matplotlib):
I would like four categories along the y-axis, and then the dates along the x-axis just as in the figure.
I have a CSV file with two columns [category], [date]. The date format is: dd-mm-yyy.
Extract:
category1,05-01-2020
category1,02-02-2020
category3,06-03-2020
category2,12-04-2020
etc...
Help will be appreciated!
You can simply plot the categories vs. the dates as is. For the color code, you need to convert the categories to individual numbers, which can be easily achieved using pandas Categorical data type.
d = """category1,05-01-2020
category1,02-02-2020
category3,06-03-2020
category2,12-04-2020"""
df = pd.read_csv(StringIO(d), sep=',', parse_dates=[1], header=None, names=['category','date'])
fig, ax = plt.subplots()
ax.scatter(df['date'],df['category'], marker='s', c=df['category'].astype('category').cat.codes, cmap='tab10')

How can I plot a pandas dataframe as a scatter graph? I think I may have messed up the indexing and can't add a new index?

I am trying to plot my steps as a scatter graph and then eventually add a trend line.
I managed to get it to work with df.plot() but it is a line chart.
The following is the code I have tried:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data_file = pd.read_csv('CSV/stepsgyro.csv')
# print(data_file.head())
# put in the correct data types
data_file = data_file.astype({"steps": int})
pd.to_datetime(data_file['date'])
# makes the date definitely the index at the bottom
data_file.set_index(['date'], inplace=True)
# sorts the data frame by the index
data_file.sort_values(by=['date'], inplace=True, ascending=True)
# data_file.columns.values[1] = 'date'
# plot the raw steps data
# data_file.plot()
plt.scatter(data_file.date, data_file.steps)
plt.title('Daily Steps')
plt.grid(alpha=0.3)
plt.show()
plt.close('all')
# plot the cumulative steps data
data_file = data_file.cumsum()
data_file.plot()
plt.title('Cumulative Daily Steps')
plt.grid(alpha=0.3)
plt.show()
plt.close('all')
and here is a screenshot of what it's looking like on my IDE:
any guidance would be greatly appreciated!
You have set the index to be the "date" column. From that moment on, there is no "date" column anymore, hence data_file.date fails.
Two options:
Don't set the index. Sorting doesn't seem to be needed anyways.
Plot the index, plt.scatter(data_file.index, data_file.steps)
I can't figure out just by looking at your example why you are getting that error. However, I can offer a quick and easy solution to plotting your data:
data_file.plot(marker='.', linestyle='none')
You can use df.plot(kind='scatter') to avoid the line chart.

nan being displayed as label in histogram for Y axis

This is a python problem. I am a novice to python and visualization and tried to do some research before this. But I wasn't able to get the right answer.
I have a csv file with first column as names of countries and remaining with some numerical data. I am trying to plot a horizontal histogram with the countries on y axis and the respective first column data on x axis. However, with this code I am getting "nan" instead of country names. How can I make sure that the yticks are correctly showing country names and not nan?
Click here for image of the plot diagram
My code is as such: (displaying only first 5 rows)
import numpy as np
import matplotlib.pyplot as plt
my_data = np.genfromtxt('c:\drinks.csv', delimiter=',')
countries = my_data[0:5,0]
y_pos = np.arange(len(countries)`enter code here`)
plt.figure()
plt.barh(y_pos, my_data[0:5:,1])
plt.yticks(y_pos, countries)
plt.show()
Here is the link to the csv file
This works but you have lots of countries on the y axis. I don't know if you plan to plot only few of them.
with open("drinks.csv") as file:
lines = file.readlines()
countries = [line.split(",")[0] for line in lines[0:10]]
my_data = [int(line.split(",")[1]) for line in lines[0:10]]
plt.figure()
y_pos = np.arange(len(countries))
plt.barh(y_pos, my_data)
plt.yticks(y_pos, countries)
plt.show()

Categories