How to stop matplotlib from skipping gaps in data? - python

I have this simple csv:
date,count
2020-07-09,144.0
2020-07-10,143.5
2020-07-12,145.5
2020-07-13,144.5
2020-07-14,146.0
2020-07-20,145.5
2020-07-21,146.0
2020-07-24,145.5
2020-07-28,143.0
2020-08-05,146.0
2020-08-10,147.0
2020-08-11,147.5
2020-08-14,146.5
2020-09-01,143.5
2020-09-02,143.0
2020-09-09,144.5
2020-09-10,143.5
2020-09-25,144.0
2021-09-21,132.4
2021-09-23,131.2
2021-09-25,131.0
2021-09-26,130.8
2021-09-27,130.6
2021-09-28,128.4
2021-09-30,126.8
2021-10-02,126.2
If I copy it into excel and scatter plot it, it looks like this
This is correct; there should be a big gap in the middle (look carefully at the data, it jumps from 2020 to 2021)
However if I do this in python:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('data.csv')
data.plot.scatter('date', 'count')
plt.show()
It looks like this:
It evenly spaces them at the gap is gone. How do I stop that behavior? I tried to do
plt.xticks = data.date
But that didn't do anything different.

I dont exactly know the types of columns in data but it is probably beacuse tpye of 'date' column is string. So python does not understand comperable value. Before plotting try to convert it's type.
data['date'] = pd.to_datetime(data['date'])

I've tested:
import io
import pandas as pd
txt = """
date,count
2020-07-09,144.0
2020-07-10,143.5
2020-07-12,145.5
2020-07-13,144.5
2020-07-14,146.0
2020-07-20,145.5
2020-07-21,146.0
2020-07-24,145.5
2020-07-28,143.0
2020-08-05,146.0
2020-08-10,147.0
2020-08-11,147.5
2020-08-14,146.5
2020-09-01,143.5
2020-09-02,143.0
2020-09-09,144.5
2020-09-10,143.5
2020-09-25,144.0
2021-09-21,132.4
2021-09-23,131.2
2021-09-25,131.0
2021-09-26,130.8
2021-09-27,130.6
2021-09-28,128.4
2021-09-30,126.8
2021-10-02,126.2"""
data = pd.read_csv(io.StringIO(txt), sep=r",", parse_dates=["date"])
data.plot.scatter('date', 'count')
and the result is:
Two observations:
date must be of date type, which is ensured by parse_dates=["date"] option
importing matplotlib.pyplot is not necessary, because You used pandas.DataFrame.plot.scatter method.

Related

Python "value_count" output to formatted table

I have value_count output data for a single column that I would like to feed into a table and format nicely. I would like to bold the headings, have "alternating colors" for the rows, change the font to "serif", and italicize the main column. Kind of like this.
I thought I found something applicable, but I do not know how to apply it to my data (or perhaps it is not suited for what I want to achieve).
I found "table styles" with the following example:
df4 = pd.DataFrame([[1,2],[3,4]])
s4 = df4.style
props = 'font-family: "Times New Roman", Times, serif; color: #e83e8c; font-size:1.3em;'
df4.style.applymap(lambda x: props, subset=[1])
Here is my code on its own. Please note I had to first split my data (here) so that I could properly count to end up with the value_count output data. These are a few libraries I have been working with (but there could be a few unnecessary ones in here).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
#access file
data = pd.read_csv('E:/testing_data.csv')
supplies = pd.DataFrame(data)
supplies.Toppings = supplies.Toppings.str.split('\r\n')
supplies = supplies.explode('Toppings').reset_index(drop=True)
supplies.Toppings.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
Please be as specific as possible as I am still getting used to Python terms. Thank you.

How to make it so that all available data is being pulled instead of specifically typing out a date range for this script?

The available options dates are below. How can I write a code so that it pulls all those dates instead of having to type them all out in a separate row?
2022-03-11, 2022-03-18, 2022-03-25, 2022-04-01, 2022-04-08, 2022-04-14, 2022-04-22, 2022-05-20, 2022-06-17, 2022-07-15, 2022-10-21, 2023-01-20, 2024-01-19
import yfinance as yf
gme = yf.Ticker("gme")
opt = gme.option_chain('2022-03-11')
print(opt)
First of all, as these dates have no regular pattern, you should create a list of the dates.
list1=['2022-03-11', '2022-03-18', '2022-03-25', '2022-04-01', '2022-04-08', '2022-04-14', '2022-04-22', '2022-05-20', '2022-06-17', '2022-07-15', '2022-10-21', '2023-01-20', '2024-01-19']
After you have created the list, you can initiate your code as how you have done:
import yfinance as yf
gme = yf.Ticker("gme")
But right now, since you would want to have everything being printed out, and I assume you would need to save it to file for a better view (as I have checked the output and I personally prefer csv for yfinance), you can do this:
for date in list1:
df = gme.option_chain(date)
df_call = df[0]
df_put = df[1]
df_call.to_csv(f'call_{date}.csv')
df_put.to_csv(f'put_{date}.csv')

Filter columns in a csv file and output plot

I am trying to filter a column in a CSV just like you would in excel. Then based on that filter I would like to have it call a different column and output the data from that column into a plot.
I have tried to print the code on its own and its prints correctly. I am just not sure about the syntax. When I print the code it shows that I can correctly search through a column
data.head()
print('banana',
data[('Sub-Dept')] == 'Stow Each') #and data[('Sub-Dept')] == 'Stow Each Nike', 'Each Stow to Prime', 'Each Stow to Prime E', 'Each Stow to Prime W', 'Stow to Prime LeadPA')
But I do not know how to get it to filter that first, then plot underneath there. I am fairly new to this.
I have a column has many different filterable names inside of it. I want to call those names above.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
x = []
y = []
data = pd.read_csv(r'C:\Users\rmond\Downloads\PS_csvFile.csv', encoding="ISO-8859-1", skiprows=6)
new_data = data.loc[(data['Sub-Dept'] == 'Stow Each')]
sns.set(style="whitegrid") #this is strictly cosmetic, you can change it any time
ax = sns.countplot(x='U.S. OSHA Recordable?', data=new_data)
plt.bar(x, y, label='Loaded from file!')
plt.ylabel('Quantity of Injuries')
plt.title('Injuries (past 4 weeks)')
plt.show()
Right now, I am expecting it to out put 1 graph that has 2 bars. Problem: it shows a quantity of 80 on 1 bar and 20 on another bar. Solution: After the data is filtered from another column it should show 21 in 1 bar and 7 in another bar inside the same graph.
The graphing portion works great and so does pulling the data from the excel. The only part I do not know how to do is filtering that column and then graphing based on that filter. I am not sure what the code should look like and where it should go. please help
CSV FILE HERE: https://drive.google.com/open?id=1yJ6iQL-bOvGSLAKPcPXqgk9aKLqUEmPK
Try pandas.query()
Pandas query might be useful.
data = pd.read_csv(r'C:\Users\rmond\Downloads\PS_csvFile.csv', encoding="ISO-8859-1", skiprows=6)
new_data = data.query("Sub-Dept == Stow Each or
Sub-Dept == RF_Pick")
I am so happy to have figured this out. I had trouble finding the answer to this on the internet. So I hope this helps someone else in the future. Thanks to Datanovice for the initial idea into .loc. That helped me get to the next steps.The rest of my answer came from here: https://www.geeksforgeeks.org/python-pandas-extracting-rows-using-loc/
Sorry I left my comments in the code
import pandas as pd # powerful data visualization library
import numpy as np
import matplotlib.pyplot as plt # allows us to plot things
import csv # allows us to import and use CSV commands which are simple but effective
import seaborn as sns #https://seaborn.pydata.org/generated/seaborn.boxplot.html
# This website saved my life https://www.pythonforengineers.com/introduction-to-pandas/
# use this to check the available styles: plt.style.available
x = []
y = []
data = pd.read_csv(r'C:\Users\rmond\Downloads\PS_csvFile.csv', encoding="ISO-8859-1", skiprows=6, index_col="Sub-Dept") #skiprows allows you to skip the comments on top... & ecoding allows pandas to work on this CSV
new_data = data.loc[["Each Stow to Prime", "Each Stow to Prime E", "Each Stow to Prime W", "Stow Each", "Stow Each Nike", "Stow to Prime LeadPA",]]
sns.set(style="whitegrid") #this is strictly cosmetic, you can change it any time
ax = sns.countplot(x='U.S. OSHA Recordable?', data=new_data) #magic, so seaborn is designed to pull the dats from a URL. But when using pandas and seaborn there is a work around
# the key is that "countplot" literally does the work for you. its awesome
plt.bar(x, y, label='Loaded from file!')
plt.ylabel('Quantity of Injuries')
plt.title('Stow Injuries (past 4 weeks)')
plt.show() # shows the plot to the user

How to fix "wrong number of items passed 5, placement implies 1"

I am trying to make 6 separate graphs from a dataframe that has 5 columns and multiple rows that is imported from Excel. I want to add two lines to the graph that are the point in the dataframe plus and minus the rolling standard deviation at each point in each column and row of the dataframe. To do this I am using a nested for loop and then graphing, however, it is saying wrong number of items pass placement implies 1. I do not know how to fix this.
I have tried converting the dataframe to a list and appending rows as well. Nothing seems to work. I know this could be easily done.
import pandas as pd
import matplotlib.pyplot as plt
excel_file = 'C:/Users/afrydman/Documents/Storage and Data Centers FFO Multiples Data.xlsx'
dfStorage = pd.read_excel(excel_file,sheet_name='Storage Data', index_col='Date')
dfrollingStd = dfStorage.rolling(12).std().shift(-11)
#dfrollingStd.fillna(0)
#print(dfStorage[1][3])
for k,p in dfStorage, dfrollingStd:
dftemp = pd.DataFrame(dfStorage,columns=[k])
dfnew=pd.DataFrame(dfrollingStd,columns=[p])
for i,j in dfStorage, dfrollingStd:
dftemp = pd.DataFrame(dfStorage,index=[i])
dfnew=pd.DataFrame(dfrollingStd,index=[j])
dftemp['-1std'] = pd.DataFrame(dftemp).subtract(dfnew)
dftemp['+1std'] = pd.DataFrame(dftemp).add(dfnew)
pd.DataFrame(dftemp).plot()
plt.ylabel('P/FFO')
I expect the output to be 6 separate graphs each with 3 lines. Instead I am not getting anything. My loop is also not executing properly.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
excel_file = 'C:/Users/afrydman/Documents/Storage and Data Centers FFO Multiples Data.xlsx'
dfStorage = pd.read_excel(excel_file,sheet_name='Storage Data', index_col='Date')
dfrollingStd = dfStorage.rolling(12).std().shift(-11)
#dfrollingStd.fillna(0)
#print(dfStorage[1][3])
for i in dfStorage:
dftemp = pd.DataFrame(dfStorage,columns=[i])
for j in dfrollingStd:
dfnew=pd.DataFrame(dfrollingStd,columns=[j])
dftemp['-1std'] = pd.DataFrame(dftemp).subtract(dfnew)
dftemp['+1std'] = pd.DataFrame(dftemp).add(dfnew)
pd.DataFrame(dftemp).plot()
plt.ylabel('P/FFO')
This is my updated code and I am still getting the same error. This time it is saying Wrong number of items passed 2, placement implies 1

How to pass an array in python pandas to plot two axes?

I am trying to create an XY chart using Python and the Pygal library. The source data is contained in a CSV file with three columns; ID, Portfolio and Value. Unfortunately I can only plot one axis and I suspect it's an issue with the array. Can anyone point me in the right direction? Do I need to use numpy? Thank you!
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
xy_chart = pygal.XY()
xy_chart.add('Portfolio', data['Portfolio','Value'] << I suspect this is wrong
)
xy_chart.render_in_browser()
With
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
xy_chart = pygal.XY()
xy_chart.add('Portfolio', data['Portfolio']
)
xy_chart.render_in_browser()
I get:
A graph with a series of horizontal data points/values; i.e. it has the X values but no Y values.
With:
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
xy_chart = pygal.XY()
xy_chart.add('Portfolio', data['Portfolio','Value']
)
xy_chart.render_in_browser()
I get:
KeyError: ('Portfolio', 'Value')
Sample data:
ID Portfolio Value
1 1 -2560.042036
2 2 1208.106958
3 3 5702.386949
4 4 -8827.63913
5 5 -3881.665733
6 6 5951.602484
Maybe a little late here, but I just did something similar. Your second example requires multiple columns to be handed in as a array and then the DataFrame you get back needs to be converted into a list of tuples.
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
points = data[['Portfolio','Value']].to_records(index=False).tolist()
xy_chart = pygal.XY()
xy_chart.add('Portfolio', points)
xy_chart.render_in_browser()
There may be a more elegant use of the pandas or pygal API to get the columns into a list of tuples.

Categories