I am trying to make 6 separate graphs from a dataframe that has 5 columns and multiple rows that is imported from Excel. I want to add two lines to the graph that are the point in the dataframe plus and minus the rolling standard deviation at each point in each column and row of the dataframe. To do this I am using a nested for loop and then graphing, however, it is saying wrong number of items pass placement implies 1. I do not know how to fix this.
I have tried converting the dataframe to a list and appending rows as well. Nothing seems to work. I know this could be easily done.
import pandas as pd
import matplotlib.pyplot as plt
excel_file = 'C:/Users/afrydman/Documents/Storage and Data Centers FFO Multiples Data.xlsx'
dfStorage = pd.read_excel(excel_file,sheet_name='Storage Data', index_col='Date')
dfrollingStd = dfStorage.rolling(12).std().shift(-11)
#dfrollingStd.fillna(0)
#print(dfStorage[1][3])
for k,p in dfStorage, dfrollingStd:
dftemp = pd.DataFrame(dfStorage,columns=[k])
dfnew=pd.DataFrame(dfrollingStd,columns=[p])
for i,j in dfStorage, dfrollingStd:
dftemp = pd.DataFrame(dfStorage,index=[i])
dfnew=pd.DataFrame(dfrollingStd,index=[j])
dftemp['-1std'] = pd.DataFrame(dftemp).subtract(dfnew)
dftemp['+1std'] = pd.DataFrame(dftemp).add(dfnew)
pd.DataFrame(dftemp).plot()
plt.ylabel('P/FFO')
I expect the output to be 6 separate graphs each with 3 lines. Instead I am not getting anything. My loop is also not executing properly.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
excel_file = 'C:/Users/afrydman/Documents/Storage and Data Centers FFO Multiples Data.xlsx'
dfStorage = pd.read_excel(excel_file,sheet_name='Storage Data', index_col='Date')
dfrollingStd = dfStorage.rolling(12).std().shift(-11)
#dfrollingStd.fillna(0)
#print(dfStorage[1][3])
for i in dfStorage:
dftemp = pd.DataFrame(dfStorage,columns=[i])
for j in dfrollingStd:
dfnew=pd.DataFrame(dfrollingStd,columns=[j])
dftemp['-1std'] = pd.DataFrame(dftemp).subtract(dfnew)
dftemp['+1std'] = pd.DataFrame(dftemp).add(dfnew)
pd.DataFrame(dftemp).plot()
plt.ylabel('P/FFO')
This is my updated code and I am still getting the same error. This time it is saying Wrong number of items passed 2, placement implies 1
Related
I have this simple csv:
date,count
2020-07-09,144.0
2020-07-10,143.5
2020-07-12,145.5
2020-07-13,144.5
2020-07-14,146.0
2020-07-20,145.5
2020-07-21,146.0
2020-07-24,145.5
2020-07-28,143.0
2020-08-05,146.0
2020-08-10,147.0
2020-08-11,147.5
2020-08-14,146.5
2020-09-01,143.5
2020-09-02,143.0
2020-09-09,144.5
2020-09-10,143.5
2020-09-25,144.0
2021-09-21,132.4
2021-09-23,131.2
2021-09-25,131.0
2021-09-26,130.8
2021-09-27,130.6
2021-09-28,128.4
2021-09-30,126.8
2021-10-02,126.2
If I copy it into excel and scatter plot it, it looks like this
This is correct; there should be a big gap in the middle (look carefully at the data, it jumps from 2020 to 2021)
However if I do this in python:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('data.csv')
data.plot.scatter('date', 'count')
plt.show()
It looks like this:
It evenly spaces them at the gap is gone. How do I stop that behavior? I tried to do
plt.xticks = data.date
But that didn't do anything different.
I dont exactly know the types of columns in data but it is probably beacuse tpye of 'date' column is string. So python does not understand comperable value. Before plotting try to convert it's type.
data['date'] = pd.to_datetime(data['date'])
I've tested:
import io
import pandas as pd
txt = """
date,count
2020-07-09,144.0
2020-07-10,143.5
2020-07-12,145.5
2020-07-13,144.5
2020-07-14,146.0
2020-07-20,145.5
2020-07-21,146.0
2020-07-24,145.5
2020-07-28,143.0
2020-08-05,146.0
2020-08-10,147.0
2020-08-11,147.5
2020-08-14,146.5
2020-09-01,143.5
2020-09-02,143.0
2020-09-09,144.5
2020-09-10,143.5
2020-09-25,144.0
2021-09-21,132.4
2021-09-23,131.2
2021-09-25,131.0
2021-09-26,130.8
2021-09-27,130.6
2021-09-28,128.4
2021-09-30,126.8
2021-10-02,126.2"""
data = pd.read_csv(io.StringIO(txt), sep=r",", parse_dates=["date"])
data.plot.scatter('date', 'count')
and the result is:
Two observations:
date must be of date type, which is ensured by parse_dates=["date"] option
importing matplotlib.pyplot is not necessary, because You used pandas.DataFrame.plot.scatter method.
I have an output printed that shows 2 variable with "array" in front of them. I am trying to remove these and just make it a normal array so I can convert into a data frame
(array([5.13431118, 2.75188667, 2.14270195, 1.85232761, 1.54816285,
1.07358247, 0.83953893, 0.79920618, 0.71898919, 0.68808879,
0.67637336, 0.65179984, 0.62325295, 0.59656284, 0.56309083,
0.54330533, 0.51451752, 0.49450315, 0.48263952, 0.448921 ,
0.42336611, 0.40067145, 0.38780448, 0.38185679, 0.26253902]), array([ 4.73899342, 2.39633009, 1.70232176, 1.35374982, 1.10683548,
0.61925838, 0.38851972, 0.34679717, 0.24324399, 0.23452607,
0.20272406, 0.16004198, 0.12964943, 0.11750156, 0.09877278,
0.0909908 , 0.07669317, 0.05444491, 0.04149164, 0.02239192,
0.01442321, 0.01144795, 0.00353863, 0.00107867, -0.00000164]))
This works when putting a * in front of the variable when using print function. But can't actually save it.
print('Eigen Values Test(>1.0):',*ev,sep="\n")
Example above my whole code below
#Import required libraries
import numpy as np
import pandas as pd
import scipy
from sklearn.datasets import load_iris
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
df = pd.read_csv("https://www.dropbox.com/s/110tmphef00ybyg/bfi.csv?raw=1")
#show existing columns
print(df.columns)
# Dropping unnecessary columns
df.drop(['gender', 'education', 'age', 'Unnamed: 0'],axis=1,inplace=True)
#Dropping missing values rows
df.dropna(inplace=True) # Removes Entire row that has NA
# Summary of data
print(df.info())
print(df.head())
#ADEQUACY TEST (Testing factorability)
print('')
#Bartlett’s test
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value,p_value=calculate_bartlett_sphericity(df)
print('Bartlett’s test:',chi_square_value, p_value,sep='\n')
#Kaiser-Meyer-Olkin (KMO) Test
from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all,kmo_model=calculate_kmo(df)
print('Kaiser-Meyer-Olkin (KMO) Test(>0.8):', kmo_model,sep='\n')
#CHOOSING NUMBER OF FACTORS
# Create factor analysis object and perform factor analysis
fa = FactorAnalyzer(25, rotation=None)
print('')
# Check Eigenvalues
fa.fit(df)
ev = fa.get_eigenvalues()
print('Eigen Values Test(>1.0):',ev,sep="\n")
Ideally it should look like:
[5.13431118 2.75188667 2.14270195 1.85232761 1.54816285 1.07358247
0.83953893 0.79920618 0.71898919 0.68808879 0.67637336 0.65179984
0.62325295 0.59656284 0.56309083 0.54330533 0.51451752 0.49450315
0.48263952 0.448921 0.42336611 0.40067145 0.38780448 0.38185679
0.26253902]
[ 4.73899342 2.39633009 1.70232176 1.35374982 1.10683548 0.61925838
0.38851972 0.34679717 0.24324399 0.23452607 0.20272406 0.16004198
0.12964943 0.11750156 0.09877278 0.0909908 0.07669317 0.05444491
0.04149164 0.02239192 0.01442321 0.01144795 0.00353863 0.00107867
-0.00000164]
I am trying to filter a column in a CSV just like you would in excel. Then based on that filter I would like to have it call a different column and output the data from that column into a plot.
I have tried to print the code on its own and its prints correctly. I am just not sure about the syntax. When I print the code it shows that I can correctly search through a column
data.head()
print('banana',
data[('Sub-Dept')] == 'Stow Each') #and data[('Sub-Dept')] == 'Stow Each Nike', 'Each Stow to Prime', 'Each Stow to Prime E', 'Each Stow to Prime W', 'Stow to Prime LeadPA')
But I do not know how to get it to filter that first, then plot underneath there. I am fairly new to this.
I have a column has many different filterable names inside of it. I want to call those names above.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
x = []
y = []
data = pd.read_csv(r'C:\Users\rmond\Downloads\PS_csvFile.csv', encoding="ISO-8859-1", skiprows=6)
new_data = data.loc[(data['Sub-Dept'] == 'Stow Each')]
sns.set(style="whitegrid") #this is strictly cosmetic, you can change it any time
ax = sns.countplot(x='U.S. OSHA Recordable?', data=new_data)
plt.bar(x, y, label='Loaded from file!')
plt.ylabel('Quantity of Injuries')
plt.title('Injuries (past 4 weeks)')
plt.show()
Right now, I am expecting it to out put 1 graph that has 2 bars. Problem: it shows a quantity of 80 on 1 bar and 20 on another bar. Solution: After the data is filtered from another column it should show 21 in 1 bar and 7 in another bar inside the same graph.
The graphing portion works great and so does pulling the data from the excel. The only part I do not know how to do is filtering that column and then graphing based on that filter. I am not sure what the code should look like and where it should go. please help
CSV FILE HERE: https://drive.google.com/open?id=1yJ6iQL-bOvGSLAKPcPXqgk9aKLqUEmPK
Try pandas.query()
Pandas query might be useful.
data = pd.read_csv(r'C:\Users\rmond\Downloads\PS_csvFile.csv', encoding="ISO-8859-1", skiprows=6)
new_data = data.query("Sub-Dept == Stow Each or
Sub-Dept == RF_Pick")
I am so happy to have figured this out. I had trouble finding the answer to this on the internet. So I hope this helps someone else in the future. Thanks to Datanovice for the initial idea into .loc. That helped me get to the next steps.The rest of my answer came from here: https://www.geeksforgeeks.org/python-pandas-extracting-rows-using-loc/
Sorry I left my comments in the code
import pandas as pd # powerful data visualization library
import numpy as np
import matplotlib.pyplot as plt # allows us to plot things
import csv # allows us to import and use CSV commands which are simple but effective
import seaborn as sns #https://seaborn.pydata.org/generated/seaborn.boxplot.html
# This website saved my life https://www.pythonforengineers.com/introduction-to-pandas/
# use this to check the available styles: plt.style.available
x = []
y = []
data = pd.read_csv(r'C:\Users\rmond\Downloads\PS_csvFile.csv', encoding="ISO-8859-1", skiprows=6, index_col="Sub-Dept") #skiprows allows you to skip the comments on top... & ecoding allows pandas to work on this CSV
new_data = data.loc[["Each Stow to Prime", "Each Stow to Prime E", "Each Stow to Prime W", "Stow Each", "Stow Each Nike", "Stow to Prime LeadPA",]]
sns.set(style="whitegrid") #this is strictly cosmetic, you can change it any time
ax = sns.countplot(x='U.S. OSHA Recordable?', data=new_data) #magic, so seaborn is designed to pull the dats from a URL. But when using pandas and seaborn there is a work around
# the key is that "countplot" literally does the work for you. its awesome
plt.bar(x, y, label='Loaded from file!')
plt.ylabel('Quantity of Injuries')
plt.title('Stow Injuries (past 4 weeks)')
plt.show() # shows the plot to the user
I am trying to create an XY chart using Python and the Pygal library. The source data is contained in a CSV file with three columns; ID, Portfolio and Value. Unfortunately I can only plot one axis and I suspect it's an issue with the array. Can anyone point me in the right direction? Do I need to use numpy? Thank you!
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
xy_chart = pygal.XY()
xy_chart.add('Portfolio', data['Portfolio','Value'] << I suspect this is wrong
)
xy_chart.render_in_browser()
With
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
xy_chart = pygal.XY()
xy_chart.add('Portfolio', data['Portfolio']
)
xy_chart.render_in_browser()
I get:
A graph with a series of horizontal data points/values; i.e. it has the X values but no Y values.
With:
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
xy_chart = pygal.XY()
xy_chart.add('Portfolio', data['Portfolio','Value']
)
xy_chart.render_in_browser()
I get:
KeyError: ('Portfolio', 'Value')
Sample data:
ID Portfolio Value
1 1 -2560.042036
2 2 1208.106958
3 3 5702.386949
4 4 -8827.63913
5 5 -3881.665733
6 6 5951.602484
Maybe a little late here, but I just did something similar. Your second example requires multiple columns to be handed in as a array and then the DataFrame you get back needs to be converted into a list of tuples.
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
points = data[['Portfolio','Value']].to_records(index=False).tolist()
xy_chart = pygal.XY()
xy_chart.add('Portfolio', points)
xy_chart.render_in_browser()
There may be a more elegant use of the pandas or pygal API to get the columns into a list of tuples.
How can I plot word frequency histogram (for author column)using pandas and matplotlib from a csv file? My csv is like: id, author, title, language
Sometimes I have more than one authors in author column separated by space
file = 'c:/books.csv'
sheet = open(file)
df = read_csv(sheet)
print df['author']
Use collections.Counter for creating the histogram data, and follow the example given here, i.e.:
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Read CSV file, get author names and counts.
df = pd.read_csv("books.csv", index_col="id")
counter = Counter(df['author'])
author_names = counter.keys()
author_counts = counter.values()
# Plot histogram using matplotlib bar().
indexes = np.arange(len(author_names))
width = 0.7
plt.bar(indexes, author_counts, width)
plt.xticks(indexes + width * 0.5, author_names)
plt.show()
With this test file:
$ cat books.csv
id,author,title,language
1,peter,t1,de
2,peter,t2,de
3,bob,t3,en
4,bob,t4,de
5,peter,t5,en
6,marianne,t6,jp
the code above creates the following graph:
Edit:
You added a secondary condition, where the author column might contain multiple space-separated names. The following code handles this:
from itertools import chain
# Read CSV file, get
df = pd.read_csv("books2.csv", index_col="id")
authors_notflat = [a.split() for a in df['author']]
counter = Counter(chain.from_iterable(authors_notflat))
print counter
For this example:
$ cat books2.csv
id,author,title,language
1,peter harald,t1,de
2,peter harald,t2,de
3,bob,t3,en
4,bob,t4,de
5,peter,t5,en
6,marianne,t6,jp
it prints
$ python test.py
Counter({'peter': 3, 'bob': 2, 'harald': 2, 'marianne': 1})
Note that this code only works because strings are iterable.
This code is essentially free of pandas, except for the CSV-parsing part that led the DataFrame df. If you need the default plot styling of pandas, then there also is a suggestion in the mentioned thread.
You can count up the number of occurrences of each name using value_counts:
In [11]: df['author'].value_counts()
Out[11]:
peter 3
bob 2
marianne 1
dtype: int64
Series (and DataFrames) have a hist method for drawing histograms:
In [12]: df['author'].value_counts().hist()