Python "value_count" output to formatted table - python

I have value_count output data for a single column that I would like to feed into a table and format nicely. I would like to bold the headings, have "alternating colors" for the rows, change the font to "serif", and italicize the main column. Kind of like this.
I thought I found something applicable, but I do not know how to apply it to my data (or perhaps it is not suited for what I want to achieve).
I found "table styles" with the following example:
df4 = pd.DataFrame([[1,2],[3,4]])
s4 = df4.style
props = 'font-family: "Times New Roman", Times, serif; color: #e83e8c; font-size:1.3em;'
df4.style.applymap(lambda x: props, subset=[1])
Here is my code on its own. Please note I had to first split my data (here) so that I could properly count to end up with the value_count output data. These are a few libraries I have been working with (but there could be a few unnecessary ones in here).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
#access file
data = pd.read_csv('E:/testing_data.csv')
supplies = pd.DataFrame(data)
supplies.Toppings = supplies.Toppings.str.split('\r\n')
supplies = supplies.explode('Toppings').reset_index(drop=True)
supplies.Toppings.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
Please be as specific as possible as I am still getting used to Python terms. Thank you.

Related

create variables in python with available data

I have read a data like this:
import numpy as np
arr=n.loadtext("data/za.csv",delimeter=",")
display(arr)
Now the display looks like this:
array([[5.0e+01,1.8e+00,1.6e+00,1.75+e00],
[4.8e+01,1.77e+00,1.63e+00,1.75+e00],
[5.5e+01,1.8e+00,1.6e+00,1.75+e00],
...,
[5.0e+01,1.8e+00,1.6e+00,1.75+e00],
[4.8e+01,1.77e+00,1.63e+00,1.75+e00],
[5.0e+01,1.8e+00,1.6e+00,1.75+e00]])
Now I would like to give this variables to this array
the first ist weight of person
second is height of person
third is height of mother
fourth is height of father
Now I would like to now how can I create this variables that representin the columns?
install pandas library
import pandas as pd
use pd.read_csv("data/za.csv", columns= ["height", "weight", "etc"]) for read the data
hope you get the solution.
As it has already been advised, you may use pandas.read_csv for the purpose as per below:
df = pd.read_csv(**{
'filepath_or_buffer': "data/za.csv",
'header': None,
'names': ('weight_of_person', 'height_of_person', 'height_of_mother', 'height_of_father'),
})

How to stop matplotlib from skipping gaps in data?

I have this simple csv:
date,count
2020-07-09,144.0
2020-07-10,143.5
2020-07-12,145.5
2020-07-13,144.5
2020-07-14,146.0
2020-07-20,145.5
2020-07-21,146.0
2020-07-24,145.5
2020-07-28,143.0
2020-08-05,146.0
2020-08-10,147.0
2020-08-11,147.5
2020-08-14,146.5
2020-09-01,143.5
2020-09-02,143.0
2020-09-09,144.5
2020-09-10,143.5
2020-09-25,144.0
2021-09-21,132.4
2021-09-23,131.2
2021-09-25,131.0
2021-09-26,130.8
2021-09-27,130.6
2021-09-28,128.4
2021-09-30,126.8
2021-10-02,126.2
If I copy it into excel and scatter plot it, it looks like this
This is correct; there should be a big gap in the middle (look carefully at the data, it jumps from 2020 to 2021)
However if I do this in python:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('data.csv')
data.plot.scatter('date', 'count')
plt.show()
It looks like this:
It evenly spaces them at the gap is gone. How do I stop that behavior? I tried to do
plt.xticks = data.date
But that didn't do anything different.
I dont exactly know the types of columns in data but it is probably beacuse tpye of 'date' column is string. So python does not understand comperable value. Before plotting try to convert it's type.
data['date'] = pd.to_datetime(data['date'])
I've tested:
import io
import pandas as pd
txt = """
date,count
2020-07-09,144.0
2020-07-10,143.5
2020-07-12,145.5
2020-07-13,144.5
2020-07-14,146.0
2020-07-20,145.5
2020-07-21,146.0
2020-07-24,145.5
2020-07-28,143.0
2020-08-05,146.0
2020-08-10,147.0
2020-08-11,147.5
2020-08-14,146.5
2020-09-01,143.5
2020-09-02,143.0
2020-09-09,144.5
2020-09-10,143.5
2020-09-25,144.0
2021-09-21,132.4
2021-09-23,131.2
2021-09-25,131.0
2021-09-26,130.8
2021-09-27,130.6
2021-09-28,128.4
2021-09-30,126.8
2021-10-02,126.2"""
data = pd.read_csv(io.StringIO(txt), sep=r",", parse_dates=["date"])
data.plot.scatter('date', 'count')
and the result is:
Two observations:
date must be of date type, which is ensured by parse_dates=["date"] option
importing matplotlib.pyplot is not necessary, because You used pandas.DataFrame.plot.scatter method.

Filter columns in a csv file and output plot

I am trying to filter a column in a CSV just like you would in excel. Then based on that filter I would like to have it call a different column and output the data from that column into a plot.
I have tried to print the code on its own and its prints correctly. I am just not sure about the syntax. When I print the code it shows that I can correctly search through a column
data.head()
print('banana',
data[('Sub-Dept')] == 'Stow Each') #and data[('Sub-Dept')] == 'Stow Each Nike', 'Each Stow to Prime', 'Each Stow to Prime E', 'Each Stow to Prime W', 'Stow to Prime LeadPA')
But I do not know how to get it to filter that first, then plot underneath there. I am fairly new to this.
I have a column has many different filterable names inside of it. I want to call those names above.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
x = []
y = []
data = pd.read_csv(r'C:\Users\rmond\Downloads\PS_csvFile.csv', encoding="ISO-8859-1", skiprows=6)
new_data = data.loc[(data['Sub-Dept'] == 'Stow Each')]
sns.set(style="whitegrid") #this is strictly cosmetic, you can change it any time
ax = sns.countplot(x='U.S. OSHA Recordable?', data=new_data)
plt.bar(x, y, label='Loaded from file!')
plt.ylabel('Quantity of Injuries')
plt.title('Injuries (past 4 weeks)')
plt.show()
Right now, I am expecting it to out put 1 graph that has 2 bars. Problem: it shows a quantity of 80 on 1 bar and 20 on another bar. Solution: After the data is filtered from another column it should show 21 in 1 bar and 7 in another bar inside the same graph.
The graphing portion works great and so does pulling the data from the excel. The only part I do not know how to do is filtering that column and then graphing based on that filter. I am not sure what the code should look like and where it should go. please help
CSV FILE HERE: https://drive.google.com/open?id=1yJ6iQL-bOvGSLAKPcPXqgk9aKLqUEmPK
Try pandas.query()
Pandas query might be useful.
data = pd.read_csv(r'C:\Users\rmond\Downloads\PS_csvFile.csv', encoding="ISO-8859-1", skiprows=6)
new_data = data.query("Sub-Dept == Stow Each or
Sub-Dept == RF_Pick")
I am so happy to have figured this out. I had trouble finding the answer to this on the internet. So I hope this helps someone else in the future. Thanks to Datanovice for the initial idea into .loc. That helped me get to the next steps.The rest of my answer came from here: https://www.geeksforgeeks.org/python-pandas-extracting-rows-using-loc/
Sorry I left my comments in the code
import pandas as pd # powerful data visualization library
import numpy as np
import matplotlib.pyplot as plt # allows us to plot things
import csv # allows us to import and use CSV commands which are simple but effective
import seaborn as sns #https://seaborn.pydata.org/generated/seaborn.boxplot.html
# This website saved my life https://www.pythonforengineers.com/introduction-to-pandas/
# use this to check the available styles: plt.style.available
x = []
y = []
data = pd.read_csv(r'C:\Users\rmond\Downloads\PS_csvFile.csv', encoding="ISO-8859-1", skiprows=6, index_col="Sub-Dept") #skiprows allows you to skip the comments on top... & ecoding allows pandas to work on this CSV
new_data = data.loc[["Each Stow to Prime", "Each Stow to Prime E", "Each Stow to Prime W", "Stow Each", "Stow Each Nike", "Stow to Prime LeadPA",]]
sns.set(style="whitegrid") #this is strictly cosmetic, you can change it any time
ax = sns.countplot(x='U.S. OSHA Recordable?', data=new_data) #magic, so seaborn is designed to pull the dats from a URL. But when using pandas and seaborn there is a work around
# the key is that "countplot" literally does the work for you. its awesome
plt.bar(x, y, label='Loaded from file!')
plt.ylabel('Quantity of Injuries')
plt.title('Stow Injuries (past 4 weeks)')
plt.show() # shows the plot to the user

How to fix "wrong number of items passed 5, placement implies 1"

I am trying to make 6 separate graphs from a dataframe that has 5 columns and multiple rows that is imported from Excel. I want to add two lines to the graph that are the point in the dataframe plus and minus the rolling standard deviation at each point in each column and row of the dataframe. To do this I am using a nested for loop and then graphing, however, it is saying wrong number of items pass placement implies 1. I do not know how to fix this.
I have tried converting the dataframe to a list and appending rows as well. Nothing seems to work. I know this could be easily done.
import pandas as pd
import matplotlib.pyplot as plt
excel_file = 'C:/Users/afrydman/Documents/Storage and Data Centers FFO Multiples Data.xlsx'
dfStorage = pd.read_excel(excel_file,sheet_name='Storage Data', index_col='Date')
dfrollingStd = dfStorage.rolling(12).std().shift(-11)
#dfrollingStd.fillna(0)
#print(dfStorage[1][3])
for k,p in dfStorage, dfrollingStd:
dftemp = pd.DataFrame(dfStorage,columns=[k])
dfnew=pd.DataFrame(dfrollingStd,columns=[p])
for i,j in dfStorage, dfrollingStd:
dftemp = pd.DataFrame(dfStorage,index=[i])
dfnew=pd.DataFrame(dfrollingStd,index=[j])
dftemp['-1std'] = pd.DataFrame(dftemp).subtract(dfnew)
dftemp['+1std'] = pd.DataFrame(dftemp).add(dfnew)
pd.DataFrame(dftemp).plot()
plt.ylabel('P/FFO')
I expect the output to be 6 separate graphs each with 3 lines. Instead I am not getting anything. My loop is also not executing properly.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
excel_file = 'C:/Users/afrydman/Documents/Storage and Data Centers FFO Multiples Data.xlsx'
dfStorage = pd.read_excel(excel_file,sheet_name='Storage Data', index_col='Date')
dfrollingStd = dfStorage.rolling(12).std().shift(-11)
#dfrollingStd.fillna(0)
#print(dfStorage[1][3])
for i in dfStorage:
dftemp = pd.DataFrame(dfStorage,columns=[i])
for j in dfrollingStd:
dfnew=pd.DataFrame(dfrollingStd,columns=[j])
dftemp['-1std'] = pd.DataFrame(dftemp).subtract(dfnew)
dftemp['+1std'] = pd.DataFrame(dftemp).add(dfnew)
pd.DataFrame(dftemp).plot()
plt.ylabel('P/FFO')
This is my updated code and I am still getting the same error. This time it is saying Wrong number of items passed 2, placement implies 1

Formatting Pandas DataFrame with Quantopian qgrid

I'm attempting to use quantopian qgrid to print dataframes in iPython notebook. Simple example based on example notebook:
import qgrid
qgrid.nbinstall(overwrite=True)
qgrid.set_defaults(remote_js=True, precision=2)
from pandas import Timestamp
from pandas_datareader.data import get_data_yahoo
data = get_data_yahoo(symbols='SPY',
start=Timestamp('2014-01-01'),
end=Timestamp('2016-01-01'),
adjust_price=True)
qgrid.show_grid(data, grid_options={'forceFitColumns': True})
Other than the precision args, how do you format the column data? It seems to be possible to pass in grid options like formatterFactory or defaultFormatter but unclear exactly how to use them for naive user.
Alternative approaches suggested in another question but I like the interaction the SlickGrid object provides.
Any help or suggestions much appreciated.
Short answer is that most grid options are passed through grid_options as a dictionary, eg:
grid_options requires a dictionary, eg for setting options for a specific grid:
qgrid.show_grid(data,
grid_options={'fullWidthRows': True,
'syncColumnCellResize': True,
'forceFitColumns': True,
'rowHeight': 40,
'enableColumnReorder': True,
'enableTextSelectionOnCells': True,
'editable': True})
Please, see details here
I changed the float format using df.round (would be useful change the format using the column_definitions).
Anyway the slider filter isn't in line with the values in the columns, why?
import pandas as pd
import qgrid
import numpy as np
col_def = { 'B': {"Slick.Editors.Float.DefaultDecimalPlaces":2}}
np.random.seed(25)
df = pd.DataFrame(np.random.random([5, 4]), columns =["A", "B", "C", "D"])
df = df.round({"A":1, "B":2, "C":3, "D":4})
qgrid_widget = qgrid.show_grid(df, show_toolbar=True, column_definitions=col_def)
qgrid_widget

Categories