create variables in python with available data - python

I have read a data like this:
import numpy as np
arr=n.loadtext("data/za.csv",delimeter=",")
display(arr)
Now the display looks like this:
array([[5.0e+01,1.8e+00,1.6e+00,1.75+e00],
[4.8e+01,1.77e+00,1.63e+00,1.75+e00],
[5.5e+01,1.8e+00,1.6e+00,1.75+e00],
...,
[5.0e+01,1.8e+00,1.6e+00,1.75+e00],
[4.8e+01,1.77e+00,1.63e+00,1.75+e00],
[5.0e+01,1.8e+00,1.6e+00,1.75+e00]])
Now I would like to give this variables to this array
the first ist weight of person
second is height of person
third is height of mother
fourth is height of father
Now I would like to now how can I create this variables that representin the columns?

install pandas library
import pandas as pd
use pd.read_csv("data/za.csv", columns= ["height", "weight", "etc"]) for read the data
hope you get the solution.

As it has already been advised, you may use pandas.read_csv for the purpose as per below:
df = pd.read_csv(**{
'filepath_or_buffer': "data/za.csv",
'header': None,
'names': ('weight_of_person', 'height_of_person', 'height_of_mother', 'height_of_father'),
})

Related

Python "value_count" output to formatted table

I have value_count output data for a single column that I would like to feed into a table and format nicely. I would like to bold the headings, have "alternating colors" for the rows, change the font to "serif", and italicize the main column. Kind of like this.
I thought I found something applicable, but I do not know how to apply it to my data (or perhaps it is not suited for what I want to achieve).
I found "table styles" with the following example:
df4 = pd.DataFrame([[1,2],[3,4]])
s4 = df4.style
props = 'font-family: "Times New Roman", Times, serif; color: #e83e8c; font-size:1.3em;'
df4.style.applymap(lambda x: props, subset=[1])
Here is my code on its own. Please note I had to first split my data (here) so that I could properly count to end up with the value_count output data. These are a few libraries I have been working with (but there could be a few unnecessary ones in here).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
#access file
data = pd.read_csv('E:/testing_data.csv')
supplies = pd.DataFrame(data)
supplies.Toppings = supplies.Toppings.str.split('\r\n')
supplies = supplies.explode('Toppings').reset_index(drop=True)
supplies.Toppings.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
Please be as specific as possible as I am still getting used to Python terms. Thank you.

Adding column titles between current titles in pandas

I'm relatively new to coding so may be an easy answer! Basically I'm using pandas to import data and I want to add a column header between the original header titles. I've added the code with the names= section showing essentially what I would like to see. Help with how that is actually implemented would be a great help as I am very stuck.
dfFQExp = pd.read_csv(fileFQExp, delimiter='\s+', names=["Original header1", "error1", "Original header2", "error2"....])
Thanks!
If you would like to rename the column names, you can do it this way:
By location:
dfFQExp.rename(columns={ dfFQExp.columns[0]: 'new header1'}, inplace = True)
By original name:
dfFQExp.rename(columns={ 'Original header1': 'new header1'}, inplace = True)

How to stop matplotlib from skipping gaps in data?

I have this simple csv:
date,count
2020-07-09,144.0
2020-07-10,143.5
2020-07-12,145.5
2020-07-13,144.5
2020-07-14,146.0
2020-07-20,145.5
2020-07-21,146.0
2020-07-24,145.5
2020-07-28,143.0
2020-08-05,146.0
2020-08-10,147.0
2020-08-11,147.5
2020-08-14,146.5
2020-09-01,143.5
2020-09-02,143.0
2020-09-09,144.5
2020-09-10,143.5
2020-09-25,144.0
2021-09-21,132.4
2021-09-23,131.2
2021-09-25,131.0
2021-09-26,130.8
2021-09-27,130.6
2021-09-28,128.4
2021-09-30,126.8
2021-10-02,126.2
If I copy it into excel and scatter plot it, it looks like this
This is correct; there should be a big gap in the middle (look carefully at the data, it jumps from 2020 to 2021)
However if I do this in python:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('data.csv')
data.plot.scatter('date', 'count')
plt.show()
It looks like this:
It evenly spaces them at the gap is gone. How do I stop that behavior? I tried to do
plt.xticks = data.date
But that didn't do anything different.
I dont exactly know the types of columns in data but it is probably beacuse tpye of 'date' column is string. So python does not understand comperable value. Before plotting try to convert it's type.
data['date'] = pd.to_datetime(data['date'])
I've tested:
import io
import pandas as pd
txt = """
date,count
2020-07-09,144.0
2020-07-10,143.5
2020-07-12,145.5
2020-07-13,144.5
2020-07-14,146.0
2020-07-20,145.5
2020-07-21,146.0
2020-07-24,145.5
2020-07-28,143.0
2020-08-05,146.0
2020-08-10,147.0
2020-08-11,147.5
2020-08-14,146.5
2020-09-01,143.5
2020-09-02,143.0
2020-09-09,144.5
2020-09-10,143.5
2020-09-25,144.0
2021-09-21,132.4
2021-09-23,131.2
2021-09-25,131.0
2021-09-26,130.8
2021-09-27,130.6
2021-09-28,128.4
2021-09-30,126.8
2021-10-02,126.2"""
data = pd.read_csv(io.StringIO(txt), sep=r",", parse_dates=["date"])
data.plot.scatter('date', 'count')
and the result is:
Two observations:
date must be of date type, which is ensured by parse_dates=["date"] option
importing matplotlib.pyplot is not necessary, because You used pandas.DataFrame.plot.scatter method.

Boxplot needs to use multiple groupby in Pandas

I am using pandas, Jupyter notebooks and python.
I have a following dataset as a dataframe
Cars,Country,Type
1564,Australia,Stolen
200,Australia,Stolen
579,Australia,Stolen
156,Japan,Lost
900,Africa,Burnt
2000,USA,Stolen
1000,Indonesia,Stolen
900,Australia,Lost
798,Australia,Lost
128,Australia,Lost
200,Australia,Burnt
56,Australia,Burnt
348,Australia,Burnt
1246,USA,Burnt
I would like to know how I can use a box plot to answer the following question "Number of cars in Australia that were affected by each type". So basically, I should have 3 boxplots(for each type) showing the number of cars affected in Australia.
Please keep in mind that this is a subset of the real dataset.
You can select only the rows corresponding to "Australia" from the column "Country" and group it by the column "Type" as shown:
from StringIO import StringIO
import pandas as pd
text_string = StringIO(
"""
Cars,Country,Type,Score
1564,Australia,Stolen,1
200,Australia,Stolen,2
579,Australia,Stolen,3
156,Japan,Lost,4
900,Africa,Burnt,5
2000,USA,Stolen,6
1000,Indonesia,Stolen,7
900,Australia,Lost,8
798,Australia,Lost,9
128,Australia,Lost,10
200,Australia,Burnt,11
56,Australia,Burnt,12
348,Australia,Burnt,13
1246,USA,Burnt,14
""")
df = pd.read_csv(text_string, sep = ",")
# Specifically checks in column name "Cars"
group = df.loc[df['Country'] == 'Australia'].boxplot(column = 'Cars', by = 'Type')

Formatting Pandas DataFrame with Quantopian qgrid

I'm attempting to use quantopian qgrid to print dataframes in iPython notebook. Simple example based on example notebook:
import qgrid
qgrid.nbinstall(overwrite=True)
qgrid.set_defaults(remote_js=True, precision=2)
from pandas import Timestamp
from pandas_datareader.data import get_data_yahoo
data = get_data_yahoo(symbols='SPY',
start=Timestamp('2014-01-01'),
end=Timestamp('2016-01-01'),
adjust_price=True)
qgrid.show_grid(data, grid_options={'forceFitColumns': True})
Other than the precision args, how do you format the column data? It seems to be possible to pass in grid options like formatterFactory or defaultFormatter but unclear exactly how to use them for naive user.
Alternative approaches suggested in another question but I like the interaction the SlickGrid object provides.
Any help or suggestions much appreciated.
Short answer is that most grid options are passed through grid_options as a dictionary, eg:
grid_options requires a dictionary, eg for setting options for a specific grid:
qgrid.show_grid(data,
grid_options={'fullWidthRows': True,
'syncColumnCellResize': True,
'forceFitColumns': True,
'rowHeight': 40,
'enableColumnReorder': True,
'enableTextSelectionOnCells': True,
'editable': True})
Please, see details here
I changed the float format using df.round (would be useful change the format using the column_definitions).
Anyway the slider filter isn't in line with the values in the columns, why?
import pandas as pd
import qgrid
import numpy as np
col_def = { 'B': {"Slick.Editors.Float.DefaultDecimalPlaces":2}}
np.random.seed(25)
df = pd.DataFrame(np.random.random([5, 4]), columns =["A", "B", "C", "D"])
df = df.round({"A":1, "B":2, "C":3, "D":4})
qgrid_widget = qgrid.show_grid(df, show_toolbar=True, column_definitions=col_def)
qgrid_widget

Categories