How to create a hexbin plot from a pandas dataframe - python

I have this dataframe:
! curl -O https://raw.githubusercontent.com/msu-cmse-courses/cmse202-S21-student/master/data/Dataset.data
import pandas as pd
#I read it in
data = pd.read_csv("Dataset.data", delimiter=' ', header = None)
#Now I want to add column titles to the file so I add them
data.columns = ['sex','length','diameter','height','whole_weight','shucked_weight','viscera_weight','shell_weight','rings']
print(data)
Now I want to grab the x variable column shell_weight and the y variable column rings and graph them as a histogram using plt.hexbin:
df = pd.DataFrame(data)
plt.hexbin(x='shell_weight', y='rings')
For some reason when I graph the code it is not working:
ValueError: First argument must be a sequence
Can anyone help me graph these 2 variables?

ValueError: First argument must be a sequence
The issue with plt.hexbin(x='shell_weight', y='rings') is that matplotlib doesn't know what shell_weight and rings are supposed to be. It doesn't know about df unless you specify it.
Since you already have a dataframe, it's simplest to plot with pandas, but pure matplotlib is still possible if you specify the source df:
df.plot.hexbin (simplest)
In this case, pandas will automatically infer the columns from df, so we can just pass the column names:
df.plot.hexbin(x='shell_weight', y='rings') # pandas infers the df source
plt.hexbin
With pure matplotlib, either pass the actual columns:
plt.hexbin(x=df.shell_weight, y=df.rings) # actual columns, not column names
# ^^^ ^^^
Or pass the column names while specifying the data source:
plt.hexbin(x='shell_weight', y='rings', data=df) # column names with df source
# ^^^^^^^

Related

Issue Creating Data Frame out of Columns Pandas - Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
df = pd.read_csv('time_series_covid_19_deaths_US.csv')
df = df.drop(['UID','iso2','iso3','code3','FIPS','Admin2','Combined_Key'],axis =1)
for name, values in df.iteritems():
if '/' in name:
df.drop([name],axis=1,inplace =True)
df2 = df.set_index(['Lat','Long_'])
print(df2.head())
lat = df2[df2["Lat"]]
print(lat)
long = df2[df2['Long_']]
Code is above. I got the data set from https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset - using the US deaths.
I have attached an image of the output. I do not know what this error means.
Apologies if worded ambiguously / incorrectly, or if there is a preexisting answer somewhere
When you define an index using one or more columns, e.g. via set_index(), these columns are promoted to index and by default no longer accessible using the df[<colname>] notation. This behavior can be changed with set_index(..., drop=False) but that's usually not necessary.
With the index in place, use df.loc[] to access single rows by their index value, aka label. Read about label-based indexing here.
To access the values of your MultiIndex as you would do with a column, you can use df.index.get_level_values(<colname>).array (or .to_numpy()). So in your case you could write:
lat = df2.index.get_level_values('Lat').array
print(lat)
long = df2.index.get_level_values('Long_').array
print(lat)
BTW: read_csv() has a useful usecols argument that lets you specify which columns to load (others will be ignored).

removing columns in a loop from different size dataframes [duplicate]

I am reading from an Excel sheet and I want to read certain columns: column 0 because it is the row-index, and columns 22:37. Now here is what I do:
import pandas as pd
import numpy as np
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = 37)
df= pd.concat([df[df.columns[0]], df[df.columns[22:]]], axis=1)
But I would hope there is better way to do that! I know if I do parse_cols=[0, 22,..,37] I can do it, but for large datasets this doesn't make sense.
I also did this:
s = pd.Series(0)
s[1]=22
for i in range(2,14):
s[i]=s[i-1]+1
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = s)
But it reads the first 15 columns which is the length of s.
You can use column indices (letters) like this:
import pandas as pd
import numpy as np
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols="A,C:AA")
print(df)
Corresponding documentation:
usecols : int, str, list-like, or callable default None
If None, then parse all columns.
If str, then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.
If list of int, then indicates list of column numbers to be parsed.
If list of string, then indicates list of column names to be parsed.
New in version 0.24.0.
If callable, then evaluate each column name against it and parse the column if the callable returns True.
Returns a subset of the columns according to behavior above.
New in version 0.24.0.
parse_cols is deprecated, use usecols instead
that is:
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols = "A,C:AA")
"usecols" should help, use range of columns (as per excel worksheet, A,B...etc.)
below are the examples
1. Selected Columns
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A,C,F")
2. Range of Columns and selected column
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:F,H")
3. Multiple Ranges
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:F,H,J:N")
4. Range of columns
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:N")
If you know the names of the columns and do not want to use A,B,D or 0,4,7. This actually works
df = pd.read_excel(url)[['name of column','name of column','name of column','name of column','name of column']]
where "name of column" = columns wanted. Case and whitespace sensitive
Read any column's data in excel
import pandas as pd
name_of_file = "test.xlsx"
data = pd.read_excel(name_of_file)
required_colum_name = "Post test Number"
print(data[required_colum_name])
Unfortunately these methods still seem to read and convert the headers before returning the subselection. I have an Excel sheet with duplicate header names because the sheet contains several similar tables. I want to read those tables individually, so I would want to apply usecols. However, this still add suffixes to the duplicate column names.
To reproduce:
create an Excel sheet with headers named Header1, Header2, Header1, Header2 under columns A, B, C, D
df.read_excel(filename, usecols='C:D')
df.columns will return ['Header1.1', 'Header2.1']
Is there way to circumvent this, aside from splitting and joining the resulting headers? Especially when it is unknown whether there are duplicate columns it is tricky to rename them, as splitting on '.' may be corrupting a non-duplicate header.
Edit: additionally, the length (in indeces) of a DataFrame based on a subset of columns will be determined by the length of the full file. So if column A has 10 rows, and column B only has 5, a DataFrame generated by usecols='B' will have 10 rows of which 5 filled with NaN's.

How to fix autogenerated index in dataframe by real index after getting data from pd.read_html

I am not able to find how to index my dataframe columns properly
I tried some methods but not able to find right one
import pandas as pd
df = pd.read_html('sbi.html')
data = df[1]
i want the second row as my index of columns in which "Narration" is there
Set header parameter to 1:
data = pd.read_html('sbi.html', header=1)[0]
Or use skiprows parameter:
data = pd.read_html('sbi.html', skiprows=1)[0]

python pandas dataframe - can't figure out how to lookup an index given a value from a df

I have 2 dataframes of numerical data. Given a value from one of the columns in the second df, I would like to look up the index for the value in the first df. More specifically, I would like to create a third df, which contains only index labels - using values from the second to look up its coordinates from the first.
listso = [[21,101],[22,110],[25,113],[24,112],[21,109],[28,108],[30,102],[26,106],[25,111],[24,110]]
data = pd.DataFrame(listso,index=list('abcdefghij'), columns=list('AB'))
rollmax = pd.DataFrame(data.rolling(center=False,window=5).max())
So for the third df, I hope to use the values from rollmax and figure out which row they showed up in data. We can call this third df indexlookup.
For example, rollmax.ix['j','A'] = 30, so indexlookup.ix['j','A'] = 'g'.
Thanks!
You can build a Series with the indexing the other way around:
mapA = pd.Series(data.index, index=data.A)
Then mapA[rollmax.ix['j','A']] gives 'g'.

Change one column of a DataFrame only

I'm using Pandas with Python 3. I have a dataframe with a bunch of columns, but I only want to change the data type of all the values in one of the columns and leave the others alone. The only way I could find to accomplish this is to edit the column, remove the original column and then merge the edited one back. I would like to edit the column without having to remove and merge, leaving the the rest of the dataframe unaffected. Is this possible?
Here is my solution now:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
def make_float(var):
var = float(var)
return var
#create a new dataframe with the value types I want
df2 = df1['column'].apply(make_float)
#remove the original column
df3 = df1.drop('column',1)
#merge the dataframes
df1 = pd.concat([df3,df2],axis=1)
It also doesn't work to apply the function to the dataframe directly. For example:
df1['column'].apply(make_float)
print(type(df1.iloc[1]['column']))
yields:
<class 'str'>
df1['column'] = df1['column'].astype(float)
It will raise an error if conversion fails for some row.
Apply does not work inplace, but rather returns a series that you discard in this line:
df1['column'].apply(make_float)
Apart from Yakym's solution, you can also do this -
df['column'] += 0.0

Categories