Issue Creating Data Frame out of Columns Pandas - Python - python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
df = pd.read_csv('time_series_covid_19_deaths_US.csv')
df = df.drop(['UID','iso2','iso3','code3','FIPS','Admin2','Combined_Key'],axis =1)
for name, values in df.iteritems():
if '/' in name:
df.drop([name],axis=1,inplace =True)
df2 = df.set_index(['Lat','Long_'])
print(df2.head())
lat = df2[df2["Lat"]]
print(lat)
long = df2[df2['Long_']]
Code is above. I got the data set from https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset - using the US deaths.
I have attached an image of the output. I do not know what this error means.
Apologies if worded ambiguously / incorrectly, or if there is a preexisting answer somewhere

When you define an index using one or more columns, e.g. via set_index(), these columns are promoted to index and by default no longer accessible using the df[<colname>] notation. This behavior can be changed with set_index(..., drop=False) but that's usually not necessary.
With the index in place, use df.loc[] to access single rows by their index value, aka label. Read about label-based indexing here.
To access the values of your MultiIndex as you would do with a column, you can use df.index.get_level_values(<colname>).array (or .to_numpy()). So in your case you could write:
lat = df2.index.get_level_values('Lat').array
print(lat)
long = df2.index.get_level_values('Long_').array
print(lat)
BTW: read_csv() has a useful usecols argument that lets you specify which columns to load (others will be ignored).

Related

How to create a hexbin plot from a pandas dataframe

I have this dataframe:
! curl -O https://raw.githubusercontent.com/msu-cmse-courses/cmse202-S21-student/master/data/Dataset.data
import pandas as pd
#I read it in
data = pd.read_csv("Dataset.data", delimiter=' ', header = None)
#Now I want to add column titles to the file so I add them
data.columns = ['sex','length','diameter','height','whole_weight','shucked_weight','viscera_weight','shell_weight','rings']
print(data)
Now I want to grab the x variable column shell_weight and the y variable column rings and graph them as a histogram using plt.hexbin:
df = pd.DataFrame(data)
plt.hexbin(x='shell_weight', y='rings')
For some reason when I graph the code it is not working:
ValueError: First argument must be a sequence
Can anyone help me graph these 2 variables?
ValueError: First argument must be a sequence
The issue with plt.hexbin(x='shell_weight', y='rings') is that matplotlib doesn't know what shell_weight and rings are supposed to be. It doesn't know about df unless you specify it.
Since you already have a dataframe, it's simplest to plot with pandas, but pure matplotlib is still possible if you specify the source df:
df.plot.hexbin (simplest)
In this case, pandas will automatically infer the columns from df, so we can just pass the column names:
df.plot.hexbin(x='shell_weight', y='rings') # pandas infers the df source
plt.hexbin
With pure matplotlib, either pass the actual columns:
plt.hexbin(x=df.shell_weight, y=df.rings) # actual columns, not column names
# ^^^ ^^^
Or pass the column names while specifying the data source:
plt.hexbin(x='shell_weight', y='rings', data=df) # column names with df source
# ^^^^^^^

How can I get the difference between values in a Pandas dataframe grouped by another field?

I have a CSV of data I've loaded into a dataframe that I'm trying to massage: I want to create a new column that contains the difference from one record to another, grouped by another field.
Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
rl = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'
all_counties = pd.read_csv(url, dtype={"fips": str})
all_counties.date = pd.to_datetime(all_counties.date)
oregon = all_counties.loc[all_counties['state'] == 'Oregon']
oregon.set_index('date', inplace=True)
oregon.sort_values('county', inplace=True)
# This is not working; I was hoping to find the differences from one day to another on a per-county basis
oregon['delta'] = oregon.groupby(['state','county'])['cases'].shift(1, fill_value=0)
oregon.tail()
Unfortunately, I'm getting results where the delta is always the same as the cases.
I'm new at Pandas and relatively inexperienced with Python, so bonus points if you can point me towards how to best read the documentation.
Lets Try
oregon['delta']=oregon.groupby(['state','county'])['cases'].diff().fillna(0)

Using str.startswith to access a dataframe slice

I have a dataframe that with temperature values over the years, What I want to do is put all the rows that are from year 2015 into a new dataframe. Currently, the Date column is an object type with the str format looking like this: YYYY-MM-DD
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("C:\\whatever\weather.csv")
weather_2015 = df.loc[df.Date == df.Date.str.startswith("2015"), :]
weather_2015.head()
this is what the data looks like in the main data frame
NOTE: if I do something like
weather_2015 = df.loc[df.Date == "2015-02-03", :]
weather_2015.head()
I get what I'd expect, dates only that match 2015-02-03
pd.Series.str.startswith returns a boolean mask, you don't need to compare it to df.Date again. You could just index with it directly:
weather_2015 = df[df.Date.str.startswith("2015")]
You don't even need .loc here.
Note that if you want to make changes on this slice, you might prefer a copy, in which case you should call df.copy:
weather_2015 = df[df.Date.str.startswith("2015")].copy()

Data Frame Indexing

Using python3 I wrote a code for calculating data. Code is as follows:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
def data(symbols):
dates = pd.date_range('2016/01/01','2016/12/23')
df=pd.DataFrame(index=dates)
for symbol in symbols:
df_temp=pd.read_csv("/home/furqan/Desktop/Data/{}.csv".format(symbol),
index_col='Date',parse_dates=True,usecols=['Date',"Close"],
na_values = ['nan'])
df_temp=df_temp.rename(columns={'Close':symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
df=(df/df.ix[0,: ])
return df
symbols = ['FABL','HINOON']
df=data(symbols)
print(df)
p_value=(np.zeros((2,2),dtype="float"))
p_value[0,0]=0.5
p_value[1,1]=0.5
print(df.shape[1])
print(p_value.shape[0])
df=np.dot(df,p_value)
print(df.shape[1])
print(df.shape[0])
print(df)
When I print df for second time the index has vanished. I think the issue is due to matrix multiplication. How can I get the indexing and column headings back into df?
To resolve your issue, because you are using numpy methods, these typically return a numpy array which is why any existing columns and index labels will have been lost.
So instead of
df=np.dot(df,p_value)
you can do
df=df.dot(p_value)
Additionally because p_value is a pure numpy array, there is no column names here so you can either create a df using existing column names:
p_value=pd.DataFrame(np.zeros((2,2),dtype="float"), columns = df.columns)
or just overwrite the column names directly after calculating the dot product like so:
df.columns = ['FABL', 'HINOON']

Change one column of a DataFrame only

I'm using Pandas with Python 3. I have a dataframe with a bunch of columns, but I only want to change the data type of all the values in one of the columns and leave the others alone. The only way I could find to accomplish this is to edit the column, remove the original column and then merge the edited one back. I would like to edit the column without having to remove and merge, leaving the the rest of the dataframe unaffected. Is this possible?
Here is my solution now:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
def make_float(var):
var = float(var)
return var
#create a new dataframe with the value types I want
df2 = df1['column'].apply(make_float)
#remove the original column
df3 = df1.drop('column',1)
#merge the dataframes
df1 = pd.concat([df3,df2],axis=1)
It also doesn't work to apply the function to the dataframe directly. For example:
df1['column'].apply(make_float)
print(type(df1.iloc[1]['column']))
yields:
<class 'str'>
df1['column'] = df1['column'].astype(float)
It will raise an error if conversion fails for some row.
Apply does not work inplace, but rather returns a series that you discard in this line:
df1['column'].apply(make_float)
Apart from Yakym's solution, you can also do this -
df['column'] += 0.0

Categories