Using str.startswith to access a dataframe slice - python

I have a dataframe that with temperature values over the years, What I want to do is put all the rows that are from year 2015 into a new dataframe. Currently, the Date column is an object type with the str format looking like this: YYYY-MM-DD
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("C:\\whatever\weather.csv")
weather_2015 = df.loc[df.Date == df.Date.str.startswith("2015"), :]
weather_2015.head()
this is what the data looks like in the main data frame
NOTE: if I do something like
weather_2015 = df.loc[df.Date == "2015-02-03", :]
weather_2015.head()
I get what I'd expect, dates only that match 2015-02-03

pd.Series.str.startswith returns a boolean mask, you don't need to compare it to df.Date again. You could just index with it directly:
weather_2015 = df[df.Date.str.startswith("2015")]
You don't even need .loc here.
Note that if you want to make changes on this slice, you might prefer a copy, in which case you should call df.copy:
weather_2015 = df[df.Date.str.startswith("2015")].copy()

Related

Issue Creating Data Frame out of Columns Pandas - Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
df = pd.read_csv('time_series_covid_19_deaths_US.csv')
df = df.drop(['UID','iso2','iso3','code3','FIPS','Admin2','Combined_Key'],axis =1)
for name, values in df.iteritems():
if '/' in name:
df.drop([name],axis=1,inplace =True)
df2 = df.set_index(['Lat','Long_'])
print(df2.head())
lat = df2[df2["Lat"]]
print(lat)
long = df2[df2['Long_']]
Code is above. I got the data set from https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset - using the US deaths.
I have attached an image of the output. I do not know what this error means.
Apologies if worded ambiguously / incorrectly, or if there is a preexisting answer somewhere
When you define an index using one or more columns, e.g. via set_index(), these columns are promoted to index and by default no longer accessible using the df[<colname>] notation. This behavior can be changed with set_index(..., drop=False) but that's usually not necessary.
With the index in place, use df.loc[] to access single rows by their index value, aka label. Read about label-based indexing here.
To access the values of your MultiIndex as you would do with a column, you can use df.index.get_level_values(<colname>).array (or .to_numpy()). So in your case you could write:
lat = df2.index.get_level_values('Lat').array
print(lat)
long = df2.index.get_level_values('Long_').array
print(lat)
BTW: read_csv() has a useful usecols argument that lets you specify which columns to load (others will be ignored).

Data Frame Indexing

Using python3 I wrote a code for calculating data. Code is as follows:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
def data(symbols):
dates = pd.date_range('2016/01/01','2016/12/23')
df=pd.DataFrame(index=dates)
for symbol in symbols:
df_temp=pd.read_csv("/home/furqan/Desktop/Data/{}.csv".format(symbol),
index_col='Date',parse_dates=True,usecols=['Date',"Close"],
na_values = ['nan'])
df_temp=df_temp.rename(columns={'Close':symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
df=(df/df.ix[0,: ])
return df
symbols = ['FABL','HINOON']
df=data(symbols)
print(df)
p_value=(np.zeros((2,2),dtype="float"))
p_value[0,0]=0.5
p_value[1,1]=0.5
print(df.shape[1])
print(p_value.shape[0])
df=np.dot(df,p_value)
print(df.shape[1])
print(df.shape[0])
print(df)
When I print df for second time the index has vanished. I think the issue is due to matrix multiplication. How can I get the indexing and column headings back into df?
To resolve your issue, because you are using numpy methods, these typically return a numpy array which is why any existing columns and index labels will have been lost.
So instead of
df=np.dot(df,p_value)
you can do
df=df.dot(p_value)
Additionally because p_value is a pure numpy array, there is no column names here so you can either create a df using existing column names:
p_value=pd.DataFrame(np.zeros((2,2),dtype="float"), columns = df.columns)
or just overwrite the column names directly after calculating the dot product like so:
df.columns = ['FABL', 'HINOON']

Change one column of a DataFrame only

I'm using Pandas with Python 3. I have a dataframe with a bunch of columns, but I only want to change the data type of all the values in one of the columns and leave the others alone. The only way I could find to accomplish this is to edit the column, remove the original column and then merge the edited one back. I would like to edit the column without having to remove and merge, leaving the the rest of the dataframe unaffected. Is this possible?
Here is my solution now:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
def make_float(var):
var = float(var)
return var
#create a new dataframe with the value types I want
df2 = df1['column'].apply(make_float)
#remove the original column
df3 = df1.drop('column',1)
#merge the dataframes
df1 = pd.concat([df3,df2],axis=1)
It also doesn't work to apply the function to the dataframe directly. For example:
df1['column'].apply(make_float)
print(type(df1.iloc[1]['column']))
yields:
<class 'str'>
df1['column'] = df1['column'].astype(float)
It will raise an error if conversion fails for some row.
Apply does not work inplace, but rather returns a series that you discard in this line:
df1['column'].apply(make_float)
Apart from Yakym's solution, you can also do this -
df['column'] += 0.0

Python pandas plot time-series with gap

I am trying to plot a pandas DataFrame with TimeStamp indizes that has a time gap in its indizes. Using pandas.plot() results in linear interpolation between the last TimeStamp of the former segment and the first TimeStamp of the next. I do not want linear interpolation, nor do I want empty space between the two date segments. Is there a way to do that?
Suppose we have a DataFrame with TimeStamp indizes:
>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
>>> df = df.cumsum()
Now lets take two time chunks of it and plot it:
>>> df = pd.concat([df['Jan 2000':'Aug 2000'], df['Jan 2001':'Aug 2001']])
>>> df.plot()
>>> plt.show()
The resulting plot has an interpolation line connecting the TimeStamps enclosing the gap. I cannot figure out how to upload pictures on this machine, but these pictures from Google Groups show my problem (interpolated.jpg, no-interpolation.jpg and no gaps.jpg). I can recreate the first as shown above. The second is achievable by replacing all gap values with NaN (see also this question). How can I achieve the third version, where the time gap is omitted?
Try:
df.plot(x=df.index.astype(str))
You may want to customize ticks and tick labels.
EDIT
That works for me using pandas 0.17.1 and numpy 1.10.4.
All you really need is a way to convert the DatetimeIndex to another type which is not datetime-like. In order to get meaningful labels I chose str. If x=df.index.astype(str) does not work with your combination of pandas/numpy/whatever you can try other options:
df.index.to_series().dt.strftime('%Y-%m-%d')
df.index.to_series().apply(lambda x: x.strftime('%Y-%m-%d'))
...
I realized that resetting the index is not necessary so I removed that part.
In my case I had DateTimeIndex objects instead of TimeStamp, but the following works for me in pandas 0.24.2 to eliminate the time series gaps after converting the DatetimeIndex objects to string.
df = pd.read_sql_query(sql, sql_engine)
df.set_index('date'), inplace=True)
df.index = df.index.map(str)

Pandas filtering - between_time on a non-index column

I need to filter out data with specific hours. The DataFrame function between_time seems to be the proper way to do that, however, it only works on the index column of the dataframe; but I need to have the data in the original format (e.g. pivot tables will expect the datetime column to be with the proper name, not as the index).
This means that each filter looks something like this:
df.set_index(keys='my_datetime_field').between_time('8:00','21:00').reset_index()
Which implies that there are two reindexing operations every time such a filter is run.
Is this a good practice or is there a more appropriate way to do the same thing?
Create a DatetimeIndex, but store it in a variable, not the DataFrame.
Then call it's indexer_between_time method. This returns an integer array which can then be used to select rows from df using iloc:
import pandas as pd
import numpy as np
N = 100
df = pd.DataFrame(
{'date': pd.date_range('2000-1-1', periods=N, freq='H'),
'value': np.random.random(N)})
index = pd.DatetimeIndex(df['date'])
df.iloc[index.indexer_between_time('8:00','21:00')]

Categories