How should I reduce the computing time in pandas on Kaggle? - python

I am working on 2019 Data Science Bowl.The training and testing data is taking a long time when I am using pandas to read it ,I want to reduce the time so that the machine can run the analysis efficiently.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
keep_cols = ['event_id', 'game_session', 'installation_id', 'event_count', 'event_code', 'title', 'game_time', 'type', 'world']
specs_df = pd.read_csv('/kaggle/input/data-science-bowl-2019/specs.csv')
train_df = pd.read_csv('/kaggle/input/data-science-bowl-2019/train.csv',usecols=keep_cols)
test_df = pd.read_csv('/kaggle/input/data-science-bowl-2019/test.csv')
train_labels_df = pd.read_csv('/kaggle/input/data-science-bowl-2019/train_labels.csv')

Pandas read_csv method has a chunksize argument yields a certain number of rows as an iterator. This is useful for very large data sets where you can train on a smaller subset of the data iteratively.
More information on iterating through files is described in the documentation here.

Related

How can I fix this Import Error Seaborn Heatmap?

I am new to python and I am trying to create a heatmap off of a pivot table. Below is the code I am using. There are NaN values in my df but I made them 0's and I am still getting the same answer. I have also read that it has something to do with the way I have named my file similar to python's standard library but I do not recall doing this or know how to change it.
sns.heatmap(pivot, annot = True)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()
The error looks like this:
ImportError: cannot import name 'roperator'
These are the import I have done:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import matplotlib.pyplot as plt
import seaborn as sns

How to create a dataframe of one index of a dataset?

I have a dataset NearGrid with dimensions (index:225, time:25933) that contains daily temperature data for 225 locations.
How can I create a dataframe for the first location (index=0) where the columns are date and tmax and each row represents one day of data (i.e. 25933 rows x 2 columns)?
Here's what I'm trying:
#import libraries
import os
import matplotlib.pyplot as plt
from netCDF4 import Dataset as netcdf_dataset
import numpy as np
from cartopy import config
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import xarray as xr
import pandas as pd
#open netcdf files
df=xr.open_mfdataset('/glacier1/mmartin/data/NOAATmax/Tmax****.nc')
#open cm stations csv and create new dataset with only points closest to stations in CMStations
CMStations=pd.read_csv('Slope95.csv')
Lat=CMStations.lat
Lon=CMStations.lon
NearGrid=df.sel(lat=Lat.to_xarray(), lon=Lon.to_xarray(), method='nearest')
#create dataframe of first location in NearGrid
NearGrid.isel(index=0).to_dataframe()
but when I do this the code runs indefinitely and nothing happens.
The problem was the way the data was chunked. When I saved the subsetted data as a new netcdf file and then opened it in a new notebook, it worked. I did that through this:
#import libraries
import os
import matplotlib.pyplot as plt
from netCDF4 import Dataset as netcdf_dataset
import numpy as np
from cartopy import config
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import xarray as xr
import pandas as pd
#open netcdf files
df=xr.open_mfdataset('/glacier1/mmartin/data/NOAATmax/Tmax****.nc')
#open cm stations csv and create new dataset with only points closest to stations in CMStations
CMStations=pd.read_csv('Slope95.csv')
Lat=CMStations.lat
Lon=CMStations.lon
NearGrid=df.sel(lat=Lat.to_xarray(), lon=Lon.to_xarray(), method='nearest')
#save as new netcdf file
NearGrid.to_netcdf('/glacier1/mmartin/data/NOAATmax/Tmax_CMStations_19510101-20211231.nc')
I then opened this file in a new notebook and manipulated the data there

Pyplot directly on yfinance object is fast. Pyplot on equivalent csv is slow

Pyplot directly on data from yfinance
Here's a little script which loads data into pyplot from yfinance:
import yfinance as yf
import matplotlib.pyplot as plt
data = yf.Ticker('MSFT').history(period="max").reset_index()[["Date", "Open"]]
plt.plot(data["Date"], data["Open"])
plt.show()
The UI loads quickly and is quite responsive. It looks like this:
Pyplot on equivalent data from csv
Here's a similar script which writes the data to CSV, loads it from CSV, then plots the graph using the data from CSV:
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt
data = yf.Ticker('MSFT').history(period="max").reset_index()[["Date", "Open"]]
data.to_csv('out.csv')
from_csv = pd.read_csv('out.csv', index_col=0)
plt.plot(from_csv["Date"], from_csv["Open"])
plt.show()
This time:
The UI loads much more slowly
Zooming is slow
Panning is slow
Resizing is slow
The horizontal axes labels don't display clearly
Question
I'd like to avoid hitting the yfinance API each time the script is run so as to not burden their systems unnecessarily. (I'd include a bit more logic than in the script above which takes care of not accessing the API if a CSV is available. I kept the example simple for the sake of demonstration.)
Is there a way to get the CSV version to result in a pyplot UI that is as responsive as the direct-from-API version?
When loading from CSV, the Date column needs to be converted to an actual date value.
from_csv["Date"] = pd.to_datetime(from_csv["Date"])
Here's a fast version of the above script:
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt
data = yf.Ticker('MSFT').history(period="max").reset_index()[["Date", "Open"]]
data.to_csv('out.csv')
from_csv = pd.read_csv('out.csv', index_col=0)
from_csv["Date"] = pd.to_datetime(from_csv["Date"])
plt.plot(from_csv["Date"], from_csv["Open"])
plt.show()

Using Dask with Python causes issues when running Pandas code

I am trying to work with Dask because my dataframe has become large and that pandas by itself can't simply process it. I read my dataset in as follows and get the following result that looks odd, not sure why its not outputting the dataframe:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
import dask.bag as db
import json
%matplotlib inline
Leads = db.read_text('Leads 6.4.18.txt')
Leads
This returns (instead of my pandas dataframe):
dask.bag<bag-fro..., npartitions=1>
Then when I try to rename a few columns:
Leads_updated = Leads.rename(columns={'Business Type':'Business_Type','Lender
Type':'Lender_Type'})
Leads_updated
I get:
AttributeError: 'Bag' object has no attribute 'rename'
Can someone please explain what I am not doing correctly. The ojective is to just use Dask for all these steps since it is too big for regular Python/Pandas. My understanding is the syntax used under Dask should be the same as Pandas.

Plotting an ETF price for longer time period

I have the code below. If you run my code the graph will be showing price history for just one year. Can someone tell me how I can plot SPY instrument for the whole time period from 01.01.2008 until now.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt from pandas_datareader import data as web
from pylab import plt plt.style.use('ggplot')
%matplotlib inline
spy=web.DataReader("SPY",data_source="google",start="2008-1-1")
spy["Close"].plot()

Categories