pandas dataframe conditional selecting - python

I have a Dataframe and im trying to apply some ml algorithms on it.
im using pandas to handle it but im having several problems with it:
as you see in the 3rd cell, i have splitted Y into Ytr and Yts. after this the dataframe losses its column names. I've tried to name the column again but it doesn't work.
in the 4th cell, Im trying to use conditional statement to create a subset of Y in which Y values are 1(it is named ytr1). but it returns an empty dataframe.
any suggestions on the whole code would be really appreciated since im not really experienced with Pandas
note: if you haven't worked with jupyter notebook, #%% just means a new cell.
#%%
from pandas import DataFrame as df
import random
import numpy as np
import pandas as pd
import re
#%%
# Preparing the DataFrame
labels = pd.read_csv(r'A:\Data Sets\Pima Indian Diabetes\labels.csv', header=None)
ll = labels.loc[:, 0].tolist()
data = pd.read_csv(r'A:\Data Sets\Pima Indian Diabetes\pima-indians-diabetes2.csv', names=ll)
i = data.columns.values.tolist() # i is the labels of the csv file
i[-1]
#%%
# Spliting the Dataset
X = data.drop(i[-1], axis=1)
Y = data.iloc[:, 8]
Y = Y.to_frame()
Y = pd.DataFrame(Y.values.reshape(-1, 1), columns=i[-1])
tr_idx = data.sample(frac=0.7).index
Xtr = df(X[X.index.isin(tr_idx)])
Xts = df(X[~X.index.isin(tr_idx)])
Ytr = df(Y[X.index.isin(tr_idx)], columns='result')
Yts = df(Y[~X.index.isin(tr_idx)], columns=i[-1])
#%%
# splitting the Classes
ytr1 = Ytr.drop(Ytr[Ytr.iloc[0]!=1].index)
X: all the columns except Labels\classes which are 0 or 1
Y: last column of the csv files that are loaded as labels
Xtr: fraction of X that Im planning to use for training
Xts: fraction of X that Im planning to use for testing

Related

Confuse why my KNN code is throwing a ValueError

I am using sklearn for KNN regressor:
#importing libraries and data
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor as KNR
theta = pd.read_csv("train.csv")#pandas dataframe
#getting data wanted from theta and putting it in a new dataframe
a = theta.get("YearBuilt")
b = theta.get("YrSold")
A = a.to_frame()
B = b.to_frame()
glasses = [A,B]
x = pd.concat(glasses)
#getting target data
y = theta.get("SalePrice")
#using KNN
horses = KNR(n_neighbors = 3)
horses.fit(x,y)
I get this error message:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Could someone please explain this? My data is in the hundred thousands for target and the thousands for input. And there is no blanks in the data.
Before answering the question, Let me refactor the code. You are using a dataframe so you can index single or muliple fields of the dataframe without going through the extra steps you've used:
#importing libraries and data
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor as KNR
theta = pd.read_csv("train.csv") # pandas dataframe
#getting data wanted from theta and putting it in a new dataframe
x = theta[["YearBuilt", "YrSold"]] # index multiple fields
#getting target data
y = theta["SalePrice"] # index single field
#using KNN
horses = KNR(n_neighbors = 3)
horses.fit(x,y) # fit KNN
Regarding your error, it indicates that you have some NaN, Inf, large values in your data. You can ensure these doesnt occur by filtering out the NaN and inf values using this:
theta = theta.replace([np.inf, -np.inf], np.nan)
theta.dropna(inplace=True)

How to input dataset while using Salesforce-merlion package for timeseries forecasting

I have installed Salesforce-Merlion package in my conda-environment. Now I want to use my own dataset to run the algorithm for forecasting. Here I need only one univariate series to forecast. But I cannot figure out how to do that. As there are some variables which I cannot find how to initialize those. In the example provided in GIThub, using some already splitted dataset. Can someone can help me out here?
GIThub example for forecasting is like this:
from merlion.utils import TimeSeries from ts_datasets.forecast import M4
# Data loader returns pandas DataFrames, which we convert to Merlion TimeSeries
time_series, metadata = M4(subset="Hourly")[0]
train_data = TimeSeries.from_pd(time_series[metadata.trainval])
test_data = TimeSeries.from_pd(time_series[~metadata.trainval])
The complete code with internal dataset is available in the following link:
https://github.com/salesforce/Merlion/tree/main/examples/forecast
(Here they are using their internal dataset M4)
Now, I have to use my dataset. So my code is like this:
from merlion.utils import TimeSeries
df = pd.read_csv(r'C:\Users\Doyel_De_Sarkar\Desktop\forecasting\15786_GIK.csv')
df.dropna(inplace=True)
df['ts'] = pd.to_datetime(df['ts'])
df.sort_values('ts', inplace=True)
trainval = []
for i in range(len(df)):
if i <= (round((len(df)*0.75),0)):
trainval.append(True)
else:
trainval.append(False)
df['trainval'] = trainval
df = df.drop(columns=['wday', 'hour'])
from merlion.utils import UnivariateTimeSeries
kpi = UnivariateTimeSeries(
time_stamps=df.ts, # timestamps in units of seconds
values=df.saps_total, # time series values
name="kpi" # optional: a name for this univariate
)
kpi_label = UnivariateTimeSeries(
time_stamps=df.ts, # timestamps in units of seconds
values=df.trainval # time series values
)
from merlion.utils import TimeSeries
time_series, metadata = kpi, kpi_label
train_data = TimeSeries.from_pd(time_series[metadata.trainval])
test_data = TimeSeries.from_pd(time_series[~metadata.trainval])
test_data = TimeSeries.from_pd(time_series[~metadata.trainval])
I am getting this following error
'UnivariateTimeSeries' object has no attribute 'trainval'
at this line:
train_data = TimeSeries.from_pd(time_series[metadata.trainval])
The reason you're getting this error is because trainval is not a parameter of the TimeSeries class. In the example from GitHub that you shared, metadata is a pandas timeframe, but you're constructing a TimeSeries object out of kpi_label.
I'm not sure exactly what your dataset looks like, but try using:
kpi_labels = df.trainval
instead.
Thank you SalmonKiller for taking out time to look into the issue. The dataset used in the github has a very weird data structure, hence I had to create the column trainval and set the metadata as the column df[['trainval']]. The univariate I had created was of no use. The issue was there with indexing. After I set the time stamp column as index , the issue got solved.
Here is the code which is running fine now.
import os
import numpy as np
import pandas as pd
from merlion.models.forecast.smoother import MSESConfig, MSES
from merlion.transform.resample import TemporalResample
from merlion.utils import TimeSeries
df = pd.read_csv(r'<file.csv>')
df['ts'] = pd.to_datetime(df['ts'])
df.set_index('ts', inplace=True)
df.sort_values('ts', inplace=True)
hours = pd.date_range(start=df.index[0], end=df.index[-1], freq='H')
mean = df.saps_total.mean()
df = df.reindex(hours, fill_value=mean)
trainval = []
for i in range(len(df)):
if i <= (round((len(df)*0.75),0)):
trainval.append(True)
else:
trainval.append(False)
df['trainval'] = trainval
df = df.drop(columns=['wday', 'hour'])
from merlion.utils import TimeSeries
time_series = df[['saps_total']]
metadata = df[['trainval']]
train_data = TimeSeries.from_pd(time_series[metadata.trainval])
test_data = TimeSeries.from_pd(time_series[~metadata.trainval])
from merlion.models.forecast.arima import Arima, ArimaConfig
config1 = ArimaConfig(max_forecast_steps=len(time_series[~metadata.trainval].index), order=(0, 1, 0),
transform=TemporalResample(granularity="1h"))
model1 = Arima(config1)
model1.train(train_data=train_data)
test_pred, test_err = model1.forecast(time_stamps=test_data.time_stamps)
print(test_pred)

pandas.read_csv() returns strings from columns instead numbers

I am trying to find linear regression plot for the data provided
import pandas
from pandas import DataFrame
import matplotlib.pyplot
data = pandas.read_csv('cost_revenue_clean.csv')
data.describe()
X = DataFrame(data,columns=['production_budget_usd'])
y = DataFrame(data,columns=['worldwide_gross_usd'])
when I try to plot it
matplotlib.pyplot.scatter(X,y)
matplotlib.pyplot.show()
the plot was completely empty
and when I printed the type of X
for element in X:
print(type(element))
it shows the type is string.. Where am I standing wrong???
No need to make new DataFrames for X and y. Try astype(float) if you want them as numeric:
X = data['production_budget_usd'].astype(float)
y = data['worldwide_gross_usd'].astype(float)

Split Pandas Dataframe With Equal Amount of Rows for each Column Value

This is for a machine learning project.
I have a CSV file which I have read in as a Pandas dataframe. The CSV looks like this:
id,label
f38a6374c348f90b587e046aac6079959adf3835,0
c18f2d887b7ae4f6742ee445113fa1aef383ed77,1
755db6279dae599ebb4d39a9123cce439965282d,0
bc3f0c64fb968ff4a8bd33af6971ecae77c75e08,0
068aba587a4950175d04c680d38943fd488d6a9d,0
acfe80838488fae3c89bd21ade75be5c34e66be7,0
a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da,1
7f6ccae485af121e0b6ee733022e226ee6b0c65f,1
559e55a64c9ba828f700e948f6886f4cea919261,0
8eaaa7a400aa79d36c2440a4aa101cc14256cda4,0
...
[220025 rows x 2 columns]
I have decreased the sample size and equalized the data, so that I have a dataframe with 60,000 rows; 30,000 rows with label 1 and label 0. I now want to split the dataframe into two with one dataframe having 50,000 rows, and the other having 10,000, but I want each dataframe to have an equal amount of rows with label 1 and label 0.
There are some longer solutions, such as splitting the dataframe, then using .frac() to make two dataframes then merging alternate ones, but that is unnecessarily complex.
Is there any method to split the dataframe with equal amounts of rows for each label, but a different amount of total rows in each dataframe?
Here is the code I have used:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import cv2
import random
df = pd.read_csv("../input/histopathologic-cancer-detection/train_labels.csv")
ones_subset = df.loc[df["label"] == 1, :]
num_ones = len(ones_subset)
zeros_subset = df.loc[df["label"] == 0, :]
sampled_zeros = zeros_subset.sample(num_ones)
print(num_ones)
print(sampled_zeros)
df = pd.concat([ones_subset, sampled_zeros], ignore_index=True)
df = df.groupby("label").sample(30000).sample(frac=1).reset_index(drop=True)
print(df)
Try with sklearn + stratify
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.16, random_state=19, stratify=df['label'])

Loading the Iris Dataset in from a CSV file?

I am practicing data processing with Scikit learn, and I'm looking at Classification Probability. I've successfully ran the model using the data set from import dataset now I want to try and do the same thing with a CSV file, so I've downloaded the same dataset, and am trying to load it into my code.
iris = np.loadtxt('./iris.csv', delimiter=',', skiprows=1)
X = iris.data[:, 0:2]
y = iris.target
However I get an error stating ValueError: could not convert string to float: 'setosa' I understand that this is from the CSV as it is the name of the flower, is there any other way to import this CSV file so that this issue isnt an issue?
For this you can use pandas:
data = pandas.read_csv("iris.csv")
data.head() # to see first 5 rows
X = data.drop(["target"], axis = 1)
Y = data["target"]
or you can try (I would personally recommend to use pandas)
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')

Categories