How to add LabelBinarizer columns to DataFrame - python

I have recently started working with LabelBinarizer by running the following code. (here are the first couple of rows of the CSV file that I'm using):
import pandas as pd
from sklearn.preprocessing import LabelBinarizer
#import matplotlib.pyplot as plot
#--------------------------------
label_conv = LabelBinarizer()
appstore_original = pd.read_csv("AppleStore.csv")
#--------------------------------
lb_conv = label_conv.fit_transform(appstore["cont_rating"])
column_names = label_conv.classes_
print(column_names)
print(lb_conv)
I get the lb_conv and the column names. Therefore:
how could I attach label_conv to appstore_original using column_names as the column names?
If anyone could help that would be great.

try this:
lb = LabelBinarizer()
df = pd.read_csv("AppleStore.csv")
df = df.join(pd.DataFrame(lb.fit_transform(df["cont_rating"]),
columns=lb.classes_,
index=df.index))
to make sure that a newly created DF will have the same index elements as the original DF (we need it for joining), we will specify index=df.index in the constructor call.

Related

Pandas DataFrame wrong indexing after reading from csv

I know very little about python's pandas module. I need to create a DataFrame and store it in .csv file for my project. I am using to_csv and read_csv functions. However, when I compared the two frames (before exporting and the imported one) I got different results. This is the the minimal reproducible example:
import sys
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
documents = []
documents.append("i love python")
documents.append("foo bar")
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
X = X.T.toarray()
df = pd.DataFrame(X, index=vectorizer.get_feature_names())
df.to_csv(path_or_buf = "db.csv")
df1 = pd.read_csv("db.csv")
print(df.axes)
print()
print(df1.axes)
And this is what is printed:
[Index(['bar', 'foo', 'love', 'python'], dtype='object'), RangeIndex(start=0, stop=2, step=1)]
[RangeIndex(start=0, stop=4, step=1), Index(['Unnamed: 0', '0', '1'], dtype='object')]
How can I make the DataFrame imported from a .csv file identical to the original one?
UPDATE:Give index name for the dataframe you are exporting and while reading the exported csv use that name as index. Here I am using vectors as index name
import sys
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
documents = []
documents.append("i love python")
documents.append("foo bar")
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
X = X.T.toarray()
df = pd.DataFrame(X, index=vectorizer.get_feature_names())
df.index.name = 'vectors'
df.to_csv(path_or_buf="db.csv")
df1 = pd.read_csv("db.csv",index_col='vectors')
print(df)
print()
print(df1)
Old answer: Try exporting csv without index by setting index to false as
df.to_csv(path_or_buf="db.csv", index=False)

Cant select a column by name in a DataFrame

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv("data/opti.csv")
df03 = df.loc[df["%"]==0.3]
df04 = df.loc[df["%"]==0.4]
df06 = df.loc[df["%"]==0.6]
df08 = df.loc[df["%"]==0.8]
df1 = df.loc[df["%"]==1]
x = np.array([0.3,0.4,0.6,0.8,1])
e = np.array([np.std(df03),np.std(df04),np.std(df06),np.std(df08),np.std(df1)])
df.head(5)
df["Tablettenmasse"]
df Output
When I write df["Tablettenmasse"] I get a key error. But when I select the column with iloc it works. Why isnĀ“t it working the normal way?
edit: as mosc9575 suggested, there was a space before the word. Thanks!

how to drop a categorical value from a data frame column in python?

I am working with a data frame title price_df. and I would like to drop the rows that contain '4wd' from the column drive-wheels. I have tried price_df2 = price_df.drop(index='4wd', axis=0) and a few other variations after reading the docs pages in pandas, but I continue to get error codes. Could anyone direct me to the correct way to drop the rows that contain values 4wd from the column and data frame? Below is the code I have ran before trying to drop the values:
# Cleaned up Dataset location
fileName = "https://library.startlearninglabs.uw.edu/DATASCI410/Datasets/Automobile%20price%20data%20_Raw_.csv"
# Import libraries
from scipy.stats import norm
import numpy as np
import pandas as pd
import math
import numpy.random as nr
price_df = pd.read_csv(fileName)
round(price_df.head(),2) #getting an overview of that data
price_df.loc[:,'drive-wheels'].value_counts()
price_df2 = price_df.drop(index='4wd', axis=0)
You can use pd.DataFrame.query and back ticks for this column name with a hyphen:
price_df.query('`drive-wheels` != "4wd"')
Try this
price_df = pd.read_csv(fileName)
mask = price_df["drive-wheels"] =="4wd"
price_df = price_df[~mask]
Get a subset of your data with this one-liner:
price_df2 = price_df[price_df.drive-wheels != '4wd']

KeyError with Pandas CSV

Getting KeyError: Revenue
My CSV file
Product,Revenue
Onetap Master,538.07
Aimware Masterpack,306.06
Personal Config,159.94
Aimware Lua,29.95
Config Swap,22.76
The code
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(open('sales.csv'),index_col=1, sep=',')
print(df.columns.tolist())
pd.value_counts(df['Revenue']).plot.bar()
plt.show()
When I use Product instead of Revenue it works just fine
Simply drop the
index_col=1
From your pd.read_csv() step, and it works. You can also skip the
open()
and
sep=','
parts of pd.read_csv()
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('sales.csv')
print(df.columns.tolist()) # returns ['Product', 'Revenue']
pd.value_counts(df['Revenue']).plot.bar()
plt.show()
When you use index_col=1 you make the Revenue column the index and the dataframe looks like:
Product
Revenue
538.07 Onetap Master
306.06 Aimware Masterpack
159.94 Personal Config
29.95 Aimware Lua
22.76 Config Swap
So it is a single column dataframe, which can be make evident by examining df.columns: is is just [Products])
TL/DR: in you want to use Revenue as a column, do not put it into the index

Plotting a multiple column in Pandas (converting strings to floats)

I'd like to plot "MJD" vs "MULTIPLE_MJD" for the data given here::
https://www.dropbox.com/s/cicgc1eiwrz93tg/DR14Q_pruned_several3cols.csv?dl=0
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ast
filename = 'DR14Q_pruned_several3cols.csv'
datafile= path+filename
df = pd.read_csv(datafile)
df.plot.scatter(x='MJD', y='N_SPEC')
plt.show()
ser = df['MJD_DUPLICATE'].apply(ast.literal_eval).str[1]
df['MJD_DUPLICATE'] = pd.to_numeric(ser, errors='coerce')
df['MJD_DUPLICATE_NEW'] = pd.to_numeric(ser, errors='coerce')
df.plot.scatter(x='MJD', y='MJD_DUPLICATE')
plt.show()
This makes a plot, but only for one value of MJD_DUPLICATE::
print(df['MJD_DUPLICATE_NEW'])
0 55214
1 55209
...
Thoughts??
There are two issues here:
Telling Pandas to parse tuples within the CSV. This is covered here: Reading back tuples from a csv file with pandas
Transforming the tuples into multiple rows. This is covered here: Getting a tuple in a Dafaframe into multiple rows
Putting those together, here is one way to solve your problem:
# Following https://stackoverflow.com/questions/23661583/reading-back-tuples-from-a-csv-file-with-pandas
import pandas as pd
import ast
df = pd.read_csv("DR14Q_pruned_several3cols.csv",
converters={"MJD_DUPLICATE": ast.literal_eval})
# Following https://stackoverflow.com/questions/39790830/getting-a-tuple-in-a-dafaframe-into-multiple-rows
df2 = pd.DataFrame(df.MJD_DUPLICATE.tolist(), index=df.MJD)
df3 = df2.stack().reset_index(level=1, drop=True)
# Now just plot!
df3.plot(marker='.', linestyle='none')
If you want to remove the 0 and -1 values, a mask will work:
df3[df3 > 0].plot(marker='.', linestyle='none')

Categories