Plotting a multiple column in Pandas (converting strings to floats) - python

I'd like to plot "MJD" vs "MULTIPLE_MJD" for the data given here::
https://www.dropbox.com/s/cicgc1eiwrz93tg/DR14Q_pruned_several3cols.csv?dl=0
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ast
filename = 'DR14Q_pruned_several3cols.csv'
datafile= path+filename
df = pd.read_csv(datafile)
df.plot.scatter(x='MJD', y='N_SPEC')
plt.show()
ser = df['MJD_DUPLICATE'].apply(ast.literal_eval).str[1]
df['MJD_DUPLICATE'] = pd.to_numeric(ser, errors='coerce')
df['MJD_DUPLICATE_NEW'] = pd.to_numeric(ser, errors='coerce')
df.plot.scatter(x='MJD', y='MJD_DUPLICATE')
plt.show()
This makes a plot, but only for one value of MJD_DUPLICATE::
print(df['MJD_DUPLICATE_NEW'])
0 55214
1 55209
...
Thoughts??

There are two issues here:
Telling Pandas to parse tuples within the CSV. This is covered here: Reading back tuples from a csv file with pandas
Transforming the tuples into multiple rows. This is covered here: Getting a tuple in a Dafaframe into multiple rows
Putting those together, here is one way to solve your problem:
# Following https://stackoverflow.com/questions/23661583/reading-back-tuples-from-a-csv-file-with-pandas
import pandas as pd
import ast
df = pd.read_csv("DR14Q_pruned_several3cols.csv",
converters={"MJD_DUPLICATE": ast.literal_eval})
# Following https://stackoverflow.com/questions/39790830/getting-a-tuple-in-a-dafaframe-into-multiple-rows
df2 = pd.DataFrame(df.MJD_DUPLICATE.tolist(), index=df.MJD)
df3 = df2.stack().reset_index(level=1, drop=True)
# Now just plot!
df3.plot(marker='.', linestyle='none')
If you want to remove the 0 and -1 values, a mask will work:
df3[df3 > 0].plot(marker='.', linestyle='none')

Related

Filtering pandas dataframe in python

I have a csv file with rows and columns separated by commas. This file contains headers (str) and values. Now, I want to filter all the data with a condition. For example, there is a header called "pmra" and I want to keep all the information for pmra values between -2.6 and -2.0. How can I do that? I tried with np.where but it did not work. Thanks for your help.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
filename="NGC188_C.csv"
data = pd.read_csv(filename)
ra = data["ra"]
dec = data["dec"]
parallax = data["parallax"]
pm_ra = data["pmra"]
pm_dec = data["pmdec"]
g_band = data["phot_g_mean_mag"]
bp_rp = data["bp_rp"]
You can use something like:
data[(data["pmra"] >= -2.6) & (data["pmra"] <= -2)]
There is also another approach: You can use between function:
data["pmra"].between(-2.6, -2)

Cant select a column by name in a DataFrame

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv("data/opti.csv")
df03 = df.loc[df["%"]==0.3]
df04 = df.loc[df["%"]==0.4]
df06 = df.loc[df["%"]==0.6]
df08 = df.loc[df["%"]==0.8]
df1 = df.loc[df["%"]==1]
x = np.array([0.3,0.4,0.6,0.8,1])
e = np.array([np.std(df03),np.std(df04),np.std(df06),np.std(df08),np.std(df1)])
df.head(5)
df["Tablettenmasse"]
df Output
When I write df["Tablettenmasse"] I get a key error. But when I select the column with iloc it works. Why isnĀ“t it working the normal way?
edit: as mosc9575 suggested, there was a space before the word. Thanks!

how to drop a categorical value from a data frame column in python?

I am working with a data frame title price_df. and I would like to drop the rows that contain '4wd' from the column drive-wheels. I have tried price_df2 = price_df.drop(index='4wd', axis=0) and a few other variations after reading the docs pages in pandas, but I continue to get error codes. Could anyone direct me to the correct way to drop the rows that contain values 4wd from the column and data frame? Below is the code I have ran before trying to drop the values:
# Cleaned up Dataset location
fileName = "https://library.startlearninglabs.uw.edu/DATASCI410/Datasets/Automobile%20price%20data%20_Raw_.csv"
# Import libraries
from scipy.stats import norm
import numpy as np
import pandas as pd
import math
import numpy.random as nr
price_df = pd.read_csv(fileName)
round(price_df.head(),2) #getting an overview of that data
price_df.loc[:,'drive-wheels'].value_counts()
price_df2 = price_df.drop(index='4wd', axis=0)
You can use pd.DataFrame.query and back ticks for this column name with a hyphen:
price_df.query('`drive-wheels` != "4wd"')
Try this
price_df = pd.read_csv(fileName)
mask = price_df["drive-wheels"] =="4wd"
price_df = price_df[~mask]
Get a subset of your data with this one-liner:
price_df2 = price_df[price_df.drive-wheels != '4wd']

How to add LabelBinarizer columns to DataFrame

I have recently started working with LabelBinarizer by running the following code. (here are the first couple of rows of the CSV file that I'm using):
import pandas as pd
from sklearn.preprocessing import LabelBinarizer
#import matplotlib.pyplot as plot
#--------------------------------
label_conv = LabelBinarizer()
appstore_original = pd.read_csv("AppleStore.csv")
#--------------------------------
lb_conv = label_conv.fit_transform(appstore["cont_rating"])
column_names = label_conv.classes_
print(column_names)
print(lb_conv)
I get the lb_conv and the column names. Therefore:
how could I attach label_conv to appstore_original using column_names as the column names?
If anyone could help that would be great.
try this:
lb = LabelBinarizer()
df = pd.read_csv("AppleStore.csv")
df = df.join(pd.DataFrame(lb.fit_transform(df["cont_rating"]),
columns=lb.classes_,
index=df.index))
to make sure that a newly created DF will have the same index elements as the original DF (we need it for joining), we will specify index=df.index in the constructor call.

Pandas apply, how to combine the results returned

In python pandas apply, the applied function takes each row of the Dataframe and will return another Dataframe, how can I get the combination of (append) these Dataframes returned through applying? For example:
# this is an example
import pandas as pd
import numpy as np
def newdata(X, data2):
return X - data2[data2['no']!=X['no']].sample(1,random_state=100)
col = ['no','a','b']
data1 = pd.DataFrame(np.column_stack((range(5),np.random.rand(5,2))),columns=col)
data2 = pd.DataFrame(np.column_stack((range(3),np.random.rand(3,2))),columns=col)
Newdata = data1.apply(newdata, args=(data2,), axis=1)

Categories