Filtering pandas dataframe in python - python

I have a csv file with rows and columns separated by commas. This file contains headers (str) and values. Now, I want to filter all the data with a condition. For example, there is a header called "pmra" and I want to keep all the information for pmra values between -2.6 and -2.0. How can I do that? I tried with np.where but it did not work. Thanks for your help.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
filename="NGC188_C.csv"
data = pd.read_csv(filename)
ra = data["ra"]
dec = data["dec"]
parallax = data["parallax"]
pm_ra = data["pmra"]
pm_dec = data["pmdec"]
g_band = data["phot_g_mean_mag"]
bp_rp = data["bp_rp"]

You can use something like:
data[(data["pmra"] >= -2.6) & (data["pmra"] <= -2)]
There is also another approach: You can use between function:
data["pmra"].between(-2.6, -2)

Related

Cant select a column by name in a DataFrame

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv("data/opti.csv")
df03 = df.loc[df["%"]==0.3]
df04 = df.loc[df["%"]==0.4]
df06 = df.loc[df["%"]==0.6]
df08 = df.loc[df["%"]==0.8]
df1 = df.loc[df["%"]==1]
x = np.array([0.3,0.4,0.6,0.8,1])
e = np.array([np.std(df03),np.std(df04),np.std(df06),np.std(df08),np.std(df1)])
df.head(5)
df["Tablettenmasse"]
df Output
When I write df["Tablettenmasse"] I get a key error. But when I select the column with iloc it works. Why isnĀ“t it working the normal way?
edit: as mosc9575 suggested, there was a space before the word. Thanks!

how to drop a categorical value from a data frame column in python?

I am working with a data frame title price_df. and I would like to drop the rows that contain '4wd' from the column drive-wheels. I have tried price_df2 = price_df.drop(index='4wd', axis=0) and a few other variations after reading the docs pages in pandas, but I continue to get error codes. Could anyone direct me to the correct way to drop the rows that contain values 4wd from the column and data frame? Below is the code I have ran before trying to drop the values:
# Cleaned up Dataset location
fileName = "https://library.startlearninglabs.uw.edu/DATASCI410/Datasets/Automobile%20price%20data%20_Raw_.csv"
# Import libraries
from scipy.stats import norm
import numpy as np
import pandas as pd
import math
import numpy.random as nr
price_df = pd.read_csv(fileName)
round(price_df.head(),2) #getting an overview of that data
price_df.loc[:,'drive-wheels'].value_counts()
price_df2 = price_df.drop(index='4wd', axis=0)
You can use pd.DataFrame.query and back ticks for this column name with a hyphen:
price_df.query('`drive-wheels` != "4wd"')
Try this
price_df = pd.read_csv(fileName)
mask = price_df["drive-wheels"] =="4wd"
price_df = price_df[~mask]
Get a subset of your data with this one-liner:
price_df2 = price_df[price_df.drive-wheels != '4wd']

How to delete \xa0 symbol from all dataframe in pandas?

I have a data frame with the 2011 census (Chile) information. In some columns names and some variable values, I have \xa0 symbol and is trouble to call some parts of data frames. My code is the following:
import numpy as numpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
#READ AND SAVE EACH DATA SET IN A DATA FRAME
data_2011 = pd.read_csv("2011.csv")
#SHOW FIRST 5 ROWS
print(data_2011.head())
Doing this got the following output:
This far, everything is right, but when I want to see the column names using:
print("ATRIBUTOS DATOS CENSO 2011",list(data_2011))
I got the following ouput:
How can I fix this? Thanks in advance
Try the following code:
import re
inputarray = re.sub(r'\s', data_2011).split(",')

Plotting a multiple column in Pandas (converting strings to floats)

I'd like to plot "MJD" vs "MULTIPLE_MJD" for the data given here::
https://www.dropbox.com/s/cicgc1eiwrz93tg/DR14Q_pruned_several3cols.csv?dl=0
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ast
filename = 'DR14Q_pruned_several3cols.csv'
datafile= path+filename
df = pd.read_csv(datafile)
df.plot.scatter(x='MJD', y='N_SPEC')
plt.show()
ser = df['MJD_DUPLICATE'].apply(ast.literal_eval).str[1]
df['MJD_DUPLICATE'] = pd.to_numeric(ser, errors='coerce')
df['MJD_DUPLICATE_NEW'] = pd.to_numeric(ser, errors='coerce')
df.plot.scatter(x='MJD', y='MJD_DUPLICATE')
plt.show()
This makes a plot, but only for one value of MJD_DUPLICATE::
print(df['MJD_DUPLICATE_NEW'])
0 55214
1 55209
...
Thoughts??
There are two issues here:
Telling Pandas to parse tuples within the CSV. This is covered here: Reading back tuples from a csv file with pandas
Transforming the tuples into multiple rows. This is covered here: Getting a tuple in a Dafaframe into multiple rows
Putting those together, here is one way to solve your problem:
# Following https://stackoverflow.com/questions/23661583/reading-back-tuples-from-a-csv-file-with-pandas
import pandas as pd
import ast
df = pd.read_csv("DR14Q_pruned_several3cols.csv",
converters={"MJD_DUPLICATE": ast.literal_eval})
# Following https://stackoverflow.com/questions/39790830/getting-a-tuple-in-a-dafaframe-into-multiple-rows
df2 = pd.DataFrame(df.MJD_DUPLICATE.tolist(), index=df.MJD)
df3 = df2.stack().reset_index(level=1, drop=True)
# Now just plot!
df3.plot(marker='.', linestyle='none')
If you want to remove the 0 and -1 values, a mask will work:
df3[df3 > 0].plot(marker='.', linestyle='none')

Subtracting the rows of a column from the preceding rows in a python pandas dataframe

I have a .dat file which takes thousands of rows in a column (say, the column is time, t), now I want to find the interval between the rows in the column, that means subtracting the value of second row from first row, and so on.. (to find dt). Then I wish to make a new column with those interval values and plot it against the original column. If any other language other than python is helpful in this case, I appreciate their suggestion too.
I have written a pseudo python code for that:
import pandas as pd
import numpy as np
from sys import argv
from pylab import *
import csv
script, filename = argv
# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./flash.dat").readlines()]
# write it as a new CSV file
with open("./flash.dat", "wb") as f:
writer = csv.writer(f)
writer.writerows(datContent)
columns_to_keep = ['#time']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
df = pd.DataFrame({"#time"})
df["#time"] = df["#time"] + [pd.Timedelta(minutes=m) for m in np.random.choice(a=range(60), size=df.shape[0])]
df["value"] = np.random.normal(size=df.shape[0])
df["prev_time"] = [np.nan] + df.iloc[:-1]["#time"].tolist()
df["time_delta"] = df.time - df.prev_time
df
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
dataframe.plot(x='#time', y='time_delta', style='r')
print dataframe
show()
Updated my code, and i am also sharing the .dat file I am working on.
https://www.dropbox.com/s/w4jbxmln9e83355/flash.dat?dl=0
One easy way to perform an operation involving values from different rows is simply to copy the required values one the same row and then apply a simple row-wise operation.
For instance, in your example, we'd have a dataframe with one time column and some other data, like so:
import pandas as pd
import numpy as np
df = pd.DataFrame({"time": pd.date_range("24 sept 2016", periods=5*24, freq="1h")})
df["time"] = df["time"] + [pd.Timedelta(minutes=m) for m in np.random.choice(a=range(60), size=df.shape[0])]
df["value"] = np.random.normal(size=df.shape[0])
If you want to compute the time delta from the previous (or next, or whatever) row, you can simply copy the value from it, and then perform the subtraction:
df["prev_time"] = [np.nan] + df.iloc[:-1]["time"].tolist()
df["time_delta"] = df.time - df.prev_time
df

Categories