Selecting rows from 2 columns based on a list

Selecting rows from 2 columns based on a list - python

I have a list which contains a list of indices. Now, I want to select two columns from a dataframe according to the indices.
I am trying:
indices = full_train_df.query("primary == primary").index
X = train_df[["A","B"]][:clean_df_indices].values
y = train_df["year"][:clean_df_indices].values
However, it says that none of them are in the index. What can I do to solve this error?

Use loc like this:
indices = full_train_df.query("primary == primary").index
X = train_df.loc[:clean_df_indices, ["A","B"]].values
y = train_df.loc[:clean_df_indices, "year"].values

Related

Plotting values above a threshold in Python

Having issues with plotting values above a set threshold using a pandas dataframe.
I have a dataframe that has 21453 rows and 20 columns, and one of the columns is just 1 and 0 values. I'm trying to plot this column using the following code:
lst1 = []
for x in range(0, len(df)):
if(df_smooth['Active'][x] == 1):
lst1.append(df_smooth['Time'][x])
plt.plot(df_smooth['Time'], df_smooth['CH1'])
plt.plot(df_smooth['Time'], lst1)
But get the following errors:
x and y must have same first dimension, but have shapes (21453,) and (9,)
Any suggestions on how to fix this?

The error is probably the result of this line plt.plot(df_smooth['Time'], lst1). While lst1 is a subset of df_smooth[Time], df_smooth['Time'] is the full series.
The solution I would do is to also build a filtered x version for example -
lst_X = []
lst_Y = []
for x in range(0, len(df)):
if(df_smooth['Active'][x] == 1):
lst_X.append(df_smooth['Time'][x])
lst_Y.append(df_smooth['Time'][x])
Another option is to build a sub-dataframe -
sub_df = df_smooth[df_smooth['Active']==1]
plt.plot(sub_df['Time'], sub_df['Time'])
(assuming the correct column as Y column is Time, otherwise just replace it with the correct column)

It seems like you are trying to plot two different data series using the plt.plot() function, this is causing the error because plt.plot() expects both series to have the same length.
You will need to ensure that both data series have the same length before trying to plot them. One way to do this is to create a new list that contains the same number of elements as the df_smooth['Time'] data series, and then fill it with the corresponding values from the lst1 data series.
# Create a new list with the same length as the 'Time' data series
lst2 = [0] * len(df_smooth['Time'])
# Loop through the 'lst1' data series and copy the values to the corresponding
# indices in the 'lst2' data series
for x in range(0, len(lst1)):
lst2[x] = lst1[x]
# Plot the 'Time' and 'lst2' data series using the plt.plot() function
plt.plot(df_smooth['Time'], df_smooth['CH1'])
plt.plot(df_smooth['Time'], lst2)
I think this should work.

How can i get the location of certain rows by using the index of a series with Pandas?

I want to use the high_accidents(Pandas Series) index which is a list of cities to get the rows of the dataframe that match the df["City"] value.
count_city = df["City"].value_counts()
high_accidents = count_city[count_city >= 1000]
new_df = df.loc[df["City"].values == high_accidents.index]
This returns ValueError: Lengths must match to compare

Here df['City'].values will give your an array of cities and high_accidents.index is of pandas.index type. So the error message is shown that length must match to compare.
To get your desired result you can modify the code as following:-
count_city = df["City"].value_counts()
high_accidents = count_city[count_city >= 1000]
new_df = df.loc[df["City"].isin(high_accidents.index)]

Lookup value in the same dataframe based on label and add to a new column (Vlookup)

I have a table which contains laboratory results, including 'blind duplicate samples'. These are basically a sample taken twice, where the second sample was given a non-descript label. The corresponding origina; sample is indicated in a separate column
Labels = ['A1-1', 'A1-2', 'A1-3', 'A1-4','B1-2', 'B1-3', 'B1-4', 'B1-5', 'Blank1', 'Blank2', 'Blank3']
Values = [8356532 ,7616084,5272477, 5076012, 411851, 415258, 8285777, 9700884, 9192185, 4466890,830516]
Duplicate_of = ['','','','','','','','','A1-1', 'A1-4', 'B1-3']
d = {'Labels': Labels, 'Values': Values, 'Duplicate_of' : Duplicate_of}
df = pd.DataFrame(data=d)
df = df[['Labels','Values','Duplicate_of']]
I would like to add a column to the dataframe which holds the 'value' from the original sample for the duplicates. So a new column ('Original_value'), where for 'Blank1' the value of 'A1-1' is entered, for 'Blank2' the value of 'A1-4' is entered, etc. For rows where the 'Duplicate_of' field is empty, this new column is also empty.
In excel, this is very easy with Vlookup, but I haven't seen an easy way in Pandas (maybe other than joining the entire table with itself?)

Here is the easiest way to do this, in one line:
df["Original_value"] = df["Duplicate_of"].apply(lambda x: "" if x == "" else df.loc[df["Labels"] == x, "Values"].values[0])
Explanation:
This simply applies a lambda function to each element of the column "Duplicate_of"
First we check if the item is an empty string and we return an empty string if so:
"" if x == ""
is equivalent to:
if x == "" return ""
If it is not an empty string the following command is executed:
df.loc[df["Labels"] == x, "Values"].values[0]
This simple return the value in the column "Values" when the condition df["Labels"] == x is true. If you are wondering about the .values[0] part, it is there because .loc returns a series; our series in this case is just a single value so we simply get it with .values[0].

Not a memory efficient answer but this works
import numpy as np
dictionary = dict(zip(Labels, Values))
df["Original_value"] = df["Duplicate_of"].map(lambda x: np.nan if x not in dictionary else dictionary[x])
For rest of the values in Original_Value it gives NaN. You can decide what you want in place of that.
The type of the new column will not be integer that can also be changed if needed.
with #jezrael comment the same thing can be done as
import numpy as np
dictionary = dict(zip(Labels, Values))
df["Original_value"] = df["Duplicate_of"].map(dictionary)

Creating a new column that combines content of two other columns in a list

Say I have a DataFrame as follows:
I'd like to create a new column whose value is the 2nd and 3rd columns combined into a list in a cell.
i.e.
combined
[-8589.95, -6492.41]
[-1475.30, 249.52]
Any ideas how to do this? I get this error:
ValueError: Length of values does not match length of index
when I try to do something like this:
DF['combined'] = [DF['chicago_bound1'], DF['chicago_bound2']]

Try:
df['combined'] = list(zip(df.chicago_bound1, df.chicago_bound2))
or
df['combined'] = df.apply(lambda x: [[x.chicago_bound1, x.chicago_bound2]], axis=1)

You can do selection by position by integer slices which you can out put to a list.
In this case your selection would be df.iloc[0:2, 1:3]
foo = df.iloc[0:2, 1:3].values.tolist()
df['combined']= foo
Output:
chicago chicago_bound1 chicago_bound2 combined
0 -7541.18 -8589.95 -6492.41 [-8589.95, -6492.41]
1 -612.89 -1475.30 249.52 [-1475.3, 249.52]

Python & Numpy - create dynamic, arbitrary subsets of ndarray

I am looking for a general way to do this:
raw_data = np.array(somedata)
filterColumn1 = raw_data[:,1]
filterColumn2 = raw_data[:,3]
cartesian_product = itertools.product(np.unique(filterColumn1), np.unique(filterColumn2))
for val1, val2 in cartesian_product:
fixed_mask = (filterColumn1 == val1) & (filterColumn2 == val2)
subset = raw_data[fixed_mask]
I want to be able to use any amount of filterColumns. So what I want is this:
filterColumns = [filterColumn1, filterColumn2, ...]
uniqueValues = map(np.unique, filterColumns)
cartesian_product = itertools.product(*uniqueValues)
for combination in cartesian_product:
variable_mask = ????
subset = raw_data[variable_mask]
Is there a simple syntax to do what I want? Otherwise, should I try a different approach?
Edit: This seems to be working
cartesian_product = itertools.product(*uniqueValues)
for combination in cartesian_product:
variable_mask = True
for idx, fc in enumerate(filterColumns):
variable_mask &= (fc == combination[idx])
subset = raw_data[variable_mask]

You could use numpy.all and index broadcasting for this
filter_matrix = np.array(filterColumns)
combination_array = np.array(combination)
bool_matrix = filter_matrix == combination_array[newaxis, :] #not sure of the newaxis position
subset = raw_data[bool_matrix]
There are however simpler ways of doing the same thing if your filters are within the matrix, notably through numpy argsort and numpy roll over an axis. First you roll axes until your axes until you've ordered your filters as first columns, then you sort on them and slice the array vertically to get the rest of the matrix.
In general if an for loop can be avoided in Python, better avoid it.
Update:
Here is the full code without a for loop:
import numpy as np
# select filtering indexes
filter_indexes = [1, 3]
# generate the test data
raw_data = np.random.randint(0, 4, size=(50,5))
# create a column that we would use for indexing
index_columns = raw_data[:, filter_indexes]
# sort the index columns by lexigraphic order over all the indexing columns
argsorts = np.lexsort(index_columns.T)
# sort both the index and the data column
sorted_index = index_columns[argsorts, :]
sorted_data = raw_data[argsorts, :]
# in each indexing column, find if number in row and row-1 are identical
# then group to check if all numbers in corresponding positions in row and row-1 are identical
autocorrelation = np.all(sorted_index[1:, :] == sorted_index[:-1, :], axis=1)
# find out the breakpoints: these are the positions where row and row-1 are not identical
breakpoints = np.nonzero(np.logical_not(autocorrelation))[0]+1
# finally find the desired subsets
subsets = np.split(sorted_data, breakpoints)
An alternative implementation would be to transform the indexing matrix into a string matrix, sum row-wise, get an argsort over the now unique indexing column and split as above.
For conveniece, it might be more interesting to first roll the indexing matrix until they are all in the beginning of the matrix, so that the sorting done above is clear.

Something like this?
variable_mask = np.ones_like(filterColumns[0]) # select all rows initially
for column, val in zip(filterColumns, combination):
variable_mask &= (column == val)
subset = raw_data[variable_mask]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting rows from 2 columns based on a list - python

Use loc like this: indices = full_train_df.query("primary == primary").index X = train_df.loc[:clean_df_indices, ["A","B"]].values y = train_df.loc[:clean_df_indices, "year"].values

Related

Plotting values above a threshold in Python

How can i get the location of certain rows by using the index of a series with Pandas?

Lookup value in the same dataframe based on label and add to a new column (Vlookup)

Creating a new column that combines content of two other columns in a list

Python & Numpy - create dynamic, arbitrary subsets of ndarray

Categories

Resources