drop rows with multiple column values

drop rows with multiple column values - python

I have a dataset where I have to drop rows with multiple columns. I tried this, but do not know how to do with multiple values
import pandas as pd
df = pd.read_csv("data.csv")
new_df = df[df.location == 'New York' ]
new_df.count()
I also tried another method, but do not know, how to do with multiple values:
import pandas as pd
df = pd.read_csv("data.csv")
df.drop(df[df['location '] == 'New York'].index, inplace = True)
I have delete rows, with values new york, boston, Austin and keep other locations remaining.
Also, I have replace value of a column
if San Francisco then change value to 1, if Miami change to 2, so all values in location, should be replaced

You can use query method and variable with all cities you want to filter
np.random.seed(0)
cities = ['New York', 'Chicago', 'Miami']
data = pd.DataFrame(dict(cities = np.random.choice(cities, 10),
values = np.random.choice(10,10)))
data.cities.unique() # array(['New York', 'Chicago', 'Miami'], dtype=object)
filter = ['New York', 'Chicago']
data_filtered = data.query('cities not in #filter').copy()
data_filtered.cities.unique() # array(['Miami'], dtype=object)
For the values, you can manually set values
data_filtered.loc[data_filtered.cities == 'Miami', ['values']] =2

I don't quite follow what you mean by dropping rows with multiple columns, but to check for multiple values you could use: new_df = df[df.location in ['New York', 'Boston']]

You can try:
# Drop the rows with location "New York", "Boston", "Austin" (1)
df = df[~df["location"].isin(["New York", "Boston", "Austin"])]
# Replace locations with numbers: (2)
loc_map = {"San Francisco": 1, "Miami": 2, ...}
df["location"] = df["location"].map(loc_map)
For step (2), in case you have many values, you can create loc_map automatically by:
loc_map = {df.location.unique()[i]: i+1 for i in range(len(df.location.unique()))}
Hope this helps.

Related

SettingWithCopyWarning when I try to add a new column to a DataFrame

Not entirely sure what problem here is?
When I run the code below I get the following warning. Why is this case, and how is it fixed? Thanks
:18: SettingWithCopyWarning: A value
is trying to be set on a copy of a slice from a DataFrame. Try using
.loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
city['average'] = np.mean(city.transaction_value)
Here is the code.
import pandas as pd
import numpy as np
#Create DataFrame
city = ['Paris', 'Paris', 'Paris', 'London', 'London', 'London', 'New York', 'New York', 'New York']
transaction = [100, 90, 40, 100, 110, 40, 150, 200, 100]
df = pd.DataFrame(list(zip(city, transaction)), columns=['city', 'transaction_value'])
# Create new DataFrame to work with
transactions = df.loc[:, ['city', 'transaction_value']]
city_averages = pd.DataFrame()
city_averages
for i in transactions['city'].unique():
city = transactions[transactions['city'] == i]
city['average'] = np.mean(city.transaction_value)
city_averages = city_averages.append(city)
city_averages

Add a calculated column to a pivot table in pandas

Hi I am trying to create new columns to a multi-indexed pandas pivot table to do a countif statement (similar to excel) depending if a level of the index contains a specific string. This is the sample data:
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover','Adak','Denver','Houston','Adak','Denver'],
'State': ['Texas', 'Texas', 'Alabama','Alaska','Colorado','Texas','Alaska','Colorado'],
'Name':['Aria', 'Penelope', 'Niko','Susan','Aria','Niko','Aria','Niko'],
'Unit':['Sales', 'Marketing', 'Operations','Sales','Operations','Operations','Sales','Operations'],
'Assigned':['Yes','No','Maybe','No','Yes','Yes','Yes','Yes']},
columns=['City', 'State', 'Name', 'Unit','Assigned'])
pivot=df.pivot_table(index=['City','State'],columns=['Name','Unit'],values=['Assigned'],aggfunc=lambda x:', '.join(set(x)),fill_value='')
and this is the desired output (in screenshot). Thanks in advance!

try:
temp = pivot[('Mango', 'Aria', 'Sales')].str.len()>0
pivot['new col'] = temp.astype(int)
the result:
Based on your edit:
import numpy as np
temp = pivot.xs('Sales', level=2, drop_level=False, axis = 1).apply(lambda x: np.sum([1 if y!='' else 0 for y in x]), axis = 1)
pivot[('', 'total sales', 'count how many...')]=temp

Faster way to iterate over columns in pandas

I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!

You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')

You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.

how to apply a class function to replace NaN for mean within a subset of pandas df columns?

The class is composed of a set of attributes and functions including:
Attributes:
df : a pandas dataframe.
numerical_feature_names: df columns with a numeric value.
label_column_names: df string columns to be grouped.
Functions:
mean(nums): takes a list of numbers as input and returns the mean
fill_na(df, numerical_feature_names, label_columns): takes class attributes as inputs and returns a transformed df.
And here's the class:
class PLUMBER():
def __init__(self):
################# attributes ################
self.df=df
# specify label and numerical features names:
self.numerical_feature_names=numerical_feature_names
self.label_column_names=label_column_names
##################### mean ##############################
def mean(self, nums):
total=0.0
for num in nums:
total=total+num
return total/len(nums)
############ fill the numerical features ##################
def fill_na(self, df, numerical_feature_names, label_column_names):
# declaring parameters:
df=self.df
numerical_feature_names=self.numerical_feature_names
label_column_names=self.label_column_names
# now replacing NaN with group mean
for numerical_feature_name in numerical_feature_names:
df[numerical_feature_name]=df.groupby([label_column_names]).transform(lambda x: x.fillna(self.mean(x)))
return df
When trying to apply it to a pandas df:
if __name__=="__main__":
# initialize class
plumber=PLUMBER()
# replace NaN with group mean
df=plumber.fill_na(df=df, numerical_feature_names=numerical_feature_names, label_column_names=label_column_names)
The next error arises:
ValueError: Grouper and axis must be same length
data and class parameters
import pandas as pd
d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'],
'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'],
'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'],
'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'],
'number':[np.nan, 450, 299, np.nan, 19, 29],
'age':[np.nan, 30, 28, np.nan, 29, 18]}
df=pd.DataFrame(d)
# headers
column_names=df.columns.values.tolist()
column_names= [column_name.strip() for column_name in column_names]
# label_column_names (to be grouped)
label_column_names=['country', 'level', 'job title']
# numerical_features:
numerical_feature_names = [x for x in column_names if x not in label_column_names]
numerical_feature_names.remove('month')
How could I change the class in order to get the transformed df (i.e. the one that replaces np.nan with it's group mean)?

First the error is because label_column_names is already a list, so in the groupby you don't need the [] around it. so it should be df.groupby(label_column_names)... instead of df.groupby([label_column_names])...
Now, to actually solve you problem, in the function fill_na of your class, replace the loop for (you don't need it actually) by
df[numerical_feature_names] = (
df[numerical_feature_names]
.fillna(
df.groupby(label_column_names)
[numerical_feature_names].transform('mean')
)
)
in which you fillna the columns numerical_feature_names by the result of the groupy.tranform with the mean of these columns

Add same value to multiple sets of rows. The value changes based on condition

I have a dataframe that is dynamically created.
I create my first set of rows as:
df['tourist_spots'] = pd.Series(<A list of tourist spots in a city>)
To this df I add:
df['city'] = <City Name>
So far so good. A bunch of rows are created with the same city name for multiple tourist spots.
I want to add a new city. So I do:
df['tourist_spots'].append(pd.Series(<new data>))
Now, when I append a new city with:
df['city'].append('new city')
the previously updated city data is gone. It is as if every time the rows are replaced and not appended.
Here's an example of what I want:
Step 1:
df['tourist_spot'] = pd.Series('Golden State Bridge' + a bunch of other spots)
For all the rows created by the above data I want:
df['city'] = 'San Francisco'
Step 2:
df['tourist_spot'].append(pd.Series('Times Square' + a bunch of other spots)
For all the rows created by the above data, I want:
df['city'] = 'New York'
How can I achieve this?

Use dictionary to add rows to your data frame, it is faster method.
Here is an e.g.
STEP 1
Create dictionary:
dict_df = [{'tourist_spots': 'Jones LLC', 'City': 'Boston'},
{'tourist_spots': 'Alpha Co', 'City': 'Boston'},
{'tourist_spots': 'Blue Inc', 'City': 'Singapore' }]
STEP2
Convert dictionary to dataframe:
df = pd.DataFrame(dict_df)
STEP3
Add new entries to dataframe in dictionary format:
df = df.append({'tourist_spots': 'New_Blue', 'City': 'Singapore'}, ignore_index=True)
References:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_dict.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

drop rows with multiple column values - python

I don't quite follow what you mean by dropping rows with multiple columns, but to check for multiple values you could use: new_df = df[df.location in ['New York', 'Boston']]

Related

SettingWithCopyWarning when I try to add a new column to a DataFrame

Add a calculated column to a pivot table in pandas

Faster way to iterate over columns in pandas

how to apply a class function to replace NaN for mean within a subset of pandas df columns?

Add same value to multiple sets of rows. The value changes based on condition

Categories

Resources