SettingWithCopyWarning when I try to add a new column to a DataFrame - python

Not entirely sure what problem here is?
When I run the code below I get the following warning. Why is this case, and how is it fixed? Thanks
:18: SettingWithCopyWarning: A value
is trying to be set on a copy of a slice from a DataFrame. Try using
.loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
city['average'] = np.mean(city.transaction_value)
Here is the code.
import pandas as pd
import numpy as np
#Create DataFrame
city = ['Paris', 'Paris', 'Paris', 'London', 'London', 'London', 'New York', 'New York', 'New York']
transaction = [100, 90, 40, 100, 110, 40, 150, 200, 100]
df = pd.DataFrame(list(zip(city, transaction)), columns=['city', 'transaction_value'])
# Create new DataFrame to work with
transactions = df.loc[:, ['city', 'transaction_value']]
city_averages = pd.DataFrame()
city_averages
for i in transactions['city'].unique():
city = transactions[transactions['city'] == i]
city['average'] = np.mean(city.transaction_value)
city_averages = city_averages.append(city)
city_averages

Related

Add a calculated column to a pivot table in pandas

Hi I am trying to create new columns to a multi-indexed pandas pivot table to do a countif statement (similar to excel) depending if a level of the index contains a specific string. This is the sample data:
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover','Adak','Denver','Houston','Adak','Denver'],
'State': ['Texas', 'Texas', 'Alabama','Alaska','Colorado','Texas','Alaska','Colorado'],
'Name':['Aria', 'Penelope', 'Niko','Susan','Aria','Niko','Aria','Niko'],
'Unit':['Sales', 'Marketing', 'Operations','Sales','Operations','Operations','Sales','Operations'],
'Assigned':['Yes','No','Maybe','No','Yes','Yes','Yes','Yes']},
columns=['City', 'State', 'Name', 'Unit','Assigned'])
pivot=df.pivot_table(index=['City','State'],columns=['Name','Unit'],values=['Assigned'],aggfunc=lambda x:', '.join(set(x)),fill_value='')
and this is the desired output (in screenshot). Thanks in advance!
try:
temp = pivot[('Mango', 'Aria', 'Sales')].str.len()>0
pivot['new col'] = temp.astype(int)
the result:
Based on your edit:
import numpy as np
temp = pivot.xs('Sales', level=2, drop_level=False, axis = 1).apply(lambda x: np.sum([1 if y!='' else 0 for y in x]), axis = 1)
pivot[('', 'total sales', 'count how many...')]=temp

Faster way to iterate over columns in pandas

I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!
You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')
You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.

drop rows with multiple column values

I have a dataset where I have to drop rows with multiple columns. I tried this, but do not know how to do with multiple values
import pandas as pd
df = pd.read_csv("data.csv")
new_df = df[df.location == 'New York' ]
new_df.count()
I also tried another method, but do not know, how to do with multiple values:
import pandas as pd
df = pd.read_csv("data.csv")
df.drop(df[df['location '] == 'New York'].index, inplace = True)
I have delete rows, with values new york, boston, Austin and keep other locations remaining.
Also, I have replace value of a column
if San Francisco then change value to 1, if Miami change to 2, so all values in location, should be replaced
You can use query method and variable with all cities you want to filter
np.random.seed(0)
cities = ['New York', 'Chicago', 'Miami']
data = pd.DataFrame(dict(cities = np.random.choice(cities, 10),
values = np.random.choice(10,10)))
data.cities.unique() # array(['New York', 'Chicago', 'Miami'], dtype=object)
filter = ['New York', 'Chicago']
data_filtered = data.query('cities not in #filter').copy()
data_filtered.cities.unique() # array(['Miami'], dtype=object)
For the values, you can manually set values
data_filtered.loc[data_filtered.cities == 'Miami', ['values']] =2
I don't quite follow what you mean by dropping rows with multiple columns, but to check for multiple values you could use: new_df = df[df.location in ['New York', 'Boston']]
You can try:
# Drop the rows with location "New York", "Boston", "Austin" (1)
df = df[~df["location"].isin(["New York", "Boston", "Austin"])]
# Replace locations with numbers: (2)
loc_map = {"San Francisco": 1, "Miami": 2, ...}
df["location"] = df["location"].map(loc_map)
For step (2), in case you have many values, you can create loc_map automatically by:
loc_map = {df.location.unique()[i]: i+1 for i in range(len(df.location.unique()))}
Hope this helps.

Pandas Dataframe to String with separator

I want to turn a dataframe into a string.
this topic How to turn a pandas dataframe row into a comma separated string is close to what I want. The only problem of this solution : I have a column 'Country' with string which have separator (for example, with this solution, the dataframe is converting into string but I have 'United States' that become 'United,States')
So currently I just have the following code:
df = df.to_string(index=False).split('\n')
df = [','.join(ele.split()) for ele in df]
df = '\r\n'.join(df)
df = df.encode('utf8')
but for a dataframe like this one:
data = [['United States', 10, 12], ['United Kingdom', 15, 25], ['France', 14, 18]]
df = pd.DataFrame(data, columns = ['Country', 'Number1', 'Number2'])
I will have
b'Country,Number1,Number2\r\nUnited,States,10,12\r\n,United,Kingdom,15,25\r\nFrance,14,18'
Instead of:
b'Country,Number1,Number2\r\nUnited States,10,12\r\n,United Kingdom,15,25\r\nFrance,14,18'
Currently I have solved the problem by many :
df= df.replace('United,States', 'United States')
But it is not a really good solution because each times a new country with space come, I have to update the script
(the final goal is to convert dataframe into string in utf-8 to be allow to compute it's md5 , without using df.to_csv() and compute the md5 of the file created, if you have a better way than this trick it can also help me)
Thanks!
data = [['United States', 10, 12], ['United Kingdom', 15, 25], ['France', 14, 18]]
df = pd.DataFrame(data, columns = ['Country', 'Number1', 'Number2'])
df = df.to_csv(header=None, index=False).strip('\n').split('\n')
df_string = '\r\n'.join(df) # <= this is the string that you can use with md5
df_bytes = df_string.encode('utf8') # <= this is bytes object to write the file
print(df_bytes)
Use df_string for md5 and df_bytes to write the file.
df_bytes contains this:
b'United States,10,12\r\nUnited Kingdom,15,25\r\nFrance,14,18'
Variant without sending it to csv:
import pandas as pd
data = [['United States', 10, 12], ['United Kingdom', 15, 25], ['France', 14, 18]]
df = pd.DataFrame(data, columns = ['Country', 'Number1', 'Number2'])
df['Country']=df['Country'].str.replace(' ','_')
df = df.to_string(index=False).split('\n')
df = [','.join(ele.split()) for ele in df]
df = [element.replace('_',' ') for element in df]
df = '\r\n'.join(df)
df = df.encode('utf8')
df

Combining Masking and Indexing in Pandas

Consider the following data frame :
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
pop = pd.Series(population_dict)
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
data = pd.DataFrame({'area' : area, 'pop' : pop})
I can perform masking and indexing on columns in the same line as follows :
In [492]:data.loc[data.density > 100, ['pop', 'density']]
Out[492]:
pop density
New York 19651127 139.076746
Florida 19552860 114.806121
But if I need to do this masking and indexing on rows? Something like:
data.loc[data.density > 100, ['New York']]. But this statement obviously gives an error.
If you just want to extract information, chaining loc works just fine:
data[data.density > 100].loc[['New York']]
Output:
area pop density
New York 141297 19651127 139.076746
Try using:
data2 = data.loc[data.density > 100, ['pop', 'density']]
print(data2.loc[data2.index == 'New York'])

Categories