Pandas DataFrame creation throws ValueError in loop - python

I have a nested dictionary(stats), which I'm trying to convert into a Pandas DF. When I run the below code, I get the desired result:
BAL_sp = pd.DataFrame(data = stats['sp']['Orioles'])
However, I need to do this 30 times then concatenate the results. When I run a for loop, I get a ValueError: DataFrame constructor not properly called! I don't understand, it is recognizing the key in stats as valid in the loop:
team_dict = {'LAA': 'Angels', 'ARI': 'Diamondbacks', 'BAL': 'Orioles', 'BOS': 'Red Sox', 'CHC': 'Cubs', 'CIN': 'Reds',
'CLE': 'Indians', 'COL': 'Rockies', 'DET': 'Tigers', 'HOU': 'Astros', 'KC': 'Royals', 'LAD': 'Dodgers',
'WSH':'Nationals', 'NYM': 'Mets', 'OAK': 'Athletics', 'PIT': 'Pirates', 'SD': 'Padres', 'SEA': 'Mariners',
'SF': 'Giants', 'STL': 'Cardinals', 'TB': 'Rays', 'TEX': 'Rangers', 'TOR': 'Blue Jays', 'MIN': 'Twins',
'PHI': 'Phillies', 'ATL': 'Braves', 'CWS': 'White Sox', 'MIA': 'Marlins', 'NYY': 'Yankees', 'MIL': 'Brewers' }
frames = []
for team in team_dict.values():
temp = pd.DataFrame(data = stats['sp'][team])
frames.append(temp)
sp_df = pd.concat(frames)
It doesn't throw an error if I do data = [stats['sp'][team]], but that does not produce the desired result. Thank you for any help.

Related

SettingWithCopyWarning when I try to add a new column to a DataFrame

Not entirely sure what problem here is?
When I run the code below I get the following warning. Why is this case, and how is it fixed? Thanks
:18: SettingWithCopyWarning: A value
is trying to be set on a copy of a slice from a DataFrame. Try using
.loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
city['average'] = np.mean(city.transaction_value)
Here is the code.
import pandas as pd
import numpy as np
#Create DataFrame
city = ['Paris', 'Paris', 'Paris', 'London', 'London', 'London', 'New York', 'New York', 'New York']
transaction = [100, 90, 40, 100, 110, 40, 150, 200, 100]
df = pd.DataFrame(list(zip(city, transaction)), columns=['city', 'transaction_value'])
# Create new DataFrame to work with
transactions = df.loc[:, ['city', 'transaction_value']]
city_averages = pd.DataFrame()
city_averages
for i in transactions['city'].unique():
city = transactions[transactions['city'] == i]
city['average'] = np.mean(city.transaction_value)
city_averages = city_averages.append(city)
city_averages

Faster way to iterate over columns in pandas

I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!
You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')
You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.

Style rows of a dataframe based on column value

I have a csv file that I'm trying to read into a dataframe and style in jupyter notebook. The csv file data is:
[[' ', 'Name', 'Title', 'Date', 'Transaction', 'Price', 'Shares', '$ Value'],
[0, 'Sneed Michael E', 'EVP, Global Corp Aff & COO', 'Dec 09', 'Sale', 152.93, 54662, 8359460],
[1, 'Wengel Kathryn E', 'EVP, Chief GSC Officer', 'Sep 02', 'Sale', 153.52, 16115, 2473938],
[2, 'McEvoy Ashley', 'EVP, WW Chair, Medical Devices', 'Jul 28', 'Sale', 147.47, 29000, 4276630],
[3, 'JOHNSON & JOHNSON', '10% Owner', 'Jun 30', 'Buy', 17.00, 725000, 12325000]]
My goal is to style the background color of the rows so that the row is colored green if the Transaction column value is 'Buy', and red if the Transaction column value is 'Sale'.
The code I've tried is:
import pandas as pd
data = pd.read_csv('/Users/broderickbonelli/Desktop/insider.csv', index_col='Unnamed: 0')
def red_or_green():
if data.Transaction == 'Sale':
return ['background-color: red']
else:
return ['background-color: green']
data.style.apply(red_or_green, axis=1)
display(data)
When I run the code it outputs an unstyled spreadsheet without giving me an error code:
Dataframe
I'm not really sure what I'm doing wrong, I've tried it a # of different ways but can't seem to make it work. Any help would be appreciated!
If you want to compare the entire row when the condition matches, the following is faster than apply on axis=1, where we use style on the entire dataframe:
def red_or_green(dataframe):
c = dataframe['Transaction'] == 'Sale'
a = np.where(np.repeat(c.to_numpy()[:,None],dataframe.shape[1],axis=1),
'background-color: red','background-color: green')
return pd.DataFrame(a,columns=dataframe.columns,index=dataframe.index)
df.style.apply(red_or_green, axis=None)#.to_excel(.....)
Try
data['background-color'] = data.apply(lambda x: red_or_green(x.Transaction), axis=1)
def red_or_green(transaction):
if transaction == 'Sale':
return 'red'
else:
return 'green'
or you can use map:
data['background-color'] = data.Transaction.map(red_or_green)

Combining Masking and Indexing in Pandas

Consider the following data frame :
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
pop = pd.Series(population_dict)
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
data = pd.DataFrame({'area' : area, 'pop' : pop})
I can perform masking and indexing on columns in the same line as follows :
In [492]:data.loc[data.density > 100, ['pop', 'density']]
Out[492]:
pop density
New York 19651127 139.076746
Florida 19552860 114.806121
But if I need to do this masking and indexing on rows? Something like:
data.loc[data.density > 100, ['New York']]. But this statement obviously gives an error.
If you just want to extract information, chaining loc works just fine:
data[data.density > 100].loc[['New York']]
Output:
area pop density
New York 141297 19651127 139.076746
Try using:
data2 = data.loc[data.density > 100, ['pop', 'density']]
print(data2.loc[data2.index == 'New York'])

pandas concat successful but error message following stops loop

I'm getting the error message
ValueError: No objects to concatenate
When using pd.concat, but the concat appears to be successful as when I try and print the resulting dataframe it does so successfully but the error message terminates the loop.
state_list = ['Colorado', 'Ilinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Michigan', 'Minnesota', 'Missouri', 'Nebraska',\
'North Carolina', 'North Dakota', 'Ohio', 'Pennsylvania', 'South Dakota', 'Tennessee', 'Texas', 'Wisconsin']
for state_name in state_list:
### DF1 is a dataframe unique to each state###
condition_categories = (df1['description'].unique())
cats = []
for cat in condition_categories:
category_df = df1[['week', 'value']].where(df1['description'] == cat).dropna()
category_df = category_df.set_index('week')
category_df = category_df.rename(columns={'value': str(cat)})
category_df.week = dtype=np.int
cats.append(category_df)
#print(category_df)
df = pd.concat(cats, axis =1)
print(df)
Sorry for the late response. Looks like there is some issue with "cats" list. It is empty in one of the iterations of outer for loop.
You can add a condition just above concatination line. Like below, it may work.
if len(cats) > 0:
df = pd.concat(cats, axis =1)
else:
print("No records to concatinate")

Categories