I have a .csv file that when opened in Excel looks like this:
My code:
myfile = open("/Users/it/Desktop/Python/In-Class Programs/countries.csv", "rb")
countries = []
for item in myfile:
a = item.split(",")
countries.append(a)
hdi_list = []
for acountry in countries:
hdi = acountry[3]
try:
hdi_list.append(float(hdi))
except:
pass
average = round(sum(hdi_list)/len(hdi_list), 2)
maxNumber = round(max(hdi_list), 2)
minNumber = round(min(hdi_list), 2)
This code works well, however, when I find the max,min, or avg I need to grab the corresponding name of the country and print that as well.
How can I change my code to grab the country name of the min,max, avg as well?
Instead of putting the values straight in the list, use tuples instead, like this:
hdi_list.append((float(hdi), acountry[1]))
Then you can use this instead:
maxTuple = max(hdi_list)
maxNumber = round(maxTuple[0], 2)
maxCountry = maxTuple[1]
Using the pandas module, [4], [5], and [6] below should show the max, min, and average respectively. Note that the data below doesn't match yours save for country.
In [1]: import pandas as pd
In [2]: df = pd.read_csv("hdi.csv")
In [3]: df
Out[3]:
Country HDI
0 Norway 83.27
1 Australia 80.77
2 Netherlands 87.00
3 United States 87.43
4 New Zealand 87.43
5 Canada 87.66
6 Ireland 75.47
7 Liechtenstein 88.97
8 Germany 86.31
9 Sweden 80.54
In [4]: df.ix[df["HDI"].idxmax()]
Out[4]:
Country Liechtenstein
HDI 88.97
Name: 7, dtype: object
In [5]: df.ix[df["HDI"].idxmin()]
Out[5]:
Country Ireland
HDI 75.47
Name: 6, dtype: object
In [6]: df["HDI"].mean()
Out[6]: 84.484999999999985
Assuming both Liechtenstein and Germany have max values:
In [15]: df
Out[15]:
Country HDI
0 Norway 83.27
1 Australia 80.77
2 Netherlands 87.00
3 United States 87.43
4 New Zealand 87.43
5 Canada 87.66
6 Ireland 75.47
7 Liechtenstein 88.97
8 Germany 88.97
9 Sweden 80.54
In [16]: df[df["HDI"] == df["HDI"].max()]
Out[16]:
Country HDI
7 Liechtenstein 88.97
8 Germany 88.97
The same logic can be applied for the minimum value.
The following approach is close enough to your implementation that I think it might be useful. However, if you start working with larger or more complicated csv files, you should look into packages like "csv.reader" or "Pandas" (as previously mentioned). They are more robust and efficient in working with complex .csv data. You could also work through Excel with the "xlrd" package.
In my opinion, the simplest solution to reference country names with their respective values is to combine your 'for loops'. Instead of looping through your data twice (in two separate 'for loops') and creating two separate lists, use a single 'for loop' and create a dictionary with relevant data (ie. "country name", "hdi"). You could also create a tuple (as previously mentioned) but I think dictionaries are more explicit.
myfile = open("/Users/it/Desktop/Python/In-Class Programs/countries.csv", "rb")
countries = []
for line in myfile:
country_name = line.split(",")[1]
value_of_interest = float(line.split(",")[3])
countries.append(
{"Country Name": country_name,
"Value of Interest": value_of_interest})
ave_value = sum([country["Value of Interest"] for country in countries])/len(countries)
max_value = max([country["Value of Interest"] for country in countries])
min_value = min([country["Value of Interest"] for country in countries])
print "Country Average == ", ave_value
for country in countries:
if country["Value of Interest"] == max_value:
print "Max == {country}:{value}".format(country["Country Name"], country["Value of Interest"])
if country["Value of Interest"] == min_value:
print "Min == {country}:{value}".format(country["Country Name"], country["Value of Interest"])
Note that this method returns multiple countries if they have equal min/max values.
If you are dead-set on creating separate lists (like your current implementation), you might consider zip() to connect your lists (by index), where
zip(countries, hdi_list) = [(countries[1], hdi_list[1]), ...]
For example:
for country in zip(countries, hdi_list):
if country[1] == max_value:
print country[0], country[1]
with similar logic applied to the min and average. This method works but is less explicit and more difficult to maintain.
Related
I have a dataframe of people's addresses and names. I have a function that processes names that I want to apply. I am creating sub selections of people with matching addresses and applying the function to those groups.
To this point I have been using .loc to as follows
for x in df['address'].unique():
sub_selection = df.loc[df['address'] == x]
sub_selection.apply(lambda x: function(x), axis = 1)
Is there a more efficient way to approach this. I am looking into pandas .groupby() functionality, but i am struggling to get it to work.
df.groupby('address').agg(lambda x: function(x['names']))
Here is some sample data:
address, name, Unique_ID
1022 Boogie Woogie Ave, John Smith, np.nan
1022 Boogie Woogie Ave, Frederick Smith, np.nan
1022 Boogie Woogie Ave, John Jacob Smith, np.nan
3030 Sesame Street, Big Bird, np.nan
3030 Sesame Street, Elmo, np.nan
3030 Sesame Street, Big Yellow Bird, np.nan
My function itself has some moving parts, but basically I check the name against a reference dictionary I create. This process passes a few other steps, but returns a list of indexes where the name matches. I use those indexes to assign a shared unique id for matching names. In my example big bird and big yellow bird would match.
def function(x):
match_list = []
if x['name'] in __lookup_dict[0]:
match_list.append((__lookup_dict[0][x['name']))
#reduce all elements matching list to a single list of place ids matching all elements
result = set(match_list[0])
for s in match_list[1:]:
if len(result.intersection(s)) != 0:
result.intersection_update(s)
#take the reduce lists and assign each place id an unique id.
#note we are working with place ids not the sub df's index. They don't match
if pd.isnull(x['Unique_ID']):
uid = str(uuid.uuid4())
for g in result:
df.at[df.index[df.index == g].tolist()[0], 'Unq_ID'] = uid
else:
pass
return result
Try using
df.groupby('address').apply(lambda x: function(x['names']))
Edited:
Check this example. I've used a dataframe from another StackOverflow question
import pandas as pd
df = pd.DataFrame({
"City":["Delhi","Delhi","Mumbai","Mumbai","Lahore","Lahore"],
"Points":[90.1,90.3,94.1,95,89,90.5],
"Gender":["Male","Female","Female","Male","Female","Male"]
})
d = {k:v for v,k in enumerate(df.City.unique())}
df['idx'] = df['City'].replace(d)
print(df)
Output:
City Points Gender idx
0 Delhi 90.1 Male 0
1 Delhi 90.3 Female 0
2 Mumbai 94.1 Female 1
3 Mumbai 95.0 Male 1
4 Lahore 89.0 Female 2
5 Lahore 90.5 Male 2
So, try using
d = {k:v for v,k in enumerate(df['address'].unique())}
df['idx'] = df['address'].replace(d)
I know how to get an iterable for all the rows in a selection, but I would like to add a GROUP BY statement to my selection and return a set of iterables instead, one for each group. That is, suppose I have a table like so:
and suppose I want to group by nationality; I would like a set (e.g. iterable) of iterables with the following rows: [1, 3], [2, 5], [4, 6], [7] (in no particular order). Currently I'm doing this manually using order_by on nationality and iterating over all the rows; is there a built-in way using sqlalchemy?
From what I've read, the group_by method only allows you to do what SQL GROUP BY does, which is somewhat limited - I'm looking for something like the pandas groupby which allows you to apply an arbitrary function to each group.
Not sure if this is what you were looking for:
# Import libraries
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'id':[1,2,3,4,5,6,7],
'name':['Franz','Louis','Werner','Leonardo','Antoine','Marco','Nacho'],
'nationality':['Germany', 'France', 'Germany','Italy', 'France','Italy','Spain']
})
# Group
grouped = df.groupby(['nationality'])
# Iterate over Group
for name, group in grouped:
print(name)
print(group)
Output
France
id name nationality
1 2 Louis France
4 5 Antoine France
Germany
id name nationality
0 1 Franz Germany
2 3 Werner Germany
Italy
id name nationality
3 4 Leonardo Italy
5 6 Marco Italy
Spain
id name nationality
6 7 Nacho Spain
I have a dataframe with several columns of personal data for each row (person). I want to apply a function to look up each person's city or state in regional lists, and then apply the result to a new column "Region" in the same dataframe.
I have been able to make the same operation work with a very simplified dataframe with categories for colors and vehicles (see below). But when I try to do it with the personal data, it won't work the same way and I don't understand why.
I've read through many theads on lambda functions, but I think what I'm asking is too complex for that. Most solutions deal with numerical data and I'm using strings, but as I said, I was able to make it work with one dataset. Obviously I'm new here. I'd also appreciate advice on how to build the new column as part of the function instead of having to build it as a separate step, but that isn't frustrating me as much as the main question.
This example works:
# Python: import pandas
import pandas as pd
# Simple dataframe. Empty column 'type'.
df = pd.DataFrame({'one':['1','2','3','4','5','6','7','8'],
'two':['A','B','C','D','E','F','G','H'],
'three': ['car','bus','red','blue','truck','pencil','yellow','green'],
'type':''})
df displays:
one two three type
0 1 A car
1 2 B bus
2 3 C red
3 4 D blue
4 5 E truck
5 6 F pencil
6 7 G yellow
7 8 H green
Now define lists and custom function:
# Definte lists of colors and vehicles
colors = ['red','blue','green','yellow']
vehicles = ['car','truck','bus','motorcycle']
# Create function 'celltype' to return values based on x
def celltype (x):
if x in colors: return 'color'
elif x in vehicles: return 'vehicle'
else: return 'other'
Then construct a loop to iterate through each row and apply the function:
# Write loop to iterate through df rows and apply function 'celltype' to column 'three' in each row
for index, row in df.iterrows():
row['type'] = celltype(row['three'])
And in this case the result is just what I want:
one two three type
0 1 A car vehicle
1 2 B bus vehicle
2 3 C red color
3 4 D blue color
4 5 E truck vehicle
5 6 F pencil other
6 7 G yellow color
7 8 H green color
This example doesn't work, and I don't know why:
df1 = pd.DataFrame({'Last Name':['SMITH','JONES','WILSON','DOYLE','ANDERSON'], 'First Name':['TOM','DICK','HARRY','MICHAEL','KEVIN'],
'Code':[12,34,56,78,90], 'Deparment':['Research','Management','Maintenance','Marketing','IT'],
'City':['NEW YORK','BOSTON','SAN FRANCISCO','DALLAS','DETROIT'], 'State':['NY','MA','CA','TX','MI'], 'Region':''})
df1 displays:
Last Name First Name Code Deparment City State Region
0 SMITH TOM 12 Research NEW YORK NY
1 JONES DICK 34 Management BOSTON MA
2 WILSON HARRY 56 Maintenance SAN FRANCISCO CA
3 DOYLE MICHAEL 78 Marketing DALLAS TX
4 ANDERSON KEVIN 90 IT DETROIT MI
Again, defining lists and functions:
# Define lists for regions
east = ['NEW YORK','BOSTON']
west = ['SAN FRANCISCO','LOS ANGELES']
south = ['TX']
# Create function 'region' to return values based on x
def region (x):
if x in east: return 'east'
elif x in west: return 'west'
elif x in south: return 'south'
else: return 'other'
# Write loop to iterate through df1 rows and apply function 'region' to column 'City' in each row
for index, row in df1.iterrows():
row['Region'] = region(row['City'])
if row['Region'] == 'other': row['Region'] = region(row['State'])
This results in an unchanged df1. The 'Region' column is still blank. We should see "east", "east", "west", "south", "other". The only difference in the code is the additional 'if' statement, to catch Dallas by state (which is something I need for my real world dataset). But I think that line is sound and I get the same result without it.
First off, apply and iterrows are slow, so try not to use them, ever.
What I usually do in this situation is to create a pair of forward and backward dicts:
forward = {'east': east,
'west': west,
'south': south}
backward = {x:k for k,v in forward.items() for x in v}
And then update with map. Since you want to update based on two columns, fillna will be helpful:
df1['Region'] = (df1['State'].map(backward)
.fillna(df1['City'].map(backward))
.fillna('other')
)
gives:
Last Name First Name Code Deparment City State Region
0 SMITH TOM 12 Research NEW YORK NY east
1 JONES DICK 34 Management BOSTON MA east
2 WILSON HARRY 56 Maintenance SAN FRANCISCO CA west
3 DOYLE MICHAEL 78 Marketing DALLAS TX south
4 ANDERSON KEVIN 90 IT DETROIT MI other
Your issue is with using iterrows. You should, in general, never modify something you are iterating over. In this case, the iterrows is creating a copy of your data and so is not actually modifying your df1. The copy is something that might or might not happen depending on the circumstances, so something like this is something you generally want to avoid doing.
You can make sure it modifies the original by calling the Dataframe directly with at:
for index, row in df1.iterrows():
df1.at[index, 'Region'] = region(row['City'])
if df1.at[index, 'Region'] == 'other': df1.at[index, 'Region'] = region(row['State'])
I have a dataframe like I am trying to keep rows that have more than 5 characters. Here is what I tried, but it removes 'of', 'U.', 'and','Arts',...etc. I just need to remove characters in a row that have len less than 5.
id schools
1 University of Hawaii
2 Dept in Colorado U.
3 Dept
4 College of Arts and Science
5 Dept
6 Bldg
wrong output from my code:
0 University Hawaii
1 Colorado
2
3 College Science
4
5
Looking for output like this:
id schools
1 University of Hawaii
2 Dept in Colorado U.
4 College of Arts and Science
Code:
l = [1,2,3,4,5,6]
s = ['University of Hawaii', 'Dept in Colorado U.','Dept','College of Arts and Science','Dept','Bldg']
df1 = pd.DataFrame({'id':l, 'schools':s})
df1 = df1['schools'].str.findall('\w{5,}').str.join(' ') # not working
df1
Using a regex is a huge (and slow) overkill for this task. You can use simple pandas indexing:
filtrered_df = df1[df1['schools'].str.len() > 5] # or >= depending on the required logic
There is a simpler filter for your data.
mask = df1['schools'].str.len() > 5
Then create a new data frame from the filter
df2 = df1[mask].copy()
import pandas as pd
name = ['University of Hawaii','Dept in Colorado U.','Dept','College of Arts and Science','Dept','Bldg']
labels =['schools']
df =pd.DataFrame.from_records([[i] for i in name],columns=labels)
df[df['schools'].str.len() >5 ]
I have a dataframe as follwos:
City Name
0 Seattle Alice
1 Seattle Bob
2 Portland Mallory
3 Portland Bob
Given two names I want to find whether they are in the same or not.
What is an efficient way to do this?
I was thinking about group by "City", but I don't know how to check if two names are in the same group.
(the dataframe I'm using is much larger with millions of rows and I want to find two or more people in the same city multiple times)
A possible way to use groupby:
x = "Mallory"
y = "Alice"
any(any(names[1].str.contains(x)) and any(names[1].str.contains(y)) for names in df.groupby('City').Name)
# False
You could use:
names = ['Alice', 'Bob']
df[df.Name.isin(names)].groupby('City').Name.nunique() > 1
yields
City
Portland False
Seattle True
Name: Name, dtype: bool
Enclose with (..).any() to get a summary True / False result.
If one person can be several times in the same City, you could use .drop_duplicates(['Name', 'City') first.
Wrapped in a function:
def same_city(df, n1, n2):
same = df[df.Name.isin(names)].groupby('City').Name.nunique() > 1
return same, same.any()
result, summary = same_city(df, 'Alice', 'Bob')
yields:
City
Portland False
Seattle True
Name: Name, dtype: bool
True
Try this:
def bothInCity(df, n1, n2):
s = {n1, n2}
c = df.groupby('City').Name.apply(set)
chk = lambda x: s.issubset(x)
return c.loc[c.apply(chk)]
Then use it like:
bothInCity(df, 'Bob', 'Alice')
City
Seattle {Bob, Alice}
Name: Name, dtype: object