Get row iterables for each group via GROUP BY clause - python

I know how to get an iterable for all the rows in a selection, but I would like to add a GROUP BY statement to my selection and return a set of iterables instead, one for each group. That is, suppose I have a table like so:
and suppose I want to group by nationality; I would like a set (e.g. iterable) of iterables with the following rows: [1, 3], [2, 5], [4, 6], [7] (in no particular order). Currently I'm doing this manually using order_by on nationality and iterating over all the rows; is there a built-in way using sqlalchemy?
From what I've read, the group_by method only allows you to do what SQL GROUP BY does, which is somewhat limited - I'm looking for something like the pandas groupby which allows you to apply an arbitrary function to each group.

Not sure if this is what you were looking for:
# Import libraries
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'id':[1,2,3,4,5,6,7],
'name':['Franz','Louis','Werner','Leonardo','Antoine','Marco','Nacho'],
'nationality':['Germany', 'France', 'Germany','Italy', 'France','Italy','Spain']
})
# Group
grouped = df.groupby(['nationality'])
# Iterate over Group
for name, group in grouped:
print(name)
print(group)
Output
France
id name nationality
1 2 Louis France
4 5 Antoine France
Germany
id name nationality
0 1 Franz Germany
2 3 Werner Germany
Italy
id name nationality
3 4 Leonardo Italy
5 6 Marco Italy
Spain
id name nationality
6 7 Nacho Spain

Related

Get index and column name for a particular value in Pandas Dataframe

I have the following Pandas DataFrame:
A B
0 Exporter Invoice No. & Date
1 ABC PVT LTD. ABC/1234/2022-23 DATED 20/08/2022
2 1234/B, XYZ,
3 ABCD, DELHI, INDIA Proforma Invoice No. Date.
4 AB/CDE/FGH/2022-23/1234 20.08.2022
5 Consignee Buyer (If other than consignee)
6 ABC Co.
8 P.O BOX NO. 54321
9 Berlin, Germany
Now I want to search for a value in this DataFrame, and store the index and column name in 2 different variables.
For example:
If I search "Consignee", I should get
index = 5
column = 'A'
Assuming you really want the index/column of the match, you can use a mask and stack:
df.where(df.eq('Consignee')).stack()
output:
5 A Consignee
dtype: object
As list:
df.where(df.eq('Consignee')).stack().index.tolist()
output: [(5, 'A')]

Categorize column according to lists and aggregate with result

Let's say I have a dataframe as follows:
d = {'name': ['spain', 'greece','belgium','germany','italy'], 'davalue': [3, 4, 6, 9, 3]}
df = pd.DataFrame(data=d)
index name davalue
0 spain 3
1 greece 4
2 belgium 6
3 germany 9
4 italy 3
I would like to aggregate and sum based on a list of strings in the name column. So for example, I may have: southern=['spain', 'greece', 'italy'] and northern=['belgium','germany'].
My goal is to aggregate by using sum, and obtain:
index name davalue
0 southern 10
1 northen 15
where 10=3+4+3 and 15=6+9
I imagined something like:
df.groupby(by=[['spain','greece','italy'],['belgium','germany']])
could exist. The docs say
A label or list of labels may be passed to group by the columns in self
but I'm not sure I understand what that means in terms of syntax.
I would build a dictionary and map:
d = {v:'southern' for v in southern}
d.update({v:'northern' for v in northern})
df['davalue'].groupby(df['name'].map(d)).sum()
Output:
name
northern 15
southern 10
Name: davalue, dtype: int64
One way could be using np.select and using the result as a grouper:
import numpy as np
southern=['spain', 'greece', 'italy']
northern=['belgium','germany']
g = np.select([df.name.isin(southern),
df.name.isin(northern)],
['southern', 'northern'],
'others')
df.groupby(g).sum()
davalue
northern 15
southern 10
df["regional_group"]=df.apply(lambda x: "north" if x["home_team_name"] in ['belgium','germany'] else "south",axis=1)
You create a new column by which you later groubpy.
df.groupby("regional_group")["davavalue"].sum()

Filtering out string values in Python Pandas Dataframe

I am using Python Pandas.
For example, I have a dataframe as follows
index, name, acct_no, city
1, alex, 10011, huntington
2, rider, 100AB, charleston
3, daniel, A1009, bonn
4, rice, AAAA1, new york
5, ricardo, 12121, london
From this dataset, I would like to get ONLY those
records who donot have any string in the acct_no column.
So, I would like to get the following result from the above dataset. In the following result, there is no string in the values of the acct_no column.
index, name, acct_no, city
1, alex, 10011, huntington
5, ricardo, 12121, london
Which code will give me such result?
May check str.contains
df1=df[~df.acct_no.str.contains('[a-zA-Z]')]
df1
Out[119]:
index name acct_no city
0 1 alex 10011 huntington
4 5 ricardo 12121 london
Or using to_numeric and filter by notna
df[pd.to_numeric(df.acct_no,errors='coerce').notna()]
Another solution might be to use pd.to_numeric, which tries to convert a value to a number. When it fails we can let it return nan (by specifying errors='coerce'), and then we drop all nan values:
df.acct_no = pd.to_numeric(df.acct_no, errors='coerce')
df.dropna()

Comparing columns from two data frames

I am relatively new to Python. If I have the following two types of dataframes, Lets say df1 and df2 respectively.
Id Name Job Name Salary Location
1 Jim Tester Jim 100 Japan
2 Bob Developer Bob 200 US
3 Sam Support Si 300 UK
Sue 400 France
I want to compare the 'Name' column in df2 to df1 such that if the name of the person (in df2) does not exist in df1 than that row in df2 would be outputed to another dataframe. So for the eg above the output would be:
Name Salary Location
Si 300 UK
Sue 400 France
Si and Sue are outputed because they do not exist in the 'Name' column in df1.
You can use Boolean indexing:
res = df2[~df2['Name'].isin(df1['Name'].unique())]
We use hashing via pd.Series.unique as an optimization in case you have duplicate names in df1.

Find max number in .CSV file in Python

I have a .csv file that when opened in Excel looks like this:
My code:
myfile = open("/Users/it/Desktop/Python/In-Class Programs/countries.csv", "rb")
countries = []
for item in myfile:
a = item.split(",")
countries.append(a)
hdi_list = []
for acountry in countries:
hdi = acountry[3]
try:
hdi_list.append(float(hdi))
except:
pass
average = round(sum(hdi_list)/len(hdi_list), 2)
maxNumber = round(max(hdi_list), 2)
minNumber = round(min(hdi_list), 2)
This code works well, however, when I find the max,min, or avg I need to grab the corresponding name of the country and print that as well.
How can I change my code to grab the country name of the min,max, avg as well?
Instead of putting the values straight in the list, use tuples instead, like this:
hdi_list.append((float(hdi), acountry[1]))
Then you can use this instead:
maxTuple = max(hdi_list)
maxNumber = round(maxTuple[0], 2)
maxCountry = maxTuple[1]
Using the pandas module, [4], [5], and [6] below should show the max, min, and average respectively. Note that the data below doesn't match yours save for country.
In [1]: import pandas as pd
In [2]: df = pd.read_csv("hdi.csv")
In [3]: df
Out[3]:
Country HDI
0 Norway 83.27
1 Australia 80.77
2 Netherlands 87.00
3 United States 87.43
4 New Zealand 87.43
5 Canada 87.66
6 Ireland 75.47
7 Liechtenstein 88.97
8 Germany 86.31
9 Sweden 80.54
In [4]: df.ix[df["HDI"].idxmax()]
Out[4]:
Country Liechtenstein
HDI 88.97
Name: 7, dtype: object
In [5]: df.ix[df["HDI"].idxmin()]
Out[5]:
Country Ireland
HDI 75.47
Name: 6, dtype: object
In [6]: df["HDI"].mean()
Out[6]: 84.484999999999985
Assuming both Liechtenstein and Germany have max values:
In [15]: df
Out[15]:
Country HDI
0 Norway 83.27
1 Australia 80.77
2 Netherlands 87.00
3 United States 87.43
4 New Zealand 87.43
5 Canada 87.66
6 Ireland 75.47
7 Liechtenstein 88.97
8 Germany 88.97
9 Sweden 80.54
In [16]: df[df["HDI"] == df["HDI"].max()]
Out[16]:
Country HDI
7 Liechtenstein 88.97
8 Germany 88.97
The same logic can be applied for the minimum value.
The following approach is close enough to your implementation that I think it might be useful. However, if you start working with larger or more complicated csv files, you should look into packages like "csv.reader" or "Pandas" (as previously mentioned). They are more robust and efficient in working with complex .csv data. You could also work through Excel with the "xlrd" package.
In my opinion, the simplest solution to reference country names with their respective values is to combine your 'for loops'. Instead of looping through your data twice (in two separate 'for loops') and creating two separate lists, use a single 'for loop' and create a dictionary with relevant data (ie. "country name", "hdi"). You could also create a tuple (as previously mentioned) but I think dictionaries are more explicit.
myfile = open("/Users/it/Desktop/Python/In-Class Programs/countries.csv", "rb")
countries = []
for line in myfile:
country_name = line.split(",")[1]
value_of_interest = float(line.split(",")[3])
countries.append(
{"Country Name": country_name,
"Value of Interest": value_of_interest})
ave_value = sum([country["Value of Interest"] for country in countries])/len(countries)
max_value = max([country["Value of Interest"] for country in countries])
min_value = min([country["Value of Interest"] for country in countries])
print "Country Average == ", ave_value
for country in countries:
if country["Value of Interest"] == max_value:
print "Max == {country}:{value}".format(country["Country Name"], country["Value of Interest"])
if country["Value of Interest"] == min_value:
print "Min == {country}:{value}".format(country["Country Name"], country["Value of Interest"])
Note that this method returns multiple countries if they have equal min/max values.
If you are dead-set on creating separate lists (like your current implementation), you might consider zip() to connect your lists (by index), where
zip(countries, hdi_list) = [(countries[1], hdi_list[1]), ...]
For example:
for country in zip(countries, hdi_list):
if country[1] == max_value:
print country[0], country[1]
with similar logic applied to the min and average. This method works but is less explicit and more difficult to maintain.

Categories