Efficient way to write Pandas groupby codes by eliminating repetition - python

I have a DataFrame as below.
df = pd.DataFrame({
'Country':['A','A','A','A','A','A','B','B','B'],
'City':['C 1','C 1','C 1','B 2','B 2','B 2','C 1','C 1','C 1'],
'Date':['7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020'],
'Value':[46,90,23,84,89,98,31,84,41]
})
I need to create 2 averages
Firstly, both Country and City as the criteria
Secondly, Average for only the Country
In order to achieve this, we can easily write below codes
df.groupby(['Country','City']).agg('mean')
.
+---------+------+-------+
| Country | City | Value |
+---------+------+-------+
| A | B 2 | 90.33 |
| +------+-------+
| | C 1 | 53 |
+---------+------+-------+
| B | C 1 | 52 |
+---------+------+-------+
df.groupby(['Country']).agg('mean')
.
+---------+-------+
| Country | |
+---------+-------+
| A | 71.67 |
+---------+-------+
| B | 52 |
+---------+-------+
The only change in the above 2 codes are the groupby criteria City. apart from that everything is same. so there's a clear repetition/duplication of codes. (specially when it comes to complex scenarios).
Now my question is, Is there any way that, we could write one code to incorporate both the scenarios at once. DRY - Don't Repeat Yourself.
what I've in my mind is something like below.
Choice = 'City' `<<--Here I type either City or None or something based on the requirement. Eg: If None, the Below code will ignore that criteria.`
df.groupby(['Country',Choice]).agg('mean')
Is this possible? or what is the best way to write the above codes efficiently without repetition?

I am not sure what you want to accomplish but.. why not just using a if?
columns=['Country']
if Choice:
columns.append(Choice)
df.groupby(columns).agg('mean')

Related

Assign several DataFrame columns to match SQL table

I have several DataFrames that need to flip to SQL tables. The SQL tables all share one schema yet the DataFrames do not. I need to be able to easily match/change the df columns to the sql table. Everything I have seen on here is manipulating 1 or 2 fields using df.to_sql. I need to be able to manipulate at least 10 fields as easy as I do with lists. Below are example tables
list1
+-------+-------+-------+-------+
| name |hobby1 |hobby2 |hobby3 |
+-------+-------+-------+-------+
| kris | ball | swim | dance |
| james | eat | sing | sleep |
| amy | swim | eat | watch |
+-------+-------+-------+-------+
df2
+---------+------------+-----------+-----------+
| df2name | df2hobby1 | df2hobby2 |df2hobby3 |
+---------+------------+-----------+-----------+
| kris | ball | swim | dance |
| james | eat | sing | sleep |
| amy | swim | eat | watch |
+----------+-----------+-----------+-----------+
sql1
+-----------+-----------+-----------+-----------+
| sql_name |sql_hobby1 |sql_hobby2 |sql_hobby3 |
+-----------+-----------+-----------+-----------+
| kris | ball | swim | dance |
| james | eat | sing | sleep |
| amy | swim | eat | watch |
+----------+-----------+------------+------------+
Sometimes I receive the data in a python dict, I can easily transfer using a kwargs function and works great. My function is below:
def transfer_dict(**kwargs):
transfer = {'sqlname':' ',
'sqlhobby1' : ' ',
'sqlhobby2' : ' ',
'sqlhobby3' : ' '
}
transfer.update(kwargs)
return (transfer)
I transfer easily by doing:
new_list.append(transfer_dict(sqlname=name, sqlhobby1=hobby1, sqlhobby2=hobby2, sqlhobby3=hobby3))
Can I use my same kwargs transfer function to apply on DataFrame transfers to SQL? Or is there a better way?
The pandas.DataFrame.rename() method will accept a dict-like set of column names and names to rename them with. In many cases, the fastest solution to the problem you are describing (if I'm understanding you correctly) is to use a combination of rename() and drop() to change the source DataFrame so that it matches the SQL target, and then use to_sql() as you have described doing (but now, critically, all the column names match their intended targets). For example:
sql_mappings = {'df2_name':'sql_name', 'df2_hobby1':'sql_hobby1', 'df2_hobby2':'sql_hobby2', 'df2_hobby3':'sql_hobby3'}
sql_columns = [i for i in sql_mappings.values()]
df2 = df2.rename(columns=sql_mappings)
df2 = df2.drop(columns=[col for col in df2 if col not in sql_columns ])
If you want to set things like the sql table name and execute to_sql dynamically, I can imagine a fairly straightforward wrapper function that does both tasks using this approach.

Link lists that share common elements

I have an issue similar to this one with a few differences/complications
I have a list of groups containing members, rather than merging the groups that share members I need to preserve the groupings and create a new set of edges based on which groups have members in common, and do so conditionally based on attributes of the groups
The source data looks like this:
+----------+------------+-----------+
| Group ID | Group Type | Member ID |
+----------+------------+-----------+
| A | Type 1 | 1 |
| A | Type 1 | 2 |
| B | Type 1 | 2 |
| B | Type 1 | 3 |
| C | Type 1 | 3 |
| C | Type 1 | 4 |
| D | Type 2 | 4 |
| D | Type 2 | 5 |
+----------+------------+-----------+
Desired output is this:
+----------+-----------------+
| Group ID | Linked Group ID |
+----------+-----------------+
| A | B |
| B | C |
+----------+-----------------+
A is linked to B because it shares 2 in common
B is linked to C because it shares 3 in common
C is not linked to D, it has a member in common but is of a different type
The number of shared members doesn't matter for my purposes, a single member in common means they're linked
The output is being used as the edges of a graph, so if the output is a graph that fits the rules that's fine
The source dataset is large (hundreds of millions of rows), so performance is a consideration
This poses a similar question, however I'm new to Python and can't figure out how to get the source data to a point where I can use the answer, or work in the additional requirement of the group type matching
Try some thing like this-
df1=df.groupby(['Group Type','Member ID'])['Group ID'].apply(','.join).reset_index()
df2=df1[df1['Group ID'].str.contains(",")]
This might not handle the case of cyclic grouping.

Creating new column from API lookup using groupby

I have a dataframe of weather date that looks like this:
+----+------------+----------+-----------+
| ID | Station_ID | Latitude | Longitude |
+----+------------+----------+-----------+
| 0 | 6010400 | 52.93 | -82.43 |
| 1 | 6010400 | 52.93 | -82.43 |
| 2 | 6010400 | 52.93 | -82.43 |
| 3 | 616I001 | 45.07 | -77.88 |
| 4 | 616I001 | 45.07 | -77.88 |
| 5 | 616I001 | 45.07 | -77.88 |
+----+------------+----------+-----------+
I want to create a new column called postal_code using an API lookup based on the latitude and longitude values. I cannot perform a lookup for each row in the dataframe as that would be inefficient, since there are over 500,000 rows and only 186 unique Station_IDs. It's also unfeasible due to rate limiting on the API I need to use.
I believe I need to perform a groupby transform but can't quite figure out how to get it to work correctly.
Any help with this would be greatly appreciated.
I believe, you can use groupby only for aggregations, which is not what you want.
First combine both 'Latitude' and 'Longitude'. It gives a new column with tuples.
df['coordinates'] = list(zip(df['Latitude'],df['Longitude']))
Then you can use this 'coordinates' column to create all unique values of (Latitude,Longitude) using set datatype, so it doesn't contain duplicates.
set(list(df['coordinates']))
Then fetch the postal_codes of these coordinates using API calls as you said and store them as a dict.
Then you can use this dict to populate postal codes for each row.
postal_code_dict = {'key':'value'} #sample dictionary
df['postal_code'] = df['coordinates'].apply(lambda x: postal_code_dict[x])
Hope this helps.

Find value that is a subset to a row in Pandas dataframe

This is ascutally a follow up solution / question to one of my other questions: Python Pandas compare two dataframes to assign country to phone number
We have two data frames:
df1 = pd.DataFrame({"TEL": ["49123410", "49123411","49123412","49123413","49123414","49123710", "49123810"]})
df2 = pd.DataFrame({"BASE_NR": ["491234","491237","491238"],"NAME": ["A","B","C"]})
What I want to do is to assign on of the df2 Names to the df1 TEL. If we take the first value "491234", we see that the first five list entries in df1 start exactly on this string. This should result in something like this:
| | TEL | PREFIX |
| 0 | 49123410 | 491234 |
| 1 | 49123411 | 491234 |
| 2 | 49123412 | 491234 |
| 3 | 49123413 | 491234 |
| 4 | 49123414 | 491234 |
| 5 | 49123710 | 491237 |
| 6 | 49123810 | 491238 |
Other than in Python Pandas compare two dataframes to assign country to phone number
I developed another approach that works much faster:
for i, s in df2.iterrows():
df1.loc[df1["TEL"].str.startswith(s[0], na=False), "PREFIX"] = s[0]
So far, it worked perfectly and I have been using it over and over again, as I have to match many different sources on phone numbers and their subsets. But lately, I am experiencing more and more issues. The PREFIX column will be setup but stays empty. No matches are found any longer, where I had about 150.000 before.
Is there something fundamental that I am missing and was it only luck it worked this way? Input files (I am reading them in from a csv) and data types have not changed. I also have not changed the Pandas version (22).
PS: What also would be helpful is an idea, how to debug that part that happens here:
df1.loc[df1["TEL"].str.startswith(s[0], na=False), "PREFIX"] = s[0]
Well if it is speed you are after, this should be faster:
mapping = dict(zip(df2['BASE_NR'].tolist(), df2['NAME'].tolist()))
def getName(tel):
for k, v in mapping.items():
if tel.startswith(k):
return k, v
return '', ''
df1['BASE_NR'], df1['NAME'] = zip(*df1['TEL'].apply(getName))

Replicating GROUP_CONCAT for pandas.DataFrame

I have a pandas DataFrame df:
+------+---------+
| team | user |
+------+---------+
| A | elmer |
| A | daffy |
| A | bugs |
| B | dawg |
| A | foghorn |
| B | speedy |
| A | goofy |
| A | marvin |
| B | pepe |
| C | petunia |
| C | porky |
+------+---------
I want to find or write a function to return a DataFrame that I would return in MySQL using the following:
SELECT
team,
GROUP_CONCAT(user)
FROM
df
GROUP BY
team
for the following result:
+------+---------------------------------------+
| team | group_concat(user) |
+------+---------------------------------------+
| A | elmer,daffy,bugs,foghorn,goofy,marvin |
| B | dawg,speedy,pepe |
| C | petunia,porky |
+------+---------------------------------------+
I can think of nasty ways to do this by iterating over rows and adding to a dictionary, but there's got to be a better way.
Do the following:
df.groupby('team').apply(lambda x: ','.join(x.user))
to get a Series of strings or
df.groupby('team').apply(lambda x: list(x.user))
to get a Series of lists of strings.
Here's what the results look like:
In [33]: df.groupby('team').apply(lambda x: ', '.join(x.user))
Out[33]:
team
a elmer, daffy, bugs, foghorn, goofy, marvin
b dawg, speedy, pepe
c petunia, porky
dtype: object
In [34]: df.groupby('team').apply(lambda x: list(x.user))
Out[34]:
team
a [elmer, daffy, bugs, foghorn, goofy, marvin]
b [dawg, speedy, pepe]
c [petunia, porky]
dtype: object
Note that in general any further operations on these types of Series will be slow and are generally discouraged. If there's another way to aggregate without putting a list inside of a Series you should consider using that approach instead.
A more general solution if you want to use agg:
df.groupby('team').agg({'user' : lambda x: ', '.join(x)})

Categories