Join pandas dataframes using regex on a column using Python 3 - python

I have two pandas dataframe df1 and df2. I would like to join the two dataframes using regex on column 'CODE'.
df1
STR CODE
Nonrheumatic aortic valve disorders I35
Glaucoma suspect H40.0
df2
STR CODE
Nonrheumatic aortic valve disorders I35
Nonrheumatic 1 I35.1
Nonrheumatic 2 I35.2
Nonrheumatic 3 I35.3
Glaucoma suspect H40.0
Glaucoma 2 H40.1
Diabetes H50
Diabetes 1 H50.1
Diabetes 1 H50.2
The final output should be like this:
STR CODE
Nonrheumatic aortic valve disorders I35
Nonrheumatic 1 I35.1
Nonrheumatic 2 I35.2
Nonrheumatic 3 I35.3
Glaucoma suspect H40.0
Glaucoma 2 H40.1
Any help is highly appreciated!

You can "align" the codes of 2 dataframes with matching their essential prefixes (obtained by regex substitution) in pandas.Series.where condition:
df1_codes = df1.CODE.str.replace(r'\..+', '', regex=True)
df2.loc[df2.CODE.str.replace(r'\..+', '', regex=True)\
.pipe(lambda s: s.where(s.isin(df1_codes))).dropna().index]
STR CODE
0 Nonrheumatic aortic valve disorders I35
1 Nonrheumatic 1 I35.1
2 Nonrheumatic 2 I35.2
3 Nonrheumatic 3 I35.3
4 Glaucoma suspect H40.0
5 Glaucoma 2 H40.1

A possible solution, based on pandas.DataFrame.merge and pandas.Series.str.split:
(df1.assign(
CODEA=df1['CODE'].str.split('\.', expand=True)[0])
.merge(df2.assign(
CODEA=df2['CODE'].str.split('\.', expand=True)[0]),
on='CODEA', suffixes=['_x', '']).loc[:, df1.columns])
Output:
STR CODE
0 Nonrheumatic aortic valve disorders I35
1 Nonrheumatic 1 I35.1
2 Nonrheumatic 2 I35.2
3 Nonrheumatic 3 I35.3
4 Glaucoma suspect H40.0
5 Glaucoma 2 H40.1

Related

Randomly sample from panel data by 3months periods

I have pandas data frame that is panel data i.e. data of multiple customers over a timeframe. I want to sample (for bootstraping) a continuous three months period (i always wanna get a full month) of a random customer 90 times.
I have googled a bit and found several sampling techniques but none that would include sampling based on three continuous months.
I was considering just making a list of all the month names and sampling three consecutive ones (although not sure how to do consecutive). But how would i then be able to e.g. pick Nov21-Dec21-Jan22 ?
Would appreciate the help a lot!
import pandas as pd
date_range = pd.date_range("2020-01-01", "2022-01-01")
df = pd.DataFrame({"value":3}, index=date_range)
df.groupby(df.index.quarter).sample(5)
This would output:
Out[12]:
value
2021-01-14 3
2021-02-27 3
2020-01-20 3
2021-02-03 3
2021-02-19 3
2021-04-27 3
2021-06-29 3
2021-04-12 3
2020-06-24 3
2020-06-05 3
2021-07-30 3
2020-08-29 3
2021-07-03 3
2020-07-17 3
2020-09-12 3
2020-12-22 3
2021-12-13 3
2021-11-29 3
2021-12-19 3
2020-10-18 3
It selected 5 sample values form each quarter group.
From now own you can format date column (index) for it to write month in text.

Pandas - Group by multiple columns and get count of 1 of the columns

This is a sample of what my df looks like:
normalized_page_url impression_id ts user_ip_country
0 viosdkfjki-o1 6954BBC 2022-01-08 za
1 vd/dkjfduof-at 061E9974B233 2022-01-08 pk
2 vd-le-se-fn-pase-ri 170331464 2022-01-08 gp
3 vntaetal-mia-mre 4EC9C93E4 2022-01-08 ru
4 viater-g-kfrom-id 6B4A846D6 2022-01-08 jp
However this is what I want it to look like :
normalized_page_url imp_id_count ts user_ip_country
0 blah blah blah 2 2022-01-08 za
1 vd/dkjfduof-at 2 2022-01-08 pk
2 extra blah blah. 1 2022-01-08 gp
3 vntaetal-mia-mre 2 2022-01-08 ru
4 viater-g-kfrom-id 1 2022-01-08 jp
I've tried this but it just groups by all columns and it doesn't return impression_id count
df.groupby(['normalized_page_url', 'ts', 'user_ip_country','impression_id'])
also tried this but it doesn't look like it did anything:
df.groupby(['normalized_page_url','impression_id', 'ts', 'user_ip_country']).agg({'impression_id':'count'})
If it helps, this is how I have the query running in snowflake, this works as I'd like it to I'm just trying to get it like this in pandas:
SELECT NORMALIZED_PAGE_URL, to_date(ts) as ts_date, USER_IP_COUNTRY, count(impression_id) as imp_id_count
FROM my_table
group by 1, 2,3
I think I got it!
df = df.groupby(['normalized_page_url', 'ts', 'user_ip_country']).agg(
imp_id_count=('impression_id','count'))

How do groups of IDs in rows of one pandas dataframe and use them to extract records from another dataframe

I have two dataframes. One contains the contact information for individuals and households. The other contains an ID field for a Household, followed by the individuals in that household. I would like to select all records from the first dataframe and insert a column with their associated Household ID.
Minimum reproducible:
df1 = pd.DataFrame({'Constituent Id':['111111','222222','333333','444444','555555','666666','777777'],
'Type':['Individual','Household','Individual','Household',
'Individual','Individual','Individual'],
'Name':['Panda Smith','Panda and Python','Python Jones','Postgres Family',
'Paul Postgres','Mary Postgres','Sqlite Postgres']})
df2 = pd.DataFrame({'Account_ID':['ABCDEF','GHIJKL'],
'Household_0':['222222','444444'],
'Individual_0':['111111','555555'],
'Individual_1':['333333','666666'],
'Individual_2':['','777777']})
Resulting in:
>>> df1
Constituent Id Type Name
0 111111 Individual Panda Smith
1 222222 Household Panda and Python
2 333333 Individual Python Jones
3 444444 Household Postgres Family
4 555555 Individual Paul Postgres
5 666666 Individual Mary Postgres
6 777777 Individual Sqlite Postgres
>>> df2
Account_ID Household_0 Individual_0 Individual_1 Individual_2
0 ABCDEF 222222 111111 333333
1 GHIJKL 444444 555555 666666 777777
What I want to do is append a column to df1 with the Account_ID that applies to each of the individuals in the account. Households aren't necessary, but it's fine if I include those.
Because the number of individuals varies in each Household, I couldn't think of a great way to do this without iterating over each row. That seems very un-pandas and I'm sure there's a better way, perhaps by stacking or something.
In my example, the output would look like:
Constituent Id Type Name Account_ID
0 111111 Individual Panda Smith ABCDEF
1 222222 Household Panda and Python ABCDEF
2 333333 Individual Python Jones ABCDEF
3 444444 Household Postgres Family GHIJKL
4 555555 Individual Paul Postgres GHIJKL
5 666666 Individual Mary Postgres GHIJKL
6 777777 Individual Sqlite Postgres GHIJKL
IIUC need melt then merge
If . Type isn't required you can ommit it from the 2nd line and merge clause.
s = pd.melt(df2,id_vars='Account_ID',var_name='Type',value_name='Constituent Id')
s['Type'] = s['Type'].str.split('_',expand=True)[0]
print(s.head(5))
Account_ID Type Constituent Id
0 ABCDEF Household 222222
1 GHIJKL Household 444444
2 ABCDEF Individual 111111
3 GHIJKL Individual 555555
4 ABCDEF Individual 333333
df3 = pd.merge(df1,
s,
on=['Type','Constituent Id'],
how='left'
)
print(df3)
Constituent Id Type Name Account_ID
0 111111 Individual Panda Smith ABCDEF
1 222222 Household Panda and Python ABCDEF
2 333333 Individual Python Jones ABCDEF
3 444444 Household Postgres Family GHIJKL
4 555555 Individual Paul Postgres GHIJKL
5 666666 Individual Mary Postgres GHIJKL
6 777777 Individual Sqlite Postgres GHIJKL

fill up NaN values(column1) in existing column based on other column(column2) using pandas dataframe python

i have 2 columns area and pincode
Area Pincode
ABC - 1234
XYZ - 4118
qwe - 1023
rty - 1234
XYZ - ?
rty - ?
qwe - ?
ABC - ?
so i have multiple areas and want to fillup pincode column based on area as I see area and pin codes are available but i notice that some pin codes are missing though the area is same
Thanks!
df4.loc[df4.pins.isnull(),'pins'] = df4.loc[df4.pins.isnull(),'Area'].map(df4.loc[df4.pins.notnull()].set_index('Area')['pins'])
but this is not working
this should work, hard to say without a proper visual of your data..
df.loc[df['pcode'].isnull()==True,'pcode'] = df['Unnnamed:']
IIUC, you can use a fillna with a groupby and transform to get your result
using your dummy data above : (I replaced your '?' with true null values)
ABC 1234
0 XYZ 4118
1 qwe 1023
2 rty 1234
3 XYZ NaN
4 rty NaN
5 qwe NaN
df['1234'] = df['1234'].fillna(df.groupby('ABC')['1234'].transform('first'))
print(df)
ABC 1234
0 XYZ 4118
1 qwe 1023
2 rty 1234
3 XYZ 4118
4 rty 1234
5 qwe 1023

Transposing and pivoting using pandas dataframe [duplicate]

I have a dataframe. Rows are unique persons and columns are various action types taken. I need the data restructured to show the individual events by row. Here is my current and desired format, as well as the approach I've been trying to implement.
current = pd.DataFrame({'name': {0: 'ross', 1: 'allen', 2: 'jon'},'action a': {0:'2017-10-04', 1:'2017-10-04', 2:'2017-10-04'},'action b': {0:'2017-10-05', 1:'2017-10-05', 2:'2017-10-05'},'action c': {0:'2017-10-06', 1:'2017-10-06', 2:'2017-10-06'}})
desired = pd.DataFrame({'name':['ross','ross','ross','allen','allen','allen','jon','jon','jon'],'action':['action a','action b','action c','action a','action b','action c','action a','action b','action c'],'date':['2017-10-04','2017-10-05','2017-10-05','2017-10-04','2017-10-05','2017-10-05','2017-10-04','2017-10-05','2017-10-05']})
Use df.melt (v0.20+):
df
action a action b action c name
0 2017-10-04 2017-10-05 2017-10-06 ross
1 2017-10-04 2017-10-05 2017-10-06 allen
2 2017-10-04 2017-10-05 2017-10-06 jon
df = df.melt('name').sort_values('name')
df.columns = ['name', 'action', 'date']
df
name action date
1 allen action a 2017-10-04
4 allen action b 2017-10-05
7 allen action c 2017-10-06
2 jon action a 2017-10-04
5 jon action b 2017-10-05
8 jon action c 2017-10-06
0 ross action a 2017-10-04
3 ross action b 2017-10-05
6 ross action c 2017-10-06
r = df.roles
c = df.roles.str.count(',') + 1
i = df.index
df.loc[i.repeat(c)].assign(roles=','.join(r).split(','))
company employer_id roles
0 a 1 engineer
0 a 1 data_scientist
0 a 1 architect
1 b 2 engineer
1 b 2 front_end_developer

Categories