I have two columns, and I want to check if they match between 4 or more characters regardless of the position of the array, if they match then create a column that is OK if it matches and KO otherwise.
How can I do this in PYTHON or SQL LITE?
Example:
DATASET WITH ;
Street 1;Street 2
ASENSIO Y TOLEDO 15;AVILA 9
AVILA 9;AVILA 9
FISTERRA S/N;FINISTERRE S/N - SAN ROQUE
PASEO DEL PUER;PASEO DEL PUERTO SN
PASEO DEL PUER;PASEO DEL PUERTO SN
LA UNION 2;LA UNION 2
ALEGRIA 14;LA UNION 2
Thank you.
https://i.stack.imgur.com/gYLcg.png
Code:
def dataet():
df_dataset= pd.read_csv("C:/Users/Documents/DATASET2.CSV", sep=';')
print(df_dataset.columns.values)
query = """
SELECT INSTR(street 1, street 2)
FROM df_dataset
"""
result= pdsql.sqldf(query)
print(result)
In python you can use sets to get unique characters in a string and then & sets from Street 1 and Street 2 to get their union. I'm also removing spaces from the matching list, you don't want to count them, right?
df['count'] = ['OK' if len(set(x) & set(y) - set(' ')) >= 4 else 'KO' for x, y in zip(df['Street 1'].fillna(''), df['Street 2'].fillna(''))]
print(df)
Output:
Street 1 Street 2 count
0 ASENSIO Y TOLEDO 15 AVILA 9 KO
1 AVILA 9 AVILA 9 OK
2 FISTERRA S/N FINISTERRE S/N - SAN ROQUE OK
3 PASEO DEL PUER PASEO DEL PUERTO SN OK
4 PASEO DEL PUER PASEO DEL PUERTO SN OK
5 LA UNION 2 LA UNION 2 OK
6 ALEGRIA 14 LA UNION 2 KO
Update: If you're looking for the length of the longest common substring between Street 1 and Street 2:
from difflib import SequenceMatcher
z = df.fillna('')
z['count'] = [len(x[m.a:m.a+m.size].replace(' ', '')) for x, m in
[(x, SequenceMatcher(None, x, y).find_longest_match(0, len(x), 0, len(y)))
for x, y in zip(z['Street 1'], z['Street 2'])]]
z['match'] = ['OK' if x >= 4 else 'KO' for x in z['count']]
print(z)
Output:
Street 1 Street 2 count match
0 ASENSIO Y TOLEDO 15 AVILA 9 1 KO
1 AVILA 9 AVILA 9 6 OK
2 FISTERRA S/N FINISTERRE S/N - SAN ROQUE 6 OK
3 PASEO DEL PUER PASEO DEL PUERTO SN 12 OK
4 PASEO DEL PUER PASEO DEL PUERTO SN 12 OK
5 LA UNION 2 LA UNION 2 8 OK
6 ALEGRIA 14 LA UNION 2 1 KO
7 JARILLO 7 BO IZD SAN AMBROSIO 1 KO
8 STREET AVE PARRA PARRA STREET 4 6 OK
9 PARRA 4 0 KO
Also using numpy.where():
df['res'] = np.where([len(set(x) - set(y))>=4 for x, y in zip(df['Street 1'], df['Street 2'])], 'OK', 'KO')
Related
I want to fix a data frame that has the following aspect:
etiqueta
suma
2015-10
33
Baja California
12
Campeche
21
2015-11
12
Colima
6
Ciudad de México
6
2015-12
30
Ciudad de México
20
Quintana Roo
10
To make it look like this?
fecha
Baja California
Campeche
Colima
Ciudad de México
Quintana Roo
2015-10
12
21
0
0
0
2015-11
0
0
6
6
0
2015-12
0
0
0
20
10
I already tried regex to create another column with the dates but I'm blocked
Let us do pd.to_datetime then mask those return NaN and fill it with ffill
df['new'] = df['etiqueta'].mask(pd.to_datetime(df['etiqueta'], format = '%Y-%m', errors='coerce').isna()).ffill()
out = df.query('etiqueta!=new').pivot_table(index = 'new',columns = 'etiqueta',values= 'suma',fill_value=0)
Out[213]:
etiqueta Baja California Campeche Ciudad de México Colima Quintana Roo
new
2015-10 12 21 0 0 0
2015-11 0 0 6 6 0
2015-12 0 0 20 0 10
Find rows where etiqueta matches the date pattern YYYY-MM.
Create Groups based on where those dates appear using cumsum. Use groupby transform to get the date from each group at the end of every row in that group.
Use a pivot_table to transition to wide format based on date.
Reset index and cleanup axis names.
import pandas as pd
df = pd.DataFrame({'etiqueta': {0: '2015-10', 1: 'Baja California',
2: 'Campeche', 3: '2015-11', 4: 'Colima',
5: 'Ciudad de México', 6: '2015-12',
7: 'Ciudad de México', 8: 'Quintana Roo'},
'suma': {0: 33, 1: 12, 2: 21, 3: 12, 4: 6, 5: 6, 6: 30,
7: 20, 8: 10}})
# Mask where matches date pattern
m = df['etiqueta'].str.match(r'\d{4}-\d{2}')
# Use Transform to add date to end of every row in each group
# (Date is the first element in each group)
df['fecha'] = df.groupby(m.cumsum())['etiqueta'].transform(lambda g: g.iloc[0])
# Pivot On groups (Excluding Date Rows)
out = df[~m].pivot_table(index='fecha',
columns='etiqueta',
values='suma',
fill_value=0)
# Reset Index and Drop Axis Name
out = out.reset_index().rename_axis(None, axis=1)
# For Display
print(out.to_string())
Out:
fecha Baja California Campeche Ciudad de México Colima Quintana Roo
0 2015-10 12 21 0 0 0
1 2015-11 0 0 6 6 0
2 2015-12 0 0 20 0 10
#Create column fetcha by extracting dates. You do this by creating a new group using cumsum.
df['fecha']=df.groupby(df['etiqueta'].str.contains('\d').cumsum())['etiqueta'].apply(lambda x: x.str.extract('(\d{4}\-\d{2})')).fillna(method='ffill')
Once extracted, drop dates from etiqueta and pivot.
df[~df['etiqueta'].str.contains('-\d')].pivot(index='fecha', columns='etiqueta', values='suma').fillna(0).reset_index()
Following your comments. Looks like you have duplicates in the index.
df['fecha']=df.groupby(df['etiqueta'].str.contains('\d').cumsum())['etiqueta'].apply(lambda x: x.str.extract('(\d{4}\-\d{2})')).fillna(method='ffill')
df2=df[~df['etiqueta'].str.contains('-\d')]
pd.pivot(df2,index='fecha', columns='etiqueta', values='suma').fillna(0).reset_index()
etiqueta fecha Baja California Campeche Ciudad de México Colima \
0 2015-10 12.0 21.0 0.0 0.0
1 2015-11 0.0 0.0 6.0 6.0
2 2015-12 0.0 0.0 20.0 0.0
etiqueta Quintana Roo
0 0.0
1 0.0
2 10.0
I have the following dataframe and would like to create a column at the end called "dup" showing the number of times the row shows up based on the "Seasons" and "Actor" columns. Ideally the dup column would look like this:
Name Seasons Actor dup
0 Stranger Things 3 Millie 1
1 Game of Thrones 8 Emilia 1
2 La Casa De Papel 4 Sergio 1
3 Westworld 3 Evan Rachel 1
4 Stranger Things 3 Millie 2
5 La Casa De Papel 4 Sergio 1
This should do what you need:
df['dup'] = df.groupby(['Seasons', 'Actor']).cumcount() + 1
Output:
Name Seasons Actor dup
0 Stranger Things 3 Millie 1
1 Game of Thrones 8 Emilia 1
2 La Casa De Papel 4 Sergio 1
3 Westworld 3 Evan Rachel 1
4 Stranger Things 3 Millie 2
5 La Casa De Papel 4 Sergio 2
As Scott Boston mentioned, according to your criteria the last row should also be 2 in the dup column.
Here is a similar post that can provide you more information. SQL-like window functions in PANDAS
Goal: If the name in df2 in row i is a sub-string or an exact match of a name in df1 in some row N and the state and district columns of row N in df1 are a match to the respective state and district columns of df2 row i, combine.
Break down of data frame inputs:
df1 is a time-series style data frame.
df2 is a regular data frame.
3.df1 and df2 do not have the same length.
df1 Names contain initials,titles, and even weird character encodings.
df2 Names are just a combination of First Name, Space and Last Name.
My attempts have centered around taking into account 1. Names, Districts and State.
My approaches have tried to take into account that names in df1 have initials or second names, titles, etc whereas df2 is simply first and last names. I tried to use str.contains('A-za-z') to account for this difference.
# Data Frame Samples
# Data Frame 1
CandidateName = ['Theodorick A. Bland','Aedanus Rutherford Burke','Jason Lewis','Barbara Comstock','Theodorick Bland','Aedanus Burke','Jason Initial Lewis', '','']
State = ['VA', 'SC', 'MN','VA','VA', 'SC', 'MN','NH','NH']
District = [9,2,2,10,9,2,2,1,1]
Party = ['','', '','Democrat','','','Democrat','Whig','Whig']
data1 = {'CandidateName':CandidateName, 'State':State, 'District':District,'Party':Party }
df1 = pd.DataFrame(data = data1)
print df1
# CandidateName District Party State
#0 Theodorick A. Bland 9 VA
#1 Aedanus Rutherford Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
#4 Theodorick Bland 9 VA
#5 Aedanus Burke 2 SC
#6 Jason Initial Lewis 2 Democrat MN
#7 '' 1 Whig NH
#8 '' 1 Whig NH
Name = ['Theodorick Bland','Aedanus Burke','Jason Lewis', 'Barbara Comstock']
State = ['VA', 'SC', 'MN','VA']
District = [9,2,2,10]
Party = ['','', 'Democrat','Democrat']
data2 = {'Name':Name, 'State':State, 'District':District, 'Party':Party}
df2 = pd.DataFrame(data = data2)
print df2
# CandidateName District Party State
#0 Theodorick Bland 9 VA
#1 Aedanus Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
# Attempt code
df3 = df1.merge(df2, left_on = (df1.State, df1.District,df1.CandidateName.str.contains('[A-Za-z]')), right_on=(df2.State, df2.District,df2.Name.str.contains('[A-Za-z]')))
I included merging on District and State in order to reduce redundancies and inaccuracies. When I removed district and state from left_on and right_on, not did the output df3 increase in size with a lot of wrong matches.
Examples include CandidateName and Name being two different people:
Theodorick A. Bland sharing the same row as Jasson Lewis Sr.
Some of the row results with the Attempt Code above are as follows:
Header
key_0 key_1 key_2 CandidateName District_x Party_x State_x District_y Name Party_y State_y
Row 6, index 4
MN 2 True Jason Lewis 2 Democrat MN 2 Jasson Lewis Sr. Republican MN
Row 11, index 3
3 VA 10 True Barbara Comstock 10 VA 10 Barbara Comstock Democrat VA
We can use difflib for this to create an artificial key column to merge on. We call this column name, like the one in df2:
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'])
print(df_merge)
CandidateName State District Party Name
0 Theodorick A. Bland VA 9 Theodorick Bland
1 Theodorick Bland VA 9 Theodorick Bland
2 Aedanus Rutherford Burke SC 2 Aedanus Burke
3 Aedanus Burke SC 2 Aedanus Burke
4 Jason Lewis MN 2 Jason Lewis
5 Jason Initial Lewis MN 2 Democrat Jason Lewis
6 Barbara Comstock VA 10 Democrat Barbara Comstock
Explanation of difflib.get_close_matches. It looks for similar strings in df2. This is how the new Name column in df1 looks like:
print(df1)
CandidateName State District Party Name
0 Theodorick A. Bland VA 9 Theodorick Bland
1 Aedanus Rutherford Burke SC 2 Aedanus Burke
2 Jason Lewis MN 2 Jason Lewis
3 Barbara Comstock VA 10 Democrat Barbara Comstock
4 Theodorick Bland VA 9 Theodorick Bland
5 Aedanus Burke SC 2 Aedanus Burke
6 Jason Initial Lewis MN 2 Democrat Jason Lewis
I have a following dataframe df_address containing addresses of students
student_id address_type Address City
1 R 6th street MPLS
1 P 10th street SE Chicago
1 E 10th street SE Chicago
2 P Washington ST Boston
2 E Essex St NYC
3 E 1040 Taft Blvd Dallas
4 R 24th street NYC
4 P 8th street SE Chicago
5 T 10 Riverside Ave Boston
6 20th St NYC
Each student can have multiple address types:
R stands for "Residential",P for "Permanent" ,E for "Emergency",T for "Temporary" and addr_type can also be blank
I want to populate "IsPrimaryAddress" columns based on the following logic:
If for particular student if address_type R exists then "Yes" should be written
in front of address_type "R" in the IsPrimaryAddress column
and "No" should be written in front of other address types for that particular student_id.
if address_type R doesn't exist but P exists then IsPrimaryAddress='Yes' for 'P' and 'No'
for rest of the types
if neither P or R exists,but E exists then IsPrimaryAddress='Yes' for 'E'
if P,R or E don't exist,but 'T' exists then IsPrimaryAddress='Yes' for 'T'
Resultant dataframe would look like this:
student_id address_type Address City IsPrimaryAddress
1 R 6th street MPLS Yes
1 P 10th street SE Chicago No
1 E 10th street SE Chicago No
2 P Washington ST Boston Yes
2 E Essex St NYC No
3 E 1040 Taft Blvd Dallas Yes
4 R 24th street NYC Yes
4 P 8th street SE Chicago No
5 T 10 Riverside Ave Boston Yes
6 20th St NYC Yes
How can I achieve this?I tried rank and cumcount functions on address_type but couldn't get them work.
First using Categorical make the address_type can be sort customized
df.address_type=pd.Categorical(df.address_type,['R','P','E','T',''],ordered=True)
df=df.sort_values('address_type') # the sort the values
df['new']=(df.groupby('student_id').address_type.transform('first')==df.address_type).map({True:'Yes',False:'No'}) # since we sorted the value , so the first value of each group is the one we need to mark as Yes
df=df.sort_index() # sort the index order back to the original df
student_id address_type new
0 1 R Yes
1 1 P No
2 1 E No
3 2 P Yes
4 2 E No
5 3 E Yes
6 4 R Yes
7 4 P No
8 5 T Yes
9 6 Yes
My pandas Data frame df could produce result as below:
grouped = df[(df['X'] == 'venture') & (df['company_code'].isin(['TDS','XYZ','UVW']))].groupby(['company_code','sector'])['X_sector'].count()
The output of this is as follows:
company_code sector
TDS Meta 404
Electrical 333
Mechanical 533
Agri 453
XYZ Sports 331
Electrical 354
Movies 375
Manufacturing 355
UVW Sports 505
Robotics 345
Movies 56
Health 3263
Manufacturing 456
Others 524
Name: X_sector, dtype: int64
What I want to get is the top three sectors within the company codes.
What is the way to do it?
You will have to chain a groupby here. Consider this example:
import pandas as pd
import numpy as np
np.random.seed(111)
names = [
'Robert Baratheon',
'Jon Snow',
'Daenerys Targaryen',
'Theon Greyjoy',
'Tyrion Lannister'
]
df = pd.DataFrame({
'season': np.random.randint(1, 7, size=100),
'actor': np.random.choice(names, size=100),
'appearance': 1
})
s = df.groupby(['season','actor'])['appearance'].count()
print(s.sort_values(ascending=False).groupby('season').head(1)) # <-- head(3) for 3 values
Returns:
season actor
4 Daenerys Targaryen 7
6 Robert Baratheon 6
3 Robert Baratheon 6
5 Jon Snow 5
2 Theon Greyjoy 5
1 Jon Snow 4
Where s is (clipped at 4)
season actor
1 Daenerys Targaryen 2
Jon Snow 4
Robert Baratheon 2
Theon Greyjoy 3
Tyrion Lannister 4
2 Daenerys Targaryen 4
Jon Snow 3
Robert Baratheon 1
Theon Greyjoy 5
Tyrion Lannister 3
3 Daenerys Targaryen 2
Jon Snow 1
Robert Baratheon 6
Theon Greyjoy 3
Tyrion Lannister 3
4 ...
Why would you want things to be complicated, when there are simple codes possible:
Z = df.groupby('country_code')['sector'].value_counts().groupby(level=0).head(3).sort_values(ascending=False).to_frame('counts').reset_index()
Z