To start with, I have 3 Excel files, canada.xlsx, mexico.xlsx and usa.xlsx, each has 3 columns: id, a number, ColA, a text like Val1, and Country, each Excel file has only the country of its name in the third column, like only Canada in canada.xlsx
I make a df:
import pandas as pd
import glob
savepath = '/home/pedro/myPython/pandas/xl_files/'
saveoutputpath = '/home/pedro/myPython/pandas/outputxl/'
# I put an extra column in each excel file named country with either Canada, Mexico or USA
filelist = glob.glob(savepath + "*.xlsx")
# open the xl files with the data
# put all the data in 1 df
df = pd.concat((pd.read_excel(f) for f in filelist))
# change the indexes to get unique indexes
# df.index.size gets how many indexes there are
indexes = []
for i in range(df.index.size):
indexes.append(i)
# now change the indexes pass a list to df.index
# never good to have 2 indexes the same
df.index = indexes
I make the output Excel, it has 4 columns, id, Canada, Mexico, USA. The point of the exercise is, write X in each country column with a corresponding id number, for example id 42345 may be in country Canada and Mexico, so 42345 should get an X in those 2 columns
I made this work, but I extracted the data from df to a dictionary. I tried various ways of doing this with df.loc() or df.iloc() but I can't seem to make it. I don't use pandas much.
This is the output df_out
# get a list of the ids
mylist = df["id"].values.tolist()
# get a set of the unique ids
myset = set(mylist)
#create new DataFrame with unique values in the column id
df_out = pd.DataFrame(columns=['id', 'Canada', 'Mexico', 'USA'], index=range(0, len(myset)))
df_out.fillna(0, inplace=True)
# make a list of unique ids and sort them
id_names = list(myset)
id_names.sort()
# populate the id column with id_names
df_out["id"] = id_names
# see how many rows and columns
print(df_out.shape)
# mydict[key][0] is the id column , mydict[key][2]]is the country
for key in mydict.keys():
df_out.loc[df_out["id"] == mydict[key][0], mydict[key][2]] = "X"
Can you help me with a more "pandas way" of writing the X in df_out directly from df?
df:
id Col A country
0 42345 Test 1 USA
1 681593 Test 2 USA
2 331574 Test 3 USA
3 15786 Test 4 USA
4 93512 Chk1 Mexico
5 681593 Chk2 Mexico
6 331574 Chk3 Mexico
7 89153 Chk4 Mexico
8 42345 Val1 Canada
9 93512 Val2 Canada
10 331574 Val3 Canada
11 76543 Val4 Canada
df_out:
id Canada Mexico USA
0 15786 0 0 0
1 42345 0 0 0
2 76543 0 0 0
3 89153 0 0 0
4 93512 0 0 0
5 331574 0 0 0
6 681593 0 0 0
What you want is a pivot table.
pd.pivot_table(df, index='id', columns='country', aggfunc=lambda z: 'X', fill_value=0).rename_axis(None, axis=1).reset_index()
Input
id country
0 42345 USA
1 681593 USA
2 331574 USA
3 15786 USA
4 93512 Mexico
5 681593 Mexico
6 331574 Mexico
7 89153 Mexico
8 42345 Canada
9 93512 Canada
10 331574 Canada
11 76543 Canada
Output
id Canada Mexico USA
0 15786 0 0 X
1 42345 X 0 X
2 76543 X 0 0
3 89153 0 X 0
4 93512 X X 0
5 331574 X X X
6 681593 0 X X
I have two dataframes like as below
ID,Name,Sub,Country
1,ABC,ENG,UK
1,ABC,MATHS,UK
1,ABC,Science,UK
2,ABE,ENG,USA
2,ABE,MATHS,USA
2,ABE,Science,USA
3,ABF,ENG,IND
3,ABF,MATHS,IND
3,ABF,Science,IND
df1 = pd.read_clipboard(sep=',')
ID,Name,class,age
11,ABC,ENG,21
12,ABC,MATHS,23
1,ABC,Science,25
22,ABE,ENG,19
23,ABE,MATHS,22
24,ABE,Science,26
33,ABF,ENG,24
31,ABF,MATHS,28
32,ABF,Science,26
df2 = pd.read_clipboard(sep=',')
I would like to do the below
a) Check whether the ID and Name from df1 is present in df2.
b) If present in df2, put Yes in Status column or No in Status column. Don't use ~ or not in operator because my df2 has million of rows. So, it will result in irrelevant results
I tried the below
ID_list = df1['ID'].unique().tolist()
Name_list = df1['Name'].unique().tolist()
filtered_df = df2[((df2['ID'].isin(ID_list)) & (df2['Name'].isin(Name_list)))]
filtered_df = filtered_df.groupby(['ID','Name','Sub']).size().reset_index()
The above code gives matching ids and names between df1 and df2.
But I want to find the ids and names that are present in df1 but missing/absent in df2. I cannot use the ~ operator because it will return all the rows from df2 that don't have a match in df1. In real world, my df2 has millions of rows. I only want to find the missing df1 ids and names and put a status column
I expect my output to be like as below
ID,Name,Sub,Country, Status
1,ABC,ENG,UK,No
1,ABC,MATHS,UK,No
1,ABC,Science,UK,Yes
2,ABE,ENG,USA,No
2,ABE,MATHS,USA,No
2,ABE,Science,USA,No
3,ABF,ENG,IND,No
3,ABF,MATHS,IND,No
3,ABF,Science,IND,No
Expected ouput is for match by 3 columns:
m = df1.merge(df2,
left_on=['ID','Name','Sub'],
right_on=['ID','Name','class'],
indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK No
1 1 ABC MATHS UK No
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
With testing by isin solution is:
idx1 = pd.MultiIndex.from_frame(df1[['ID','Name','Sub']])
idx2 = pd.MultiIndex.from_frame(df2[['ID','Name','class']])
df1['Status'] = np.where(idx1.isin(idx2), 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK No
1 1 ABC MATHS UK No
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
Because if match by 2 columns ouput is different:
m = df1.merge(df2, on=['ID','Name'], indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK Yes
1 1 ABC MATHS UK Yes
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
for country in df1['country ']:
for street,City in zip(df2.street, df2.City):
if re.match(r'[A-Za-z]+\:'+ street + r'\.'+ City,country ):
s = (re.match(r'[A-Za-z]+\:'+ street + r'\.'+ TR +
r'\_(VS).+',country))
Matches += 1
print(s)
print(Matches)
df1:
UID country
0 1 Gervais Philippon:France.PARISPenthièvre25
1 2 Jed Turner:England.LONDONQueensway69
2 3 Lino Jimenez:Spain.MADRIDChavela33
df2:
UID country City
0 1 France PARIS
1 2 Spain MADRID
2 3 England LONDON
Expected output:
UID country UID_df2
0 1 Gervais Philippon:France.PARISPenthièvre25 1
1 2 Jed Turner:England.LONDONQueensway69 3
2 3 Lino Jimenez:Spain.MADRIDChavela33 2
The matches are shown correctly. How can i link the dataframes by assigning the matched string to the other dataframe ? I would like the ideal format:
Thank you.
First I would renamed country in df1 to data or something else so it doesn't get confused with country in df2
df1 = df1.rename(columns={'country': 'data'})
Get the country and City data
df1[['country', 'City']] = df1['data'].str.extract('(:([A-Z]+[a-z]*)).([A-Z]+)', expand=True)[[1, 2]]
Fix the regex in the City name, this step can be removed by updating the regex above
df1['City'] = df1['City'].map(lambda x: x[:-1])
Finally merge with df2
df1.merge(df2, on=['country', 'City'])
UID_x place country City UID_y
0 1 Gervais Philippon:France.PARISPenthièvre25 France PARIS 1
1 2 Jed Turner:England.LONDONQueensway69 England LONDON 3
2 3 Lino Jimenez:Spain.MADRIDChavela33 Spain MADRID 2
I've a dateframe 1
Place
0 New York
1 Los Angeles 1
2 Los Angeles- 2
3 Dallas -1
4 Dallas - 2
5 Dallas3
dataframe 2
Place target value1 value2
New York 1000 a b
Los Angeles 1500 c d
Dallas 1 2000 e f
Desired dataframe
Place target value1 value2
New York 1000 a b
Los Angeles 1 750 c d
Los Angeles- 2 750 c d
Dallas -1 666.6 e f
Dallas - 2 666.6 e f
Dallas3 666.6 e f
Explanation: We have to merge dataframe1 and dateframe2 on 'place'. We have 1 New york, 2 Los Angeles, 3 Dallas in dataframe1, but we have only ones in dateframe2. So we split the target based on the count of places (only names, not numbers) in df1 and assign value1 and value2 to respective place.
Is there any way to consider all the spell check, whitespaces, special characters using regex and obtain the desired dataframe?
This is the exact solution:
def extract_city(col):
return col.str.extract('([a-zA-Z]+(?:\s+[a-zA-Z]+)*)')[0]
df = pd.merge(df1, df2, left_on=extract_city(df1['Place']), right_on=extract_city(df2['Place']))
df = df.drop(['key_0', 'Place_y'], axis=1).rename({'Place_x' : 'Place'}, axis=1)
df['Target'] /= df.groupby(extract_city(df['Place']))['Place'].transform('count')
df
An alternate method to do thing this will be as follows:
import pandas as pd
df1 = pd.DataFrame({'Place':['New York','Los Angeles 1','Los Angeles- 2','Dallas -1','Dallas - 2','Dallas3']})
print (df1)
#create a column to compare both dataframes. Remove numeric, - and space values
df1['Place_compare'] = df1.Place.str.replace('\d+|-| ', '')
df2 = pd.DataFrame({'Place':['New York','Los Angeles','Dallas 1'],
'target':[1000,1500,2000],
'value1':['a','c','e'],
'value2':['b','d','f']})
print (df2)
#create a column to compare both dataframes. Remove numeric, - and space values
df2['Place_compare'] = df2.Place.str.replace('\d+|-| ', '')
#count number of times the unique values of Place occurs in df1. assign to df2
df2['counts'] = df2['Place_compare'].map(df1['Place_compare'].value_counts())
#calculate new target based on number of occurrences of Place in df1
df2['new_target'] = (df2['target'] / df2['counts']).round(2)
#repeat the nows by the number of times it appears in counts
df2 = df2.reindex(df2.index.repeat(df2['counts']))
#drop temp columns
df2.drop(['counts','Place_compare','target'], axis=1, inplace=True)
#rename new_target as target
df2 = df2.rename({'new_target': 'target'}, axis=1)
print (df2)
The output of this will be:
Dataframe1:
Place
0 New York
1 Los Angeles 1
2 Los Angeles- 2
3 Dallas -1
4 Dallas - 2
5 Dallas3
Dataframe2:
Place target value1 value2
0 New York 1000 a b
1 Los Angeles 1500 c d
2 Dallas 1 2000 e f
Updated DataFrame with repeated values:
Place value1 value2 target
0 New York a b 1000.00
1 Los Angeles c d 750.00
1 Los Angeles c d 750.00
2 Dallas 1 e f 666.67
2 Dallas 1 e f 666.67
2 Dallas 1 e f 666.67
How can I group by a couple of columns for only values that contain a string anywhere in that column value?
For example if I want to look at state and theatre name but only look at the count or number of times a title as the word dog anywhere in it how can I group by to filter with that?
State | Theatre | Title | TicketPrice
NY B Dog in heaven 5.50
NJ C Basketball 3.33
NY B Cats 9.00
NY B Hair of Dog 44.00
NY B Lions 22.00
NJ C Dog Land 4.99
Grouping by State and Theatre, I want only the count of titles where Dog as a word appears in the Title column and the sum for each grouped by only for titles where Dog appears?
Thanks!
Compare column by Series.str.contains for mask, convert to integers for True->1 and False->0 mapping and count number of 1 by sum:
df1 = (df.assign(count = df['Title'].str.contains('Dog').astype(int))
.groupby(['State', 'Theatre'])['count']
.sum()
.reset_index())
print (df1)
State Theatre count
0 NJ C 1
1 NY B 2
If want also aggregate sum for TicketPrice column per groups:
df2 = (df.assign(count = df['Title'].str.contains('Dog').astype(int))
.groupby(['State', 'Theatre'])['count', 'TicketPrice']
.sum()
.reset_index())
print (df2)
State Theatre count TicketPrice
0 NJ C 1 8.32
1 NY B 2 80.50
Filter rows and then count number of rows, but if filter out groups with no match:
df1 = (df[df['Title'].str.contains('Dog')]
.groupby(['State', 'Theatre'])['TicketPrice']
.size()
.reset_index(name='count'))
print (df1)
State Theatre count
0 NJ C 1
1 NY B 2