split values among names in Python - python

I've a dateframe 1
Place
0 New York
1 Los Angeles 1
2 Los Angeles- 2
3 Dallas -1
4 Dallas - 2
5 Dallas3
dataframe 2
Place target value1 value2
New York 1000 a b
Los Angeles 1500 c d
Dallas 1 2000 e f
Desired dataframe
Place target value1 value2
New York 1000 a b
Los Angeles 1 750 c d
Los Angeles- 2 750 c d
Dallas -1 666.6 e f
Dallas - 2 666.6 e f
Dallas3 666.6 e f
Explanation: We have to merge dataframe1 and dateframe2 on 'place'. We have 1 New york, 2 Los Angeles, 3 Dallas in dataframe1, but we have only ones in dateframe2. So we split the target based on the count of places (only names, not numbers) in df1 and assign value1 and value2 to respective place.
Is there any way to consider all the spell check, whitespaces, special characters using regex and obtain the desired dataframe?

This is the exact solution:
def extract_city(col):
return col.str.extract('([a-zA-Z]+(?:\s+[a-zA-Z]+)*)')[0]
df = pd.merge(df1, df2, left_on=extract_city(df1['Place']), right_on=extract_city(df2['Place']))
df = df.drop(['key_0', 'Place_y'], axis=1).rename({'Place_x' : 'Place'}, axis=1)
df['Target'] /= df.groupby(extract_city(df['Place']))['Place'].transform('count')
df

An alternate method to do thing this will be as follows:
import pandas as pd
df1 = pd.DataFrame({'Place':['New York','Los Angeles 1','Los Angeles- 2','Dallas -1','Dallas - 2','Dallas3']})
print (df1)
#create a column to compare both dataframes. Remove numeric, - and space values
df1['Place_compare'] = df1.Place.str.replace('\d+|-| ', '')
df2 = pd.DataFrame({'Place':['New York','Los Angeles','Dallas 1'],
'target':[1000,1500,2000],
'value1':['a','c','e'],
'value2':['b','d','f']})
print (df2)
#create a column to compare both dataframes. Remove numeric, - and space values
df2['Place_compare'] = df2.Place.str.replace('\d+|-| ', '')
#count number of times the unique values of Place occurs in df1. assign to df2
df2['counts'] = df2['Place_compare'].map(df1['Place_compare'].value_counts())
#calculate new target based on number of occurrences of Place in df1
df2['new_target'] = (df2['target'] / df2['counts']).round(2)
#repeat the nows by the number of times it appears in counts
df2 = df2.reindex(df2.index.repeat(df2['counts']))
#drop temp columns
df2.drop(['counts','Place_compare','target'], axis=1, inplace=True)
#rename new_target as target
df2 = df2.rename({'new_target': 'target'}, axis=1)
print (df2)
The output of this will be:
Dataframe1:
Place
0 New York
1 Los Angeles 1
2 Los Angeles- 2
3 Dallas -1
4 Dallas - 2
5 Dallas3
Dataframe2:
Place target value1 value2
0 New York 1000 a b
1 Los Angeles 1500 c d
2 Dallas 1 2000 e f
Updated DataFrame with repeated values:
Place value1 value2 target
0 New York a b 1000.00
1 Los Angeles c d 750.00
1 Los Angeles c d 750.00
2 Dallas 1 e f 666.67
2 Dallas 1 e f 666.67
2 Dallas 1 e f 666.67

Related

Use pandas to mark cell X if id in country

To start with, I have 3 Excel files, canada.xlsx, mexico.xlsx and usa.xlsx, each has 3 columns: id, a number, ColA, a text like Val1, and Country, each Excel file has only the country of its name in the third column, like only Canada in canada.xlsx
I make a df:
import pandas as pd
import glob
savepath = '/home/pedro/myPython/pandas/xl_files/'
saveoutputpath = '/home/pedro/myPython/pandas/outputxl/'
# I put an extra column in each excel file named country with either Canada, Mexico or USA
filelist = glob.glob(savepath + "*.xlsx")
# open the xl files with the data
# put all the data in 1 df
df = pd.concat((pd.read_excel(f) for f in filelist))
# change the indexes to get unique indexes
# df.index.size gets how many indexes there are
indexes = []
for i in range(df.index.size):
indexes.append(i)
# now change the indexes pass a list to df.index
# never good to have 2 indexes the same
df.index = indexes
I make the output Excel, it has 4 columns, id, Canada, Mexico, USA. The point of the exercise is, write X in each country column with a corresponding id number, for example id 42345 may be in country Canada and Mexico, so 42345 should get an X in those 2 columns
I made this work, but I extracted the data from df to a dictionary. I tried various ways of doing this with df.loc() or df.iloc() but I can't seem to make it. I don't use pandas much.
This is the output df_out
# get a list of the ids
mylist = df["id"].values.tolist()
# get a set of the unique ids
myset = set(mylist)
#create new DataFrame with unique values in the column id
df_out = pd.DataFrame(columns=['id', 'Canada', 'Mexico', 'USA'], index=range(0, len(myset)))
df_out.fillna(0, inplace=True)
# make a list of unique ids and sort them
id_names = list(myset)
id_names.sort()
# populate the id column with id_names
df_out["id"] = id_names
# see how many rows and columns
print(df_out.shape)
# mydict[key][0] is the id column , mydict[key][2]]is the country
for key in mydict.keys():
df_out.loc[df_out["id"] == mydict[key][0], mydict[key][2]] = "X"
Can you help me with a more "pandas way" of writing the X in df_out directly from df?
df:
id Col A country
0 42345 Test 1 USA
1 681593 Test 2 USA
2 331574 Test 3 USA
3 15786 Test 4 USA
4 93512 Chk1 Mexico
5 681593 Chk2 Mexico
6 331574 Chk3 Mexico
7 89153 Chk4 Mexico
8 42345 Val1 Canada
9 93512 Val2 Canada
10 331574 Val3 Canada
11 76543 Val4 Canada
df_out:
id Canada Mexico USA
0 15786 0 0 0
1 42345 0 0 0
2 76543 0 0 0
3 89153 0 0 0
4 93512 0 0 0
5 331574 0 0 0
6 681593 0 0 0
What you want is a pivot table.
pd.pivot_table(df, index='id', columns='country', aggfunc=lambda z: 'X', fill_value=0).rename_axis(None, axis=1).reset_index()
Input
id country
0 42345 USA
1 681593 USA
2 331574 USA
3 15786 USA
4 93512 Mexico
5 681593 Mexico
6 331574 Mexico
7 89153 Mexico
8 42345 Canada
9 93512 Canada
10 331574 Canada
11 76543 Canada
Output
id Canada Mexico USA
0 15786 0 0 X
1 42345 X 0 X
2 76543 X 0 0
3 89153 0 X 0
4 93512 X X 0
5 331574 X X X
6 681593 0 X X

Pandas filter without ~ and not in operator

I have two dataframes like as below
ID,Name,Sub,Country
1,ABC,ENG,UK
1,ABC,MATHS,UK
1,ABC,Science,UK
2,ABE,ENG,USA
2,ABE,MATHS,USA
2,ABE,Science,USA
3,ABF,ENG,IND
3,ABF,MATHS,IND
3,ABF,Science,IND
df1 = pd.read_clipboard(sep=',')
ID,Name,class,age
11,ABC,ENG,21
12,ABC,MATHS,23
1,ABC,Science,25
22,ABE,ENG,19
23,ABE,MATHS,22
24,ABE,Science,26
33,ABF,ENG,24
31,ABF,MATHS,28
32,ABF,Science,26
df2 = pd.read_clipboard(sep=',')
I would like to do the below
a) Check whether the ID and Name from df1 is present in df2.
b) If present in df2, put Yes in Status column or No in Status column. Don't use ~ or not in operator because my df2 has million of rows. So, it will result in irrelevant results
I tried the below
ID_list = df1['ID'].unique().tolist()
Name_list = df1['Name'].unique().tolist()
filtered_df = df2[((df2['ID'].isin(ID_list)) & (df2['Name'].isin(Name_list)))]
filtered_df = filtered_df.groupby(['ID','Name','Sub']).size().reset_index()
The above code gives matching ids and names between df1 and df2.
But I want to find the ids and names that are present in df1 but missing/absent in df2. I cannot use the ~ operator because it will return all the rows from df2 that don't have a match in df1. In real world, my df2 has millions of rows. I only want to find the missing df1 ids and names and put a status column
I expect my output to be like as below
ID,Name,Sub,Country, Status
1,ABC,ENG,UK,No
1,ABC,MATHS,UK,No
1,ABC,Science,UK,Yes
2,ABE,ENG,USA,No
2,ABE,MATHS,USA,No
2,ABE,Science,USA,No
3,ABF,ENG,IND,No
3,ABF,MATHS,IND,No
3,ABF,Science,IND,No
Expected ouput is for match by 3 columns:
m = df1.merge(df2,
left_on=['ID','Name','Sub'],
right_on=['ID','Name','class'],
indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK No
1 1 ABC MATHS UK No
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
With testing by isin solution is:
idx1 = pd.MultiIndex.from_frame(df1[['ID','Name','Sub']])
idx2 = pd.MultiIndex.from_frame(df2[['ID','Name','class']])
df1['Status'] = np.where(idx1.isin(idx2), 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK No
1 1 ABC MATHS UK No
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
Because if match by 2 columns ouput is different:
m = df1.merge(df2, on=['ID','Name'], indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK Yes
1 1 ABC MATHS UK Yes
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No

Flag difference in panda dataframe

I have pandas dataset and want to create a column that would flag the difference
i.e Column B should have the same values for each value in column A and vice versa. If it's not then flag it as 1
column A
Column B
New Column
Atlanta
GA
0
Atlanta
GA
0
Newyork
NY
1
Newyork
YN
1
company1
Com
1
company
Com
1
company
Com
1
Since the question is updated, here is a way of doing it.
I use this data :
df = pd.DataFrame({"column A": ["Atlanta", "Atlanta", "New York", "New York"], "column B": ["AT", "AT", "YN", "NY"]})
df
column A column B
0 Atlanta AT
1 Atlanta AT
2 New York YN
3 New York NY
With pd.groupby :
df_gb = df.groupby("column A", as_index=False).nunique()
condition = [df_gb["column B"] == 1]
value = [0]
df_gb["difference"] = np.select(condition, value, default=1)
df_gb = df_gb[["column A", "difference"]]
Output[0] :
df_gb
column A difference
0 Atlanta 0
1 New York 1
Then finally :
df = df.merge(df_gb, on="column A", how="left")
Output[1] :
df
column A column B difference
0 Atlanta AT 0
1 Atlanta AT 0
2 New York YN 1
3 New York NY 1
If you care about the order and repetition at each character in column B, you can get the similarity for each word in B to A.
def sim_lower(A, B):
return ''.join([ch for ch in B.lower() if ch in A.lower()])
df['Flag'] = [sim_lower(A,B) == B.lower() for A,B in zip(df['column A'],df['column B'])]
Which returns
column A column B New Column Flag
0 Atlanta GA 0 False
1 Atlanta GA 0 False
2 Newyork NY 1 True
3 Newyork YN 1 True
4 company1 Com 1 True
5 company Com 1 True
6 company Com 1 True

Merge multiple tables and join the same column with comma split

I have about 15 csv files with the same number of unique IDs. And for each of the file the col1 contains different text. How can I join them together to create a new table contains all the information from those 15 files? I tried to use pd.merge, create a new col1 comma split those text and delete the duplicates col1. There will be some columns named col1_x,col1_y, col1_y,etc.. Is there any other better ways to implement this?
My input is,
df1:
ID col1 location gender
1 Airplane NY F
2 Bus CA M
3 NaN FL M
4 Bus WA F
df2:
ID col1 location gender
1 Apple NY F
2 Peach CA M
3 Melon FL M
4 Banana WA F
df3:
ID col1 location gender
1 NaN NY F
2 Football CA M
3 Boxing FL M
4 Running WA F
Expected output is,
ID col1 location gender
1 Airplane,Apple NY F
2 Bus,Peach,Football CA M
3 Melon,Boxing FL M
4 Bus,Banana,Running WA F
You could use concat + groupby:
merged = pd.concat([df1, df2, df3], sort=False)
result = merged.dropna().groupby(['location', 'gender'], as_index=False).agg({'col1' : ','.join}).reset_index(drop=True)
print(result)
Output
location gender col1
0 CA M Bus,Peach,Football
1 FL M Melon,Boxing
2 NY F Airplane,Apple
3 WA F Bus,Banana,Running
For your data, you can do:
(pd.concat(df.melt(id_vars='ID').dropna() for df in [df1,df2,df3])
.groupby(['ID','variable'])['value'].apply(lambda x: ','.join(x.unique()))
.unstack()
)
Output:
variable col1 gender location
ID
1 Airplane,Apple F NY
2 Bus,Peach,Football M CA
3 Melon,Boxing M FL
4 Bus,Banana,Running F WA

Dividing each row by the previous one

I have pandas dataframe:
df = pd.DataFrame()
df['city'] = ['NY','NY','LA','LA']
df['hour'] = ['0','12','0','12']
df['value'] = [12,24,3,9]
city hour value
0 NY 0 12
1 NY 12 24
2 LA 0 3
3 LA 12 9
I want, for each city, to divide each row by the previous one and write the result into a new dataframe. The desired output is:
city ratio
NY 2
LA 3
What's the most pythonic way to do this?
First divide by shifted values per groups:
df['ratio'] = df['value'].div(df.groupby('city')['value'].shift(1))
print (df)
city hour value ratio
0 NY 0 12 NaN
1 NY 12 24 2.0
2 LA 0 3 NaN
3 LA 12 9 3.0
Then remove NaNs and select only city and ratio column:
df = df.dropna(subset=['ratio'])[['city', 'ratio']]
print (df)
city ratio
1 NY 2.0
3 LA 3.0
You can use pct_change:
In [20]: df[['city']].assign(ratio=df.groupby('city').value.pct_change().add(1)).dropna()
Out[20]:
city ratio
1 NY 2.0
3 LA 3.0
This'll do it:
df.groupby('city')['value'].agg({'ratio': lambda x: x.max()/x.min()}).reset_index()
# city ratio
#0 LA 3
#1 NY 2
This is one way using a custom function. It assumes you want to ignore the NaN rows in the result of dividing one series by a shifted version of itself.
def divider(x):
return x['value'] / x['value'].shift(1)
res = df.groupby('city').apply(divider)\
.dropna().reset_index()\
.rename(columns={'value': 'ratio'})\
.loc[:, ['city', 'ratio']]
print(res)
city ratio
0 LA 3.0
1 NY 2.0
one way is,
df.groupby(['city']).apply(lambda x:x['value']/x['value'].shift(1))
for further improvement,
print df.groupby(['city']).apply(lambda x:(x['value']/x['value'].shift(1)).fillna(method='bfill'))).reset_index().drop_duplicates(subset=['city']).drop('level_1',axis=1)
city value
0 LA 3.0
2 NY 2.0

Categories