Comparing two Dataframes with diff length to find difference in specific column - python

i have 2 dataframes have same columns with different len.
in [1] : df_g = pd.DataFrame([['EDC_TRAING_NO', 'EDU_T_N', 'NUMVER', '20'],
['EDC_TRAING_NAME', 'EDU_T_NM', 'VARCHAR', '40'],
['EDC_TRAING_ST', 'EDU_T_SD', 'DATETIME', '15'],
['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
['EDC_PLACE_NM', 'EDU_P_NM', 'VARCHAR2', '40'],
['ONLINE_REQST_POSBL_AT', 'ONLINE_R_P_A', 'VARCHAR2', '1']],
columns=['NAME', 'ID', 'TYPE', 'LEN'])
in [2] : df_n = pd.DataFrame([['EDC_TRAING_NO', 'EDU_TR_N', 'NUMVER', '20'],
['EDC_TRAING_NAME', 'EDU_TR_NM', 'VARCHAR', '20'],
['EDC_TRAING_ST', 'EDU_TR_SD', 'DATETIME', '15'],
['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
['EDC_PLACE_NM', 'EDU_PL_NM', 'VARCHAR2', '40'],
['ONLINE_REQST_POSBL_AT', 'ONLINE_REQ_P_A', 'VARCHAR2', '1']],
columns=['NAME', 'ID', 'TYPE', 'LEN'])
the reuslt i want to get:
result = pd.DataFrame([['EDC_TRAING_NO', 'EDU_TR_N', 'NUMVER', '20'],
['EDC_TRAING_ST', 'EDU_TR_SD', 'DATETIME', '15'],
['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
['EDC_PLACE_NM', 'EDU_PL_NM', 'VARCHAR2', '40'],
['ONLINE_REQST_POSBL_AT', 'ONLINE_REQ_P_A', 'VARCHAR2', '1']],
columns=['NAME', 'ID', 'TYPE', 'LEN'])
and each df have length like this.
len(df_g) : 1000
len(df_n) : 5000
each dataframe has column named 'name, id, type, len'
i need to check those columns(name,type,len) in each df to compare 'id' column whether it has same value or not.
so i tried like this.
for i in g.index:
for j in n.index:
g = g.iloc[i].values
# make it to ndarray
g_Str = g[0] + g[2] + g[3]
# make it to str for pivot
n = n.iloc[j].values
n_Str = n[0] + str(n[2]) + str(n[3])
# comparing and check two df
if g_Str == n_Str and g[1] != n[1]:
print(i, j)
print(g[0])
I have above code for 2 different length DataFrame.
first i tried with 'iterrows()' for comparing those two df,
but it took too much time.(very slow)
i looked up for other ways to make it work better performance.
possible ways i found
option1
transform df to dict with to_dict() / compare those in nested for-loop
option2
transform df.series to ndarray / compare those in nested for-loop
is there any other better option?
or any option to not using nested for-loop?
thx.

you can try merge,
and if you are looking for records where ids do mismatch then the following is one way of achieving it:
r1=df_g.merge(df_n,on=['NAME', 'TYPE', 'LEN'],how='inner').query('ID_x != ID_y').rename(columns={'ID_x': 'ID'}).drop('ID_y', 1)
I have used how="inner" join, but based on need can use any of the following joins:
{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’

Related

Python dataframe use apply function with multiple lambda

I have a short script. I use that script for for example I have dataset
I try to group by id first 3 then I try to group them again but this time I try to merge name, url and house
example output and input
data set
input csv
id,name,house
1,a,house1,
1,aa,house2
1,aaa,house3
2,b,house4
2,bb,house5
2,bbb,house6
3,c,house7
3,cc,house8
3,ccc,house9
4,d,house10
4,dd,house11
4,ddd,house12
4,dddd, house13
the output csv
1,a,house1,aa,house2,aaa,houes3
2,b,house4,bb,house5,bbb,houes6
3,c,house7,cc,house8,ccc,houes9
4,d,house10,dd,house11,ddd,house12
script
df = pd.read_csv('test.csv', delim_whitespace=True)
df.sort_values(by=['id'])
df = df.groupby('id').head(3).groupby('id').agg({
'name': lambda l: ','.join(l),
'house': lambda l: ','.join(l)
})
df[['name_first', 'name_second', 'name_third']] = df.name.str.split(',', expand=True)
df[['house_first', 'house_second', 'house_third']] = df.house.str.split(',', expand=True)
df = df.reset_index().drop(['name', 'house'], axis=1)
df.to_csv('output.csv')
I want to add progressbar, but I couldn't add, If I can switch agg func to apply func, I think I will be able to switch it progress_apply but I couldn't change how can I do that, I need progressbar because I have really huge csv file which over 10 millions lines so it is gonna take time, I want to track process
df = pd.DataFrame({'id': ['1', '1', '1', '2', '2', '2', '3', '3', '3', '4', '4', '4', '4'],
'name': ['a', 'aa', 'aaa', 'b', 'bb', 'bbb', 'c', 'cc', 'ccc', 'd', 'dd', 'ddd', 'dddd'],
'house': ['house1', 'house2', 'house3', 'house4', 'house5', 'house6', 'house7', 'house8', 'house9', 'house10', 'house11', 'house12', ' house13']
})
This approach creates a pivot table
outcome = df.groupby('id').head(3)\
.assign(count=df.groupby('id').cumcount())\
.set_index(['id', 'count']).unstack()\
.sort_index(axis=1, level=1)
and then we can save it after renaming the columns
outcome.columns = [f'{x}_{str(y)}' for x, y in outcome.columns]
outcome.to_csv('...')
But this does not come with a progress bar because I did not use apply.
To use progress bar for the sake of using it:
from tqdm.notebook import tqdm
tqdm.pandas()
outcome = df.groupby('id').progress_apply(
lambda x: x.head(3).reset_index(drop=True).set_index('id', append=True).unstack(0),
).droplevel(0).sort_index(axis=1, level=1)
outcome.columns = [f'{x}_{str(y)}' for x, y in outcome.columns]
outcome.to_csv('...')
Please try the both approaches and see which is faster.

modifying multiple Pandas columns with .map()

I'm trying to modify a few columns that all the yesses and nos become 1 and 0:
df['Outbreak Associated', 'FSA'] = df['Outbreak Associated', 'FSA'].map({'yes': '1', 'no': '0'})
Doing them one at a time works, but two or more is giving me an error. I imagine there's something simple I'm missing but I can't think of what it is.
KeyError: ('Outbreak Associated', 'FSA')
Any thoughts?
You can use replace, BUT if no match get original value, not NaN like map:
cols = ['Outbreak Associated', 'FSA']
df[cols] = df[cols].replace({'yes': '1', 'no': '0'})
Solutions for Series.map - you can loop each column by DataFrame.apply with lambda funcion:
df[cols] = df[cols].apply(lambda x: x.map({'yes': '1', 'no': '0'}))
Or use DataFrame.stack and
Series.unstack:
df[cols] = df[cols].stack().map({'yes': '1', 'no': '0'}).unstack()
Or using map for each column separately ;):
d = {'yes': '1', 'no': '0'}
df['Outbreak Associated'] = df['Outbreak Associated'].map(d)
df['FSA'] = df['FSA'].map(d)

Updating/updating a data table using python

I would like some advice on how to update/insert new data into an already existing data table using Python/Databricks:
# Inserting and updating already existing data
# Original data
import pandas as pd
source_data = {'Customer Number': ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df1 = pd.DataFrame (source_data, columns = ['Customer Number','Colour', 'Flow'])
print(df1)
# New data
new_data = {'Customer Number': ['1', '4',],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df2 = pd.DataFrame (new_data, columns = ['Customer Number','Colour', 'Flow'])
print(df2)
# What the updated table will look like
updated_data = {'Customer Number': ['1', '2', '3', '4',],
'Colour': ['Blue', 'Blue', 'Green', 'Blue',],
'Flow': ['Bad', 'Bad', "Good", 'Bad']
}
df3 = pd.DataFrame (updated_data, columns = ['Customer Number','Colour', 'Flow'])
print(df3)
What you can see here is that the original data has three customers. I then get 'new_data' which contains an update of customer 1's data and new data for 'customer 4', who was not already in the original data. Then if you look at 'updated_data' you can see what the final data should look like. Here 'Customer 1's data has been updated and customer 4s data has been inserted.
Does anyone know where I should start with this? Which module I could use?
I’m not expecting someone to solve this in terms of developing, just need a nudge in the right direction.
Edit: the data source is .txt or CSV, the output is JSON, but as I load the data to Cosmos DB it’ll automatically convert so don’t worry too much about that.
Thanks
Current data frame structure and 'pd.update'
With some preparation, you can use the pandas 'update' function.
First, the data frames must be indexed (this is often useful anyway).
Second, the source data frame must be extended by the new indices with dummy/NaN data so that it can be updated.
# set indices of original data frames
col = 'Customer Number'
df1.set_index(col, inplace=True)
df2.set_index(col, inplace=True)
df3.set_index(col, inplace=True)
# extend source data frame by new customer indices
df4 = df1.copy().reindex(index=df1.index.union(df2.index))
# update data
df4.update(df2)
# verify that new approach yields correct results
assert all(df3 == df4)
Current data frame structure and 'pd.concat'
A slightly easier approach joins the data frames and removes duplicate
rows (and sorts by index if wanted). However, the temporary concatenation requires
more memory which may limit the size of the data frames.
df5 = pd.concat([df1, df2])
df5 = df5.loc[~df5.index.duplicated(keep='last')].sort_index()
assert all(df3 == df5)
Alternative data structure
Given that 'Customer Number' is the crucial attribute of your data,
you may also consider restructuring your original dictionaries like that:
{'1': ['Red', 'Good'], '2': ['Blue', 'Bad'], '3': ['Green', 'Good']}
Then updating your data simply corresponds to (re)setting the key of the source data with the new data. Typically, working directly on dictionaries is faster than using data frames.
# define function to restructure data, for demonstration purposes only
def restructure(data):
# transpose original data
# https://stackoverflow.com/a/6473724/5350621
vals = data.values()
rows = list(map(list, zip(*vals)))
# create new restructured dictionary with customers as keys
restructured = dict()
for row in rows:
restructured[row[0]] = row[1:]
return restructured
# restructure data
source_restructured = restructure(source_data)
new_restructured = restructure(new_data)
# simply (re)set new keys
final_restructured = source_restructured.copy()
for key, val in new_restructured.items():
final_restructured[key] = val
# convert to data frame and check results
df6 = pd.DataFrame(final_restructured, index=['Colour', 'Flow']).T
assert all(df3 == df6)
PS: When setting 'df1 = pd.DataFrame(source_data, columns=[...])' you do not need the 'columns' argument because your dictionaries are nicely named and the keys are automatically taken as column names.
You can use set intersection to find the Customer Numbers to update and set difference to find new Customer Number to add.
Then you can first update the initial data frame rows iterating through the intersection of Costumer Number and then merge the initial data frame only with the new rows of the data frame with the new values.
# same name column for clarity
cn = 'Customer Number'
# convert Consumer Number values into integer to use set
CusNum_df1 = [int(x) for x in df1[cn].values]
CusNum_df2 = [int(x) for x in df2[cn].values]
# find Customer Numbers to update and to add
CusNum_to_update = list(set(CusNum_df1).intersection(set(CusNum_df2)))
CusNum_to_add = list(set(CusNum_df2) - set(CusNum_df1))
# update rows in initial data frame
for num in CusNum_to_update:
index_initial = df1.loc[df1[cn]==str(num)].index[0]
index_new = df2.loc[df2[cn]==str(num)].index[0]
for col in df1.columns:
df1.at[index_initial,col]= df2.loc[index_new,col]
# concatenate new rows to initial data frame
for num in CusNum_to_add:
df1 = pd.concat([df1, df2.loc[df2[cn]==str(num)]]).reset_index(drop=True)
out:
Customer Number Colour Flow
0 1 Blue Bad
1 2 Blue Bad
2 3 Green Good
3 4 Blue Bad
There are many ways, but in terms of readability, I would prefer to do this.
import pandas as pd
dict_source = {'Customer Number': ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df_origin = pd.DataFrame.from_dict(dict_source)
dict_new = {'Customer Number': ['1', '4', ],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df_new = pd.DataFrame.from_dict(dict_new)
df_result = df_origin.copy()
df_result.set_index(['Customer Number', ], inplace=True)
df_new.set_index(['Customer Number', ], inplace=True)
df_result.update(df_new) # update number 1
# handle number 4
df_result.reset_index(['Customer Number', ], inplace=True)
df_new.reset_index(['Customer Number', ], inplace=True)
df_result = df_result.merge(df_new, on=list(df_result), how='outer')
print(df_result)
Customer Number Colour Flow
0 1 Blue Bad
1 2 Blue Bad
2 3 Green Good
3 4 Blue Bad
You can use 'Customer Number' as index and use update method:
import pandas as pd
source_data = {'Customer Number': ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df1 = pd.DataFrame (source_data, index=source_data['Customer Number'], columns=['Colour', 'Flow'])
print(df1)
# New data
new_data = {'Customer Number': ['1', '4',],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df2 = pd.DataFrame (new_data, index=new_data['Customer Number'], columns=['Colour', 'Flow'])
print(df2)
df3 = df1.reindex(index=df1.index.union(df2.index))
df3.update(df2)
print(df3)
Colour Flow
1 Blue Bad
2 Blue Bad
3 Green Good
4 Blue Bad

For loop that appends rows to dataframe, starting from a list output

I've got an output from an API call as a list:
out = client.phrase_this(phrase='ciao', database='it')
out
[{'Keyword': 'ciao',
'Search Volume': '673000',
'CPC': '0.05',
'Competition': '0',
'Number of Results': '205000000'}]
type(out)
list
I'd like to to create a dataframe and loop-append to that dataframe a new row, starting the API output from multiple keywords.
index = ['ciao', 'google', 'microsoft']
columns = ['Keyword', 'Search Volume', 'CPC', 'Competition', 'Number of Results']
df = pd.DataFrame(index=index, columns=columns)
For loop that is not working:
for keyword in index:
df.loc[keyword] = client.phrase_this(phrase=index, database='it')
Thanks!
The reason this is not working is because you are trying to assign a dictionary inside of a list to the data frame row, rather than just a list.
You are receiving a list containing a dictionary. If you only want to use the first entry of this list the following solution should work:
for keyword in index:
df.loc[keyword] = client.phrase_this(phrase=keyword, database='it')[0].values()
[0] gets the first entry of the list.
values() returns a list of all the values in the dictionary. https://www.tutorialspoint.com/python/dictionary_values.htm
for keyword in index:
df.loc[keyword] = client.phrase_this(phrase=keyword, database='it')
This passes the keyword to the phrase_this function, instead of the entire index list.
Thanks for the answers, I found a workaround:
index = ['ciao', 'google', 'microsoft']
columns = ['Keyword', 'Search Volume', 'CPC', 'Competition', 'Number of Results']
out = []
for query in index:
out.append(client.phrase_this(phrase=query, database='it')[0].values())
out
[dict_values(['ciao', '673000', '0.05', '0', '205000000']),
dict_values(['google', '24900000', '0.66', '0', '13020000000']),
dict_values(['microsoft', '110000', '0.12', '0.06', '77'])]
df = pd.DataFrame(out, columns=columns).set_index('Keyword')

Reorder a list based on regular expression matches in Python 2.7?

I would like to reorder a list of strings (column headers from Pandas) in Python 2.7.13 based on a regular expression. The desired output will have the current 0 index item in the same place, followed immediately by the matched strings found using the regular expression, followed by the remaining strings.
# Here's the input list:
cols = ['ID', 'MP', 'FC', 'Dest_MP', 'Dest_FC', 'Origin_MP', 'Origin_FC']
# And the desired output:
output_cols = ['ID', 'FC', 'Dest_FC', 'Origin_FC', 'MP', 'Dest_MP', 'Origin_MP']
I have a working code example. It's not pretty, and that's why I'm here.
import re
cols = ['ID', 'MP', 'FC', 'Dest_MP', 'Dest_FC', 'Origin_MP', 'Origin_FC']
pattern = re.compile(r'^FC|FC$')
matched_cols = filter(pattern.search, cols)
indices = [0] + [cols.index(match_column) for match_column in matched_cols]
output_cols, counter = [], 0
for index in indices:
output_cols.append(cols.pop(index - counter))
counter += 1
output_cols += cols
print(output_cols)
Is there a more readable, more pythonic way to accomplish this?
Isolate first element, no way around that.
Then, on the rest of the list, use a sort key which returns a couple:
first priority a boolean to indicate it matches the regex or not (negated so it appears first)
second priority the element itself to tiebreak matching/non matching elements
like this:
import re
cols = ['ID', 'MP', 'FC', 'Dest_MP', 'Dest_FC', 'Origin_MP', 'Origin_FC']
new_cols = [cols[0]] + sorted(cols[1:],key=lambda x : (not bool(re.search("^FC|FC$",x)),x))
result:
['ID', 'Dest_FC', 'FC', 'Origin_FC', 'Dest_MP', 'MP', 'Origin_MP']
if you want FC to appear first, add a third value to the returned key. Let's choose the length of the strings (not clear what you really want to see as a tiebreaker
key=lambda x : (not bool(re.search("^FC|FC$",x)),len(x),x)
result is now:
['ID', 'FC', 'Dest_FC', 'Origin_FC', 'MP', 'Dest_MP', 'Origin_MP']
note that sort is stable, so maybe you don't need a tiebreaker at all:
new_cols = [cols[0]] + sorted(cols[1:],key=lambda x : not bool(re.search("^FC|FC$",x)))
result:
['ID', 'FC', 'Dest_FC', 'Origin_FC', 'MP', 'Dest_MP', 'Origin_MP']

Categories