pandas str contains with maximum value - python

I have 2 data-frames, one of them contains strings and the other contains a timestamp and a string.
df2= pd.DataFrame({'Name':['Tim', 'Timothy', 'Kistian', 'Kris cole','Ian'],
'Age':['1-2-1997', '21-3-1998', '19-6-2000', '18-4-1996','12-12-2001']})
df1= pd.DataFrame({'string':['Ti', 'Kri' ,'ian' ],
'MaxDate':[None, None, None]})
I want to assign to MaxDate column the maximum date of a str.contains(df1['string'][0] operation on df2:
for example: df2[df2.Name.str.contains(df1['string'][0])] gives me 2 records
I want to assign the maximum of these values to MaxDate corresponding to 'ti':
ie o/p for the first iteration will be:
df1= pd.DataFrame({'string':['Ti', 'Kri' ,'ian' ],
'MaxDate':['1-2-1997', None, None]})
How can I do this for all entries of df1 using a loop?

If need loop solution create list of dictionaries with max and pass to DataFrame constructor:
df2['Age'] = pd.to_datetime(df2['Age'], dayfirst=True)
out = []
for x in df1['string']:
m = df2.loc[df2.Name.str.contains(x), 'Age'].max()
out.append({'string': x, 'MaxDate': m})
df = pd.DataFrame(out)
print (df)
string MaxDate
0 Ti 1998-03-21
1 Kri 1996-04-18
2 ian 2000-06-19

Related

find common data between two dataframes on a specific range of date

I have two dataframes df1 and df2 based, respectively, on these dictionaries:
data1 = {'date': ['5/09/22', '7/09/22', '7/09/22','10/09/22'],
'second_column': ['first_value', 'second_value', 'third_value','fourth_value'],
'id_number':['AA576bdk89', 'GG6jabkhd589', 'BXV6jabd589','BXzadzd589'],
'fourth_column':['first_value', 'second_value', 'third_value','fourth_value'],}
data2 = {'date': ['5/09/22', '7/09/22', '7/09/22', '7/09/22', '7/09/22', '11/09/22'],
'second_column': ['first_value', 'second_value', 'third_value','fourth_value', 'fifth_value','sixth_value'],
'id_number':['AA576bdk89', 'GG6jabkhd589', 'BXV6jabd589','BXV6mkjdd589','GGdbkz589', 'BXhshhsd589'],
'fourth_column':['first_value', 'second_value', 'third_value','fourth_value', 'fifth_value','sixth_value'],}
I want to compare df2 with df1 in order to show the "id_number" of df2 that are in df1.
I also want to compare the two dataframes on the same date range.
For example the shared date range between df1 and df2 should be the from 5/09/22 to 10/09/22 (and not beyond)
How can I do this?
You can define a helper function to make dataframes of your dictionaries and slice them on certain date range:
def format(dictionary, start, end):
"""Helper function.
Args:
dictionary: dictionary to format.
start: start date (DD/MM/YY).
end: end date (DD/MM/YY).
Returns:
Dataframe.
"""
return (
pd.DataFrame(dictionary)
.pipe(lambda df_: df_.assign(date=pd.to_datetime(df_["date"], format="%d/%m/%y")))
.pipe(
lambda df_: df_.loc[
(df_["date"] >= pd.to_datetime(start, format="%d/%m/%y"))
& (df_["date"] <= pd.to_datetime(end, format="%d/%m/%y")),
:,
]
).reset_index(drop=True)
)
Then, with dictionaries you provided, here is how you can "show the "id_number" of df2 that are in df1" for the desired date range:
df1 = format(data1, "05/09/22", "10/09/22")
df2 = format(data2, "05/09/22", "10/09/22")
print(df2[df2["id_number"].isin(df1["id_number"])]["id_number"])
# Output
0 AA576bdk89
1 GG6jabkhd589
2 BXV6jabd589
Name: id_number, dtype: object

How to split a column into many columns where the name of this columns change

I defined a data frame into a "function" where the name of each column in the dataframes changes continuously so I can't specify the name of this column and then split it to many columns. For example, I can't say df ['name'] and then split it into many columns. The number of columns and rows of this dataframes is not constant. I need to split any column contains more than one item to many components (columns).
For example:
This is one of the dataframes which I have:
name/one name/three
(192.26949,) (435.54,436.65,87.3,5432)
(189.4033245,) (45.51,56.612, 54253.543, 54.321)
(184.4593252,) (45.58,56.6412,654.876,765.66543)
I want to convert it to:
name/one name/three1 name/three2 name/three3 name/three4
192.26949 435.54 436.65 87.3 5432
189.4033245 45.51 56.612 54253.543 54.321
184.4593252 45.58 56.6412 654.876 765.66543
Solution if all data are tuples in all rows and all columns use concat with DataFrame constructor and DataFrame.add_prefix:
df = pd.concat([pd.DataFrame(df[c].tolist()).add_prefix(c) for c in df.columns], axis=1)
print (df)
name/one0 name/three0 name/three1 name/three2 name/three3
0 192.269490 435.54 436.6500 87.300 5432.00000
1 189.403324 45.51 56.6120 54253.543 54.32100
2 184.459325 45.58 56.6412 654.876 765.66543
If possible string repr of tuples:
import ast
L = [pd.DataFrame([ast.literal_eval(y) for y in df[c]]).add_prefix(c) for c in df.columns]
df = pd.concat(L, axis=1)
print (df)
name/one0 name/three0 name/three1 name/three2 name/three3
0 192.269490 435.54 436.6500 87.300 5432.00000
1 189.403324 45.51 56.6120 54253.543 54.32100
2 184.459325 45.58 56.6412 654.876 765.66543

Combining dataframes in Python to a dictionary using one of the dataframes as key

I have 3 dataframes, containing daily data: unique code, names, scores. First value in Row 1 is called Rank and then I have dates, first column under Rank contains the rank number (the first column is used as index).
**df1** UNIQUE CODES
Rank 12/8/2017 12/9/2017 .... 1/3/2018
1 Code_1 Code_3 Code_4
2 Code_2 Code_1 Code_2
...
1000 Code_5 Code_6 Code_7
**df2** NAMES
Rank 12/8/2017 12/9/2017 .... 1/3/2018
1 Jon Maria Peter
2 Brian Jon Maria
...
1000 Chris Tim Charles
**df3** SCORES
Rank 12/8/2017 12/9/2017 .... 1/3/2018
1 10 20 30
2 15 10 40
...
1000 25 15 20
Desired output:
I want to combine these dataframes into a dictionary, using df1 codenames as keys, so it will look like this:
dictionary = {'Code_1':[Jon, 20] , 'Code_2':[Brian, 15]}
As there are repeat competitors, I will need to sum their scores during all of the data series. So in the above examples, the Score_1 of Jon will contain scores for 12/8/2017 and 12/9/2017.
There are 1000 rows and 26 columns + index, so need a way to capture those. I think that a nested loop could work here, but don't have enough experience to build one that works.
In the end, I would like to sort the dictionary by highest score. Please suggest any solutions to this or more straightforward ways to combine this data and get the score ranking.
I attached pictures of dataframes, containing names, codes, and scores.
names
codes
scores
I used the proposed solution below on the 3 dataframes that I have. Please note that hashtags stands for code, players for names, and trophies for scores:
# reshape to get dates into rows
hashtags_reshaped = pd.melt(hashtags, id_vars = ['Rank'],
value_vars = hashtags.columns,
var_name = 'Date',
value_name = 'Code').drop('Rank', axis = 1)
# reshape to get dates into rows
players_reshaped = pd.melt(players, id_vars = ['Rank'],
value_vars = hashtags.columns,
var_name = 'Date',
value_name = 'Name').drop('Rank', axis = 1)
# reshape to get the dates into rows
trophies_reshaped = pd.melt(trophies, id_vars = ['Rank'],
value_vars = hashtags.columns,
var_name = 'Date',
value_name = 'Score').drop('Rank', axis = 1)
# merge the three together.
# This _assumes_ that the dfs are all in the same order and that all the data matches up.
merged_df = pd.DataFrame([hashtags_reshaped['Date'],
hashtags_reshaped['Code'], players_reshaped['Name'],
trophies_reshaped['Score']]).T
print(merged_df)
# group by code, name, and date; sum the scores together if multiple exist for a given code-name-date grouping
grouped_df = merged_df.groupby(['Code', 'Name', 'Date']).sum().sort_values('Score', ascending = False)
print(grouped_df)
summed_df = merged_df.drop('Date', axis = 1) \
.groupby(['Code', 'Name']).sum() \
.sort_values('Score', ascending = False).reset_index()
summed_df['li'] = list(zip(summed_df.Name, summed_df.Score))
print(summed_df)
But I'm getting a strange output: the summed scores should be in hundreds or low thousands (as an average score is 200-300 and an average participation frequency is 4-6 times). The score results I'm getting are way off, but their match codes and names correctly.
summed_df:
0 (MandiBralaX, 996871590076253)
1 (Arso_C, 9955130513430)
2 (ThatRainbowGuy, 9946)
3 (fabi, 9940)
4 (Dogão, 991917)
5 (Hierbo, 99168)
6 (Clyde, 9916156180128)
7 (.A.R.M.I.N., 9916014310187143)
8 (keftedokofths, 9900)
9 (⚽AngelSosa⚽, 990)
10 (Totoo98, 99)
group_df:
Code Name Score \
0 #JL2J02LY MandiBralaX 996871590076253
1 #80JQ90VC Arso_C 9955130513430
2 #9GGC2CUQ ThatRainbowGuy 9946
3 #8LL989QV fabi 9940
4 #9PPC89L Dogão 991917
5 #2JPLQ8JP8 Hierbo 99168
This should get you much of the way there. I didn't create a dictionary at the end as you specified; while you may need that format, you'd end up with nested dictionaries or lists, as each Code has 1 Name but possibly many Dates and Scores associated with it. How do you want those recorded - list, dict, etc?
The code below returns a grouped dataframe; you can output it directly to a dict (shown), but you'll probably want to specify the format in detail, especially if you need an ordered dictionary. (Dictionaries are inherently not ordered; you'll have to from collections import OrderedDict and review that documentation if you really need an ordered dictionary.
import pandas as pd
#create the dfs; note that 'Code' is set up as a string
df1 = pd.DataFrame({'Rank': [1, 2], '12/8/2017': ['1', '2'], '12/9/2017': ['3', '1']})
df1.set_index('Rank', inplace = True)
# reshape to get dates into rows
df1_reshaped = pd.melt(df1, id_vars = ['Rank'],
value_vars = df1.columns,
var_name = 'Date',
value_name = 'Code').drop('Rank', axis = 1)
#print(df1_reshaped)
# create the second df
df2 = pd.DataFrame({'Rank': [1, 2], '12/8/2017': ['Name_1', 'Name_2'], '12/9/2017': ['Name_3', 'Name_1']})
df2.set_index('Rank', inplace = True)
# reshape to get dates into rows
df2_reshaped = pd.melt(df2, id_vars = ['Rank'],
value_vars = df1.columns,
var_name = 'Date',
value_name = 'Name').drop('Rank', axis = 1)
#print(df2_reshaped)
# create the third df
df3 = pd.DataFrame({'Rank': [1, 2], '12/8/2017': ['10', '20'], '12/9/2017': ['30', '10']})
df3.set_index('Rank', inplace = True)
# reshape to get the dates into rows
df3_reshaped = pd.melt(df3, id_vars = ['Rank'],
value_vars = df1.columns,
var_name = 'Date',
value_name = 'Score').drop('Rank', axis = 1)
#print(df3_reshaped)
# merge the three together.
# This _assumes_ that the dfs are all in the same order and that all the data matches up.
merged_df = pd.DataFrame([df1_reshaped['Date'], df1_reshaped['Code'], df2_reshaped['Name'], df3_reshaped['Score']]).T
print(merged_df)
# group by code, name, and date; sum the scores together if multiple exist for a given code-name-date grouping
grouped_df = merged_df.groupby(['Code', 'Name', 'Date']).sum().sort_values('Score', ascending = False)
print(grouped_df)
summed_df = merged_df.drop('Date', axis = 1) \
.groupby(['Code', 'Name']).sum() \
.sort_values('Score', ascending = False).reset_index()
summed_df['li'] = list(zip(summed_df.Name, summed_df.Score))
print(summed_df)
Unsorted dict:
d = dict(zip(summed_df.Code, summed_df.li))
print(d)
You can make the OrderedDict directly, of course, and should:
from collections import OrderedDict
d2 = OrderedDict(zip(summed_df.Code, summed_df.li))
print(d2)
summed_df:
Code Name Score li
0 3 Name_3 30 (Name_3, 30)
1 1 Name_1 20 (Name_1, 20)
2 2 Name_2 20 (Name_2, 20)
d:
{'3': ('Name_3', 30), '1': ('Name_1', 20), '2': ('Name_2', 20)}
d2, sorted:
OrderedDict([('3', ('Name_3', 30)), ('1', ('Name_1', 20)), ('2', ('Name_2', 20))])
This returns your (name, score) as a tuple, not a list, but... it should get more of the way there.

Diff between two dataframes in pandas

I have two dataframes both of which have the same basic schema. (4 date fields, a couple of string fields, and 4-5 float fields). Call them df1 and df2.
What I want to do is basically get a "diff" of the two - where I get back all rows that are not shared between the two dataframes (not in the set intersection). Note, the two dataframes need not be the same length.
I tried using pandas.merge(how='outer') but I was not sure what column to pass in as the 'key' as there really isn't one and the various combinations I tried were not working. It is possible that df1 or df2 has two (or more) rows that are identical.
What is a good way to do this in pandas/Python?
Try this:
diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')
diff_df = diff_df.loc[diff_df['Exist'] != 'both']
You will have a dataframe of all rows that don't exist on both df1 and df2.
IIUC:
You can use pd.Index.symmetric_difference
pd.concat([df1, df2]).loc[
df1.index.symmetric_difference(df2.index)
]
You can use this function, the output is an ordered dict of 6 dataframes which you can write to excel for further analysis.
'df1' and 'df2' refers to your input dataframes.
'uid' refers to the column or combination of columns that make up the unique key. (i.e. 'Fruits')
'dedupe' (default=True) drops duplicates in df1 and df2. (refer to Step 4 in comments)
'labels' (default = ('df1','df2')) allows you to name the input dataframes. If a unique key exists in both dataframes, but have
different values in one or more columns, it is usually important to know these rows, put them one on top of the other and label the row with the name so we know to which dataframe does it belong to.
'drop' can take a list of columns to be excluded from the consideration when considering the difference
Here goes:
df1 = pd.DataFrame([['apple', '1'], ['banana', 2], ['coconut',3]], columns=['Fruits','Quantity'])
df2 = pd.DataFrame([['apple', '1'], ['banana', 3], ['durian',4]], columns=['Fruits','Quantity'])
dict1 = diff_func(df1, df2, 'Fruits')
In [10]: dict1['df1_only']:
Out[10]:
Fruits Quantity
1 coconut 3
In [11]: dict1['df2_only']:
Out[11]:
Fruits Quantity
3 durian 4
In [12]: dict1['Diff']:
Out[12]:
Fruits Quantity df1 or df2
0 banana 2 df1
1 banana 3 df2
In [13]: dict1['Merge']:
Out[13]:
Fruits Quantity
0 apple 1
Here is the code:
import pandas as pd
from collections import OrderedDict as od
def diff_func(df1, df2, uid, dedupe=True, labels=('df1', 'df2'), drop=[]):
dict_df = {labels[0]: df1, labels[1]: df2}
col1 = df1.columns.values.tolist()
col2 = df2.columns.values.tolist()
# There could be columns known to be different, hence allow user to pass this as a list to be dropped.
if drop:
print ('Ignoring columns {} in comparison.'.format(', '.join(drop)))
col1 = list(filter(lambda x: x not in drop, col1))
col2 = list(filter(lambda x: x not in drop, col2))
df1 = df1[col1]
df2 = df2[col2]
# Step 1 - Check if no. of columns are the same:
len_lr = len(col1), len(col2)
assert len_lr[0]==len_lr[1], \
'Cannot compare frames with different number of columns: {}.'.format(len_lr)
# Step 2a - Check if the set of column headers are the same
# (order doesnt matter)
assert set(col1)==set(col2), \
'Left column headers are different from right column headers.' \
+'\n Left orphans: {}'.format(list(set(col1)-set(col2))) \
+'\n Right orphans: {}'.format(list(set(col2)-set(col1)))
# Step 2b - Check if the column headers are in the same order
if col1 != col2:
print ('[Note] Reordering right Dataframe...')
df2 = df2[col1]
# Step 3 - Check datatype are the same [Order is important]
if set((df1.dtypes == df2.dtypes).tolist()) - {True}:
print ('dtypes are not the same.')
df_dtypes = pd.DataFrame({labels[0]:df1.dtypes,labels[1]:df2.dtypes,'Diff':(df1.dtypes == df2.dtypes)})
df_dtypes = df_dtypes[df_dtypes['Diff']==False][[labels[0],labels[1],'Diff']]
print (df_dtypes)
else:
print ('DataType check: Passed')
# Step 4 - Check for duplicate rows
if dedupe:
for key, df in dict_df.items():
if df.shape[0] != df.drop_duplicates().shape[0]:
print(key + ': Duplicates exists, they will be dropped.')
dict_df[key] = df.drop_duplicates()
# Step 5 - Check for duplicate uids.
if type(uid)==str or type(uid)==list:
print ('Uniqueness check: {}'.format(uid))
for key, df in dict_df.items():
count_uid = df.shape[0]
count_uid_unique = df[uid].drop_duplicates().shape[0]
var = [0,1][count_uid_unique == df.shape[0]] #<-- Round off to the nearest integer if it is 100%
pct = round(100*count_uid_unique/df.shape[0], var)
print ('{}: {} out of {} are unique ({}%).'.format(key, count_uid_unique, count_uid, pct))
# Checks complete, begin merge. '''Remenber to dedupe, provide labels for common_no_match'''
dict_result = od()
df_merge = pd.merge(df1, df2, on=col1, how='inner')
if not df_merge.shape[0]:
print ('Error: Merged DataFrame is empty.')
else:
dict_result[labels[0]] = df1
dict_result[labels[1]] = df2
dict_result['Merge'] = df_merge
if type(uid)==str:
uid = [uid]
if type(uid)==list:
df1_only = df1.append(df_merge).reset_index(drop=True)
df1_only['Duplicated']=df1_only.duplicated(keep=False) #keep=False, marks all duplicates as True
df1_only = df1_only[df1_only['Duplicated']==False]
df2_only = df2.append(df_merge).reset_index(drop=True)
df2_only['Duplicated']=df2_only.duplicated(keep=False)
df2_only = df2_only[df2_only['Duplicated']==False]
label = labels[0]+' or '+labels[1]
df_lc = df1_only.copy()
df_lc[label] = labels[0]
df_rc = df2_only.copy()
df_rc[label] = labels[1]
df_c = df_lc.append(df_rc).reset_index(drop=True)
df_c['Duplicated'] = df_c.duplicated(subset=uid, keep=False)
df_c1 = df_c[df_c['Duplicated']==True]
df_c1 = df_c1.drop('Duplicated', axis=1)
df_uc = df_c[df_c['Duplicated']==False]
df_uc_left = df_uc[df_uc[label]==labels[0]]
df_uc_right = df_uc[df_uc[label]==labels[1]]
dict_result[labels[0]+'_only'] = df_uc_left.drop(['Duplicated', label], axis=1)
dict_result[labels[1]+'_only'] = df_uc_right.drop(['Duplicated', label], axis=1)
dict_result['Diff'] = df_c1.sort_values(uid).reset_index(drop=True)
return dict_result
Set df2.columns = df1.columns
Now, set every column as the index: df1 = df1.set_index(df1.columns.tolist()), and similarly for df2.
You can now do df1.index.difference(df2.index), and df2.index.difference(df1.index), and the two results are your distinct columns.
with
left_df.merge(df,left_on=left_df.columns.tolist(),right_on=df.columns.tolist(),how='outer')
you can get the outer join result.
Similarly, you can get the inner join result.Then make a diff that would be what you want.

Dictionary in Pandas DataFrame, how to split the columns

I have a DataFrame that consists of one column ('Vals') which is a dictionary. The DataFrame looks more or less like this:
In[215]: fff
Out[213]:
Vals
0 {u'TradeId': u'JP32767', u'TradeSourceNam...
1 {u'TradeId': u'UUJ2X16', u'TradeSourceNam...
2 {u'TradeId': u'JJ35A12', u'TradeSourceNam...
When looking at an individual row the dictionary looks like this:
In[220]: fff['Vals'][100]
Out[218]:
{u'BrdsTraderBookCode': u'dffH',
u'Measures': [{u'AssetName': u'Ie0',
u'DefinitionId': u'6dbb',
u'MeasureValues': [{u'Amount': -18.64}],
u'ReportingCurrency': u'USD',
u'ValuationId': u'669bb'}],
u'SnapshotId': 12739,
u'TradeId': u'17304M',
u'TradeLegId': u'31827',
u'TradeSourceName': u'xxxeee',
u'TradeVersion': 1}
How can I split the the columns and create a new DataFrame, so that I get one column with TradeId and another one with MeasureValues?
try this:
l=[]
for idx, row in df['Vals'].iteritems():
temp_df = pd.DataFrame(row['Measures'][0]['MeasureValues'])
temp_df['TradeId'] = row['TradeId']
l.append(temp_df)
pd.concat(l,axis=0)
Here's a way to get TradeId and MeasureValues (using twice your sample row above to illustrate the iteration):
new_df = pd.DataFrame()
for id, data in fff.iterrows():
d = {'TradeId': data.ix[0]['TradeId']}
d.update(data.ix[0]['Measures'][0]['MeasureValues'][0])
new_df = pd.concat([new_df, pd.DataFrame.from_dict(d, orient='index').T])
Amount TradeId
0 -18.64 17304M
0 -18.64 17304M

Categories