How to dynamically match rows from two pandas dataframes - python

I have a large dataframe of urls and a smaller 2nd dataframe that contains columns of strings which I want to use to merge the two dataframes together. Data from the 2nd df will be used to populate the larger 1st df.
The matching strings can contain * wildcards (and more then one) but the order of the grouping still matters; so "path/*path2" would match with "exsample.com/eg_path/extrapath2.html but not exsample.com/eg_path2/path/test.html. How can I use the strings in the 2nd dataframe to merge the two dataframes together. There can be more then one matching string in the 2nd dataframe.
import pandas as pd
urls = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7]}
metadata = {'group':['group1','group2'],
'matching_string_1':['google','wikipedia*Python_'],
'matching_string_2':['stackoverflow*questions*56318782','']}
result = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7],
'group':['group2','group1','group1','']}
df1 = pd.DataFrame(urls)
df2 = pd.DataFrame(metadata)
what_I_am_after = pd.DataFrame(result)

Not very robust but gives the correct answer for my example.
import pandas as pd
urls = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7]}
metadata = {'group':['group1','group2'],
'matching_string_1':['google','wikipedia*Python_'],
'matching_string_2':['stackoverflow*questions*56318782','']}
result = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7],
'group':['group2','group1','group1','']}
df1 = pd.DataFrame(urls)
df2 = pd.DataFrame(metadata)
results = pd.DataFrame(columns=['url','hits','group'])
for index,row in df2.iterrows():
for x in row[1:]:
group = x.split('*')
rx = "".join([str(x)+".*" if len(x) > 0 else '' for x in group])
if rx == "":
continue
filter = df1['url'].str.contains(rx,na=False, regex=True)
if filter.any():
temp = df1[filter]
temp['group'] = row[0]
results = results.append(temp)
d3 = df1.merge(results,how='outer',on=['url','hits'])

Related

Pandas DataFrame combine rows by column value, where Date Rows are NULL

Scenerio:
Parse the PDF Bank statement and transform into clean and formatted csv file.
What I've tried:
I manage to parse the pdf file(tabular format) using camelot library but failed to produce the desired result in sense of formatting.
Code:
import camelot
import pandas as pd
tables = camelot.read_pdf('test.pdf', pages = '3')
for i, table in enumerate(tables):
print(f'table_id:{i}')
print(f'page:{table.page}')
print(f'coordinates:{table._bbox}')
tables = camelot.read_pdf('test.pdf', flavor='stream', pages = '3')
columns = df.iloc[0]
df.columns = columns
df = df.drop(0)
df.head()
for c in df.select_dtypes('object').columns:
df[c] = df[c].str.replace('$', '')
df[c] = df[c].str.replace('-', '')
def convert_to_float(num):
try:
return float(num.replace(',',''))
except:
return 0
for col in ['Deposits', 'Withdrawals', 'Balance']:
df[col] = df[col].map(convert_to_float)
My_Result:
Desired_Output:
The logic I came up with is to move those rows up i guess n-1 if date column is NaN i don't know if this logic is right or not.Can anyone help me to sort out this properly?
I tried pandas groupby and aggregation functions but it only merging the whole data and removing NaN and duplicate dates which is not suitable because every entry is necessary.
Using Transform -
df.loc[~df.Date.isna(), 'group'] = 1
g = df.group.fillna(0).cumsum()
df['Description'] = df.groupby(g)['Description'].transform(' '.join)
new_df = df.loc[~df['Date'].isna()]

Count occurrence of column values in other dataframe column

I have two dataframes and I want to count the occurrence of "classifier" in "fullname". My problem is that my script counts a word like "carrepair" only for one classifier and I would like to have a count for both classifiers. I would also like to add one random coordinate that matches the classifier.
First dataframe:
Second dataframe:
Result so far:
Desired Result:
My script so far:
import pandas as pd
fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')
fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()
pat = '({})'.format('|'.join(clas['classifier'].unique()))
fl['fullname'] = fl['fullname'].str.extract(pat, expand = False)
clas['count_of_classifier'] = clas['classifier'].map(fl['fullname'].value_counts())
print(clas)
Thanks!
You could try this:
import pandas as pd
fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')
fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()
# Add a new column to 'fl' containing either 'repair' or 'car'
for value in clas["classifier"].values:
fl.loc[fl["fullname"].str.contains(value, case=False), value] = value
# Count values and create a new dataframe
new_clas = pd.DataFrame(
{
"classifier": [col for col in clas["classifier"].values],
"count": [fl[col].count() for col in clas["classifier"].values],
}
)
# Merge 'fl' and 'new_clas'
new_clas = pd.merge(
left=new_clas, right=fl, how="left", left_on="classifier", right_on="fullname"
).reset_index(drop=True)
# Keep only expected columns
new_clas = new_clas.reindex(columns=["classifier", "count", "coordinate"])
print(new_clas)
# Outputs
classifier count coordinate
repair 3 52.520008, 13.404954
car 3 54.520008, 15.404954

Is there a way of dinamically find partial matching numbers between columns in pandas dataframes?

Im looking for a way of comparing partial numeric values between columns from different dataframes, this columns are filled with something like social security numbers (they can´t and won´t repeat), so something like a dynamic isin() with be ideal.
This are representations of very large dataframes that I import from csv files.
{import numpy as np
import pandas as pd
df1 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"]})
df2 = pd.DataFrame({"Id_number": ["14452", "9930", "1544", "5303", "973637", "4205", "0271600", "342964", "763", "60078"]})
print(df1)
print(df2)
df2['Id_number_length']= df2['Id_number'].str.len()
df2.groupby('Id_number_length').count()
count_list = df2.groupby('Id_number_length')[['Id_number_length']].count()
print('count_list:\n', count_list)
df1 ['S_number'] = pd.to_numeric(df1['S_number'], downcast = 'integer')
df2['Id_number'] = pd.to_numeric(df2['Id_number'], downcast = 'integer')
inner_join = pd.merge(df1, df2, left_on =['S_number'], right_on = ['Id_number'] , how ='inner')
print('MATCH!:\n', inner_join)
outer_join = pd.merge(df1, df2, left_on =['S_number'], right_on = ['Id_number'] , how ='outer', indicator = True)
anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)
print('UNMATCHED:\n', anti_join)
}
What I need to get is something as the following as a result of the inner join or whatever method:
{
df3 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"],
"Id_number": [ "027160", "60078","342964","763", "1544", "5303", "973637", "14452", "9930", "4205",]})
print('MATCH!:\n', df3)
}
I thought that something like this (very crude) pseudocode would work. Using count_list to strip parts of the numbers of df1 to fully match df2 instead of partially matching (notice that in df2 the missing or added numbers are always at the begining or the end)
{
for i in count_list:
if i ==6:
try inner join
except empty output
elif i ==5:
try
df1.loc[:,'S_number'] = df_ib_c.loc[:,'S_number'].str[1:]
inner join with df2
except empty output
try
df1.loc[:,'S_number'] = df_ib_c.loc[:,'S_number'].str[:-1]
inner join with df2
elif i == 4:
same as above...
}
But the lengths in count_list are variable so this for is an inefficient way.
Any help with this will be very appreciated, I´ve been stuck with this for days. Thanks in advance.
You can 'explode' each line of df1 into up to 45 lines. For example, SSN 123456789 can be map to [1,2,3...9,12,23,34,45..89,...12345678,23456789,123456789]. While this look bad, from algorithm standpoint it is O(1) for each row and therefore O(N) in total.
Using this new column as key, a simple 'merge on' can combine the 2 DFs easily - which is usually O(NlogN).
Here is an example of what I should do. I hope I've understood. Feel free to ask if it's not clear.
import pandas as pd
import joblib
from joblib import Parallel,delayed
# Building the base
df1 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"]})
df2 = pd.DataFrame({"Id_number": ["14452", "9930", "1544", "5303", "973637", "4205", "0271600", "342964", "763", "60078"]})
# Initiate empty list for indexes
IDX = []
# Using un function to paralleliza it if database is big
def func(x,y):
if all(c in df2.Id_number[y] for c in df1.S_number[x]):
return(x,y)
# using the max of processors
number_of_cpu = joblib.cpu_count()
# Prpeparing a delayed function to be parallelized
delayed_funcs = (delayed(func)(x,y) for x in range(len(df1)) for y in range(len(df2)))
# fiting it with processes and not threads
parallel_pool = Parallel(n_jobs=number_of_cpu,prefer="processes")
# Fillig the IDX List
IDX.append(parallel_pool(delayed_funcs))
# Droping the None
IDX = list(filter(None, IDX[0]))
# Making df3 with the tuples of indexes
df3 = pd.DataFrame(IDX)
# Making it readable
df3['df1'] = df1.S_number[df3[0]].to_list()
df3['df2'] = df2.Id_number[df3[1]].to_list()
df3
OUTPUT :

Pandas - populate list with dataframe values as strings

I am reading csv files from a folder and filtering tem into a pandas dataframe, like so:
results=[]
for filename in glob.glob(os.path.join('/path/*.csv')):
with open(filename) as p:
df = pd.read_csv(p)
filtered = df[(df['duration'] > low1) & (df['duration'] < high1)]
artist = filtered['artist'].values
print artist
track = filtered['track'].values
print track
where low1 = 0, high_1 = 0.5
artist and track print hundreds of filtered items as normal strings, but if I try to append them to results in the loop:
artist = filtered['artist'].values
track = filtered['track'].values
results.append([track,artist])
I see that I am appendding objects and types and results ends up populated with a fraction of the filtered items. I don't get what happens.
How do I populate results with all items as regular strings, in this fashion:
[['artist1', 'track1'], ['artist1', 'track2], ...]]
Create list of DataFrames and then join them together by concat, last convert to nested lists:
results=[]
for filename in glob.glob(os.path.join('/path/*.csv')):
df = pd.read_csv(filename)
#filter by conditions and also columns by names with .loc
filtered = df.loc[(df['duration'] > low1) & (df['duration'] < high1), ['artist','track']]
#alternative solution
filtered = df.loc[df['duration'].between(low1, high1,inclusive=False), ['artist','track']]
results.append(filtered)
out = pd.concat(results).values.tolist()
Another solution id append lists and last flattening them by list comprehension:
results=[]
for filename in glob.glob(os.path.join('/path/*.csv')):
df = pd.read_csv(filename)
#filter by conditions and also columns by names with .loc
mask = df['duration'].between(low1, high1,inclusive=False)
filtered = df.loc[mask, ['artist','track']].values.tolist()
results.append(filtered)
out = [y for x in results for y in x]

How do I filter columns of multiple DataFrames stored in a dictionary in an efficient way?

I am working with stock data and I want to make my data sets have equal length of data when performing certain types of analysis.
Problem
If I a load data for Apple I will get daily data since 1985 but if load data for a Natural Gas ETF it might only go as far back as 2012. I now want to filter Apple to only show history going back to 2012. Also, the end date, for example some of my dataset may not be up to date as Apple data is ranging from 1985 to 1-20-17 and the Natural Gas ETF data has a range of 2012 to 12-23-16. I also want another filter that sets the max date. So now my apple data set is filtered for dates ranging between 2012 to 12-23-16. Now my datasets are equal.
Approach
I have a dictionary called Stocks which stores all of my dateframes. All the dataframes have a column named D which is the Date column.
I wrote a function that populates a dictionary with the dataframes and also takes the min and max dates for each df. I store all those min max dates in two other dictionaries DatesMax and DateMin and then take the min and the max of those two dictionaries to get the max and the min dates that will be used for the filter value on all the dataframes.
The function below works, it gets the min and max dates of multiple dataframes and returns them in a dictionary named DatesMinMax.
def MinMaxDates (FileName):
DatesMax = {}; DatesMin = {}
DatesMinMax = {}; stocks = {}
with open (FileName) as file_object:
Current_indicators = file_object.read()
tickers = Current_indicators.split('\n')
for i in tickers:
a = '/' in i
if a == True:
x = i.find("/")+1
df = pd.read_csv(str( i[x:]) + '_data.csv')
stocks[i] = df
maxDate = max(df.D)
minDate = min(df.D)
DatesMax[i] = maxDate
DatesMin[i] = minDate
else:
df = pd.read_csv(i + '_data.csv')
stocks[i] = df
maxDate = max(df.D)
minDate = min(df.D)
DatesMax[i] = maxDate
DatesMin[i] = minDate
x = min(DatesMax.values())
y = max(DatesMin.values())
DatesMinMax = {'MaxDate' : x, 'MinDate' : y}
return DatesMinMax
print DatesMinMax
# {'MinDate': '2012-02-08', 'MaxDate': '2017-01-20'}
Question
Now, I will have to run my loop on all the dataframes in the dict name Stocks to filter there date columns. It seems inefficient to re-loop something again, but I can't think of any other other way to apply the filter.
Actually, you may not need to capture min and max (since 2016-12-30 < 2017-01-20) for later filtering, but simply run a full inner join merge across all dataframes on 'D' (Date) column.
Consider doing so with a chain merge which ensures equal lengths across all dataframe, and then slice this outputted master dataframe by ticker columns to build the Stocks dictionary. Of course, you can use the wide master dataframe for analysis:
with open (FileName) as file_object:
Current_indicators = file_object.read()
tickers = Current_indicators.split('\n')
# DATA FRAME LIST BUILD
dfs = []
for i in tickers:
if '/' in i:
x = i.find("/")+1
df = pd.read_csv(str( i[x:]) + '_data.csv')
# PREFIX ALL NON-DATE COLS WITH TICKER PREFIX
df.columns = [i+'_'+str(col) for col in df.columns if col!='D']
dfs.append(df)
else:
df = pd.read_csv(i + '_data.csv')
# PREFIX ALL NON-DATE COLS WITH TICKER PREFIX
df.columns = [i+'_'+str(col) for col in df.columns if col!='D']
dfs.append(df)
# CHAIN MERGE (INNER JOIN) ACROSS ALL DFS
masterdf = reduce(lambda left,right: pd.merge(left, right, on=['D']), dfs)
# DATA FRAME DICT BUILD
stocks = {}
for i in tickers:
# SLICE CURRENT TICKER COLUMNS
df = masterdf[['D']+[col for col in df.columns if i in col]]
# REMOVE TICKER PREFIXES
df.columns = [col.replace(i+'_', '') for col in df.columns]
stocks[i] = df

Categories