Im trying to append the dictionary_of_columns with the columns that two df has in common.
my Code:
list_of_columns = []
for column in dfUpdates.schema:
list_of_columns.append(column.jsonValue()["name"].upper())
dictionary_of_columns = {}
dictionary_of_columns['BK_COLUMNS'] = []
dictionary_of_columns['COLUMNS'] = []
for row in df_metadata.dropDuplicates(['COLUMN_NAME', 'KeyType']).collect():
if row.KeyType == 'PRIMARY KEY' and row.COLUMN_NAME.upper() in list_of_columns:
dictionary_of_columns['BK_COLUMNS'].append(row.COLUMN_NAME.upper())
elif row.KeyType != 'PRIMARY KEY' and row.COLUMN_NAME.upper() in list_of_columns:
dictionary_of_columns['COLUMNS'].append(row.COLUMN_NAME.upper())
but when I it does not match. dict_of_columns has more columns in it.
UPDATE:
dfupdate: column names
df_metadata: values in COLUMN_NAME
Desired output of dictionary_of_columns = {} should be: {'BK_COLUMNS': ['CODE'],'COLUMNS':'DESCRIPTION'}
The pseudocode will look like this. I create sets as the columns in the two dataframes, and then just take out the elements they have in common.
You can adapt it to suit your needs since I see you also filter primary keys.
dfUpdates_cols = set(dfUpdates.columns)
df_metadata_cols = set(df_metadata.columns)
print(dfUpdates_cols & df_metadata_cols)
Related
So I have a dataframe called reactions_drugs
and I want to create a table called new_r_d where I keep track of how often a see a symptom for a given medication like
Here is the code I have but I am running into errors such as "Unable to coerce to Series, length must be 3 given 0"
new_r_d = pd.DataFrame(columns = ['drugname', 'reaction', 'count']
for i in range(len(reactions_drugs)):
name = reactions_drugs.drugname[i]
drug_rec_act = reactions_drugs.drug_rec_act[i]
for rec in drug_rec_act:
row = new_r_d.loc[(new_r_d['drugname'] == name) & (new_r_d['reaction'] == rec)]
if row == []:
# create new row
new_r_d.append({'drugname': name, 'reaction': rec, 'count': 1})
else:
new_r_d.at[row,'count'] += 1
Assuming the rows in your current reactions (drug_rec_act) column contain one string enclosed in a list, you can convert the values in that column to lists of strings (by splitting each string on the comma delimiter) and then utilize the explode() function and value_counts() to get your desired result:
df['drug_rec_act'] = df['drug_rec_act'].apply(lambda x: x[0].split(','))
df_long = df.explode('drug_rec_act')
result = df_long.groupby('drugname')['drug_rec_act'].value_counts().reset_index(name='count')
I have created a dataframe as shown:
idx = pd.MultiIndex.from_product([['batch1', 'batch2','batch3', 'batch4', 'batch5'], ['quiz1', 'quiz2']])
cols=['noofpresent', 'lesserthan50', 'between50and60', 'between60and70', 'between70and80', 'greaterthan80']
statdf = pd.DataFrame('-', idx, cols)
statdf
statdf.loc['quiz1', 'noofpresent'] = qdf1.b4ispresent.count()
statdf.loc['quiz2', 'noofpresent'] = qdf2.b4ispresent.count()
statdf.noopresent = qdf1.b4ispresent.count()
statdf.noopresent = qdf2.b4ispresent.count()
statdf
Then I made some calculations. I now want to append that specific calculation of the figures '50' and '53' in column 'noofpresent' in 'batch4', 'quiz1' and 'quiz2' respectively. But instead this happened...
How can I insert my data into the right place?
you can index it like this.
statdf.loc['batch4','quiz1']['noofpresent'] = qdf1.b4ispresent.count()
statdf.loc['batch4','quiz2']['noofpresent'] =qdf2.b4ispresent.count()
I have imported an excel file and made it into a DataFrame and iterated over a column called "Titles" to spit out titles with certain keywords. I have the list of titles as "match_titles." What I want to do now is to create a For Loop to return the column before "titles" for each title in match_titles." I'm not sure why the code is not working. Any help would be appreciated.
import pandas as pd
data = pd.read_excel(r'C:\Users\bryanmccormack\Downloads\asin_list.xlsx')
df = pd.DataFrame(data, columns=['Track','Asin','Title'])
excludes = ["Chainsaw", "Diaper pail", "Leaf Blower"]
my_excludes = [set(key_word.lower().split()) for key_word in excludes]
match_titles = [e for e in df.Title if
any(keywords.issubset(e.lower().split()) for keywords in my_excludes)]
a = []
for i in match_titles:
a.append(df['Asin'])
print(a)
In your for loop you are appending the unfiltered column df['Asin'] to your list a as many times as there are values in match_titles. But there isn't any filtering of df.
One solution would be to make a column of the match_values then you can return the column Asin after filtering on that match_values column:
# make a function to perform your match analysis.
def is_match(title, excludes=["Chainsaw", "Diaper pail", "Leaf Blower"]):
my_excludes = [set(key_word.lower().split()) for key_word in excludes]
if any(keywords.issubset(title.lower().split()) for keywords in my_excludes):
return True
return False
# Make a new boolean column for the matches. This applies your
# function to each value in df['Title'] and puts the output in
# the new column.
df['match_titles'] = df['Title'].apply(is_match)
# Filter the df to only matches and return the column you want.
# Because the match_titles column is boolean it can be used as
# an index.
result = df[df['match_titles']]['Asin']
I have a large dataframe of urls and a smaller 2nd dataframe that contains columns of strings which I want to use to merge the two dataframes together. Data from the 2nd df will be used to populate the larger 1st df.
The matching strings can contain * wildcards (and more then one) but the order of the grouping still matters; so "path/*path2" would match with "exsample.com/eg_path/extrapath2.html but not exsample.com/eg_path2/path/test.html. How can I use the strings in the 2nd dataframe to merge the two dataframes together. There can be more then one matching string in the 2nd dataframe.
import pandas as pd
urls = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7]}
metadata = {'group':['group1','group2'],
'matching_string_1':['google','wikipedia*Python_'],
'matching_string_2':['stackoverflow*questions*56318782','']}
result = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7],
'group':['group2','group1','group1','']}
df1 = pd.DataFrame(urls)
df2 = pd.DataFrame(metadata)
what_I_am_after = pd.DataFrame(result)
Not very robust but gives the correct answer for my example.
import pandas as pd
urls = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7]}
metadata = {'group':['group1','group2'],
'matching_string_1':['google','wikipedia*Python_'],
'matching_string_2':['stackoverflow*questions*56318782','']}
result = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7],
'group':['group2','group1','group1','']}
df1 = pd.DataFrame(urls)
df2 = pd.DataFrame(metadata)
results = pd.DataFrame(columns=['url','hits','group'])
for index,row in df2.iterrows():
for x in row[1:]:
group = x.split('*')
rx = "".join([str(x)+".*" if len(x) > 0 else '' for x in group])
if rx == "":
continue
filter = df1['url'].str.contains(rx,na=False, regex=True)
if filter.any():
temp = df1[filter]
temp['group'] = row[0]
results = results.append(temp)
d3 = df1.merge(results,how='outer',on=['url','hits'])
I'm trying to find out averages and standard deviation of multiple columns of my dataset and then save them as a new column in a new dataframe. i.e. for every 'GROUP' in the dataset, I want one columns in the new dataframe with its average and SD. I came up with the following script but I'm not able to name it dynamically.
Average_F1_S_list, Average_F1_M_list, SD_F1_S_list, SD_F1_M_list = ([] for i in range(4))
Groups= DF['GROUP'].unique().tolist()
for key in Groups:
Average_F1_S = DF_DICT[key]['F1_S'].mean()
Average_F1_S_list.append(Average_F1_S)
SD_F1_S = DF_DICT[key]['F1_S'].std()
SD_F1_S_list.append(SD_F1_S)
Average_F1_M = DF_DICT[key]['F1_M'].mean()
Average_F1_M_list.append(Average_F1_M)
SD_F1_M = DF_DICT[key]['F1_M'].std()
SD_F1_M_list.append(SD_F1_M)
df=pd.DataFrame({'Group':Groups,
'Average_F1_S':Average_F1_S_list,'Standard_Dev_F1_S':SD_F1_S_list,
'Average_F1_M':Average_F1_M_list,'Standard_Dev_F1_M':SD_F1_M_list},
columns=['Group','Average_F1_S','Standard_Dev_F1_S','Average_F1_M', 'Standard_Dev_F1_M'])
This will not be a good solution as there are too many features. Is there any way I can create the lists dynamically?
This should do the trick! Hope this helps
# These are all the keys you want
key_names = ['F1_S', 'F1_M']
# Holds the data you want to pass to the dataframe.
df_info = {'Groups': Groups}
for group_name in Groups:
# For each group in the groups, we iterate over all the keys we want.
for key in key_names:
# Generate a keyname that you want for your dataframe.
avg_key_name = key + '_Average'
std_key_name = key + '_Standard_Dev'
if avg_key_name not in df_info:
df_info[avg_key_name] = []
df_info[std_key_name] = []
df_info[avg_key_name].append(DF_DICT[group_name][key].mean())
df_info[std_key_name].append(DF_DICT[group_name][key].std())
df = pd.DataFrame(df_info)