Python compare two different size of dataframe - python

I have a problem about comparing two different size of dafaframes and show the result of match and unmatch.
Two dafaframes is included countries. While one is listed all countries in all over the world (country_data_df), another one is consisted of some countries. (country_list_df)
Here is the structure of both dataframes
Index Country
0 Afghanistan
.. ..
Another problem is how to do it via contains method like Venezuela (Bolivarian Republic of) vs Venezuela
Here is my code snippet.
seen_countries = []
unseen_countries = []
for a in country_list_df:
if a in country_data_df:
seen_countries.append(a)
else:
unseen_countries.append(a)
How can I solve it out?

Clean your data
The 2nd part of your question deals with comparing dissimilar values in your data. The easiest thing to do would be to standardize your Country names in your list of all countries to values in your data. It's much easier to clean the smaller more finite list of countries to reuse against your larger input data set.
Do the following, once your country list has values that can be compared to your input data.
clean_data standardized the values to all lowercase and put them into a set which automatically gives you unique values.
seen_countires will automatically be created by using clean_data when you provide your country column from your input data set.
unseen_countries is simply a set of all the countries in the country_list - seen_countries set.
#!/usr/bin/env python
import pandas as pd
def clearn_data(x):
retval = set([v.lower() for v in x])
return retval
if __name__ == "__main__":
country_data = ["C", "D", "E", "F", "a", "A"]
country_list = ["a", "b", "c", "d", "e","f","g"]
country_list_df = pd.DataFrame(country_list, columns=["Country"])
country_data_df = pd.DataFrame(country_data, columns=["Country"])
seen_countries = clean_data(country_data_df.Country)
unseen_countries = clean_data(country_list_df.Country) - seen_countries
print("__Seen Countries__ ")
print(seen_countries)
print("__Unseen Countries__ ")
print(unseen_countries)
Output
Seen Countries
{'c', 'a', 'd', 'f', 'e'}
Unseen Countries
{'g', 'b'}

Have you tried using Pandas isin? it is great for comparing dataframes, even if they are different sizes.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
Example:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f'], 'C': ['Z', 'V', 'W']})
other = pd.DataFrame({'A': [1, 3, 3, 2], 'B': ['e', 'f', 'f', 'e']})
print(df.isin(other))
results in:
A B C
0 True False False
1 False False False
2 True True False

Related

How to extract elements from a list in pandas through regex?

I'm looking to extract the string of numbers that come after 'accession' in this Dataframe. My dataframe looks like this:
targets_list = pd.DataFrame(targets_df[['target_components', 'target_chembl_id']])
and the elements in each column of the target_components looks like the following:
[{'accession': 'O43451', 'component_description': 'Maltase-glucoamylase, intestinal', 'component_id': 434, 'component_type': 'PROTEIN', 'relationship': 'SINGLE PROTEIN', 'target_component_synonyms',...}]
I would just like to extract the number code after 'accession'. As I thought it was the first element of the list, I tried to tgt = targets_list['target_components'][0][0], but this returns the first element of that list, but not the accession number.
I can see that it is a list that's in each row, but how to parse that list and get that number and add it to a column is what's missing for me. It should be possible with Regex maybe? But I'm not sure how Regex works at all.
You could try:
tgt = targets_list["target_components"].str[0].str["accession"]
Result for
targets_list = pd.DataFrame(
{"target_components": [
[{"accession": "O43451", "b": "c", "d": 1}],
[{"accession": "012345", "b": "e", "d": 2}],
[{"b": "f", "d": 3}],
[]]}
)
target_components
0 [{'accession': 'O43451', 'b': 'c', 'd': 1}]
1 [{'accession': '012345', 'b': 'e', 'd': 2}]
2 [{'b': 'f', 'd': 3}]
3 []
is
0 O43451
1 012345
2 None
3 NaN
Name: target_components, dtype: object
You can use the .findall() function or .extract() to get the id.
Refer to :
Use regular expression to extract elements from a pandas data frame
First there is no need to use pd.DataFrame again to create dataframe from existing columns:
targets_list = targets_df[['target_components', 'target_chembl_id']]
Then you can use apply to access the column element
tgt = targets_list['target_components'].apply(lambda x: x[0]['accession'])
You can try this:
targets_list['target_components'].map(lambda x: x[0].get("accession"] if x else '')

Nested for-loop optimization while iterating over Dataframes

I am fairly new to python and coding. I am looking for a way to optimize a nested for loop.
The nested for loop I have written works perfectly fine, but it takes a lot of time to run.
I have explained the basic idea behind my original code and what I have tried to do, below:
data = [['a', '35-44', 'male', ['b', 'z', 'x']], ['b', '15-24', 'female', ['a', 'z', 'q']], \
['r', '35-44', 'male', ['z', 'a', 'd']], ['q', '15-24', 'female', ['u', 'k', 'b']]]
df = pd.DataFrame(data, columns= ['ID', 'age_group', 'gender', 'matching_ids'])
df is the Dataframe that I am working on.
What I want to do is compare each 'ID' in df with every other 'ID' in the same df and check if it follows certain conditions.
If the age_group is equal.
If the gender is the same.
If the 'ID' is in 'matched_ids'.
If these conditions are met I need to append that row to a separate dataframe (sample_df)
This is the code with the nested for loop that works fine:
df_copy = df.copy()
sample_df = pd.DataFrame()
for i in range(len(df)):
for j in range(len(df)):
if (i!=j) and (df.iloc[i]['ID'] in df_copy.iloc[j]['matching_ids']) and \
(df.iloc[i]['gender'] == df_copy.iloc[j]['gender']) and\
(df.iloc[i]['age_group'] == df_copy.iloc[j]['age_group']):
sample_df = sample_df.append(df_copy.iloc[[j]])
I tried simplifying it by writing a function and using df.apply(func), but it still takes almost the same amount of time.
Below is the code written with using a function:
sample_df_func = pd.DataFrame()
def func_extract(x):
for k in range(len(df)):
if (x['ID'] != df_copy.iloc[k]['ID']) and (x['ID'] in df_copy.iloc[k]['matching_ids']) and \
(x['gender'] == df_copy.iloc[k]['gender']) and\
(x['age_group'] == df_copy.iloc[k]['age_group']):
global sample_df_func
sample_df_func = sample_df_func.append(df_copy.iloc[[k]])
df.apply(func_extract, axis = 1)
sample_df_func
I am looking for ways to simplify this and optimize it further.
Forgive me, if the solution to this is very simple and I am not able to figure it out.
Thanks
PS: I've just started coding 2 months back.
We can form groups over age_group and gender to obtain subsets where first two conditions hold automatically. For the third condition, we can explode the matching_ids and then check if any of the ids isin the ID and keep those rows within groups only with boolean indexing:
out = (df.groupby(["age_group", "gender"])
.apply(lambda s: s[s.matching_ids.explode().isin(s.ID).groupby(level=0).any()])
.reset_index(drop=True))
where lastly we reset the index to get rid of grouping variables as index,
to get
>>> out
ID age_group gender matching_ids
0 b 15-24 female [a, z, q]
1 q 15-24 female [u, k, b]
2 r 35-44 male [z, a, d]

List of all dictionaries from list of keys and list of values

I have a dictionary (or list of tuples, doesn't matter):
dict(a=1, b=1, c=1)
I have a set of values:
set(['none', 'x', 'y', 'xy'])
I want to generate all possible patterns of the values applied to my dictionary, e.g.:
[{'a': 'none', 'b': 'none', 'c': 'none'},
...
{'a': 'xy', 'b': 'xy', 'c': 'xy'}]
I'm currently poking through the itertools package in python to accomplish this - but am open to solutions in R and bash as well. I'm having trouble figuring out how to word the question in googling for quick solutions.
Didn't try it out but this should work
from itertools import product
values = ["x", "y", "z", "a" ]
keys = ["a", "b", "c"]
y = itertools.product(values, repeat=len(keys) )
all_combos = []
for result in y:
all_combos.append({key:value for key, value in zip(keys, result)})
I think that the complexity of itertools.product grows exponentialy, so be carefull if you have a big number of possible values

Create list with all unique possible combination based on condition in dataframe in Python

I have the following dataset:
d = {
'Company':['A','A','A','A','B','B','B','B','C','C','C','C','D','D','D','D'],
'Individual': [1,2,3,4,1,5,6,7,1,8,9,10,10,11,12,13]
}
Now, I need to create a list in Python of all pairs of elements of 'Company', that correspond to the values in 'Individual'.
E.g. The output for above should be as follows for the dataset above:
((A,B),(A,C),(B,C),(C,D)).The first three tuples, since Individual 1 is affiliated with A,B and C and the last one since, Individual 10 is affiliated with C and D.
Further Explanation -
If individual =1, the above dataset has 'A','B' and 'C' values. Now, I want to create all unique combination of these three values (tuple), therefore it should create a list with the tuples (A,B),(A,C) and (B,C). The next is Individual=2. Here is only has the value 'A' therefore there is no tuple to append to the list. For next individuals there's only one corresponding company each, hence no further pairs. The only other tuple that has to be added is for Individual=10, since it has values 'C' and 'D' - and should therefore add the tuple (C,D) to the list.
One solution is to use pandas:
import pandas as pd
d = {'Company':['A','A','A','B','B','B','C','C','C'],'Individual': [1,2,3,1,4,5,3,6,7]}
df = pd.DataFrame(d).groupby('Individual')['Company'].apply(list).reset_index()
companies = df.loc[df['Company'].map(len)>1, 'Company'].tolist()
# [['A', 'B'], ['A', 'C']]
This isn't the most efficient way, but it may be intuitive.
Here is a solution to your refined question:
from collections import defaultdict
from itertools import combinations
data = {'Company':['A','A','A','A','B','B','B','B','C','C','C','C','D','D','D','D'],
'Individual': [1,2,3,4,1,5,6,7,1,8,9,10,10,11,12,13]}
d = defaultdict(set)
for i, j in zip(data['Individual'], data['Company']):
d[i].add(j)
res = {k: sorted(map(sorted, combinations(v, 2))) for k, v in d.items()}
# {1: [['A', 'B'], ['A', 'C'], ['B', 'C']],
# 2: [],
# 3: [],
# 4: [],
# 5: [],
# 6: [],
# 7: [],
# 8: [],
# 9: [],
# 10: [['C', 'D']],
# 11: [],
# 12: [],
# 13: []}
Try this,
temp=df[df.duplicated(subset=['Individual'],keep=False)]
print temp.groupby(['Individual'])['Company'].unique()
>>>1 [A, B]
>>>3 [A, C]

How to get distinct rows in a pandas df and merge the duplicate items into a column?

I am in a bit of a weird situation. I have already solved my programming problem before but I am looking back on it and trying to implement it using pandas. I thought this would be a good place to practice using pandas.
I am querying a database, doing some calculations, and then displaying the results onto a GUI with a PyQt QTableWidget.
An example table after the calculations could look like this:
test_list = [["a", "b", "c", "d"],
["1", "3", "5", "7"],
["1", "4", "5", "7"],
["2", "3", "6", "8"],
["2", "4", "6", "9"]]
What I want to do before I display it is: get the distinct rows based on columns "a", "c", and "d", and merge the dropped elements from column "b" back into the column. The result I want looks like this:
['a', 'b', 'c', 'd']
['1', '3, 4', '5', '7']
['2', '3', '6', '8']
['2', '4', '6', '9']
Notice how in column "b", "3, 4" are both represented in their row.
Here is how I did it initially with lists and dictionaries:
def mergeDistinct(my_list):
new_list_dict = {}
for elem in my_list[1:]:
key_str = (elem[0], elem[2], elem[3])
if key_str in new_list_dict.keys():
new_list_dict[key_str][1] += ", " + elem[1]
else:
new_list_dict[key_str] = elem[::]
new_list_dict[key_str][1] = elem[1]
ret_list = new_list_dict.values()
return [my_list[0]] + ret_list
I loop over all of the rows and use a dictionary to keep track of what distinct combination of values I have seen so far. I think it feels a bit clunky and I am trying my hand at the pandas library. I feel like it should definitely be possible but maybe I don't know the right term to google to understand how to do it.
This is what I have so far:
df = pd.DataFrame(data=test_list[1:], columns=test_list[0])
def mergeDistinctPandas(my_df):
#I feel like this is close but I don't know how to continue
df = my_df.set_index(['a', 'b', 'c', 'd']).groupby(level=['a', 'c', 'd'])
# for elem in df:
# print(elem)
# new_df = pd.DataFrame()
# for elem in df:
# merged = pd.concat([elem[1] for i, row in elem[1].iterrows()]) #.to_frame()
# merged.index = ['duplicate_{}'.format(i) for i in range(len(merged))]
# new_df = pd.concat([new_df, merged], axis=1)
return False
If I print out what I have so far I see the rows are separated and I should be able to merge them back, leaving "b" separated, but I can't see how to do it.
If pandas isn't suited to this problem, that's fine too, I'm just trying to get to grips with it.
Thanks for the help.
Here are some related questions I have found:
How to "select distinct" across multiple data frame columns in pandas? and
How do I merge duplicate rows into one on a DataFrame when they have different values
df.groupby([‘a’, ‘c’, ‘d’]).b.apply(‘, ‘.join) \
.reset_index()[df.columns]

Categories