Python dataframe merge on condition - python

I have two data frames, and 3 conditions to create new data frame
1)df1["Product"]==df2["Product"] and df2["Date"] >= df1["Date"]
2)Now need to loop df2["Product"] sum(df2["Count"]) while checking df1["Count"] on each iteration for df2["Count"] == df1["Count"]
Example
df1["Product"][2] = "147326.A" and df1["Date"][2] = "1/03/22" and df1["Count"][2] = 4,
now we check df2 if there is match df2["Product"][1] == df1["Product"][2] and df2["Date"][1] >= df1["Date"][2], first condition are met now we need to sum() the df2["Count"] end on each iteration compare it to df1["Count"] if df1["Count"]== df2[Count] add to new data frame
df1 = pd.DataFrame({"Date":["11/01/22", "1/02/22", "1/03/22", "1/04/22", "2/02/22"],"Product" :["315114.A", "147326.A", "147326.A", "91106.A", "283214.A"],"Count":[3,1,4,1,2]})
df2 = pd.DataFrame({"Date" : ["15/01/22", "4/02/22", "7/03/22", "1/04/22", "2/02/22", "15/01/22","1/06/22","1/06/22"],"Product" : ["315114.A", "147326.A ", "147326.A", "91106.A", "283214.A", "315114.A","147326.A","147326.A" ],"Count" : [1, 1, 2, 1, 2, 2, 1, 1]})
The following data should be a match:
df1 = pd.DataFrame({"Date" : ["01/03/2022"],"Product":["91106.A"],"Count":[2]})
df2 = pd.DataFrame({"Date" : ["01/03/2022", "7/03/2022", "7/03/2022", "7/03/2022","7/03/2022", "7/03/2022"],"Product" : ["91106.A", "91106.A","91106.A", "91106.A", "91106.A", "91106.A"],"Count" : [1, 1, 1, 1, 1, 1]})

You could solve this in a list comprehension (within a pd.DataFrame):
df3 = pd.DataFrame([j.to_dict() for i, j in df1.iterrows() if
j["Count"] == df2[(df2["Product"] == j["Product"]) &
(df2["Date"] >= j["Date"])]["Count"].sum()])
Splitting this up into lots of lines would look like this:
l = []
for i, j in df1.iterrows():
if j["Count"] == df2[(df2["Product"] == j["Product"]) &
(df2["Date"] >= j["Date"])]["Count"].sum():
x = j.to_dict()
l.append(x)
df3 = pd.DataFrame(l)

Related

count how many times a record appears in a pandas dataframe and create a new feature with this counter

I have this two dataframes dt_t and dt_u. I want to be able to count how many times a record in the text feature appears and I want to create a new feature in df_u where I associate to each id the counter. So id_u = 1 and id_u = 2 both will have counter = 3 since hello appears 3 times in df_t and both published a post with "hello" in the text.
import pandas as pd
import numpy as np
df_t = pd.DataFrame({'id_t': [0, 1, 2, 3, 4], 'id_u': [1, 1, 3, 2, 2], 'text': ["hello", "hello", "friend", "hello", "my"]})
print(df_t)
df_u = pd.DataFrame({'id_u': [1, 2, 3]})
print()
print(df_u)
df_u_new = pd.DataFrame({'id_u': [1, 2, 3], 'counter': [3, 3, 1]})
print()
print(df_u_new)
The code I wrote for the moment is this, but this is very slow and also I have a very huge dataset so it is impossible to do.
user_counter_dict = {}
tmp = dict(df_t["text"].value_counts())
# to speedup the process we set as index the text column
df_t.set_index(["text"], inplace=True)
for i, (k, v) in enumerate(tmp.items()):
row = (k, v)
text = row[0]
counter = row[1]
#this is slow and take much of the time
uniques_id = df_.loc[tweet]["id_u"].unique()
for elem in uniques_id:
value = user_counter_dict.setdefault(str(elem), counter)
if value < counter:
user_counter_dict[str(elem)] = counter
# and now I will put the date on the dict on a new column in df_u
Is there a very fast way to compute this?
You can do:
df_u_new = df_t.assign(counter=df_t["text"].map(df_t["text"].value_counts()))[
["id_u", "counter"]
].groupby("id_u", as_index=False).max()
Get the value_counts of text and groupby id_u and get the maximum value which is what you were trying to get IIUC.
print(df_u_new)
id_u counter
0 1 3
1 2 3
2 3 1

How do I remove repeated Row Comparisons?

I'm currently new to coding and would require to do pairwise comparisons using Pandas. Hence, I have to find a way to code row-by-row comparisons without any repetitions.
A mock data will be as follows:
Whereby i'm comparing males & their age. However, as seen in the image above, in index 1 there is a combination of Vyel & Allsebrook & in index 4 the same combination is seen with Allsebrook and Vyel.
Ideally, the desired output would be like:
Desired Results
I have managed to remove rows containing the same person twice, but is there a way i can code so i can avoid overlapping data comparisons? Would appreciate any feedback. Thank you!
Try This:
import pandas as pd
# Create the DF
id_x = [1, 1, 1, 1, 2]
last_name_x = ["Vyel", "Vyel", "Vyel", "Vyel", "Allsebrook"]
gender = ["Male", "Male", "Male", "Male", "Male"]
age_x = [66, 66, 66, 66, 50]
id_y = [1, 2, 3, 4, 1]
last_name_y = ["Vyel", "Allsebrook", "Prinett", "Jinda", "Vyel"]
age_y = [66, 50, 30, 31, 66]
df = pd.DataFrame(id_x, columns=['id_x'])
df['last_name_x'] = last_name_x
df['gender'] = gender
df['age_x'] = age_x
df['id_y'] = id_y
df['last_name_y'] = last_name_y
df['age_y'] = age_y
# Create the keys in order to check duplicates
df["key_x"] = df["id_x"].astype(str) + df["last_name_x"] + df["age_x"].astype(str)
df["key_y"] = df["id_y"].astype(str) + df["last_name_y"] + df["age_y"].astype(str)
df["combined"] = df['key_x'] + "|" +df['key_y']
s = set(df['key_y'] + "|" + df['key_x'])
d = {}
# mark all of the duplicated rows
def check_if_duplicate(row):
if row in s:
if row in d:
return True
else:
arr = row.split("|")
d[arr[1] + "|" + arr[0]] = 1
return False
df['duplicate'] = df['combined'].apply(check_if_duplicate)
# drop the duplicates and the rows we added in order to check the duplicates
df = df[df['duplicate'] != True]
df.drop(["key_x", "key_y", "combined", "duplicate"], axis=1, inplace=True)
print(df)

Frequency count based on column values in Pandas

For example I have a data frame which looks like this:
First Image
And I would like to make a new data frame which shows the number of times a word was marked as spam or ham. I want it to look like this:
Second image
I have tried the following code to make a list of only spam counts on a word to test but it does not seem to work and crashes the Kernel on Jupyter Notebook:
words = []
for word in df["Message"]:
words.extend(word.split())
sentences = []
for word in df["Message"]:
sentences.append(word.split())
spam = []
ham = []
for word in words:
sc = 0
hc = 0
for index,sentence in enumerate(sentences):
if word in sentence:
print(word)
if(df["Category"][index])=="ham":
hc+=1
else:
sc+=1
spam.append(sc)
spam
Where df is the data frame shown in the First Image.
How can I go about doing this?
You can form two dictionaries spam and ham to store the number of occurrences of different words in spam/ham message.
from collections import defaultdict as dd
spam = dd(int)
ham = dd(int)
for i in range(len(sentences)):
if df['Category'][i] == 'ham':
p = sentences[i]
for x in p:
ham[x] += 1
else:
p = sentences[i]
for x in p:
spam[x] += 1
The output obtained from the code above for similar input to yours is as below.
>>> spam
defaultdict(<class 'int'>, {'ok': 1, 'lar': 1, 'joking': 1, 'wtf': 1, 'u': 1, 'oni': 1, 'free': 1, 'entry': 1, 'in': 1, '2': 1, 'a': 1, 'wkly': 1, 'comp': 1})
>>> ham
defaultdict(<class 'int'>, {'go': 1, 'until': 1, 'jurong': 1, 'crazy': 1, 'available': 1, 'only': 1, 'in': 1, 'u': 1, 'dun': 1, 'say': 1, 's': 1, 'oearly': 1, 'nah': 1, 'I': 1, 'don’t': 1, 'think': 1, 'he': 1, 'goes': 1, 'to': 1, 'usf': 1})
Now can manipulate the data and export it in the required format.
EDIT:
answer = []
for x in spam:
answer.append([x,spam[x],ham[x]])
for x in ham:
if x not in spam:
answer.append([x,spam[x],ham[x]])
So here the numbers of rows in answer list in equal to the number of distinct words in all the messages. While the first column in every row is the word we are talking about and the second and third column is the number of occurrences of the word in spam and ham message respectively.
The output obtained for my code is as below.
['ok', 1, 0]
['lar', 1, 0]
['joking', 1, 0]
['wif', 1, 0]
['u', 1, 1]
['oni', 1, 0]
['free', 1, 0]
['entry', 1, 0]
['in', 1, 1]
This would be better:
https://docs.python.org/3.8/library/collections.html#collections.Counter
from collections import Counter
import pandas as pd
df # the data frame in your first image
df['Counter'] = df.Message.apply(lambda x: Counter(x.split()))
def func(df: pd.DataFrame):
for category, data in df.groupby('Category'):
count = Counter()
for var in data.Counter:
count += var
cur = pd.DataFrame.from_dict(count, orient='index', columns=[category])
yield cur
demo = func(df)
df2 = next(demo)
for cur in demo:
df2 = df2.merge(cur, how='outer', left_index=True, right_index=True)
EDIT:
from collections import Counter
import pandas as pd
df # the data frame in your first image. Suit both cases(whether it is a slice of the complete data frame or not)
def func(df: pd.DataFrame):
res = df.groupby('Category').Message.apply(' '.join).str.split().apply(Counter)
for category, count in res.to_dict().items():
yield pd.DataFrame.from_dict(count, orient='index', columns=[category])
demo = func(df)
df2 = next(demo)
for cur in demo:
df2 = df2.merge(cur, how='outer', left_index=True, right_index=True)

Create smaller dataframes from a large dataframe using the index values from a list

I have a list
a = [15, 50 , 75]
Using the above list I have to create smaller dataframes filtering out rows (the number of rows is defined by the list) on the index from the main dataframe.
let's say my main dataframe is df
the dataframes I'd like to have is df1 (from row index 0-15),df2 (from row index 15-65), df3 (from row index 65 - 125)
since these are just three I can easily use something like this below:
limit1 = a[0]
limit2 = a[1] + limit1
limit3 = a[2] + limit3
df1 = df.loc[df.index <= limit1]
df2 = df.loc[(df.index > limit1) & (df.index <= limit2)]
df2 = df2.reset_index(drop = True)
df3 = df.loc[(df.index > limit2) & (df.index <= limit3)]
df3 = df3.reset_index(drop = True)
But what if I want to implement this with a long list on the main dataframe df, I am looking for something which is iterable like the following (which doesn't work):
df1 = df.loc[df.index <= limit1]
for i in range(2,3):
for j in range(2,3):
for k in range(2,3):
df[i] = df.loc[(df.index > limit[j]) & (df.index <= limit[k])]
df[i] = df[i].reset_index(drop=True)
print(df[i])
you could modify your code by building dataframes from the main dataframe iteratively cutting out slices from the end of the dataframe.
dfs = [] # this list contains your partitioned dataframes
a = [15, 50 , 75]
for idx in a[::-1]:
dfs.insert(0, df.iloc[idx:])
df = df.iloc[:idx]
dfs.insert(0, df) # add the last remaining dataframe
print(dfs)
Another option is to use list expressions as follows:
a = [0, 15, 50 , 75]
dfs = [df.iloc[a[i]:a[i+1]] for i in range(len(a)-1)]
This does it. It's better to use dictionaries if you want to store multiple variables and call them later. It's bad practice to create variables in an iterative way, so always avoid it.
df = pd.DataFrame(np.linspace(1,75,75), columns=['a'])
a = [15, 50 , 25]
d = {}
b = 0
for n,i in enumerate(a):
d[f'df{n}'] = df.iloc[b:b+i]
b+=i
Output:

Pandas assignment using nested loops leading to memory error

I am using pandas and trying to do an assignment using a nested loops. I iterate over a dataframe and then run a distance function if it meets a certain criteria. I am faced with two problems:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
Memory Error. It doesn't work on large datasets. I end up having to terminate the process.
How should I change my solution to ensure it can scale with a larger dataset of 60,000 rows?
for i, row in df.iterrows():
listy = 0
school = []
if row['LS_Type'] == 'Primary (1-4)':
a = row['Northing']
b = row['Easting']
LS_ID = row['LS_ID']
for j, row2 in df.iterrows():
if row2['LS_Type'] == 'Primary (1-8)':
dist_km = distance(a,b, df.Northing[j], df.Easting[j])
if (listy == 0):
listy = dist_km
school.append([df.LS_Name[j], df.LS_ID[j]])
else:
if dist_km < listy:
listy = dist_km
school[0] = [df.LS_Name[j], int(df.LS_ID[j])]
df['dist_up_prim'][i] = listy
df["closest_up_prim"][i] = school[0]
else:
df['dist_up_prim'][i] = 0
The double for loop is what's killing you here. See if you can break it up into two separate apply steps.
Here is a toy example of using df.apply() and partial to do a nested for loop:
import math
import pandas as pd
from functools import partial
df = pd.DataFrame.from_dict({'A': [1, 2, 3, 4, 5, 6, 7, 8],
'B': [1, 2, 3, 4, 5, 6, 7, 8]})
def myOtherFunc(row):
if row['A'] <= 4:
return row['B']*row['A']
def myFunc(the_df, row):
if row['A'] <= 2:
other_B = the_df.apply(myOtherFunc, axis=1)
return other_B.mean()
return pd.np.NaN
apply_myFunc_on_df = partial(myFunc, df)
df.apply(apply_myFunc_on_df, axis=1)
You can rewrite your code in this form, which will be much faster.

Categories