Pandas - Why there are duplicates after join? - python

I have train with 3756 rows and test with 500 rows, after join I had 798974 rows.
code for join:
test.join(train.set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')
Use of drop duplicates is works, but required a lot of time and memory.

Reason is duplicated values of column link_1 in test and train, so for each duplicated values get all combinations between:
train = pd.DataFrame({"link_1": [0, 0, 0, 0, 1, 1, 1, 1],
'claps_link_1_mean': range(8)})
test = pd.DataFrame({"link_1": [0, 1, 1, 1]})
df = test.join(train.set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')
print (df)
link_1 claps_link_1_mean
0 0 0
0 0 1
0 0 2
0 0 3
1 1 4
1 1 5
1 1 6
1 1 7
2 1 4
2 1 5
2 1 6
2 1 7
3 1 4
3 1 5
3 1 6
3 1 7
If remove duplicates in one of them before join all working well:
test.join(train.drop_duplicates('link_1').set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')

Related

How to change several values of pandas DataFrame at once?

Let's consider very simple data frame:
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 2, 5], [3, 4, 5, 0, 2, 7]]).transpose()
df.columns = ["A", "B"]
A B
0 0 3
1 1 4
2 2 5
3 3 0
4 2 2
5 5 7
I want to do two things with this dataframe:
All numbers below 3 has to be changed to 0
All numbers equal to 0 has to be changed to 10
The problem is, that when we apply:
df[df < 3] = 0
df[df == 0] = 10
we are also going to change numbers which were initially not 0, obtaining:
A B
0 10 3
1 10 4
2 10 5
3 3 10
4 10 10
5 5 7
which is not a desired output which should look like this:
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
My question is - is there any opportunity to change both those things at the same time? i.e. I want to change numbers which are smaller than 3 to 0 and numbers which equal to 0 to 10 independently of each other.
Note! This example is created to just outline the problem. An obvious solution is to change the order of replacement - first change 0 to 10, and then numbers smaller than 3 to 0. But I'm struggling with a much complex problem, and I want to know if it is possible to change both of those at once.
Use applymap() to apply a function to each element in the DataFrame:
df.applymap(lambda x: 10 if x == 0 else (0 if x < 3 else x))
results in
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
I would do it following way
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 2, 5], [3, 4, 5, 0, 2, 7]]).transpose()
df.columns = ["A", "B"]
df_orig = df.copy()
df[df_orig < 3] = 0
df[df_orig == 0] = 10
print(df)
output
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
Explanation: I use .copy method to get copy of DataFrame, which is placed in variable df_orig, then use said DataFrame, which is not altered during run of program, to select places to put 0 and 10.
You can create the mask first then change value
m1 = df < 3
m2 = df == 0
df[m1] = 0
df[m2] = 10
print(df)
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7

Trying to multiply a certain data cell by another certain data cell in pandas

Due to misunderstanding using my real scenario I am going to create one.
Here is the DataFrame.
import pandas as pd
num1df = pd.DataFrame({'Number 1': [1, 4, 3, 2, 100]})
num2df = pd.DataFrame({'Number 2': [1, 2, 'NaN', 4, 5]})
num3df = pd.DataFrame({'Number 3': [1, 2, 3, 1000, 0]})
numsdf = pd.concat([num1df, num2df, num3df], axis=1, join="inner")
print(numsdf)
Number 1 Number 2 Number 3
0 1 1 1
1 4 2 2
2 3 NaN 3
3 2 4 1000
4 100 5 0
I want to be able to do the follow addition. Column Number 1 row 4 plus column Number 3 row 3 = Column Number 2 row 2. 100 + 1000 = 1100 (the answer should be in place of the NaN)
This should be the expected outcome:
Number 1 Number 2 Number 3
0 1 1 1
1 4 2 2
2 3 1100 3
3 2 4 1000
4 100 5 0
How would I do that? I cannot figure it out.
Notice: Solution working only if sme indices in all 3 DataFrames.
If possible replace non numeric values by missing values and then forward filling last non missng values in same column use:
marketcapdf['Market Cap'] = stockpricedf['Stock Price'] *
pd.to_numeric(outstandingdf['Outstanding'],
errors='coerce').ffill()
If working in one DataFrame:
df['Market Cap'] = df['Stock Price'] *
pd.to_numeric(df['Outstanding'],
errors='coerce').ffill()
EDIT: If need multiple by shifted second column with no change first value use:
numsdf['new'] = numsdf['Number 1'] * numsdf['Number 2'].shift(fill_value=1)
print(numsdf)
Number 1 Number 2 new
0 5 1 5
1 4 2 4
2 3 3 6
3 2 4 6
4 1 5 4
EDIT1: I create new columns for better understanding:
num1df = pd.DataFrame({'Number 1': [1, 4, 3, 2, 100]})
num2df = pd.DataFrame({'Number 2': [1, 2, np.nan, 4, 5]})
num3df = pd.DataFrame({'Number 3': [1, 2, 3, 1000, 0]})
numsdf = pd.concat([num1df, num2df, num3df], axis=1, join="inner")
#add by shifted values
numsdf['new'] = numsdf['Number 1'].shift(-1, fill_value=0) + numsdf['Number 3']
#shift again
numsdf['new1'] = numsdf['new'].shift(-1, fill_value=0)
#replace NaN by another column
numsdf['new2'] = numsdf['Number 2'].fillna(numsdf['new1'])
print(numsdf)
Number 1 Number 2 Number 3 new new1 new2
0 1 1.0 1 5 5 1.0
1 4 2.0 2 5 5 2.0
2 3 NaN 3 5 1100 1100.0
3 2 4.0 1000 1100 0 4.0
4 100 5.0 0 0 0 5.0
foo = numsdf.iloc[4, 0]
bar = numsdf.iloc[3, 2]
numsdf.at[2, 'Number 2'] = foo + bar
Output:
Number 1 Number 2 Number 3
0 1 1 1
1 4 2 2
2 3 1100 3
3 2 4 1000
4 100 5 0

Data frame mode function

HI I want to ask I am using df.mode() function to find the most common in one row. This will give me an extra column how could I have only one column? I am using df.mode(axis=1)
for example I have a data frame
0 1 2 3 4
1 1 0 1 1 1
2 0 1 0 0 1
3 0 0 1 1 0
so I want the output
1 1
2 0
3 0
but I am getting
1 1 NaN
2 0 NaN
3 0 NaN
Does anyone know why?
The code you tried gives the expected output in Python 3.7.6 with Pandas 1.0.3.
import pandas as pd
df = pd.DataFrame(
data=[[1, 0, 1, 1, 1], [0, 1, 0, 0, 1], [0, 0, 1, 1, 0]],
index=[1, 2, 3])
df
0 1 2 3 4
1 1 0 1 1 1
2 0 1 0 0 1
3 0 0 1 1 0
df.mode(axis=1)
0
1 1
2 0
3 0
There could be different data types in your columns and mode cannot be used to compare column of different data type.
Use str() or int() to convert your df.series to a suitable data type. Make sure that the data type is consistent in the df before employing mode(axis=1)

Vectorizing for-loop

I have a very large dataframe (~10^8 rows) where I need to change some values. The algorithm I use is complex so I tried to break down the issue into a simple example below. I mostly programmed in C++, so I keep thinking in for-loops. I know I should vectorize but I am new to python and very new to pandas and cannot come up with a better solution. Any solutions which increase performance are welcome.
#!/usr/bin/python3
import numpy as np
import pandas as pd
data = {'eventID': [1, 1, 1, 2, 2, 3, 4, 5, 6, 6, 6, 6, 7, 8],
'types': [0, -1, -1, -1, 1, 0, 0, 0, -1, -1, -1, 1, -1, -1]
}
mydf = pd.DataFrame(data, columns=['eventID', 'types'])
print(mydf)
MyIntegerCodes = np.array([0, 1])
eventIDs = np.unique(mydf.eventID.values) # can be up to 10^8 values
for val in eventIDs:
currentTypes = mydf[mydf.eventID == val].types.values
if (0 in currentTypes) & ~(1 in currentTypes):
mydf.loc[mydf.eventID == val, 'types'] = 0
if ~(0 in currentTypes) & (1 in currentTypes):
mydf.loc[mydf.eventID == val, 'types'] = 1
print(mydf)
Any ideas?
EDIT: I was ask to explain what I do with my for-loops.
For every eventID I want to know if all corresponding types contain a 1 or a 0 or both. If they contain a 1, all values which are equal to -1 should be changed to 1. If the values are 0, all values equal to -1 should be changed to 0. My problem is to do this efficiently for each eventID independently. There can be one or multiple entries per eventID.
Input of example:
eventID types
0 1 0
1 1 -1
2 1 -1
3 2 -1
4 2 1
5 3 0
6 4 0
7 5 0
8 6 -1
9 6 -1
10 6 -1
11 6 1
12 7 -1
13 8 -1
Output of example:
eventID types
0 1 0
1 1 0
2 1 0
3 2 1
4 2 1
5 3 0
6 4 0
7 5 0
8 6 1
9 6 1
10 6 1
11 6 1
12 7 -1
13 8 -1
First we create boolean masks m1 and m2 using Series.eq then use DataFrame.groupby on this mask and transform using any, then using np.select chose the elements from 1, 0 depending upon the conditions m1 or m2:
m1 = mydf['types'].eq(1).groupby(mydf['eventID']).transform('any')
m2 = mydf['types'].eq(0).groupby(mydf['eventID']).transform('any')
mydf['types'] = np.select([m1 , m2], [1, 0], mydf['types'])
Result:
# print(mydf)
eventID types
0 1 0
1 1 0
2 1 0
3 2 1
4 2 1
5 3 0
6 4 0
7 5 0
8 6 1
9 6 1
10 6 1
11 6 1
12 7 -1
13 8 -1

Counting the number of missing/NaN in each row

I've got a dataset with a big number of rows. Some of the values are NaN, like this:
In [91]: df
Out[91]:
1 3 1 1 1
1 3 1 1 1
2 3 1 1 1
1 1 NaN NaN NaN
1 3 1 1 1
1 1 1 1 1
And I want to count the number of NaN values in each row, it would be like this:
In [91]: list = <somecode with df>
In [92]: list
Out[91]:
[0,
0,
0,
3,
0,
0]
What is the best and fastest way to do it?
You could first find if element is NaN or not by isnull() and then take row-wise sum(axis=1)
In [195]: df.isnull().sum(axis=1)
Out[195]:
0 0
1 0
2 0
3 3
4 0
5 0
dtype: int64
And, if you want the output as list, you can
In [196]: df.isnull().sum(axis=1).tolist()
Out[196]: [0, 0, 0, 3, 0, 0]
Or use count like
In [130]: df.shape[1] - df.count(axis=1)
Out[130]:
0 0
1 0
2 0
3 3
4 0
5 0
dtype: int64
To count NaNs in specific rows, use
cols = ['col1', 'col2']
df['number_of_NaNs'] = df[cols].isna().sum(1)
or index the columns by position, e.g. count NaNs in the first 4 columns:
df['number_of_NaNs'] = df.iloc[:, :4].isna().sum(1)

Categories