Invert selection in Pandas - python

I have two datasets. Below you can see codes and data
import pandas as pd
import numpy as np
pd.set_option('max_columns', None)
import matplotlib.pyplot as plt
data = {'type_sale': ['group_1','group_2','group_3','group_4','group_5','group_6','group_7','group_8','group_9','group_10'],
'id':[70,20,24,80,20,20,60,20,20,20],
}
df1 = pd.DataFrame(data, columns = ['type_sale',
'id',])
data = {'type_sale': ['group_1','group_2','group_3'],
'id':[70,20,24],
}
df2 = pd.DataFrame(data, columns = ['type_sale',
'id',])
These codes created two datasets that are shown below :
Now I want to create a new data set df3 with values from df1 that are different (distinct values) from the values df2 in the column id.
The final results should as pic below
I tried with these codes but are not giving desired results.
df = pd.concat((df1, df2))
print(df.drop_duplicates('id'))
So can anybody help me how to solve this problem?

Try as follows:
Use df.isin to check for each value in df['id'] whether it is contained in df2['id'].
Next, invert the resulting boolean pd.Series by using the unary operator ~ (tilde) and select from d1.
Finally, reset the index.
In a one-liner:
df3 = df1[~df1['id'].isin(df2['id'])].reset_index(drop=True)
print(df3)
type_sale id
0 group_4 80
1 group_7 60

Related

How can i add a column that has the same value

I was trying to add a new Column to my dataset but when i did the column only had 1 index
is there a way to make one value be in al indexes in a column
import pandas as pd
df = pd.read_json('file_1.json', lines=True)
df2 = pd.read_json('file_2.json', lines=True)
df3 = pd.concat([df,df2])
df3 = df.loc[:, ['renderedContent']]
görüş_column = ['Milet İttifakı']
df3['Siyasi Yönelim'] = görüş_column
As per my understanding, this could be your possible solution:-
You have mentioned these lines of code:-
df3 = pd.concat([df,df2])
df3 = df.loc[:, ['renderedContent']]
You can modify them into
df3 = pd.concat([df,df2],axis=1) ## axis=1 means second dataframe will add to columns, default value is axis=0 which adds to the rows
Second point is,
df3 = df3.loc[:, ['renderedContent']]
I think you want to write this one , instead of df3=df.loc[:,['renderedContent']].
Hope it will solve your problem.

How can I show only some columns using Python Pandas?

I have tried the following code and it works however it shows excess columns that I don't require. This is the output showing the extra columns:
import pandas as pd
df = pd.read_csv("data.csv")
df = df.groupby(['City1', 'City2']).sum('PassengerTrips')
df['Vacancy'] = 1-df['PassengerTrips'] / df['Seats']
df = df.groupby(['City1','City2']).max('Vacancy')
df = df.sort_values('Vacancy', ascending =False)
print('The 10 routes with the highest proportion of vacant seats:')
print(df[:11])
I have tried to add the following code in after sorting the vacancy values however it gives me an error:
df = df[['City1', 'City2', 'Vacancy']]
City1 and City2 are in index since you applied a groupby on it.
You can put those in columns using reset_index to get the expected result :
df = df.reset_index(drop=False)
df = df[['City1', 'City2', 'Vacancy']]
Or, if you want to let City1 and City2 in index, you can do as #Corralien said in his comment : df = df['Vacancy']
And even df = df['Vacancy'].to_frame() to get a DataFrame instead of a Serie.

Replacing a column values with another column

I have a data frame of 6 columns that 2 first columns should be plotted as x & y. I want to replace the values of the 6th column with other values and then excluding x, y that have values larger than a threshold like 0.0003-0.002. The effort that I had is below:
import numpy as np
import os
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('configuration_1000.out', sep="\s+", header=None)
#print(df)
col_5 = df.iloc[:,5]
g = col_5.abs()
g = g*0.00005
#print(g)
df.loc[:,5].replace(g, inplace=True)
#df.head()
selected = df[ (df.loc[:,5] > 0.0003) & (df.loc[:,5] < 0.002) ]
print(selected)
plt.plot(selected[0], selected[1],marker=".")
but when I do this, nothing is gonna changed.
You don't need iloc for this, nor do you need to go through the intermediate steps. Just manipulate the column directly.
df[df.columns[5]] = abs(df[df.columns[5]])*0.00005
To solve this problem just need to do this
df.loc[:,5] = g

Pandas - Operate on a column, filtered by another column in the dataset

I have a dataframe with several columns with dates - formatted as datetime.
I am trying to get the min/max value of a date, based on another date column being NaN
For now, I am doing this in two separate steps:
temp_df = df[(df['date1'] == np.nan)]
max_date = max(temp_df['date2'])
temp_df = None
I get the result I want, but I am using an unnecesary temporary dataframe.
How can I do this without it?
Is there any reference material to read on this?
Thanks
Here is an MCVE that can be played with to obtain statistics from other columns where the value in one isnull() (NaN or NaT). This can be done in a one-liner.
import pandas as pd
import numpy as np
print(pd.__version__)
# sample date columns
daterange1 = pd.date_range('2017-01-01', '2018-01-01', freq='MS')
daterange2 = pd.date_range('2017-04-01', '2017-07-01', freq='MS')
daterange3 = pd.date_range('2017-06-01', '2018-02-01', freq='MS')
df1 = pd.DataFrame(data={'date1': daterange1})
df2 = pd.DataFrame(data={'date2': daterange2})
df3 = pd.DataFrame(data={'date3': daterange3})
# jam them together, making NaT's in non-overlapping ranges
df = pd.concat([df1, df2, df3], axis=0, sort=False)
df.reset_index(inplace=True)
max_date = df[(df['date1'].isnull())]['date2'].max()
print(max_date)

Efficiently reconstruct DataFrame using oversampled index

I have two DataFrames: df1 and df2
both df1 and df2 are derived from the same original data set, which has a DatetimeIndex.
df2 still has a DatetimeIndex.
Whereas, df1 has been oversampled and now has an int index with the prior DatetimeIndex as a 'Date' column within it.
I need to reconstruct a df2 so that it aligns with df1, i.e. I'll need to oversample the rows that are oversampled and then order them and set them onto the same int index that df1 has.
Currently, I'm using these two functions below, but they are painfully slow. Is there any way to speed this up? I haven't been able to find any built-in function that does this. Is there?
def align_data(idx_col,data):
new_data = pd.DataFrame(index=idx_col.index,columns=data.columns)
for label,group in idx_col.groupby(idx_col):
if len(group.index) > 1:
slice = expanded(data.loc[label],len(group.index)).values
else:
slice = data.loc[label]
new_data.loc[group.index] = slice
return new_data
def expanded(row,l):
return pd.DataFrame(data=[row for i in np.arange(l)],index=np.arange(l),columns=row.index)
A test can be generated using the code below:
import pandas as pd
import numpy as np
import datetime as dt
dt_idx = pd.DatetimeIndex(start='1990-01-01',end='2018-07-02',freq='B')
df1 = pd.DataFrame(data=np.zeros((len(dt_idx),20)),index=dt_idx)
df1.index.name = 'Date'
df2 = df1.copy()
df1 = pd.concat([df1,df1.sample(len(dt_idx)/2)],axis=0)
df1.reset_index(drop=False,inplace=True)
t = dt.datetime.now()
df2_aligned = align_data(df1['Date'],df2)
print(dt.datetime.now()-t)

Categories