How to extract data using groupby under specific condition? - python

I have a data set as such:
x = {'column1': ['a','a','b','b','b','c','c','c','d'],
'column2': [1,0,1,1,0,1,1,0,1]
}
df = pd.DataFrame(x, columns = ['column1', 'column2'])
print (df)
How would i extract only data from column two that have value of one (like this):
x = {'column1': ['a','b','b','c','c','d'],
'column2': [1,1,1,1,1,1]
}
df = pd.DataFrame(x, columns = ['column1', 'column2'])
print (df)
Also how would i count the number of 1's for each values in column 1 and make a new column and insert that information for respective indexes in coulmn_1(for example, how many 1's do index value a in column_1 have?).So it turns dataframe into this format:
x = {'column1': ['a','b','b','c','c','d'],
'column2': [1,1,1,1,1,1],
'column3': [1,2,2,2,2,1]
}
df = pd.DataFrame(x, columns = ['column1', 'column2','column3'])
print (df)

First question:
df[df.column2==1].reset_index(drop=True)
will give you
column1 column2
0 a 1
1 b 1
2 b 1
3 c 1
4 c 1
5 d 1
Second question:
df['column3'] = df.groupby('column1').transform(len)
will give you
column1 column2 column3
0 a 1 1
1 b 1 2
2 b 1 2
3 c 1 2
4 c 1 2
5 d 1 1

Use boolean indexing with Series.eq for compare like == and then Series.map with Series.value_counts:
df = df[df['column2'].eq(1)]
df['column3'] = df['column1'].map(df['column1'].value_counts())
Alternative with GroupBy.transform and GroupBy.size:
df['column3'] = df.groupby('column1')['column1'].transform('size')
print (df)
column1 column2 column3
0 a 1 1
2 b 1 2
3 b 1 2
5 c 1 2
6 c 1 2
8 d 1 1
Last for default index use DataFrame.reset_index with drop=True:
df = df.reset_index(drop=True)
print (df)
column1 column2 column3
0 a 1 1
1 b 1 2
2 b 1 2
3 c 1 2
4 c 1 2
5 d 1 1

Related

df.rename does not alter df column names, but df.columns and df.set_axis do (Pandas)

I have a pandas dataframe that I want to rename the columns on
When I run:
df.rename(columns={0:"C", 1:"D"}, inplace=True)
No change happens, it's still the original column names.
But if I do:
df.columns = ["C", "D"]
or
df.set_axis(["C", "D"],axis=1, inplace=True)
It works.
Why does not df.rename work?
NOTE: I specifically would like to rename the first and second column regardless of what their name is, it may change (in my case) so I can't specify it.
Example:
df = pd.DataFrame({"A": pd.Series(range(0,2)),"B": pd.Series(range(2,4))})
df
A B
1 0 2
2 1 3
df = pd.DataFrame({"A": pd.Series(range(0,2)),"B": pd.Series(range(2,4))})
df.rename(columns={0:"C", 1:"D"}, inplace=True)
df
A B
1 0 2
2 1 3
df = pd.DataFrame({"A": pd.Series(range(0,2)),"B": pd.Series(range(2,4))})
df.columns = ["C", "D"]
df
C D
0 0 2
1 1 3
df = pd.DataFrame({"A": pd.Series(range(0,2)),"B": pd.Series(range(2,4))})
df.set_axis(["C", "D"],axis=1, inplace=True)
df
C D
0 0 2
1 1 3
EDIT:
My original dataframe had the column names 0 and 1 which is why df.rename(columns={0:"C", 1:"D"}, inplace=True) worked.
Example:
df = pd.DataFrame([range(2,4), range(4,6)])
df
0 1
0 2 3
1 4 5
df.rename(columns={0:"C", 1:"D"}, inplace=True)
df
C D
0 2 3
1 4 5
If you don't want to rename by using the old name, you could zip the current columns and pass in the number of items you want.
If you're using Python 3.7+ then order should be preserved
Also don't use inplace=True
print(df)
A B
0 0 2
1 1 3
df.rename(columns=dict(zip(df.columns, ['C','E'])))
C E
0 0 2
1 1 3
df.rename(columns=dict(zip(df.columns, ['E'])))
E B
0 0 2
1 1 3

How to find duplicate values (not rows) in an entire pandas dataframe?

Consider this dataframe.
df = pd.DataFrame(data={'one': list('abcd'),
'two': list('efgh'),
'three': list('ajha')})
one two three
0 a e a
1 b f j
2 c g h
3 d h a
How can I output all duplicate values and their respective index? The output can look something like this.
id value
0 2 h
1 3 h
2 0 a
3 0 a
4 3 a
Try .melt + .duplicated:
x = df.reset_index().melt("index")
print(
x.loc[x.duplicated(["value"], keep=False), ["index", "value"]]
.reset_index(drop=True)
.rename(columns={"index": "id"})
)
Prints:
id value
0 0 a
1 3 h
2 0 a
3 2 h
4 3 a
We can stack the DataFrame, use Series.loc to keep only where value is Series.duplicated then Series.reset_index to convert to a DataFrame:
new_df = (
df.stack() # Convert to Long Form
.droplevel(-1).rename_axis('id') # Handle MultiIndex
.loc[lambda x: x.duplicated(keep=False)] # Filter Values
.reset_index(name='value') # Make Series a DataFrame
)
new_df:
id value
0 0 a
1 0 a
2 2 h
3 3 h
4 3 a
I used here melt to reshape and duplicated(keep=False) to select the duplicates:
(df.rename_axis('id')
.reset_index()
.melt(id_vars='id')
.loc[lambda d: d['value'].duplicated(keep=False), ['id','value']]
.sort_values(by='id')
.reset_index(drop=True)
)
Output:
id value
0 0 a
1 0 a
2 2 h
3 3 h
4 3 a

PANDAS How to split a dataframe according to value from first column?

I would like to split a big dataframe into several smaller dataframes according to the value from the first column if that's possible, I didn't find it online.
Example, I have this:
DF
Column1 Column2 Column3 Column4
A 1 2 1
A 1 1 2
A 3 2 2
B 2 1 2
B 3 1 1
split this into :
DF1
Column1 Column2 Column3 Column4
A 1 2 1
A 1 1 2
A 3 2 2
DF2
Column1 Column2 Column3 Column4
B 2 1 2
B 3 1 1
Simple as:
df1 = df[df['A'] == 'A']
df2 = df[df['B'] == 'B']
if you have unique values for the column you could create a list of dataframes:
df_lst = list()
unique_elements = df['column1'].unique()
for elm in unique_elements:
df_lst.append(df[df['column1'] == elm])
you can solve this by pandas and use the function groupby()
import pandas as pd
#load your file
df =pd.read_csv('sample.csv')
grouped = df.groupby(df.column1)
A = grouped.get_group("A")
B = grouped.get_group("B")
print(A)
print(B)
You can use groupby as:
df=[['A',1 ,1,2],['A',10,1,2],['B',10,1,2],['B',30,1,2]]
df = pd.DataFrame(df,columns=['a','b','c','d'])
d1,d2 = df.groupby('a')
print(d1[1])
print()
print(d2[1])
a b c d
0 A 1 1 2
1 A 10 1 2
a b c d
2 B 10 1 2
3 B 30 1 2

Looking up values in two pandas data frames and create new columns

I have two data frames in my problem.
df1
ID Value
1 A
2 B
3 C
df2:
ID F_ID S_ID
1 2 3
2 3 1
3 1 2
I want to create a column next to each ID column that will store the values looked up from df1. The output should look like this :
ID ID_Value F_ID F_ID_Value S_ID S_ID_Value
1 A 2 B 3 C
2 B 3 C 1 A
3 C 1 A 2 B
Basically looking up from df1 and creating a new column to store these values.
you can use map on each column of df2 with the value of df1.
s = df1.set_index('ID')['Value']
for col in df2.columns:
df2[f'{col}_value'] = df2[col].map(s)
print (df2)
ID F_ID S_ID ID_value F_ID_value S_ID_value
0 1 2 3 A B C
1 2 3 1 B C A
2 3 1 2 C A B
or with apply and concat
df_ = pd.concat([df2, df2.apply(lambda x: x.map(s)).add_prefix('_value')], axis=1)
df_ = df_.reindex(sorted(df_.columns), axis=1)
If order is important (I realised not in comments) is necessary use DataFrame.insert with enumerate and some maths:
s = df1.set_index('ID')['Value']
for i, col in enumerate(df2.columns, 1):
df2.insert(i * 2 - 1, f'{col}_value', df2[col].map(s))
print (df2)
ID ID_value F_ID F_ID_value S_ID S_ID_value
0 1 A 2 B 3 C
1 2 B 3 C 1 A
2 3 C 1 A 2 B

How to pivot a dataframe into a square dataframe with number of intersections in other column as values

How to pivot a dataframe into a square dataframe with number of intersections in value column as values where
my input dataframe is
field value
a 1
a 2
b 3
b 1
c 2
c 5
Output should be
a b c
a 2 1 1
b 1 2 0
c 1 0 2
The values in the output data frame should be the number of intersection of values in the value column.
Use cross join with crosstab:
df = df.merge(df, on='value')
df = pd.crosstab(df['field_x'], df['field_y'])
print (df)
field_y a b c
field_x
a 2 1 1
b 1 2 0
c 1 0 2
Then remove index and columns names by rename_axis:
#pandas 0.24+
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(index=None, columns=None)
print (df)
a b c
a 2 1 1
b 1 2 0
c 1 0 2
#pandas bellow
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(None).rename_axis(None, axis=1)

Categories