I have multiple categorical columns like Marital Status, Education, Gender, City and I wanted to check all the unique values inside these columns at once instead of writing this code every time.
df['Education'].value_counts()
I can only give an example of a few features but I need a solution when there are so many categorical features and its not possible to write code again and again to examine them.
Maritial_Status Education City
Married UG LA
Single PHD CA
Single UG Ca
Expected output:
Maritial_Status Education City
Married 1 UG 2 LA 1
Single 2 PHD 1 CA 2
Is there any kind of method to do this in Python?
Thanks
Yes, you can get what you're looking for with the following approach (also you don't have to worry about if your df has more data than the 4 columns you specified):
Get (only) all your categorical columns from your df in a list:
cat_cols = [i for i in df.columns if df[i].dtypes == 'O']
Then, run a loop performing .size() on your grouped object, over your categorical columns, and store each result (which is a df object) in an empty list.
li = []
for col in cat_cols:
li.append(df.groupby([col]).size().reset_index(name=col+'_count'))
Lastly, concat the newly created dataframes within your list, into 1.
dat = pd.concat(li,axis=1)
All in 1 block:
cat_cols = [i for i in df.columns if df[i].dtypes == 'O']
li = []
for col in cat_cols:
li.append(df.groupby([col]).size().reset_index(name=col+'_count'))
dat = pd.concat(li,axis=1)# use axis=1, so that the concatenation is column-wise
Marital Status Marital Status_count ... City City_count
0 Divorced 4.0 ... Athens 4
1 Married 3.0 ... Berlin 2
2 Single 3.0 ... London 2
3 Widowed 2.0 ... New York 2
4 NaN NaN ... Singapore 2
Using value_counts, you can do the following
res = (df
.apply(lambda x: x.value_counts()) # column by column value_counts would be applied
.stack()
.reset_index(level=0).sort_index(axis=0)
.rename(columns={'level_0': 'Value', 0: 'value_counts'}))
Another format of the the output:
res['Id'] = res.groupby(level=0).cumcount()
res.set_index('Id', append=True)
Explanation:
After applying value_counts, you will get the following:
Then using stack you can remove the NAN and get all things "stacked up" and then you can do the formatting/ ordering of the output.
To know how many repeated unique values you have for each column, you can try drop_duplicates() method:
dataset.drop_duplicates()
Related
i want to group by id and get three most frequent city. For example i have original dataframe
ID City
1 London
1 London
1 New York
1 London
1 New York
1 Berlin
2 Shanghai
2 Shanghai
and result i want is like this:
ID first_frequent_city second_frequent_city third_frequent_city
1 London New York Berlin
2 Shanghai NaN NaN
First step is use SeriesGroupBy.value_counts for count values of City per ID, advantage is already values are sorted, then get counter by GroupBy.cumcount, filter first 3 values by loc, pivoting by DataFrame.pivot, change columns names and last convert ID to column by DataFrame.reset_index:
df = (df.groupby('ID')['City'].value_counts()
.groupby(level=0).cumcount()
.loc[lambda x: x < 3]
.reset_index(name='c')
.pivot('ID','c','City')
.rename(columns={0:'first_', 1:'second_', 2:'third_'})
.add_suffix('frequent_city')
.rename_axis(None, axis=1)
.reset_index())
print (df)
ID first_frequent_city second_frequent_city third_frequent_city
0 1 London New York Berlin
1 2 Shanghai NaN NaN
Another way using count as a reference to sort, then recreate dataframe by looping through groupby object:
df = (df.assign(count=df.groupby(["ID","City"])["City"].transform("count"))
.drop_duplicates(["ID","City"])
.sort_values(["ID","count"], ascending=False))
print (pd.DataFrame([i["City"].unique()[:3] for _, i in df.groupby("ID")]).fillna(np.NaN))
0 1 2
0 London New York Berlin
1 Shanghai NaN NaN
A bit long, essentially you groupby twice, first part works on the idea that grouping sorts the data in ascending order, the second part allows us to split the data into individual columns :
(df
.groupby("ID")
.tail(3)
.drop_duplicates()
.groupby("ID")
.agg(",".join)
.City.str.split(",", expand=True)
.set_axis(["first_frequent_city",
"second_frequent_city",
third_frequent_city"],
axis="columns",)
)
first_frequent_city second_frequent_city third_frequent_city
ID
1 London New York Berlin
2 Shanghai None None
Get the .count by ID and City and then use np.where() with .groupby() with max, median and min. Then set the index and unstack rows to columns on the max column.
df = df.assign(count=df.groupby(['ID', 'City'])['City'].transform('count')).drop_duplicates()
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('min')), 'third_frequent_city', np.nan)
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('median')), 'second_frequent_city', df['max'])
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('max')), 'first_frequent_city', df['max'])
df = df.drop('count',axis=1).set_index(['ID', 'max']).unstack(1)
output:
City
max first_frequent_city second_frequent_city third_frequent_city
ID
1 London New York Berlin
2 Shanghai NaN NaN
I have dataframe as below:
Name Marks Place Points
John-->Hile 50 Germany-->Poland 1
Rog-->Oliver 60-->70 Australia-->US 2
Harry 80 UK 3
Faye-->George 90 Poland 4
I want a result as below which finds counts of value having "-->"column wise and transpose it and result as below dataframe:
Column Count
Name 3
Marks 1
Place 1
This df is eg.This datframe is dynamic and can vary in each run like in 2nd Run we might have Name,Marks,Place or Name,Marks or anything else, So code should be dynamic which can run on any df.
You can select object columns and column-wise perform a count and summation:
df.select_dtypes(object).apply(lambda x: x.str.contains('-->')).sum()
Name 3
Marks 1
Place 2
dtype: int64
Another weird, but interesting method with applymap:
(df.select_dtypes(object)
.applymap(lambda x: '-->' in x if isinstance(x, str) else False)
.sum())
Name 3
Marks 1
Place 2
dtype: int64
I have data frame like this where country and name where ever it is unique for same ID have to be in new columns.
Expected output :
If the value is repeating that need not be displayed in the new columns, it can be blank
Tried with below code, but is working fine for one column what if i have 2 columns together and do the same task.
group = df.groupby('ID')
df1 = group.apply(lambda x:x['COUNTRY'].unique())
df1=df1.apply(pd.Series)
You can do the following,
# Create a dataframe where each element is aggregated as list
new_df = df.groupby('ID').agg(lambda x: pd.Series(x).unique().tolist())
# Generate column names to be used after expanding lists
country_cols = ['Country_'+str(i) for i in range(new_df["Country"].str.len().max())]
name_cols = ['Name_'+str(i) for i in range(new_df["Name"].str.len().max())]
# Drop the Country, Name columns from the original and expand Country, Name columns and concat that to the original dataframe, finally do a fillna
df2 = pd.concat(
[new_df.drop(['Country','Name'], axis=1),
pd.DataFrame.from_records(new_df["Country"], columns=country_cols, index=new_df.index),
pd.DataFrame.from_records(new_df["Name"], columns=name_cols, index=new_df.index)
], axis=1
).fillna(' ')
We can do this with a simple function :
def unique_column_unstack(dataframe,agg_columns):
dfs = []
for col in agg_columns:
agg_df = df.groupby('ID')[col].apply(lambda x : pd.Series(x.unique().tolist())).unstack()
agg_df.columns = agg_df.columns.map(lambda x : f"{col}_{x+1}")
dfs.append(agg_df)
return pd.concat(dfs,axis=1)
new_df = unique_column_unstack(df,['COUNTRY','NAME'])
print(new_df)
COUNTRY_1 COUNTRY_2 NAME_1 NAME_2
ID
20_001 US IN LIZ LAK
20_002 US NaN LIZ CHRI
20_003 US EU LIZ NaN
20_004 EU NaN CHRI NaN
I got a weird one today. I am scraping several thousand PDFs using Tabula-py and, for whatever reason, the same table (different PDF) which has wrapped text can be auto-merged based on the tables actual split but on other occasion the pandas dataframe will have many NaN rows to account account for the wrapped text. Generally the ratio is 50:1 are merged. So it makes since to automate the merging process. Here is the example:
Desired DataFrame:
Column1 | Column2 | Column3
A Many Many ... Lots and ... This keeps..
B lots of text.. Many Texts.. Johns and jo..
C ...
D
Scraped returned Dataframe
Column1 | Column2 | Column3
A Many Many Lots This keeps Just
Nan Many Many and lots Keeps Going!
Nan Texts Nan Nan
B lots of Many Texts John and
Nan text here Johnson inc.
C ...
In this case the text should be merged up, such that "Many Many Many Many Texts" are all in cell A Column1 and so on.
I have solved this problem with the below solution, but it feels very dirty. There are a ton of index settings to avoid having to manage the columns and avoid dropping needed values. Is anyone aware of a better solution?
df = df.reset_index()
df['Unnamed: 0'] = df['Unnamed: 0'].fillna(method='ffill')
df = df.fillna('')
df = df.set_index('Unnamed: 0')
df = df.groupby(index)[df.columns].transform(lambda x: ' '.join(x))
df = df.reset_index()
df = df.drop_duplicates(keep = 'first')
df = df.set_index('Unnamed: 0')
Cheers
Similar to Ben's idea:
# fill the missing index
df.index = df.index.to_series().ffill()
(df.stack() # stack to kill the other NaN values
.groupby(level=(0,1)) # grouby (index, column)
.apply(' '.join) # join those strings
.unstack(level=1) # unstack to get columns back
)
Output:
Column1 Column2 Column3
A Many Many Many Many Texts Lots and lots This keeps Just Keeps Going!
B lots of text Many Texts here John and Johnson inc.
Try this:
df.fillna('').groupby(df.index.to_series().ffill()).agg(' '.join)
Out[1390]:
Column1 Column2 \
Unnamed: 0
A Many Many Many Many Texts Lots and lots
B lots of text Many Texts here
Column3
Unnamed: 0
A This keeps Just Keeps Going!
B John and Johnson inc.
I think you can use ffill on the index directly in the groupby. Then use agg instead of transform.
# dummy input
df = pd.DataFrame( {'a':list('abcdef'), 'b' : list('123456')},
index=['A', np.nan, np.nan, 'B', 'C', np.nan])
print (df)
a b
A a 1
NaN b 2
NaN c 3
B d 4
C e 5
NaN f 6
#then groupby on the filled index and agg
new_df = (df.fillna('')
.groupby(pd.Series(df.index).ffill().values)[df.columns]
.agg(lambda x: ' '.join(x)))
print (new_df)
a b
A a b c 1 2 3
B d 4
C e f 5 6
I'm trying to change the structure of my dataset
Currently have:
RE id Country 0 1 2 ... n
1001 CN,TH CN TH nan ... nan
1002 UK UK nan nan ... nan
I've split the Country column out, hence the additional columns. Now I am trying to use df.melt to get this:
RE id var val
1001 0 CN
1001 0 TH
So I can eventually get to this by using a pivot
RE id Country
1001 TH
1001 CN
I've tried:
df = a.melt(id_vars=[a[[0]],a[[1]],a[[2]]], value_vars=['RE id'])
How can I select the range of columns in my dataframe to use as the identifer variables?
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.melt.html#pandas.DataFrame.melt
I think the problem was that you were referencing the column names incorrectly. Also, I believe you had id_vars (should be Re id, I think) and value_vars (column names 0 and 1) inverted in your code.
Here is how I approached this
Imports
import pandas as pd
import numpy as np
Here is a part of the data, sufficient to demonstrate the likely problem
a = [['Re id', 0, 1],[1001,'CN','TH'],[1002,'UK',np.nan]]
df = pd.DataFrame(a[1:], columns=a[0])
print(df)
Re id 0 1
0 1001 CN TH
1 1002 UK NaN
Now, use pd.melt with
id_vars pointing to Re id
value_vars as the 2 columns you want to melt (namely, column names 0 and 1)
df_melt = pd.melt(df, id_vars=['Re id'], value_vars=[0,1], value_name='Country')
df_melt.sort_values(by=['Re id', 'Country'], ascending=[True,False], inplace=True)
print(df_melt)
Re id variable Country
2 1001 1 TH
0 1001 0 CN
1 1002 0 UK
3 1002 1 NaN
Also, since you have the Country names in separate columns (0,1), I do not think that you need to use the Country column at all.