Column-wise string counts for multiple columns in a pandas DataFrame - python

I have dataframe as below:
Name Marks Place Points
John-->Hile 50 Germany-->Poland 1
Rog-->Oliver 60-->70 Australia-->US 2
Harry 80 UK 3
Faye-->George 90 Poland 4
I want a result as below which finds counts of value having "-->"column wise and transpose it and result as below dataframe:
Column Count
Name 3
Marks 1
Place 1
This df is eg.This datframe is dynamic and can vary in each run like in 2nd Run we might have Name,Marks,Place or Name,Marks or anything else, So code should be dynamic which can run on any df.

You can select object columns and column-wise perform a count and summation:
df.select_dtypes(object).apply(lambda x: x.str.contains('-->')).sum()
Name 3
Marks 1
Place 2
dtype: int64
Another weird, but interesting method with applymap:
(df.select_dtypes(object)
.applymap(lambda x: '-->' in x if isinstance(x, str) else False)
.sum())
Name 3
Marks 1
Place 2
dtype: int64

Related

How to compare two columns value in pandas

I Have a dataframe which has some unique IDs in two of the columns.for e.g
S.no. Column1 Column2
1 00001x 00002x
2 00003j 00005k
3 00002x 00001x
4 00004d 00008e
Value can be anything in the string format
I want to compare the two column in such a way that either of s.no 1 or 3 data remains. as these id contains the same information. only its order is different.
Basically if for one row value in a column 1 is X and column 2 is Y and for other row value in column 1 is Y and in Column 2 is x then only one of the row should remain.
is that possible in python?
You can convert your columns as frozenset per row.
This will give a common order to apply duplicated.
Finally, slice the rows using the previous output as mask:
mask = df.filter(like='Column').apply(frozenset, axis=1).duplicated()
df[~mask]
previous answer using set:
mask = df.filter(like='Column').apply(lambda x: tuple(set(x)), axis=1).duplicated()
df[~mask]
NB. Using a set or sorted requires to convert as tuple (lambda x: tuple(sorted(x))) as the duplicated function hashes the values, which is not possible with mutable objects
output:
S.no. Column1 Column2
0 1 00001x 00002x
1 2 00003j 00005k
3 4 00004d 00008e

Check values of multiple categorical columns at a same time

I have multiple categorical columns like Marital Status, Education, Gender, City and I wanted to check all the unique values inside these columns at once instead of writing this code every time.
df['Education'].value_counts()
I can only give an example of a few features but I need a solution when there are so many categorical features and its not possible to write code again and again to examine them.
Maritial_Status Education City
Married UG LA
Single PHD CA
Single UG Ca
Expected output:
Maritial_Status Education City
Married 1 UG 2 LA 1
Single 2 PHD 1 CA 2
Is there any kind of method to do this in Python?
Thanks
Yes, you can get what you're looking for with the following approach (also you don't have to worry about if your df has more data than the 4 columns you specified):
Get (only) all your categorical columns from your df in a list:
cat_cols = [i for i in df.columns if df[i].dtypes == 'O']
Then, run a loop performing .size() on your grouped object, over your categorical columns, and store each result (which is a df object) in an empty list.
li = []
for col in cat_cols:
li.append(df.groupby([col]).size().reset_index(name=col+'_count'))
Lastly, concat the newly created dataframes within your list, into 1.
dat = pd.concat(li,axis=1)
All in 1 block:
cat_cols = [i for i in df.columns if df[i].dtypes == 'O']
li = []
for col in cat_cols:
li.append(df.groupby([col]).size().reset_index(name=col+'_count'))
dat = pd.concat(li,axis=1)# use axis=1, so that the concatenation is column-wise
Marital Status Marital Status_count ... City City_count
0 Divorced 4.0 ... Athens 4
1 Married 3.0 ... Berlin 2
2 Single 3.0 ... London 2
3 Widowed 2.0 ... New York 2
4 NaN NaN ... Singapore 2
Using value_counts, you can do the following
res = (df
.apply(lambda x: x.value_counts()) # column by column value_counts would be applied
.stack()
.reset_index(level=0).sort_index(axis=0)
.rename(columns={'level_0': 'Value', 0: 'value_counts'}))
Another format of the the output:
res['Id'] = res.groupby(level=0).cumcount()
res.set_index('Id', append=True)
Explanation:
After applying value_counts, you will get the following:
Then using stack you can remove the NAN and get all things "stacked up" and then you can do the formatting/ ordering of the output.
To know how many repeated unique values you have for each column, you can try drop_duplicates() method:
dataset.drop_duplicates()

How does (DataFrame - Groupby) match rows?

I can't figure out how (DataFrame - Groupby) works.
Specifically, given the following dataframe:
df = pd.DataFrame([['usera',1,100],['usera',5,130],['userc',1,100],['userd',5,100]])
df.columns = ['id','date','sum']
id date sum
0 usera 1 100
1 usera 5 130
2 userc 1 100
3 userd 5 100
Passing the below code returns:
df['shift'] = df['date']-df.groupby(['id'])['date'].shift(1)
id date sum shift
0 usera 1 100
1 usera 5 130 4.0
2 userc 1 100
3 userd 5 100
How did Python know that I meant for it to match by id column?
It doesn't even appear in df['date']
Let us dissect the command df['shift'] = df['date']-df.groupby(['id'])['date'].shift(1).
df['shift'] appends a new column "shift" in the dataframe.
df['date'] returns Series using date column from the dataframe.
0 1
1 5
2 1
3 5
Name: date, dtype: int64
df.groupby(['id'])['date'].shift(1) groupby(['id']) creates a groupby object.
From that groupby object selecting date column and shifting one (previous) value using shift(1). By the way, this also a Series.
df.groupby(['id'])['date'].shift(1)
0 NaN
1 1.0
2 NaN
3 NaN
Name: date, dtype: float64
The Series obtained from step 3 is subtracted (element-wise) with the Series obtained from Step 2. The result is assigned to the df['shift'] column.
df['date']-df.groupby(['id'])['date'].shift(1)
0 NaN
1 4.0
2 NaN
3 NaN
Name: date, dtype: float64
I am not exactly knowing what you are trying, but groupby() method is usuful if you have several same objects in a column (like you usera) and you want to calculate for example the sum(), mean(), find max() etc. of all columns or just one specific column.
e.g. df.groupby(['id'])['sum'].sum() groups you usera and just select the sum column and build the sum over all usera. So it is 230. If you would use .mean() it would output 115 etc. And it also does it for all other unique id in your id column. In the example from above it outputs one column with just three rows (user a-c).
Greetz, miGa

Pandas COUNTIF based on column value

I am trying to essentially do a COUNTIF in pandas to count how many items in a row match a number in the first column.
Dataframe:
a b c d
1 2 3 1
2 3 4 2
3 5 6 3
So I want to count instances in a row (b,c,d) that match a. In row 1 for instance it should be 1 as only d matches a.
I have searched quite a bit for this but so far only found examples where its a common number (like counting all values more than 0) but not based on a dataframe column. i'm guessing its some form of logic that masks based on the column but df == df.a doesnt seem to work
You can use eq, which you can pass an axis parameter to specify the direction of the comparison, then you can do a row sum to count the number of matched values:
df.eq(df.a, axis=0).sum(1) - 1
#0 1
#1 1
#2 1
#dtype: int64
df.apply(lambda x: (x == x[0]).sum()-1,axis=1)

Summing values in a group of a groupby object?

I am trying to sum up the values of a column in a groupby object for each of entries by which I grouped.
Say I had a df like this:
Letters Numbers Items Bool
A 1 lamp 1
B 2 glass 1
B 2 table 1
C 5 pic 0
And I groupby letters and then want to know the sum of the bools in the letters group. How would I do this? I've been trying
df_new = df.groupby('letters').bool.sum()
...
df_new = df.groupby('letters').sum('bool')
and other variations...
In the end I would like to get a vector that contains a value for the sum of each of the letters' groups. For the ex., it would be [1,2,0].
You were really close! Given
>>> df
Letters Numbers Items Bool
0 A 1 lamp 1
1 B 2 glass 1
2 B 2 table 1
3 C 5 pic 0
You could sum everything and take the column you want:
>>> # slower
>>> df.groupby("Letters").sum()["Bool"] # sum everything, select Bool
Letters
A 1
B 2
C 0
Name: Bool, dtype: int64
Or better, take only the column you want and sum it:
>>> df.groupby("Letters")["Bool"].sum() # select Bool, sum it
Letters
A 1
B 2
C 0
Name: Bool, dtype: int64
I prefer to stick with the Series, because you can do more with it, but you can convert this to a list using list or .tolist() if you prefer.

Categories