Unable to include column if all rows are zero in pandas crosstab - python

In pandas crosstab, now I am getting the output as below if the other col contains all zero value:
0
0 5
1 2
But I need to get an output for the other column even if it contains all zero.
0 1
0 5 0
1 2 0
I am using below code to create cross tab:
data_crosstab = pd.crosstab(data[df_all.columns[56]],
data[df_all.columns[57]],
margins = False,dropna=False)

Use DataFrame.reindex:
#margins=False is default value, so removed
#https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html
data_crosstab = pd.crosstab(data[df_all.columns[56]],
data[df_all.columns[57]], dropna=False)
data_crosstab = data_crosstab.reindex(columns=[0,1], fill_value=0)
More general solution:
data_crosstab = data_crosstab.reindex(columns=[0,1],index=[0,1], fill_value=0)

Related

nunique compare two Pandas dataframe with duplicates and pivot them

My input:
df1 = pd.DataFrame({'frame':[ 1,1,1,2,3,0,1,2,2,2,3,4,4,5,5,5,8,9,9,10,],
'label':['GO','PL','ICV','CL','AO','AO','AO','ICV','PL','TI','PL','TI','PL','CL','CL','AO','TI','PL','ICV','ICV'],
'user': ['user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1']})
df2 = pd.DataFrame({'frame':[ 1, 1, 2, 3, 4,0,1,2,2,2,4,4,5,6,6,7,8,9,10,11],
'label':['ICV','GO', 'CL','TI','PI','AO','GO','ICV','TI','PL','ICV','TI','PL','CL','CL','CL','AO','AO','PL','ICV'],
'user': ['user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2']})
df_c = pd.concat([df1,df2])
I trying compare two df, frame by frame, and check if label in df1 existing in same frame in df2. And make some calucation with result (pivot for example)
That my code:
m_df = df1.merge(df2,on=['frame'],how='outer' )
m_df['cross']=m_df.apply(lambda row: 'Matched'
if row['label_x']==row['label_y']
else 'Mismatched', axis='columns')
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0,margins=True)
pv_mc = pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.count,fill_value=0,margins=True)
but this creates a some problem:
first, I can calqulate "simple" total (column All) of matched and missmatched as descipted in picture, or its "duplicated" as AO in pv_m or wrong number as in CL in pv_m_unq
and second, I think merge method as I use int not clever way, because I get if frame+label repetead in df(its happens often), in merged df I get number row in df1 X number of rows in df2 for this specific frame+label
I think maybe there is a smarter way to compare df and pivot them?
You got the unexpected result on margin total because the margin is making use the same function passed to aggfunc (i.e. pd.Series.nunique in this case) for its calculation and the values of Matched and Mismatched in these 2 rows are both the same as 1 (hence only one unique value of 1). (You are currently getting the unique count of frame id's)
Probably, you can achieve more or less what you want by taking the count on them (including margin, Matched and Mismatched) instead of the unique count of frame id's, by using pd.Series.count instead in the last line of codes:
pv_m = pd.pivot_table(m_df,columns='cross',index='label_x',values='frame', aggfunc=pd.Series.count, margins=True, fill_value=0)
Result
cross Matched Mismatched All
label_x
AO 0 1 1
CL 1 0 1
GO 1 1 2
ICV 1 1 2
PL 0 2 2
All 3 5 8
Edit
If all you need is to have the All column being the sum of Matched and Mismatched, you can do it as follows:
Change your code of generating pv_m_unq without building margin:
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0)
Then, we create the column All as the sum of Matched and Mismatched for each row, as follows:
pv_m_unq['All'] = pv_m_unq['Matched'] + pv_m_unq['Mismatched']
Finally, create the row All as the sum of Matched and Mismatched for each column and append it as the last row, as follows:
row_All = pd.Series({'Matched': pv_m_unq['Matched'].sum(),
'Mismatched': pv_m_unq['Mismatched'].sum(),
'All': pv_m_unq['All'].sum()},
name='All')
pv_m_unq = pv_m_unq.append(row_All)
Result:
print(pv_m_unq)
Matched Mismatched All
label_x
AO 1 3 4
CL 1 2 3
GO 1 1 2
ICV 2 4 6
PL 1 5 6
TI 2 3 5
All 8 18 26
You can use isin() function like this:
df3 =df1[df1.label.isin(df2.label)]

Append output from code to row it came from in python dataframe

Ive got some code which is working to create the output but id like to tack it back onto the row it came from, I'm struggling to do it with a join because there isn't a unique identifier so ideally would like to clean this step up.
first bit of code creates recent_transactions and then id like to append onto the end of each row the output from class_df
recent_transactions = pd.DataFrame(data)
recent_transactions
class_df = pd.DataFrame(json.loads(i['unknownStatementItem']) for i in recent_transactions.itemClassData)
You can create a unique identifier by using the index entry of a dataframe that has no duplicate index numbers. You can reset the index to guarantee a clean index with no duplicates. Ideally you can use vectorization rather than iteration, but at worst, get the unique index number from the dataframe and use it to find the correct row for setting a new column to the function's output.
row1list = [1, 2]
df = pd.DataFrame([row1list],
columns=['a', 'b'])
df = df.append(df) # duplicate index numbers, so clean that up next
print(df)
# a b
# 0 1 2
# 0 1 2
df = df.reset_index(drop=True).reset_index()
# drop old index with duplicates, make new clean 'index' and make it available as a column
print(df)
# index a b
# 0 0 1 2
# 1 1 1 2
df['result_of_some_function'] = -1 # start with a bogus value, upgrade as appropriate
for i in range(len(df)):
result_of_some_function = i * 2
df.loc[df['index'] == i, 'result_of_some_function'] = result_of_some_function
print(df)
# index a b result_of_some_function
# 0 0 1 2 0
# 1 1 1 2 2

replacing columns values only on specific columns using regex in pandas

I want to replace the values of specific columns. I can change the values one by one but, I have hundreds of columns and I need to change the columns starting with a specific string. Here is an example, I want to replace the string when the column name starts with "Q14"
df.filter(regex = 'Q14').replace(1, 'Selected').replace(0, 'Not selected')
The above code is working. But, how I can implement it in my dataframe? As this is the function so I can't use inplace.
Consider below df:
In [439]: df = pd.DataFrame({'Q14_A':[ 1,0,0,2], 'Q14_B':[0,1,1,2], 'Q12_A':[1,0,0,0]})
In [440]: df
Out[440]:
Q14_A Q14_B Q12_A
0 1 0 1
1 0 1 0
2 0 1 0
3 2 2 0
Filter columns that start with Q14, save it in a variable:
In [443]: cols = df.filter(regex='^Q14').columns
Now, change the above selected columns with your replace commands:
In [446]: df[cols] = df[cols].replace(1, 'Selected').replace(0, 'Not selected')
Output:
In [447]: df
Out[447]:
Q14_A Q14_B Q12_A
0 Selected Not selected 1
1 Not selected Selected 0
2 Not selected Selected 0
3 2 2 0
You can iterate over all columns and based on matched condition apply column transformation using apply command:
for column in df.columns:
if column.startswith("Q"):
df[column] = df[column].apply(lambda x: "Selected" if x == 1 else "Not selected")
Using pandas.Series.replace dict
df = pd.DataFrame({'Q14_A':[ 1,0,0,2], 'Q14_B':[0,1,1,2], 'Q12_A':[1,0,0,0]})
cols = df.filter(regex='^Q14').columns
replace_map = {
1: "Selected",
0 : "Not Selected"
}
df[cols] = df[cols].replace(replace_map)

How to process column names and create new columns

This is my pandas DataFrame with original column names.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
1 3 0 0
2 1 1 5
Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
Finally in each new column I should store the total count of non-zero original column values, i.e.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
1 3 0 0 2 0
2 1 1 5 2 1
I implemented the first step as follows:
cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns
How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.
You can use:
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
0 1 3 0 0
1 2 1 1 5
First you can filter columns contains string cm, so columns without cm are removed.
df1 = df.filter(regex='cm')
Now you can change columns to new values like cm1, cm2, cm3.
print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']
df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
cm1 cm1 cm2
0 1 3 0
1 2 1 1
Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.
df1 = df1.astype(bool)
print df1
cm1 cm1 cm2
0 True True False
1 True True True
print df1.groupby(df1.columns, axis=1).sum()
cm1 cm2
0 2 0
1 2 1
You need unique columns, which are added to original df:
print df1.columns.unique()
['cm1' 'cm2']
Last you can add new columns by df[['cm1','cm2']] from groupby function:
df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
0 1 3 0 0 2 0
1 2 1 1 5 2 1
Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:
col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
# ^ if you are hard coding this in you might as well use 2
so that instead of the string after cm it is cm and the character directly following, in this case it would be:
{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}
Then add the new columns to the DataFrame by iterating over the dict:
for col,new_col in col_map.items():
if new_col not in df:
df[new_col] =[int(a!=0) for a in df[col]]
else:
df[new_col]+=[int(a!=0) for a in df[col]]
note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)
import operator
for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
if new_col in df:
df[new_col]+=[int(a!=0) for a in df[col]]
else:
df[new_col] =[int(a!=0) for a in df[col]]
to ensure the new columns are inserted in order.

Drop rows if value in a specific column is not an integer in pandas dataframe

If I have a dataframe and want to drop any rows where the value in one column is not an integer how would I do this?
The alternative is to drop rows if value is not within a range 0-2 but since I am not sure how to do either of them I was hoping someonelse might.
Here is what I tried but it didn't work not sure why:
df = df[(df['entrytype'] != 0) | (df['entrytype'] !=1) | (df['entrytype'] != 2)].all(1)
There are 2 approaches I propose:
In [212]:
df = pd.DataFrame({'entrytype':[0,1,np.NaN, 'asdas',2]})
df
Out[212]:
entrytype
0 0
1 1
2 NaN
3 asdas
4 2
If the range of values is as restricted as you say then using isin will be the fastest method:
In [216]:
df[df['entrytype'].isin([0,1,2])]
Out[216]:
entrytype
0 0
1 1
4 2
Otherwise we could cast to a str and then call .isdigit()
In [215]:
df[df['entrytype'].apply(lambda x: str(x).isdigit())]
Out[215]:
entrytype
0 0
1 1
4 2
str("-1").isdigit() is False
str("-1").lstrip("-").isdigit() works but is not nice.
df.loc[df['Feature'].str.match('^[+-]?\d+$')]
for your question the reverse set
df.loc[ ~(df['Feature'].str.match('^[+-]?\d+$')) ]
We have multiple ways to do the same, but I found this method easy and efficient.
Quick Examples
#Using drop() to delete rows based on column value
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
# Remove rows
df2 = df[df.Fee >= 24000]
# If you have space in column name
# Specify column name with in single quotes
df2 = df[df['column name']]
# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
# Delect rows based on multiple column value
df2 = df[ (df['Fee'] >= 22000) & (df['Discount'] == 2300)]
# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]

Categories