I have a csv file containing of many columns. I want to change 0 1 2 into 0 1 and null. My code is working perfectly but there is an issue. I think it is also replacing 1 2 & 0 in date column too. I don't want this. Below is my code:
df1 = df.replace(to_replace = [0,1,2], value = [np.nan,0, 1])
The above code is replacing the given values in my df. I am using df1 in pivot_table, and when I check the output file it does not show the Date column. (Although I put the "Date" column in the index, inside pivot_table.). Can anybody help?
Use a dictionary:
df1 = df.replace({0:float('nan'), 1:0, 2:1})
To limit to given columns:
df1 = df.copy()
df.update(df[['col1', 'col2']].replace({0:float('nan'), 1:0, 2:1}))
Related
I have a pandas dataframe like this:
df:
uid maskeduid
VEH12345L0 72647hghghghg
VEH12323L3 hgh5454jjbjjb
VEH11145M4 jhj24j3j5bjnj
VEH78345L3 12kjkkndw2knk
VEH31345N3n 145jhjhjbvrkl
I want to get a dataframe df1 where 9th character of every cell ='L'
df1:
uid maskeduid
VEH12345L0 72647hghghghg
VEH12323L3 hgh5454jjbjjb
VEH78345L3 12kjkkndw2knk
How can I achieve by using any pandas in built function?
Python counts from 0, so for select 9th values use str[8], compare by Series.eq with L and filter in boolean indexing:
df1 = df[df['uid'].str[8].eq('L')]
print (df1)
uid maskeduid
0 VEH12345L0 72647hghghghg
1 VEH12323L3 hgh5454jjbjjb
3 VEH78345L3 12kjkkndw2knk
I've done a dataframe aggregation and I want to add a new column in which if there is a value > 0 in year 2020 in row, it will put an 1, otherwise 0.
this is my code
and head of dataframe
df['year'] = pd.DatetimeIndex(df['TxnDate']).year # add column year
df['client'] = df['Customer'].str.split(' ').str[:3].str.join(' ') # add colum with 3 first word
Datedebut = df['year'].min()
Datefin = df['year'].max()
#print(df)
df1 = df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()
print(df1)
df1['nb2020']= np.where( df1['year']==2020, 1, 0)
Data frame df1 print before last line is like that:
Last line error is : KeyError: 'year'
thanks
When you performed that the aggregation and unstacked (df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()), the values of the column year have been expanded into columns, and these columns are a MultiIndex. You can look at that by calling:
print (df1.columns)
And then you can select them.
Using the MultiIndex column
So to select the column which matches to 2020 you can use:
df1.loc[:,df1.columns.get_level_values(2).isin({2020})
You can probably get the correct column then check if 2020 has a non zero value using:
df1['nb2020'] = df1.loc[:,df1.columns.get_level_values('year').isin({2020})] > 0
If you would like to have the 1 and 0 (instead of the bool types), you can convert to int (using astype).
Renaming the columns
If you think this is a bit complicated, you might also prefer change the column to single indexes. Using something like
df1.columns = df1.columns.get_level_values('year')
Or
df1.columns = df1.columns.get_level_values(2)
And then
df1['nb2020'] = (df1[2020] > 0).astype(int)
Window 10, Python 3.6
I have a dataframe df
df=pd.DataFrame({'name':['boo', 'foo', 'too', 'boo', 'roo', 'too'],
'zip':['30004', '02895', '02895', '30750', '02895', '02895']})
I want to find the repeat record that has same 'name' and 'zip', and record the repeat times. The idea output is
name repeat zip
0 too 1 02895
Because my dataframe is much more than six rows, I need to use a iterate method. I appreciate any tips.
I believe you need groupby all columns and use GroupBy.size:
#create DataFrame from online source
#df = pd.read_csv('someonline.csv')
#df = pd.read_html('someurl')[0]
#L = []
#for x in iterator:
#in loop added data to list
# L.append(x)
##created DataFrame from contructor
#df = pd.DataFrame(L)
df = df.groupby(df.columns.tolist()).size().reset_index(name='repeat')
#if need specify columns
#df = df.groupby(['name','zip']).size().reset_index(name='repeat')
print (df)
name zip repeat
0 boo 30004 1
1 boo 30750 1
2 foo 02895 1
3 roo 02895 1
4 too 02895 2
Pandas has a handy .duplicated() method that can help you identify duplicates.
df.duplicated()
By passing the duplicate vector into a selection you can get the duplicate record:
df[df.duplicated()]
You can get the sum of the duplicated records by using .sum()
df.duplicated().sum()
I'm trying to filter a dataframe based on the values within the multiple columns, based on a single condition, but keep other columns to which I don't want to apply the filter at all.
I've reviewed these answers, with the third being the closest, but still no luck:
how do you filter pandas dataframes by multiple columns
Filtering multiple columns Pandas
Python Pandas - How to filter multiple columns by one value
Setup:
import pandas as pd
df = pd.DataFrame({
'month':[1,1,1,2,2],
'a':['A','A','A','A','NONE'],
'b':['B','B','B','B','B'],
'c':['C','C','C','NONE','NONE']
}, columns = ['month','a','b','c'])
l = ['month','a','c']
df = df.loc[df['month'] == df['month'].max(), df.columns.isin(l)].reset_index(drop = True)
Current Output:
month a c
0 2 A NONE
1 2 NONE NONE
Desired Output:
month a
0 2 A
1 2 NONE
I've tried:
sub = l[1:]
df = df[(df.loc[:, sub] != 'NONE').any(axis = 1)]
and many other variations (.all(), [sub, :], ~df.loc[...], (axis = 0)), but all with no luck.
Basically I want to drop any column (within the sub list) that has all 'NONE' values in it.
Any help is much appreciated.
You first want to substitute your 'NONE' with np.nan so that it is recognized as a null value by dropna. Then use loc with your boolean series and column subset. Then use dropna with axis=1 and how='all'
df.replace('NONE', np.nan) \
.loc[df.month == df.month.max(), l].dropna(axis=1, how='all')
month a
3 2 A
4 2 NONE
I have a dataframe where I want to group by some column, and then count the number of rows that have some exact string match for some other column. Assume all dtypes are 'object'.
In pseudo-code I'm looking for something like:
df.groupby('A').filter(x['B'] == '0').size()
I want to group by column 'A', then count the number of rows of column 'B' that have an exact string match to the string '0'.
edit: I found an inelegant solution:
def counter(group):
i = 0
for item in group:
if item == '0':
i = i + 1
return i
df.groupby('A')['B'].agg(counter)
There must be a better way.
I don't see much wrong with the solution you proposed in your question. If you wanted to make it a one liner you could do the following:
data = np.array(list('abcdefabc')).reshape((3, 3))
df = pd.DataFrame(data, columns=list('ABC'))
df
A B C
0 a b c
1 d e f
2 a b c
df.groupby('A').agg(lambda x:list(x).count('c'))
B C
A
a 0 2
d 0 0
This would have the advantage of giving all of the values for each column in the original dataframe
try creating a temp column which suggest if the value is zero or not
and then make a pivot table based on this column
Hope this helps.
Let me know if it worked.
import pandas as pd
df=pd.DataFrame({'A':['one','one','one','one','one','one','one','two','two','two','two','two','two','two'],'B':[1,2,3,0,2,3,0,2,3,2,0,3,44,55]})
# create a new column if the values is ZERO or not.
df['C'] = df['B'].apply(lambda x: 'EQUALS_ZERO' if x==0 else 'NOT_EQUAL_ZERO')
# make a pivote table
# this will give you value for both =0 and !=0
x= pd.pivot_table(df,index=['A'],values='B',columns='C',aggfunc='count',fill_value=0)
print(x)