I have a dataframe that I'm working with in pandas. I have two columns that I want to determine if they're not equal. An example of the data is as follows:
A B Compare
1002 3.1 31 Not Equal
1003 5 5
1004 1 3 Not Equal
I want rows like the first one (1002) to show as equal because they contain the same numbers. Both columns A and B are float64 data types.
I have tried the following:
df['column_a'].replace('.','')
And I've also attempted to find a way to multiply a number by 10 on the condition that the value is not an integer (3.1, 2.2, 1.4, etc).
I believe I could also accomplish the same desired end result by taking all values that are greater than 5 in column B and divide them by 10. I only care about values 0 through 5. The only values I'm going to see above 5 can be divided by 10.
This is what I tried doing to accomplish that but I get an error (TypeError: invalid type comparison):
df['column_b'] = np.where(df['column_b'] > 5, /10,'')
What would be the best way to make the values equal in column A and B for row 1002?
This is worth the try:
df['Compare'] = df['A'].str.replace(".","").astype(int).eq(df['B'])
You were going in the right direction, just add astype and use .eq() ..
Related
I have a dataframe called df that looks like this:
Date Reading1 Reading2 Reading3 Reading4
2000-05-01 15 13 14 11
2000-05-02 15 14 18 9
2000-05-03 14 12 15 8
2000-05-04 17 11 16 13
I used df.setindex('Date') to make the date the index.
I have 3 questions.
1) How do I display the number of days that had a reading greater than 13 in the entire data frame not just in a single column?
I tried df.[(df.Reading1:df.Reading4>13)].shape[0] but obviously the syntax is wrong.
2) How do I display the values that happened on 2000-05-03 for columns Readings 1, 3, and 4?
I tried df.loc[["20000503"],["Reading1","Reading3,"Reading4"]]
but i got the error "None of the Index(['20000503'],dtype='object')] are in the [index]"
3) How do find do I display the dates for which the values for the column Readings 1 are twice as much as those in column Readings 2? And how do I display those values (the ones in Reading 1 that are twice as big) as well?
I have no idea where to even start this one.
Try this:
1. (df > 13).any(axis=1).sum()
Create a boolean dataframe then check to see if any value is True along the row and sum rows to get number of days.
2. df.loc['2000-05-03', ['Reading1', 'Reading3', 'Reading4']]
Use partial string indexing on DatetimeIndex to get a day, then column filtering with a list of column header.
3. df.loc[df['Reading1'] > (df['Reading2'] * 2)].index
df.loc[df['Reading1'] > (df['Reading2'] * 2)].to_numpy().tolist()
Create a boolean series to do boolean indexing and get the index to return date. Next convert the dataframe to numpy array then tolist to get values.
In pandas, axis=0 represent rows and axis=1 represent columns.
Therefore to get the sum of values in each row in pandas, df.sum(axis=0) is called.
But it returns a sum of values in each columns and vice-versa. Why???
import pandas as pd
df=pd.DataFrame({"x":[1,2,3,4,5],"y":[2,4,6,8,10]})
df.sum(axis=0)
Dataframe:
x y
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
Output:
x 15
y 30
Expected Output:
0 3
1 6
2 9
3 12
4 15
I think the right way to interpret the axis parameter is what axis you sum 'over' (or 'across'), rather than the 'direction' the sum is computed in. Specifying axis = 0 computes the sum over the rows, giving you a total for each column; axis = 1 computes the sum across the columns, giving you a total for each row.
I was a reading the source code in pandas project, and I think that this come from Numpy, in this library is used in that way(0 sum vertically and 1 horizonally), and additionally Pandas use under the hood numpy in order to make this sum.
In this link you could check that pandas use numpy.cumsum function to make the sum.
And this link is for numpy documentation.
If you are looking a way to remember how to use the axis parameter, the 'anant' answer, its a good approach, interpreting the sum over the axis instead across. So when is specified 0 you are computing the sum over the rows(iterating over the index in order to be more pandas doc complaint). When axis is 1 you are iterating over the columns.
UPDATE:
Please download my full dataset here.
my datatype is:
>>> df.dtypes
increment int64
spread float64
SYM_ROOT category
dtype: object
I have realized that the problem might have been caused by the fact that my SYM_ROOT is a category variable.
To replicate the issue you might want to do the following first:
df=pd.read_csv("sf.csv")
df['SYM_ROOT']=df['SYM_ROOT'].astype('category')
But I am still puzzled as in why my SYM_ROOT will result in the gaps in increment being filled with NA? Unless groupby category and integer value will result in a balanced panel by default.
I noticed that the behaviour of pd.groupby().last is different from that of pd.groupby().tail(1).
For example, suppose I have the following data:
increment is an integer that spans from 0 to 4680. However, for some SYM_ROOT variable, there are gaps in between. For example, 4 could be missing from it.
What I want to do is to keep the last observation per group.
If I do df.groupby(['SYM_ROOT','increment']).last(), the dataframe becomes:
While if I do df.groupby(['SYM_ROOT','increment']).tail(1), the dataframe becomes:
It looks to me that the last() statement will create a balanced time-series data and fill in the gaps with NaN, while the tail(1) statement doesn't. Is it correct?
Update :
Your columns increment is category
df=pd.DataFrame({'A':[1,1,2,2],'B':[1,1,2,3],'C':[1,1,1,1]})
df.B=df.B.astype('category')
df.groupby(['A','B']).last()
Out[590]:
C
A B
1 1 1.0
2 NaN
3 NaN
2 1 NaN
2 1.0
3 1.0
When you using tail it will not make up the miss level since , tail is more like dataframe base , not single columns
df.groupby(['A','B']).tail(1)
Out[593]:
A B C
1 1 1 1
2 2 2 1
3 2 3 1
After hange it using astype
df.B=df.B.astype('int')
df.groupby(['A','B']).last()
Out[591]:
C
A B
1 1 1
2 2 1
3 1
It is actually an issue here at Github, where the problem is mainly caused by groupby categories guessing the values.
I have a dataframe with 4 columns an ID and three categories that results fell into
<80% 80-90 >90
id
1 2 4 4
2 3 6 1
3 7 0 3
I would like to convert it to percentages ie:
<80% 80-90 >90
id
1 20% 40% 40%
2 30% 60% 10%
3 70% 0% 30%
this seems like it should be within pandas capabilities but I just can't figure it out.
Thanks in advance!
You can do this using basic pandas operators .div and .sum, using the axis argument to make sure the calculations happen the way you want:
cols = ['<80%', '80-90', '>90']
df[cols] = df[cols].div(df[cols].sum(axis=1), axis=0).multiply(100)
Calculate the sum of each column (df[cols].sum(axis=1). axis=1 makes the summation occur across the rows, rather than down the columns.
Divide the dataframe by the resulting series (df[cols].div(df[cols].sum(axis=1), axis=0). axis=0 makes the division happen across the columns.
To finish, multiply the results by 100 so they are percentages between 0 and 100 instead of proportions between 0 and 1 (or you can skip this step and store them as proportions).
df/df.sum()
If you want to divide the sum of rows, transpose it first.
You could use the .apply() method:
df = df.apply(lambda x: x/sum(x)*100, axis=1)
Tim Tian's answer pretty much worked for me, but maybe this helps if you have a df with several columns and want to do a % column wise.
df_pct = df/df[df.columns].sum()*100
I was having trouble because I wanted to have the result of a pd.pivot_table expressed as a %, but couldn't get it to work. So I just used that code on the resulting table itself and it worked.
I'm relatively new to Python and totally new to Pandas, so my apologies if this is really simple. I have a dataframe, and I want to operate over all elements in a particular column, but only if a different column with the same index meets a certain criteria.
float_col int_col str_col
0 0.1 1 a
1 0.2 2 b
2 0.2 6 None
3 10.1 8 c
4 NaN -1 a
For example, if the value in float_col is greater than 5, I want to multiply the value in in_col (in the same row) by 2. I'm guessing I'm supposed to use one of the map apply or applymap functions, but I'm not sure which, or how.
There might be more elegant ways to do this, but once you understand how to use things like loc to get at a particular subset of your dataset, you can do it like this:
df.loc[df['float_col'] > 5, 'int_col'] = df.loc[df['float_col'] > 5, 'int_col'] * 2
You can also do it a bit more succinctly like this, since pandas is smart enough to match up the results based on the index of your dataframe and only use the relevant data from the df['int_col'] * 2 expression:
df.loc[df['float_col'] > 5, 'int_col'] = df['int_col'] * 2