Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I wanted to know if there's a way for me to remove an entire row of a dataframe given a condition (An outlier) in one column?
Df
Name
Value
Tree
300
Orange
50
Apple
75
Mango
60
Cherry
1
In this case I would want to remove Cherry and Tree since they're outliers.
Quite a subjective question. There is not a single definate way to do it. In fact there are multiple ways to do it. Use descriptive stats to build logic and then code it.
In this case I define the upper and lower limit and then filter using pandas mask. Code below
lower_limit=1
Upper_limit=100
df=df.assign(Value=df['Value'].where(df['Value'].between(lower_limit,Upper_limit))).dropna()
Name Value
1 Orange 50.0
2 Apple 75.0
3 Mango 60.0
4 Cherry 1.0
df[df.apply(lambda row: row['Value'] > lower_limit and row['Value'] < upper_limit, axis=1)]
apply() returns a boolean Series representing whether each df['Value'] value is outliers. df['Value'].between(lower_limit, upper_limit) has the same effect.
Dataframe can accept a boolean indexing. The effect is to drop the coresponding False row, keep the coresponding True row.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a dataframe that (simplified) looks something like:
col1 col2
1 a
2 b
3 c,ddd,ee,f,5,hfsf,a
In col2, I need to be able to remove everything after the last 2 commas, and if it doesn't have commas just keep the value as is:
col1 col2
1 a
2 b
3 c,ddd,ee
again, this is simplified and the solution needs to scale up to something that has 1000's of rows, and the space between each comma will not always be the same
edit:
This is got me on the right track
df.col2 = df.col2.str.split(',').str[:2].str.join(',')
Pandas provides access to many familiar string functions, including slicing and selection, through the .str attribute:
df.col2.str.split(',').str[:3].str.join(',')
#0 a
#1 b
#2 c,ddd,ee
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
Pandas:I have a dataframe given below which contains the same set of banks twice..I need to slice the data from 0th index that contains a bank name, upto the index that contains the same bank name..here in the problem -DEUTSCH BANK AG..I need to apply same logic to any such kind of dataframes.ty..
I tried with logic:- df25.iloc[0,1]==df25[1].any().. but it returns nly true but not the index position.
DataFrame:-[1]:https://i.stack.imgur.com/iJ1hJ.png, https://i.stack.imgur.com/J2aDX.png
You need to get the index of all the rows that has the value you are looking for (in this case the bank name) and get the slice the data frame using the indices.
Example:
df = pd.DataFrame({'Col1':list('abcdeafgbfhi')})
search_str = 'b'
idx_list = list(df[(df['Col1']==search_str)].index.values)
print(df[idx_list[0]:idx_list[1]])
Output:
Col1
1 b
2 c
3 d
4 e
5 a
6 f
7 g
Note that the assumption is that there will be only 2 rows with the same value. If there are more than 2, you have to play with the index list values and get what you need. Hope this helps.
Keep in mind that posting a sample data set will always help you get more answers as people will move away to another question when they see images or screenshots, because it involves additional steps to reproduce the issue
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a Panda and want to do a calculation based on an existing column.
However, the apply. function is not working for some reason.
It's something like letssay
df = pd.DataFrame({'Age': age, 'Input': input})
and the input column is something like [1.10001, 1.49999, 1.60001]
Now I want to add a new column to the Dataframe, that is doing the following:
Add 0.0001 to each element in column
Multiply each value by 10
Transform each value of new column to int
Use Series.add, Series.mul and Series.astype:
#input is python code word (builtin), so better dont use it like variable
inp = [1.10001, 1.49999, 1.60001]
age = [10,20,30]
df = pd.DataFrame({'Age': age, 'Input': inp})
df['new'] = df['Input'].add(0.0001).mul(10).astype(int)
print (df)
Age Input new
0 10 1.10001 11
1 20 1.49999 15
2 30 1.60001 16
You could make a simple function and then apply it by row.
def f(row):
return int((row['input']+0.0001)*10))
df['new'] = df.apply(f, axis=1)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to groupby and sum on a dataframe. The standard groupby function groups exactly same strings in rows together but I need this to be done on similar strings. For example-:
United States | 10
Germnay | 23
Unaited Staetes | 20
Germany | 21
Germanny | 32
Uniited Staites | 30
This should result as -:
United States 60
Germnay 76
The order of names are not that important. The sum of the values are.
Thanks a lot :)
EDIT:
Perhaps it would be simpler to create an ID column that gives the same id for similar countries. I can just groupby on that then.
not the solution but a hack that might help if you are doing something quick and dirty
lowercase the country names
remove vowels from the country names
remove consecutive occurrences of consonants
after you transform data in this way that you can use normal groupby and it should work pretty well.
I suggest this since your data appears to be country names entered by users.
another idea:
preprocessing step:
use spelling corrector trained on country names to guess country name from the wrong spelling (https://norvig.com/spell-correct.html)
transform each row of the data using that.
then use groupby to group.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I am trying to drop NaN values using the dropna() method, provided by Panda. I've read the document and looked at other StackOverflow posts, but I still could not fix the error.
For my code, I will first read an excel file. If the rows have value “-“, I will change it to a NaN value. After that, I will use the method dropna() to drop the NaN values. I will then reassign the result of the dropna() method to a new variable called mydf2. Below are my codes and screenshots
mydf = pd.read_excel('pandas lab datasets/singstats_maritalstatus.xlsx',
na_values='-')
mydf = mydf.set_index(['Variables'])
print(mydf.head(5)) # Original data
mydf2 = mydf.dropna()
print(mydf2)
dropna() has worked correctly. You have two print statements. The first one has printed five rows as asked for by print(mydf.head(5)).
The output of your second print statement print(mydf2) is an empty dataframe [0 rows and 37 columns] because you have apparently got an NaN in each and every row. (see the bottom of your screenshot)
Sounds like here that NaN is a string, so do:
mydf2 = mydf.replace('-',np.nan).dropna()
I wrote a piece of code here, it works fine with my data, so try this out.
mydf = pd.read_excel('pandas lab datasets/singstats_maritalstatus.xlsx')
to_del = []
for i in range(mydf.shape[0]):
if "-" in list(mydf.iloc[i]):
to_del.append(i)
out_df = mydf.drop(to_del, axis=0)
As you have not posted your data, I'm not sure if every row has NaN values or not. If this is so, df.dropna() will simply drop every row. For example, the columns 1981 and 1982 are all NaN values in your image. use df.dropna(axis=1) will drop these two columns, and will not return you an empty df.
df = pd.DataFrame({'Variables':['Total','Single','Married','Widowed','Divorced/Separated'],
'1980':range(5),
'1981':[np.nan]*5})
df.set_index('Variables')
df.dropna(axis=1)