Dropping every row with len >2 Pandas python - python

Suppose I have a dataframe
. Values
0 25
1 897
2 48
3 28
4 214
5 25
I am trying to drop all rows with len > 2 with the following code but nothing happens when I run it.
import pandas as pd
df = pd.read_csv('File.csv')
for index in df.index:
if len(df.loc[index, 'Sevens']) > 2:
df.drop([index])
else:
pass

Use Series.str.len in boolean indexing:
df1 = df[df['Value'].str.len() <=2]
If values was numbers:
df1 = df[df['Value'].astype(str).str.len() <=2]

Related

python dataframe substract a column from multiple columns

I have a dataframe and I want to substract a column from multiple columns
code:
df = pd.DataFrame('A':[10,20,30],'B':[100,200,300],'C':[15,10,50])
# Create a new A and B columns by sub-stracting C from A and B
df[['newA','newB']] = df[['A','B']]-df['C']
Present output:
raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis
You can check sub
df[['newA', 'newB']] = df[['A', 'B']].sub(df['C'],axis=0)
df
Out[114]:
A B C newA newB
0 10 100 15 -5 85
1 20 200 10 10 190
2 30 300 50 -20 250
Another option along with the above answer, you can convert column 'C' to a numpy array by doing df[['C']].values. Hence the new code would be:
df[['newA','newB']] = df[['A','B']]-df[['C']].values
Try using Pandas .apply() method. You can pass columns and apply a given function to them, in this case subtracting one of your existing columns. The below should work. Documentation here.
df[['newA','newB']] = df[['A','B']].apply(lambda x: x - df['C'])
You can try to convert df['C'].values to the same shape with df[['A','B']].
df[['newA','newB']] = df[['A','B']] - df['C'].values[:, None]
print(df)
A B C newA newB
0 10 100 15 -5 85
1 20 200 10 10 190
2 30 300 50 -20 250

changing index of 1 row in pandas

I have the the below df build from a pivot of a larger df. In this table 'week' is the the index (dtype = object) and I need to show week 53 as the first row instead of the last
Can someone advice please? I tried reindex and custom sorting but can't find the way
Thanks!
here is the table
Since you can't insert the row and push others back directly, a clever trick you can use is create a new order:
# adds a new column, "new" with the original order
df['new'] = range(1, len(df) + 1)
# sets value that has index 53 with 0 on the new column
# note that this comparison requires you to match index type
# so if weeks are object, you should compare df.index == '53'
df.loc[df.index == 53, 'new'] = 0
# sorts values by the new column and drops it
df = df.sort_values("new").drop('new', axis=1)
Before:
numbers
weeks
1 181519.23
2 18507.58
3 11342.63
4 6064.06
53 4597.90
After:
numbers
weeks
53 4597.90
1 181519.23
2 18507.58
3 11342.63
4 6064.06
One way of doing this would be:
import pandas as pd
df = pd.DataFrame(range(10))
new_df = df.loc[[df.index[-1]]+list(df.index[:-1])].reset_index(drop=True)
output:
0
9 9
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Alternate method:
new_df = pd.concat([df[df["Year week"]==52], df[~(df["Year week"]==52)]])

How to select columns which do not have a single zero value from a DataFrame?

I want to select columns from a DataFrame, where those columns should not have even a single zero value.
How to do that?
You can use boolean indexing here. Check where the DataFrame is not equal to 0, and then use all to select the columns which are all True for that condition:
df = df.loc[:, (df != 0).all()]
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(3,5))
# zeroes in the first and third column
df.iloc[0,0] = 0
df.iloc[2,2] = 0
# 0 1 2 3 4
# 0 0.000000 0.953372 0.268231 0.500892 0.555905
# 1 0.835321 0.539232 0.697369 0.662901 0.486734
# 2 0.431325 0.662009 0.000000 0.575064 0.259657
df = df.loc[:, (df != 0).all()]
# 1 3 4
# 0 0.953372 0.500892 0.555905
# 1 0.539232 0.662901 0.486734
# 2 0.662009 0.575064 0.259657
You can use this:
df[[i for i in df.columns if 0 not in set(df[i])]]
I'd first select all cols that have 0, sum them for a bool (False/True), invert it and use it to select subset of the data from the df.
df.loc[:, ~df.isin([0]).any(axis=0)]

How do i replace the values of a dataframe on a condition

I have a pandas dataframe with more than 50 columns. All the data except the 1st column is float. I want to replace any value greater than 5.75 with 100. Can someone advise any function to do the same.
The replace function is not working as to_value can only take "=" function, and not the greater than function.
This can be done using
df['ColumnName'] = np.where(df['ColumnName'] > 5.75, 100, df['First Season'])
You can make a custom function and pass it to apply:
import pandas as pd
import random
df = pd.DataFrame({'col_name': [random.randint(0,10) for x in range(100)]})
def f(x):
if x >= 5.75:
return 100
return x
df['modified'] = df['col_name'].apply(f)
print(df.head())
col_name modified
0 2 2
1 5 5
2 7 100
3 1 1
4 9 100
If you have a dataframe:
import pandas as pd
import random
df = pd.DataFrame({'first_column': [random.uniform(5,6) for x in range(10)]})
print(df)
Gives me:
first_column
0 5.620439
1 5.640604
2 5.286608
3 5.642898
4 5.742910
5 5.096862
6 5.360492
7 5.923234
8 5.489964
9 5.127154
Then check if the value is greater than 5.75:
df[df > 5.75] = 100
print(df)
Gives me:
first_column
0 5.620439
1 5.640604
2 5.286608
3 5.642898
4 5.742910
5 5.096862
6 5.360492
7 100.000000
8 5.489964
9 5.127154
import numpy as np
import pandas as pd
#Create df
np.random.seed(0)
df = pd.DataFrame(2*np.random.randn(100,50))
for col_name in df.columns[1:]: #Skip first column
df.loc[:,col_name][df.loc[:,col_name] > 5.75] = 100
np.where(df.value > 5.75, 100, df.value)

How to select and delete columns with duplicate name in pandas DataFrame

I have a huge DataFrame, where some columns have the same names. When I try to pick a column that exists twice, (eg del df['col name'] or df2=df['col name']) I get an error. What can I do?
You can adress columns by index:
>>> df = pd.DataFrame([[1,2],[3,4],[5,6]], columns=['a','a'])
>>> df
a a
0 1 2
1 3 4
2 5 6
>>> df.iloc[:,0]
0 1
1 3
2 5
Or you can rename columns, like
>>> df.columns = ['a','b']
>>> df
a b
0 1 2
1 3 4
2 5 6
This is not a good situation to be in. Best would be to create a hierarchical column labeling scheme (Pandas allows for multi-level column labeling or row index labels). Determine what it is that makes the two different columns that have the same name actually different from each other and leverage that to create a hierarchical column index.
In the mean time, if you know the positional location of the columns in the ordered list of columns (e.g. from dataframe.columns) then you can use many of the explicit indexing features, such as .ix[], or .iloc[] to retrieve values from the column positionally.
You can also create copies of the columns with new names, such as:
dataframe["new_name"] = data_frame.ix[:, column_position].values
where column_position references the positional location of the column you're trying to get (not the name).
These may not work for you if the data is too large, however. So best is to find a way to modify the construction process to get the hierarchical column index.
Another solution:
def remove_dup_columns(frame):
keep_names = set()
keep_icols = list()
for icol, name in enumerate(frame.columns):
if name not in keep_names:
keep_names.add(name)
keep_icols.append(icol)
return frame.iloc[:, keep_icols]
import numpy as np
import pandas as pd
frame = pd.DataFrame(np.random.randint(0, 50, (5, 4)), columns=['A', 'A', 'B', 'B'])
print(frame)
print(remove_dup_columns(frame))
The output is
A A B B
0 18 44 13 47
1 41 19 35 28
2 49 0 30 16
3 39 29 43 41
4 26 19 48 13
A B
0 18 13
1 41 35
2 49 30
3 39 43
4 26 48
The following function removes columns with dublicate names and keeps only one. Not exactly what you asked for, but you can use snips of it to solve your problem. The idea is to return the index numbers and then you can adress the specific column indices directly. The indices are unique while the column names aren't
def remove_multiples(df,varname):
"""
makes a copy of the first column of all columns with the same name,
deletes all columns with that name and inserts the first column again
"""
from copy import deepcopy
dfout = deepcopy(df)
if (varname in dfout.columns):
tmp = dfout.iloc[:, min([i for i,x in enumerate(dfout.columns == varname) if x])]
del dfout[varname]
dfout[varname] = tmp
return dfout
where
[i for i,x in enumerate(dfout.columns == varname) if x]
is the part you need

Categories