speed up drop rows base on pandas column values - python

I have a very large pandas data frame, which looks something like
df = pd.DataFrame({"Station": [2, 2, 2, 5, 5, 5, 6, 6],
"Day": [1, 2, 3, 1, 2, 3, 1, 2],
"Temp": [-7.0, 2.7, -1.3, -1.9, 0.2, 0.5, 1.3, 6.4]})
and I would like to as efficiently (quickly) as possible filter out all rows, which do not have exactly n rows with a certain 'Station' value.
stations = pd.unique(df['Station'])
n = 3
def complete(temp):
for k in range(len(stations)):
if len(temp[temp['Station']== stations[k]].Temp) != n:
temp.drop(temp.index[temp['Station'] == stations[k]], inplace=True)
I've been looking into using #jit(nopython=True) or Cython along the lines of this enhance pandas tutorial, but in the examples that I have found the columns are treated separately to each other. I'm wondering, is the fastest way to somehow use #jit to create a new list of v = df['Station'] only containing the rows that I want and then to use df = df[df.Station.isin(v)] to filter out the rows of the entire data frame or is there a better way?

Use value_counts:
out = df[df['Station'].isin(df['Station'].value_counts().loc[lambda x: x==n].index)]
print(out)
# Output
Station Day Temp
0 2 1 -7.0
1 2 2 2.7
2 2 3 -1.3
3 5 1 -1.9
4 5 2 0.2
5 5 3 0.5
Result of value_counts:
>>> df['Station'].value_counts()
2 3
5 3
6 2
Name: Station, dtype: int64

You can groupby "Station" and transform count method and compare the counts with n to create a boolean Series. Then use this mask to filter the relevant rows:
n=3
msk = df.groupby('Station')['Temp'].transform('count').eq(n)
df = df[msk]
Output:
Station Day Temp
0 2 1 -7.0
1 2 2 2.7
2 2 3 -1.3
3 5 1 -1.9
4 5 2 0.2
5 5 3 0.5

Related

Apply different mathematical function in table in Python

I have two columns - Column A and Column B and it has some values like below:-
Now, I want to apply normal arithmetic function for each row and add result in next column. But Different arithmetic operator should be apply on each row. Like
A+B for first row
A-B for second row
A*B for third row
A/B for fourth row
and so on till nth record of the row with same repetitive mathematical function.
Can someone please help me with this code in Python.
python-3.x
pandas
We can use:
row.name to access the index when using apply on a row
can use a dictionary to map indexes to a operations
Code
import operator as _operator
# Data
d = {"A":[5, 6, 7, 8, 9, 10, 11],
"B": [1, 2, 3, 4, 5, 6, 7]}
df = pd.DataFrame(d)
print(df)
# Mapping from index to mathematical operation
operator_map = {
0: _operator.add,
1: _operator.sub,
2: _operator.mul,
3: _operator.truediv,
}
# use row.name % 4 to have operators have a cycle of 4
df['new'] = df.apply(lambda row: operator_map[row.name % 4](*row), axis = 1)
Output
Initial df
A B
0 5 1
1 6 2
2 7 3
3 8 4
4 9 5
5 10 6
6 11 7
New df
A B new
0 5 1 6.0
1 6 2 4.0
2 7 3 21.0
3 8 4 2.0
4 9 5 14.0
5 10 6 4.0
6 11 7 77.0
IIUC, you can try DataFrame.apply on rows with operator
import operator
operators = [operator.add, operator.sub, operator.mul, operator.truediv]
df['C'] = df.apply(lambda row: operators[row.name](*row), axis=1)
print(df)
A B C
0 5 1 6.0
1 6 2 4.0
2 7 3 21.0
3 8 4 2.0

Efficient way in Pandas to count occurrences of Series of values by row

I have a large dataframe for which I want to count the number of occurrences of a series specific values (given by an external function) by row. For reproducibility let's assume the following simplified dataframe:
data = {'A': [3, 2, 1, 0], 'B': [4, 3, 2, 1], 'C': [1, 2, 3, 4], 'D': [1, 1, 2, 2], 'E': [4, 4, 4, 4]}
df = pd.DataFrame.from_dict(data)
df
A B C D E
0 3 4 1 1 4
1 2 3 2 1 3
2 1 2 3 2 2
3 0 1 4 2 4
How can I count the number of occurrences of specific values (given by a series with the same size) by row?
Again for simplicity, let's assume this value_series is given by the max of each row.
values_series = df.max(axis=1)
0 4
1 3
2 3
3 4
dtype: int64
The solution I got to seems not very pythonic (e.g. I'm using iterrows(), which is slow):
max_count = []
for index, row in df.iterrows():
max_count.append(row.value_counts()[values_series.loc[index]])
df_counts = pd.Series(max_count)
Is there any more efficient way to do this?
We can compare the transposed df.T directly to the df.max series, thanks to broadcasting:
(df.T == df.max(axis=1)).sum()
# result
0 2
1 1
2 1
3 2
dtype: int64
(Transposing also has the added benefit that we can use sum without specifying the axis, i.e. with the default axis=0.)
You can try
df.eq(df.max(1),axis=0).sum(1)
Out[361]:
0 2
1 1
2 1
3 2
dtype: int64
The perfect job for numpy broadcasting:
a = df.to_numpy()
b = values_series.to_numpy()[:, None]
(a == b).sum(axis=1)

How to find all the zero cells in a python panda dataframe and replace them?

My data is like this:
df = pd.DataFrame({'a': [5,0,0, 6, 0, 0, 0 , 12]})
I want to count the zeros above the 6 and replace them with (6/count+1)=(6/3)=2 (I will also replace the original 6)
I also want to do a similar thing with the zeros above the 12.
So, (12/count)=(12/3)=4
So the final result will be:
[5,2,2, 2, 3, 3, 3 , 3]
I am not sure how to start. Are there any functions that do this?
Thanks.
Use GroupBy.transform with mean and custom groups created with test not equal 0, swap order, cumulative sum and swap order to original:
g = df['a'].ne(0).iloc[::-1].cumsum().iloc[::-1]
df['b'] = df.groupby(g)['a'].transform('mean')
print (df)
a b
0 5 5
1 0 2
2 0 2
3 6 2
4 0 3
5 0 3
6 0 3
7 12 3

Deleting colums of dataFrame where row value is constant for all rows

Starting with a pandas DataFrame such as
import pandas as pd
df = pd.DataFrame(
[[0, 3, 1.4, 3], [0, 3, 1.3, 1], [0, 3, 0.5, 3]]
)
or visually:
0 1 2 3
0[[0, 3, 1.4, 3]
1 [0, 3, 1.3, 1]
1 [0, 3, 0.5, 3]]
and given a special value x_1=3
What would be a smart and scaling way to come up with a DataFrame that deletes all columns in df with a constant value x in EACH row?
The result in this example would be the dataFrame df without column 1.
df_altered =
0 1 2
0[[0, 1.4, 3]
1 [0, 1.3, 1]
2 [0, 0.5, 3]]
In a small DataFrame I could itterate over all rows for each column but that would not scale and work with large DataFrames.
You can use pd.drop():
df.drop(columns=df.columns[(df == 3).all()])
Output:
0 2 3
0 0 1.4 3
1 0 1.3 1
2 0 0.5 3
One way is to determine the columns with equal values using:
>>> (df == df.iloc[0]).all(axis=0)
0 True
1 True
2 False
3 False
dtype: bool
Then extract the inverse of the above mask:
>>> df.iloc[:, ~(df == df.iloc[0]).all(axis=0).to_numpy()]
2 3
0 1.4 3
1 1.3 1
2 0.5 3
You can try this:
## find the unique val in each column
no_unique_val = df.nunique()
val = 3
for column_name in no_unique_val.index:
if 1 == no_unique_val[column_name] and val == df[column_name].values[0]:
df.drop(column_name,axis=1, inplace=True)
output:
You can use df.ne and then df.any as boolean mask
df.loc[:, df.ne(3).any()]
0 2 3
0 0 1.4 3
1 0 1.3 1
2 0 0.5 3

How to save the result from equation(float) to column, python

I have data frame look line this:
df:
1 2 3.4
-2 2 1.1
2 3 4
-5 5 5
I can use this data on my equation like:
result=abs(int(df[0])) +( int(df[1]) / 2 + float(df[2]) / 32)
So after this calculation I receive a list with results for each line from df , and the resulting type is a float.
Question: How can I save it to one column or dataframe and add this one column with result to the another dataframe that's same as df ?
I've tried pd.DataFrame(result), which doesn't work.
Assign directly to the new column you're trying to create.
df[3] = abs(int(df[0])) +( int(df[1]) / 2 + float(df[2]) / 32)
I think you need Series.astype for cast columns with Series.abs:
df = pd.DataFrame({0: [1, -2, 2, -5], 1: [2, 2, 3, 5], 2: [3.4, 1.1, 4.0, 5.0]})
print (df)
0 1 2
0 1 2 3.4
1 -2 2 1.1
2 2 3 4.0
3 -5 5 5.0
df[3] = df[1].astype(int).abs() +df[1].astype(int) / 2 + df[2].astype(float) / 32
print (df)
0 1 2 3
0 1 2 3.4 3.106250
1 -2 2 1.1 3.034375
2 2 3 4.0 4.625000
3 -5 5 5.0 7.656250

Categories