create a new data frame from existing data frame based on condition - python

I have a data frame df
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[0,1,1,0,1,0], [1,0,1,1,0,0], [1,1,0,0,0,1],[1,0,1,0,1,1],
[0,0,1,0,0,1]]))
df
Now, from data frame df I like to create a new data frame based on condition
Condition: if a column contain three or more than three '1' then the new data frame column value is '1' otherwise '0'
expected output of new data frame
1 0 1 0 0 1

You can also get it without apply. You could sum along the rows, axis=0, and creating a boolean with gt(2):
res = df.sum(axis=0).gt(2).astype(int)
print(res)
0 1
1 0
2 1
3 0
4 0
5 1
dtype: int32
As David pointed out, the result of the above is a series. If you require a dataframe, you can chain to_frame() at the end of it

You could do the following:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[0,1,1,0,1,0], [1,0,1,1,0,0], [1,1,0,0,0,1],[1,0,1,0,1,1],
[0,0,1,0,0,1]]))
df_res = pd.DataFrame(df.apply(lambda c: 1 if np.sum(c) > 2 else 0))
In [6]: df_res
Out[6]:
0
0 1
1 0
2 1
3 0
4 0
5 1
Instead of np.sum(c) you can also do c.sum()
And if you want it transposed just do the following instead:
df_res = pd.DataFrame(df.apply(lambda c: 1 if c.sum() > 2 else 0)).T

Related

How to replace values in one column by values from second, then third in dataframe - so if in first is = 0 it will be replaced from second etc

There is dataframe with many columns, that has different column's names but contains the same date, but in some columns values can be missed (=0), so i need to combine all not missed values from all columns in one, and drops all duplicate columns exept one. Real df case below:
supplierArticle sa_name_2022-07-28 sa_name_2022-07-30 article
ATD-800 0 ATD-800 0
0 SK-CONST2-BZ23T 0 0
STL-K61-A506-Y-WHITE STL-K61-A506-Y-WHITE 0 0
0 V-V1011-i 0
SK-KR-5556 SK-KR-5556 0 SK-KR-5556
I want all combine in one column:
supplierArticle
ATD-800
SK-CONST2-BZ23T
STL-K61-A506-Y-WHITE
V-V1011-i
SK-KR-5556
test example:
df:
A f s t
B 0 4 0
D 0 1 1
0 3 3 3
df_result (changing values in f column by values from s and then from t):
A f
B 4
D 1
0 3
is there pandamic way?)
updated:
Here we are, built up some working but complicated code:
import pandas as pd
import numpy as np
# test df
df = pd.DataFrame.from_dict({
'A':['B','D',0],
'f':[0,0,3],
's':[4,1,3],
't':[0,1,3],
})
print(df)
# iterate over list of df columns in which are not null values we want to take
def combine_duplicate_column(df, col_name_in: str, list_re_col_names: list):
for col_name_from in list_re_col_names:
df[col_name_in] = [_fill_missing_values(val_col_name_in, val_col_name_from) for
val_col_name_in, val_col_name_from in
zip(df[col_name_in], df[col_name_from])
]
return df
# compare two values from two taken df columns, not 0 is returned
def _fill_missing_values(val_col_name_in, val_col_name_from):
if val_col_name_in:
return val_col_name_in
return val_col_name_from
list_re_col_names = ['s', 't']
df = combine_duplicate_column(df, 'f', list_re_col_names)
df_result = df.drop(list_re_col_names, axis=1)
print (df_result)

How to select columns which do not have a single zero value from a DataFrame?

I want to select columns from a DataFrame, where those columns should not have even a single zero value.
How to do that?
You can use boolean indexing here. Check where the DataFrame is not equal to 0, and then use all to select the columns which are all True for that condition:
df = df.loc[:, (df != 0).all()]
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(3,5))
# zeroes in the first and third column
df.iloc[0,0] = 0
df.iloc[2,2] = 0
# 0 1 2 3 4
# 0 0.000000 0.953372 0.268231 0.500892 0.555905
# 1 0.835321 0.539232 0.697369 0.662901 0.486734
# 2 0.431325 0.662009 0.000000 0.575064 0.259657
df = df.loc[:, (df != 0).all()]
# 1 3 4
# 0 0.953372 0.500892 0.555905
# 1 0.539232 0.662901 0.486734
# 2 0.662009 0.575064 0.259657
You can use this:
df[[i for i in df.columns if 0 not in set(df[i])]]
I'd first select all cols that have 0, sum them for a bool (False/True), invert it and use it to select subset of the data from the df.
df.loc[:, ~df.isin([0]).any(axis=0)]

Why not apply one hot encoder , knowing that there is no error during running file?

I want to apply one hot encoding to one column which is "drive_wheels"
However, on running there is no error and no change to the dataset!
Is there any error in the code?
import pandas as pd
import numpy as np
df = pd.read_csv('onehotencoding.csv')
df.head()
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head()
pd.get_dummies(obj_df, columns=["drive_wheels"]).head()
pd.get_dummies() doesn't have an inplace switch. Therefore, you need to join the resulting DataFrame to your original:
dummies = pd.get_dummies(obj_df, columns=["drive_wheels"]).head()
combined = df.join(dummies)
For example:
df = pd.DataFrame(list('AABBABA'), columns=['cats'])
dummies = pd.get_dummies(df, columns=['cats'])
combined = df.join(dummies)
print(combined)
Which gives you:
cats cats_A cats_B
0 A 1 0
1 A 1 0
2 B 0 1
3 B 0 1
4 A 1 0
5 B 0 1
6 A 1 0

How do I subtract an odd row value from even row vlaue?

I have dataframe below.
I want to even row value substract from odd row value.
and make new dataframe.
How can I do it?
import pandas as pd
import numpy as np
raw_data = {'Time': [281.54385, 436.55295, 441.74910, 528.36445,
974.48405, 980.67895, 986.65435, 1026.02485]}
data = pd.DataFrame(raw_data)
data
dataframe
Time
0 281.54385
1 436.55295
2 441.74910
3 528.36445
4 974.48405
5 980.67895
6 986.65435
7 1026.02485
Wanted result
ON_TIME
0 155.00910
1 86.61535
2 6.19490
3 39.37050
You can use NumPy indexing:
res = pd.DataFrame(data.values[1::2] - data.values[::2], columns=['Time'])
print(res)
Time
0 155.00910
1 86.61535
2 6.19490
3 39.37050
you can use shift for the subtraction, and then pick every 2nd element, starting with the 2nd element (index = 1)
(data.Time - data.Time.shift())[1::2].rename('On Time').reset_index(drop=True)
outputs:
0 155.00910
1 86.61535
2 6.19490
3 39.37050
Name: On Time, dtype: float64

Python - pandas - Append Series into Blank DataFrame

Say I have two pandas Series in python:
import pandas as pd
h = pd.Series(['g',4,2,1,1])
g = pd.Series([1,6,5,4,"abc"])
I can create a DataFrame with just h and then append g to it:
df = pd.DataFrame([h])
df1 = df.append(g, ignore_index=True)
I get:
>>> df1
0 1 2 3 4
0 g 4 2 1 1
1 1 6 5 4 abc
But now suppose that I have an empty DataFrame and I try to append h to it:
df2 = pd.DataFrame([])
df3 = df2.append(h, ignore_index=True)
This does not work. I think the problem is in the second-to-last line of code. I need to somehow define the blank DataFrame to have the proper number of columns.
By the way, the reason I am trying to do this is that I am scraping text from the internet using requests+BeautifulSoup and I am processing it and trying to write it to a DataFrame one row at a time.
So if you don't pass an empty list to the DataFrame constructor then it works:
In [16]:
df = pd.DataFrame()
h = pd.Series(['g',4,2,1,1])
df = df.append(h,ignore_index=True)
df
Out[16]:
0 1 2 3 4
0 g 4 2 1 1
[1 rows x 5 columns]
The difference between the two constructor approaches appears to be that the index dtypes are set differently, with an empty list it is an Int64 with nothing it is an object:
In [21]:
df = pd.DataFrame()
print(df.index.dtype)
df = pd.DataFrame([])
print(df.index.dtype)
object
int64
Unclear to me why the above should affect the behaviour (I'm guessing here).
UPDATE
After revisiting this I can confirm that this looks to me to be a bug in pandas version 0.12.0 as your original code works fine:
In [13]:
import pandas as pd
df = pd.DataFrame([])
h = pd.Series(['g',4,2,1,1])
df.append(h,ignore_index=True)
Out[13]:
0 1 2 3 4
0 g 4 2 1 1
[1 rows x 5 columns]
I am running pandas 0.13.1 and numpy 1.8.1 64-bit using python 3.3.5.0 but I think the problem is pandas but I would upgrade both pandas and numpy to be safe, I don't think this is a 32 versus 64-bit python issue.

Categories