Append a new column based on existing columns - python

Pandas newbie here.
I'm trying to create a new column in my data frame that will serve as a training label when I feed this into a classifier.
The value of the label column is 1.0 if a given Id has (Value1 > 0) or (Value2 > 0) for Apples or Pears, and 0.0 otherwise.
My dataframe is row indexed by Id and looks like this:
Out[30]:
Value1 Value2 \
ProductName 7Up Apple Cheetos Onion Pear PopTart 7Up
ProductType Drinks Groceries Snacks Groceries Groceries Snacks Drinks
Id
100 0.0 1.0 2.0 4.0 0.0 0.0 0.0
101 3.0 0.0 0.0 0.0 3.0 0.0 4.0
102 0.0 0.0 0.0 0.0 0.0 2.0 0.0
ProductName Apple Cheetos Onion Pear PopTart
ProductType Groceries Snacks Groceries Groceries Snacks
Id
100 1.0 3.0 3.0 0.0 0.0
101 0.0 0.0 0.0 2.0 0.0
102 0.0 0.0 0.0 0.0 1.0
If the pandas wizards could give me a hand with the syntax for this operation - my mind is struggling to put it all together.
Thanks!

The answer provided by #vlad.rad works, but it is not very efficient since pandas has to manually loop in Python over all rows, not being able to take advantage of numpy vectorized functions speedup. The following vectorized solution should be more efficient:
condition = (df['Value1'] > 0) | (df['Value2'] > 0)
df.loc[condition, 'label'] = 1.
df.loc[~condition, 'label'] = 0.

Define your function:
def new_column (x):
if x['Value1'] > 0 :
return '1.0'
if x['Value2'] > 0 :
return '1.0'
return '0.0'
Apply it on your data:
df.apply (lambda x: new_column (x),axis=1)

Related

How can I assign the words from a specific column as a label to a new dataframe

Hi Friend I'm new here 😊,
Make a matrix from most repeated words in specific column A and add to my data frame with names of selected column as label.
What I have:
raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"]}
df=pr.DataFrame(raw_data)
What is my goal:
I want to do:
1- Separate the string & count the words in specific column
2- Make a Zero-Matrix
3- The new matrix should be labelled with founded words in step 1 (my-problem)
4- Search every row, if the word has been founded then 1 else 0
The new data frame what I have as result:
A word_count char_count 0 1 2 3 4 5 6 7 8 9 10 11
0 This is yellow 3 14 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 That is green 3 13 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 These are orange 3 16 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
3 This is a pen 4 13 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4 This is an Orange 4 17 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
What I did:
import pandas as pd
import numpy as np
# 1- Data frame
raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"]}
df=pd.DataFrame(raw_data)
df
## 2- Count the words and characters in evrey row in columns "A"
df['word_count'] = df['A'].agg(lambda x: len(x.split(" ")))
df['char_count'] = df['A'].agg(lambda x:len(x))
df
# 3- Countung the seprated words and the frequency of repetation
df_word_count=pd.DataFrame(df.A.str.split(' ').explode().value_counts()).reset_index().rename({'index':"A","A":"Count"},axis=1)
display(df_word_count)
df_word_count=list(df_word_count["A"])
len(df_word_count)
A Count
0 is 4
1 This 3
2 orange 1
3 That 1
4 yellow 1
5 Orange 1
6 are 1
7 a 1
8 an 1
9 These 1
10 green 1
11 pen 1
# 4- Make a ZERO-Matrix
allfeatures=np.zeros((df.shape[0],len(df_word_count)))
allfeatures.shape
# 5- Make a For-Loop
for i in range(len(df_word_count)):
allfeatures[:,i]=df['A'].agg(lambda x:x.split().count(df_word_count[i]))
# 5- Concat the data
Complete_data=pd.concat([df,pd.DataFrame(allfeatures)],axis=1)
display(Complete_data)
What I wanted:
The Words in "A" in step 3 should be label of new matrix instead 0 1 2 ...
A word_count char_count is This orange etc.
0 This is yellow 3 14 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 That is green 3 13 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 These are orange 3 16 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
3 This is a pen 4 13 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4 This is an Orange 4 17 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
So I changed your code a little, your step 3 looks like this:
# 3- Countung the seprated words and the frequency of repetation
df_word_count=pd.DataFrame(df.A.str.split(' ').explode().value_counts()).reset_index().rename({'index':"A","A":"Count"},axis=1)
display(df_word_count)
list_word_count=list(df_word_count["A"])
len(list_word_count)
The big change is the name of a variable in list_word_count=list(df_word_count["A"])
the rest of the code looks like this with the new variable:
# 4- Make a ZERO-Matrix
allfeatures=np.zeros((df.shape[0],len(list_word_count)))
allfeatures.shape
# 5- Make a For-Loop
for i in range(len(list_word_count)):
allfeatures[:,i]=df['A'].agg(lambda x:x.split().count(list_word_count[i]))
# 6- Concat the data
Complete_data=pd.concat([df,pd.DataFrame(allfeatures)],axis=1)
display(Complete_data)
The only change is the different name of variable. What I do is a seventh step
# 7- change columns name from list
#This creates a list of the words you wanted
l = list(df_word_count["A"])
# if you see this, it shows only the words you have in the column A
# but the result dataset that you showed you wanted, you also had some columns #that had values such as word count, etc. So we need to add that. We do this by #inserting those values you want in the list, at the beginning
l.insert(0,"char_count")
l.insert(0,"word_count")
l.insert(0,"A")
# Finally, I rename all the columns with the names that I have in the list l
Complete_data.columns = l
I get this:

Python - Create a column equal to the value of another column but if two consecutive values occur in the first column then make new column equal to 0

I have a column ('bet_made1') of ones and zeros where 1 represents taking a long position in a stock.
I want to create a new column that will equal the value of 'bet_made1'. However, if two consecutive 1s occur in 'bet_made1', I would like the new column to equal 0.
I would like to repeat so that a long position is followed by 4 days without another long being taken.
In other words, if I have a long position , I do not want to take another one till the 5th day after the initial buy order so that I only place a trade at a minimum of every 5 days.
Hope that makes sense. I've included a table below showing what I'm aiming for.
Cheers!
bet_made1 long_pos
date
02/01/2019 0.0 0.0
03/01/2019 0.0 0.0
04/01/2019 0.0 0.0
07/01/2019 0.0 0.0
08/01/2019 0.0 0.0
09/01/2019 0.0 0.0
10/01/2019 0.0 0.0
11/01/2019 0.0 0.0
14/01/2019 0.0 0.0
15/01/2019 0.0 0.0
16/01/2019 1.0 1.0
17/01/2019 1.0 0.0
18/01/2019 0.0 0.0
22/01/2019 1.0 0.0
23/01/2019 1.0 0.0
24/01/2019 1.0 1.0
25/01/2019 1.0 0.0
28/01/2019 1.0 0.0
29/01/2019 0.0 0.0
30/01/2019 0.0 0.0
There might be a more compact and tricky way, but creating a list with the new values and than add it as a new column may be a good idea:
import pandas as pd
df = pd.DataFrame(data=dict(bet_made1=[0,1,0,1,1,1,0,0,1,1,1,1,1,1]))
long_pos = []
remains = 0
for b in df.bet_made1:
if remains > 0:
remains -= 1
long_pos.append(0)
elif b == 1:
remains = 4
long_pos.append(1)
else:
long_pos.append(0)
df['long_pos'] = long_pos
This gives me the result:
bet_made1 long_pos
0 0
1 1
0 0
1 0
1 0
1 0
0 0
0 0
1 1
1 0
1 0
1 0
1 0
1 1

Is there a way to compare two columns of a dataframe containing float values and create a new column to add labels based on it?

I have a dataframe which looks like below:
df
column_A column_B
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 0.0 0.0
4 0.0 0.0
5 1.0 0.0
I want to create a if condition like:
if(df['column_A'] & df['column_b'] = 0.0:
df['label]='OK'
else:
df['label']='NO'
I tried this:
if((0.0 in df['column_A'] ) & (0.0 in df['column_B']))
for index, row in df.iterrows():
(df[((df['column_A'] == 0.0) & (df['column_B']== 0.0))])
Nothing really gave the expected outcome
I expect my output to be:
column_A column_B label
0 0.0 0.0 OK
1 0.0 0.0 OK
2 0.0 1.0 NO
3 0.0 0.0 OK
4 0.0 0.0 OK
5 1.0 0.0 NO
You can use np.where in order to create an array with either OK or NO depending on the result of the condition:
import numpy as np
df['label'] = np.where(df.column_A.add(df.column_B).eq(0), 'OK', 'NO')
column_A column_B label
0 0.0 0.0 OK
1 0.0 0.0 OK
2 0.0 1.0 NO
3 0.0 0.0 OK
4 0.0 0.0 OK
5 1.0 0.0 NO
Use numpy.where with DataFrame.any:
#solution if only 1.0, 0.0 values
df['label'] = np.where(df[['column_A', 'column_B']].any(axis=1), 'NO','OK')
#general solution with compare 0
#df['label'] = np.where(df[['column_A', 'column_B']].eq(0).all(axis=1),'OK','NO')
print (df)
column_A column_B label
0 0.0 0.0 OK
1 0.0 0.0 OK
2 0.0 1.0 NO
3 0.0 0.0 OK
4 0.0 0.0 OK
5 1.0 0.0 NO

creating a dataframe from multiple dictionaries

I have multiple dictionaries that contain word frequency counts for a series of text files. I'm trying to find a way of collating them into a dataframe (so one dict = one text file = one row in the df), but I am fairly inexperienced with Python and unsure how to proceed.
I have approx 50 text files/dictionaries, but for simplicity say I have the following;
mydict = {'red': 2,'blue': 1,'yellow': 3}
mydict2 = {'blue': 1,'orange': 3,'red': 1}
mydict3 = {'purple': 1,'green': 3,'brown': 2}
How can I create a dataframe with the full list of colours as columns, the dictionaries/text files as rows, and then the respective counts as the data-points (with any colors not appearing in a particular column registered as zero).
I would have included a coding attempt, however I do not know how to begin with the task.
You can make a series for each and then .concat them.
mydicts = [mydict, mydict2, mydict3]
df = pd.concat([pd.Series(d) for d in mydicts], axis=1).fillna(0).T
df.index = ['mydict', 'mydict1', 'mydict2']
df
returns
blue brown green orange purple red yellow
mydict 1.0 0.0 0.0 0.0 0.0 2.0 3.0
mydict1 1.0 0.0 0.0 3.0 0.0 1.0 0.0
mydict2 0.0 2.0 3.0 0.0 1.0 0.0 0.0
use pd.DataFrame.from_records():
In [6]: mydicts = [mydict, mydict2, mydict3]
In [7]: pd.DataFrame.from_records(mydicts).fillna(0)
Out[7]:
blue brown green orange purple red yellow
0 1.0 0.0 0.0 0.0 0.0 2.0 3.0
1 1.0 0.0 0.0 3.0 0.0 1.0 0.0
2 0.0 2.0 3.0 0.0 1.0 0.0 0.0

Pandas: Replace dataframe columns based on Boolean list/dict

I have two pandas data-frames that I would like to merge together, but not in the way that I've seen in the examples I've been able to find. I have a set of "old" data and a set of "new" data that for two data frames that are equal in shape with the same column names. I do some analysis and determine that I need to create third dataset, taking some of the columns from the "old" data and some from the "new" data. As an example, lets say I have these two datasets:
df_old = pd.DataFrame(np.zeros([5,5]),columns=list('ABCDE'))
df_new = pd.DataFrame(np.ones([5,5]),columns=list('ABCDE'))
which are simply:
A B C D E
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0
and
A B C D E
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0
I do some analysis and find that I want to replace columns B and D. I can do that in a loop like this:
replace = dict(A=False,B=True,C=False,D=True,E=False)
df = pd.DataFrame({})
for k,v in sorted(replace.items()):
df[k] = df_new[k] if v else df_old[k]
This gives me the data that I want:
A B C D E
0 0.0 1.0 0.0 1.0 0.0
1 0.0 1.0 0.0 1.0 0.0
2 0.0 1.0 0.0 1.0 0.0
3 0.0 1.0 0.0 1.0 0.0
4 0.0 1.0 0.0 1.0 0.0
but, this honestly seems a bit clunky, and I'd imagine that there is a better way to use pandas to do this. Plus, I'd like to preserve the order of my columns which may not be in alphabetical order like this example dataset, so sorting the dictionary may not be the way to go, although I could probably pull the column names from the data set if need be.
Is there a better way to do this using some of Pandas merge functionality?
A really rudimentary approach would just be to filter the Boolean dict and then assign directly.
to_rep = [k for k in replace if replace[k]]
df_old[to_rep] = df_new[to_rep]
If you wanted to preserve your old DataFrame, you could use assign()
df_old.assign(**{k: df_new[k] for k in replace if replace[k]})
As mentioned by Nickil, assign() evidently doesn't preserve argument order as we're passing a dict. However to be predictable, it inserts the assigned columns in alphabetical order at the end of your DataFrame.
Demo
>>> df_old.assign(**{k: df_new[k] for k in replace if replace[k]})
A B C D E
0 0.0 1.0 0.0 1.0 0.0
1 0.0 1.0 0.0 1.0 0.0
2 0.0 1.0 0.0 1.0 0.0
3 0.0 1.0 0.0 1.0 0.0
4 0.0 1.0 0.0 1.0 0.0
Simply assign the new columns that you need:
df_old['B'] = df_new['B']
df_old['D'] = df_new['D']
Or as one line:
df_changes = df_old.copy()
df_changes[['B', 'D']] = df_new[['B', 'D']]

Categories