I have two pandas data-frames that I would like to merge together, but not in the way that I've seen in the examples I've been able to find. I have a set of "old" data and a set of "new" data that for two data frames that are equal in shape with the same column names. I do some analysis and determine that I need to create third dataset, taking some of the columns from the "old" data and some from the "new" data. As an example, lets say I have these two datasets:
df_old = pd.DataFrame(np.zeros([5,5]),columns=list('ABCDE'))
df_new = pd.DataFrame(np.ones([5,5]),columns=list('ABCDE'))
which are simply:
A B C D E
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0
and
A B C D E
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0
I do some analysis and find that I want to replace columns B and D. I can do that in a loop like this:
replace = dict(A=False,B=True,C=False,D=True,E=False)
df = pd.DataFrame({})
for k,v in sorted(replace.items()):
df[k] = df_new[k] if v else df_old[k]
This gives me the data that I want:
A B C D E
0 0.0 1.0 0.0 1.0 0.0
1 0.0 1.0 0.0 1.0 0.0
2 0.0 1.0 0.0 1.0 0.0
3 0.0 1.0 0.0 1.0 0.0
4 0.0 1.0 0.0 1.0 0.0
but, this honestly seems a bit clunky, and I'd imagine that there is a better way to use pandas to do this. Plus, I'd like to preserve the order of my columns which may not be in alphabetical order like this example dataset, so sorting the dictionary may not be the way to go, although I could probably pull the column names from the data set if need be.
Is there a better way to do this using some of Pandas merge functionality?
A really rudimentary approach would just be to filter the Boolean dict and then assign directly.
to_rep = [k for k in replace if replace[k]]
df_old[to_rep] = df_new[to_rep]
If you wanted to preserve your old DataFrame, you could use assign()
df_old.assign(**{k: df_new[k] for k in replace if replace[k]})
As mentioned by Nickil, assign() evidently doesn't preserve argument order as we're passing a dict. However to be predictable, it inserts the assigned columns in alphabetical order at the end of your DataFrame.
Demo
>>> df_old.assign(**{k: df_new[k] for k in replace if replace[k]})
A B C D E
0 0.0 1.0 0.0 1.0 0.0
1 0.0 1.0 0.0 1.0 0.0
2 0.0 1.0 0.0 1.0 0.0
3 0.0 1.0 0.0 1.0 0.0
4 0.0 1.0 0.0 1.0 0.0
Simply assign the new columns that you need:
df_old['B'] = df_new['B']
df_old['D'] = df_new['D']
Or as one line:
df_changes = df_old.copy()
df_changes[['B', 'D']] = df_new[['B', 'D']]
Related
Hi Friend I'm new here 😊,
Make a matrix from most repeated words in specific column A and add to my data frame with names of selected column as label.
What I have:
raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"]}
df=pr.DataFrame(raw_data)
What is my goal:
I want to do:
1- Separate the string & count the words in specific column
2- Make a Zero-Matrix
3- The new matrix should be labelled with founded words in step 1 (my-problem)
4- Search every row, if the word has been founded then 1 else 0
The new data frame what I have as result:
A word_count char_count 0 1 2 3 4 5 6 7 8 9 10 11
0 This is yellow 3 14 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 That is green 3 13 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 These are orange 3 16 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
3 This is a pen 4 13 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4 This is an Orange 4 17 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
What I did:
import pandas as pd
import numpy as np
# 1- Data frame
raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"]}
df=pd.DataFrame(raw_data)
df
## 2- Count the words and characters in evrey row in columns "A"
df['word_count'] = df['A'].agg(lambda x: len(x.split(" ")))
df['char_count'] = df['A'].agg(lambda x:len(x))
df
# 3- Countung the seprated words and the frequency of repetation
df_word_count=pd.DataFrame(df.A.str.split(' ').explode().value_counts()).reset_index().rename({'index':"A","A":"Count"},axis=1)
display(df_word_count)
df_word_count=list(df_word_count["A"])
len(df_word_count)
A Count
0 is 4
1 This 3
2 orange 1
3 That 1
4 yellow 1
5 Orange 1
6 are 1
7 a 1
8 an 1
9 These 1
10 green 1
11 pen 1
# 4- Make a ZERO-Matrix
allfeatures=np.zeros((df.shape[0],len(df_word_count)))
allfeatures.shape
# 5- Make a For-Loop
for i in range(len(df_word_count)):
allfeatures[:,i]=df['A'].agg(lambda x:x.split().count(df_word_count[i]))
# 5- Concat the data
Complete_data=pd.concat([df,pd.DataFrame(allfeatures)],axis=1)
display(Complete_data)
What I wanted:
The Words in "A" in step 3 should be label of new matrix instead 0 1 2 ...
A word_count char_count is This orange etc.
0 This is yellow 3 14 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 That is green 3 13 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 These are orange 3 16 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
3 This is a pen 4 13 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4 This is an Orange 4 17 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
So I changed your code a little, your step 3 looks like this:
# 3- Countung the seprated words and the frequency of repetation
df_word_count=pd.DataFrame(df.A.str.split(' ').explode().value_counts()).reset_index().rename({'index':"A","A":"Count"},axis=1)
display(df_word_count)
list_word_count=list(df_word_count["A"])
len(list_word_count)
The big change is the name of a variable in list_word_count=list(df_word_count["A"])
the rest of the code looks like this with the new variable:
# 4- Make a ZERO-Matrix
allfeatures=np.zeros((df.shape[0],len(list_word_count)))
allfeatures.shape
# 5- Make a For-Loop
for i in range(len(list_word_count)):
allfeatures[:,i]=df['A'].agg(lambda x:x.split().count(list_word_count[i]))
# 6- Concat the data
Complete_data=pd.concat([df,pd.DataFrame(allfeatures)],axis=1)
display(Complete_data)
The only change is the different name of variable. What I do is a seventh step
# 7- change columns name from list
#This creates a list of the words you wanted
l = list(df_word_count["A"])
# if you see this, it shows only the words you have in the column A
# but the result dataset that you showed you wanted, you also had some columns #that had values such as word count, etc. So we need to add that. We do this by #inserting those values you want in the list, at the beginning
l.insert(0,"char_count")
l.insert(0,"word_count")
l.insert(0,"A")
# Finally, I rename all the columns with the names that I have in the list l
Complete_data.columns = l
I get this:
I have a dataframe which looks like below:
df
column_A column_B
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 0.0 0.0
4 0.0 0.0
5 1.0 0.0
I want to create a if condition like:
if(df['column_A'] & df['column_b'] = 0.0:
df['label]='OK'
else:
df['label']='NO'
I tried this:
if((0.0 in df['column_A'] ) & (0.0 in df['column_B']))
for index, row in df.iterrows():
(df[((df['column_A'] == 0.0) & (df['column_B']== 0.0))])
Nothing really gave the expected outcome
I expect my output to be:
column_A column_B label
0 0.0 0.0 OK
1 0.0 0.0 OK
2 0.0 1.0 NO
3 0.0 0.0 OK
4 0.0 0.0 OK
5 1.0 0.0 NO
You can use np.where in order to create an array with either OK or NO depending on the result of the condition:
import numpy as np
df['label'] = np.where(df.column_A.add(df.column_B).eq(0), 'OK', 'NO')
column_A column_B label
0 0.0 0.0 OK
1 0.0 0.0 OK
2 0.0 1.0 NO
3 0.0 0.0 OK
4 0.0 0.0 OK
5 1.0 0.0 NO
Use numpy.where with DataFrame.any:
#solution if only 1.0, 0.0 values
df['label'] = np.where(df[['column_A', 'column_B']].any(axis=1), 'NO','OK')
#general solution with compare 0
#df['label'] = np.where(df[['column_A', 'column_B']].eq(0).all(axis=1),'OK','NO')
print (df)
column_A column_B label
0 0.0 0.0 OK
1 0.0 0.0 OK
2 0.0 1.0 NO
3 0.0 0.0 OK
4 0.0 0.0 OK
5 1.0 0.0 NO
Let's be given a data-frame like the following one:
import pandas as pd
import numpy as np
a = ['a', 'b']
b = ['i', 'ii']
mi = pd.MultiIndex.from_product([a,b], names=['first', 'second'])
A = pd.DataFrame(np.zeros([3,4]), columns=mi)
first a b
second i ii i ii
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
I would like to create new columns iii for all first-level columns and assign the value of a new array (of matching size). I tried the following, to no avail.
A.loc[:,pd.IndexSlice[:,'iii']] = np.arange(6).reshape(3,-1)
The result should look like this:
a b
i ii iii i ii iii
0 0.0 0.0 0.0 0.0 0.0 1.0
1 0.0 0.0 2.0 0.0 0.0 3.0
2 0.0 0.0 4.0 0.0 0.0 5.0
Since you have multiple index in columns , I recommend create the additional append df , then concat it back
appenddf=pd.DataFrame(np.arange(6).reshape(3,-1),
index=A.index,
columns=pd.MultiIndex.from_product([A.columns.levels[0],['iii']]))
appenddf
a b
iii iii
0 0 1
1 2 3
2 4 5
A=pd.concat([A,appenddf],axis=1).sort_index(level=0,axis=1)
A
first a b
second i ii iii i ii iii
0 0.0 0.0 0 0.0 0.0 1
1 0.0 0.0 2 0.0 0.0 3
2 0.0 0.0 4 0.0 0.0 5
Another workable solution
for i,x in enumerate(A.columns.levels[0]):
A[x,'iii']=np.arange(6).reshape(3,-1)[:,i]
A
first a b a b
second i ii i ii iii iii
0 0.0 0.0 0.0 0.0 0 1
1 0.0 0.0 0.0 0.0 2 3
2 0.0 0.0 0.0 0.0 4 5
# here I did not add `sort_index`
I am trying to create connect 4 game with 6/7 arrray in python, and i need column headers so that column 0 is named a, column 2 is named b, and so on. The purpose of this is for the moves to be initiated by typing 'a' (drops token in first column) 'b' (drops token in second) etc.... This is my code to create the array
def clear_board():
board = np.zeros((6,7))
return board
If you need column names, the easiest way is to use a pandas Dataframe instead of a numpy array:
import pandas as pd
def clear_board():
board = pd.DataFrame(np.zeros((6,7)),columns=list('ABCDEFG'))
return board
>>> clear_board()
A B C D E F G
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Beyond that, take a look at the options provided in this answer
I have multiple dictionaries that contain word frequency counts for a series of text files. I'm trying to find a way of collating them into a dataframe (so one dict = one text file = one row in the df), but I am fairly inexperienced with Python and unsure how to proceed.
I have approx 50 text files/dictionaries, but for simplicity say I have the following;
mydict = {'red': 2,'blue': 1,'yellow': 3}
mydict2 = {'blue': 1,'orange': 3,'red': 1}
mydict3 = {'purple': 1,'green': 3,'brown': 2}
How can I create a dataframe with the full list of colours as columns, the dictionaries/text files as rows, and then the respective counts as the data-points (with any colors not appearing in a particular column registered as zero).
I would have included a coding attempt, however I do not know how to begin with the task.
You can make a series for each and then .concat them.
mydicts = [mydict, mydict2, mydict3]
df = pd.concat([pd.Series(d) for d in mydicts], axis=1).fillna(0).T
df.index = ['mydict', 'mydict1', 'mydict2']
df
returns
blue brown green orange purple red yellow
mydict 1.0 0.0 0.0 0.0 0.0 2.0 3.0
mydict1 1.0 0.0 0.0 3.0 0.0 1.0 0.0
mydict2 0.0 2.0 3.0 0.0 1.0 0.0 0.0
use pd.DataFrame.from_records():
In [6]: mydicts = [mydict, mydict2, mydict3]
In [7]: pd.DataFrame.from_records(mydicts).fillna(0)
Out[7]:
blue brown green orange purple red yellow
0 1.0 0.0 0.0 0.0 0.0 2.0 3.0
1 1.0 0.0 0.0 3.0 0.0 1.0 0.0
2 0.0 2.0 3.0 0.0 1.0 0.0 0.0