I have multiple dictionaries that contain word frequency counts for a series of text files. I'm trying to find a way of collating them into a dataframe (so one dict = one text file = one row in the df), but I am fairly inexperienced with Python and unsure how to proceed.
I have approx 50 text files/dictionaries, but for simplicity say I have the following;
mydict = {'red': 2,'blue': 1,'yellow': 3}
mydict2 = {'blue': 1,'orange': 3,'red': 1}
mydict3 = {'purple': 1,'green': 3,'brown': 2}
How can I create a dataframe with the full list of colours as columns, the dictionaries/text files as rows, and then the respective counts as the data-points (with any colors not appearing in a particular column registered as zero).
I would have included a coding attempt, however I do not know how to begin with the task.
You can make a series for each and then .concat them.
mydicts = [mydict, mydict2, mydict3]
df = pd.concat([pd.Series(d) for d in mydicts], axis=1).fillna(0).T
df.index = ['mydict', 'mydict1', 'mydict2']
df
returns
blue brown green orange purple red yellow
mydict 1.0 0.0 0.0 0.0 0.0 2.0 3.0
mydict1 1.0 0.0 0.0 3.0 0.0 1.0 0.0
mydict2 0.0 2.0 3.0 0.0 1.0 0.0 0.0
use pd.DataFrame.from_records():
In [6]: mydicts = [mydict, mydict2, mydict3]
In [7]: pd.DataFrame.from_records(mydicts).fillna(0)
Out[7]:
blue brown green orange purple red yellow
0 1.0 0.0 0.0 0.0 0.0 2.0 3.0
1 1.0 0.0 0.0 3.0 0.0 1.0 0.0
2 0.0 2.0 3.0 0.0 1.0 0.0 0.0
Related
Hi Friend I'm new here 😊,
Make a matrix from most repeated words in specific column A and add to my data frame with names of selected column as label.
What I have:
raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"]}
df=pr.DataFrame(raw_data)
What is my goal:
I want to do:
1- Separate the string & count the words in specific column
2- Make a Zero-Matrix
3- The new matrix should be labelled with founded words in step 1 (my-problem)
4- Search every row, if the word has been founded then 1 else 0
The new data frame what I have as result:
A word_count char_count 0 1 2 3 4 5 6 7 8 9 10 11
0 This is yellow 3 14 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 That is green 3 13 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 These are orange 3 16 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
3 This is a pen 4 13 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4 This is an Orange 4 17 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
What I did:
import pandas as pd
import numpy as np
# 1- Data frame
raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"]}
df=pd.DataFrame(raw_data)
df
## 2- Count the words and characters in evrey row in columns "A"
df['word_count'] = df['A'].agg(lambda x: len(x.split(" ")))
df['char_count'] = df['A'].agg(lambda x:len(x))
df
# 3- Countung the seprated words and the frequency of repetation
df_word_count=pd.DataFrame(df.A.str.split(' ').explode().value_counts()).reset_index().rename({'index':"A","A":"Count"},axis=1)
display(df_word_count)
df_word_count=list(df_word_count["A"])
len(df_word_count)
A Count
0 is 4
1 This 3
2 orange 1
3 That 1
4 yellow 1
5 Orange 1
6 are 1
7 a 1
8 an 1
9 These 1
10 green 1
11 pen 1
# 4- Make a ZERO-Matrix
allfeatures=np.zeros((df.shape[0],len(df_word_count)))
allfeatures.shape
# 5- Make a For-Loop
for i in range(len(df_word_count)):
allfeatures[:,i]=df['A'].agg(lambda x:x.split().count(df_word_count[i]))
# 5- Concat the data
Complete_data=pd.concat([df,pd.DataFrame(allfeatures)],axis=1)
display(Complete_data)
What I wanted:
The Words in "A" in step 3 should be label of new matrix instead 0 1 2 ...
A word_count char_count is This orange etc.
0 This is yellow 3 14 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 That is green 3 13 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 These are orange 3 16 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
3 This is a pen 4 13 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4 This is an Orange 4 17 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
So I changed your code a little, your step 3 looks like this:
# 3- Countung the seprated words and the frequency of repetation
df_word_count=pd.DataFrame(df.A.str.split(' ').explode().value_counts()).reset_index().rename({'index':"A","A":"Count"},axis=1)
display(df_word_count)
list_word_count=list(df_word_count["A"])
len(list_word_count)
The big change is the name of a variable in list_word_count=list(df_word_count["A"])
the rest of the code looks like this with the new variable:
# 4- Make a ZERO-Matrix
allfeatures=np.zeros((df.shape[0],len(list_word_count)))
allfeatures.shape
# 5- Make a For-Loop
for i in range(len(list_word_count)):
allfeatures[:,i]=df['A'].agg(lambda x:x.split().count(list_word_count[i]))
# 6- Concat the data
Complete_data=pd.concat([df,pd.DataFrame(allfeatures)],axis=1)
display(Complete_data)
The only change is the different name of variable. What I do is a seventh step
# 7- change columns name from list
#This creates a list of the words you wanted
l = list(df_word_count["A"])
# if you see this, it shows only the words you have in the column A
# but the result dataset that you showed you wanted, you also had some columns #that had values such as word count, etc. So we need to add that. We do this by #inserting those values you want in the list, at the beginning
l.insert(0,"char_count")
l.insert(0,"word_count")
l.insert(0,"A")
# Finally, I rename all the columns with the names that I have in the list l
Complete_data.columns = l
I get this:
I have a dataframe which looks like below:
df
column_A column_B
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 0.0 0.0
4 0.0 0.0
5 1.0 0.0
I want to create a if condition like:
if(df['column_A'] & df['column_b'] = 0.0:
df['label]='OK'
else:
df['label']='NO'
I tried this:
if((0.0 in df['column_A'] ) & (0.0 in df['column_B']))
for index, row in df.iterrows():
(df[((df['column_A'] == 0.0) & (df['column_B']== 0.0))])
Nothing really gave the expected outcome
I expect my output to be:
column_A column_B label
0 0.0 0.0 OK
1 0.0 0.0 OK
2 0.0 1.0 NO
3 0.0 0.0 OK
4 0.0 0.0 OK
5 1.0 0.0 NO
You can use np.where in order to create an array with either OK or NO depending on the result of the condition:
import numpy as np
df['label'] = np.where(df.column_A.add(df.column_B).eq(0), 'OK', 'NO')
column_A column_B label
0 0.0 0.0 OK
1 0.0 0.0 OK
2 0.0 1.0 NO
3 0.0 0.0 OK
4 0.0 0.0 OK
5 1.0 0.0 NO
Use numpy.where with DataFrame.any:
#solution if only 1.0, 0.0 values
df['label'] = np.where(df[['column_A', 'column_B']].any(axis=1), 'NO','OK')
#general solution with compare 0
#df['label'] = np.where(df[['column_A', 'column_B']].eq(0).all(axis=1),'OK','NO')
print (df)
column_A column_B label
0 0.0 0.0 OK
1 0.0 0.0 OK
2 0.0 1.0 NO
3 0.0 0.0 OK
4 0.0 0.0 OK
5 1.0 0.0 NO
I have a DataFrame (df1) as given below
Hair Feathers Legs Type Count
R1 1 NaN 0 1 1
R2 1 0 Nan 1 32
R3 1 0 2 1 4
R4 1 Nan 4 1 27
I want to merge rows based by different combinations of the values in each column and also want to add the count values for each merged row. The resultant dataframe(df2) will look like this:
Hair Feathers Legs Type Count
R1 1 0 0 1 33
R2 1 0 2 1 36
R3 1 0 4 1 59
The merging is performed in such a way that any Nan value will be merged with 0 or 1. In df2, R1 is calculated by merging the Nan value of Feathers (df1,R1) with the 0 value of Feathers (df1,R2). Similarly, the value of 0 in Legs (df1,R1) is merged with Nan value of Legs (df1,R2). Then the count of R1 (1) and R2(32) are added. In the same manner R2 and R3 are merged because Feathers value in R2 (df1) is similar to R3 (df1) and Legs value of Nan is merged with 2 in R3 (df1) and the count of R2 (32) and R3 (4) are added.
I hope the explanation makes sense. Any help will be highly appreciated
A possible way to do it is by replicating each of the rows containing NaN and fill them with values for the column.
First, we need to get the possible not-null unique values per column:
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
> unique_values
{'Hair': [1.0], 'Feathers': [0.0], 'Legs': [0.0, 2.0, 4.0], 'Type': [1.0]}
Then iterate through each row of the dataframe and replace each NaN by the possible values for each column. We can do this using pandas.DataFrame.iterrows:
mask = df.iloc[:, :-1].isnull().any(axis=1)
# Keep the rows that do not contain `Nan`
# and then added modified rows
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
for c in row[row.isnull()].index:
# For each column of the row, replace
# Nan by possible values for the column
for v in unique_values[c]:
list_of_df.append(row.copy().fillna({c:v}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
The result is a dataframe where all the NaN have been filled with possible values for the column:
> df_res
Hair Feathers Legs Type Count
0 1.0 0.0 2.0 1.0 4.0
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
3 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
To get the final result of Count grouping by possible combinations of ['Hair', 'Feathers', 'Legs', 'Type'] we just need to do:
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 33.0
1 1.0 0.0 2.0 1.0 36.0
2 1.0 0.0 4.0 1.0 59.0
Hope it serves
UPDATE
If one or more of the elements in the row are missing, the procedure looking for all the possible combinations for the missing values at the same time. Let us add a new row with two elements missing:
> df
Hair Feathers Legs Type Count
0 1.0 NaN 0.0 1.0 1.0
1 1.0 0.0 NaN 1.0 32.0
2 1.0 0.0 2.0 1.0 4.0
3 1.0 NaN 4.0 1.0 27.0
4 1.0 NaN NaN 1.0 32.0
We will proceed in similar way, but the replacements combinations will be obtained using itertools.product:
import itertools
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
mask = df.iloc[:, :-1].isnull().any(axis=1)
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
cols = row[row.isnull()].index.tolist()
for p in itertools.product(*[unique_values[c] for c in cols]):
list_of_df.append(row.copy().fillna({c:v for c, v in zip(cols, p)}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
> df_res.sort_values(['Hair', 'Feathers', 'Legs', 'Type']).reset_index(drop=True)
Hair Feathers Legs Type Count
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
6 1.0 0.0 0.0 1.0 32.0
0 1.0 0.0 2.0 1.0 4.0
3 1.0 0.0 2.0 1.0 32.0
7 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
8 1.0 0.0 4.0 1.0 32.0
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 65.0
1 1.0 0.0 2.0 1.0 68.0
2 1.0 0.0 4.0 1.0 91.0
I have two pandas data-frames that I would like to merge together, but not in the way that I've seen in the examples I've been able to find. I have a set of "old" data and a set of "new" data that for two data frames that are equal in shape with the same column names. I do some analysis and determine that I need to create third dataset, taking some of the columns from the "old" data and some from the "new" data. As an example, lets say I have these two datasets:
df_old = pd.DataFrame(np.zeros([5,5]),columns=list('ABCDE'))
df_new = pd.DataFrame(np.ones([5,5]),columns=list('ABCDE'))
which are simply:
A B C D E
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0
and
A B C D E
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0
I do some analysis and find that I want to replace columns B and D. I can do that in a loop like this:
replace = dict(A=False,B=True,C=False,D=True,E=False)
df = pd.DataFrame({})
for k,v in sorted(replace.items()):
df[k] = df_new[k] if v else df_old[k]
This gives me the data that I want:
A B C D E
0 0.0 1.0 0.0 1.0 0.0
1 0.0 1.0 0.0 1.0 0.0
2 0.0 1.0 0.0 1.0 0.0
3 0.0 1.0 0.0 1.0 0.0
4 0.0 1.0 0.0 1.0 0.0
but, this honestly seems a bit clunky, and I'd imagine that there is a better way to use pandas to do this. Plus, I'd like to preserve the order of my columns which may not be in alphabetical order like this example dataset, so sorting the dictionary may not be the way to go, although I could probably pull the column names from the data set if need be.
Is there a better way to do this using some of Pandas merge functionality?
A really rudimentary approach would just be to filter the Boolean dict and then assign directly.
to_rep = [k for k in replace if replace[k]]
df_old[to_rep] = df_new[to_rep]
If you wanted to preserve your old DataFrame, you could use assign()
df_old.assign(**{k: df_new[k] for k in replace if replace[k]})
As mentioned by Nickil, assign() evidently doesn't preserve argument order as we're passing a dict. However to be predictable, it inserts the assigned columns in alphabetical order at the end of your DataFrame.
Demo
>>> df_old.assign(**{k: df_new[k] for k in replace if replace[k]})
A B C D E
0 0.0 1.0 0.0 1.0 0.0
1 0.0 1.0 0.0 1.0 0.0
2 0.0 1.0 0.0 1.0 0.0
3 0.0 1.0 0.0 1.0 0.0
4 0.0 1.0 0.0 1.0 0.0
Simply assign the new columns that you need:
df_old['B'] = df_new['B']
df_old['D'] = df_new['D']
Or as one line:
df_changes = df_old.copy()
df_changes[['B', 'D']] = df_new[['B', 'D']]
Pandas newbie here.
I'm trying to create a new column in my data frame that will serve as a training label when I feed this into a classifier.
The value of the label column is 1.0 if a given Id has (Value1 > 0) or (Value2 > 0) for Apples or Pears, and 0.0 otherwise.
My dataframe is row indexed by Id and looks like this:
Out[30]:
Value1 Value2 \
ProductName 7Up Apple Cheetos Onion Pear PopTart 7Up
ProductType Drinks Groceries Snacks Groceries Groceries Snacks Drinks
Id
100 0.0 1.0 2.0 4.0 0.0 0.0 0.0
101 3.0 0.0 0.0 0.0 3.0 0.0 4.0
102 0.0 0.0 0.0 0.0 0.0 2.0 0.0
ProductName Apple Cheetos Onion Pear PopTart
ProductType Groceries Snacks Groceries Groceries Snacks
Id
100 1.0 3.0 3.0 0.0 0.0
101 0.0 0.0 0.0 2.0 0.0
102 0.0 0.0 0.0 0.0 1.0
If the pandas wizards could give me a hand with the syntax for this operation - my mind is struggling to put it all together.
Thanks!
The answer provided by #vlad.rad works, but it is not very efficient since pandas has to manually loop in Python over all rows, not being able to take advantage of numpy vectorized functions speedup. The following vectorized solution should be more efficient:
condition = (df['Value1'] > 0) | (df['Value2'] > 0)
df.loc[condition, 'label'] = 1.
df.loc[~condition, 'label'] = 0.
Define your function:
def new_column (x):
if x['Value1'] > 0 :
return '1.0'
if x['Value2'] > 0 :
return '1.0'
return '0.0'
Apply it on your data:
df.apply (lambda x: new_column (x),axis=1)