I want to fill the Null values in the first column based on the value of the 2nd column.
(For example)
For "Apples" in col2, the value should be 12 in places of Nan in the col1
For "Vegies", in col2 the value should be 134 in place of Nan in col1
For every description, there is a specific code(number) in the 1st column. I need to map it somehow.
(IGNORE the . (dots)
All I can think of is to make a dictionary of codes and replace null but's that very hardcoded.
Can anyone help?
col1. col2
12. Apple
134. Vegies
23. Oranges
Nan. Apples
Nan. Vegies
324. Sugar
Nan. Apples
**Reupdate
Here, I replicate your DF, and the implementation:
import pandas as pd
import numpy as np
l1 = [12, 134, 23, np.nan, np.nan, 324, np.nan,np.nan,np.nan,np.nan]
l2 = ["Apple","Vegies","Oranges","Apples","Vegies","Sugar","Apples","Melon","Melon","Grapes"]
df = pd.DataFrame(l1, columns=["col1"])
df["col2"] = pd.DataFrame(l2)
df
Out[26]:
col1 col2
0 12.0 Apple
1 134.0 Vegies
2 23.0 Oranges
3 NaN Apples
4 NaN Vegies
5 324.0 Sugar
6 NaN Apples
7 NaN Melon
8 NaN Melon
9 NaN Grapes
Then to Replace the Null values based on your rules:
df.loc[df.col2 == "Vegies", 'col1'] = 134
df.loc[df.col2 == "Apple", 'col1'] = 12
If you want to apply these to a larger scales, consider make a dictionary first:
for example is:
item_dict = {"Apples":12, "Melon":65, "Vegies":134, "Grapes":78}
Then apply all of these to your dataframe with this custom function:
def item_mapping(df, dictionary, colsource, coltarget):
dict_keys = list(dictionary.keys())
dict_values = list(dictionary.values())
for x in range(len(dict_keys)):
df.loc[df[colsource]==dict_keys[x], coltarget] = dict_values[x]
return(df)
Usage Examples:
item_mapping(df, item_dict, "col2", "col1")
col1 col2
0 12.0 Apple
1 134.0 Vegies
2 23.0 Oranges
3 12.0 Apples
4 134.0 Vegies
5 324.0 Sugar
6 12.0 Apples
7 65.0 Melon
8 65.0 Melon
9 78.0 Grapes
Related
I want to create a unique dataset of fruits. I don't know all the types (e.g. colour store, price) that could be under each fruit. For each type, there could also be duplicate rows. Is there a way to detect all possible duplicates and capture all unique informoation in a fully generalisable way?
type val detail
0 fruit apple
1 colour green greenish
2 colour yellow
3 store walmart usa
4 price 10
5 NaN
6 fruit banana
7 colour yellow
8 fruit pear
9 fruit jackfruit
...
Expected Output
fruit colour store price detail ...
0 apple [green, yellow ] [walmart] [10] [greenish, usa]
1 banana [yellow] NaN NaN
2 pear NaN NaN NaN
3 jackfruit NaN NaN NaN
I tried. But this does not get close to the expected output. It does not show the colum names either.
df.groupby("type")["val"].agg(size=len, set=lambda x: set(x))
0 fruit {"apple",...}
1 colour ...
First is created fruit column with val values if type is fruit, replace non matched values to NaNs and forward filling missing values, then pivoting by DataFrame.pivot_table with custom function for unique values without NaNs and then flatten MultiIndex:
m = df['type'].eq('fruit')
df['fruit'] = df['val'].where(m).ffill()
df1 = (df.pivot_table(index='fruit',columns='type',
aggfunc=lambda x: list(dict.fromkeys(x.dropna())))
.drop('fruit', axis=1, level=1))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df1)
detail_colour detail_price detail_store val_colour val_price \
fruit
apple [greenish] [] [usa] [green, yellow] [10]
banana [] NaN NaN [yellow] NaN
jackfruit NaN NaN NaN NaN NaN
pear NaN NaN NaN NaN NaN
val_store
fruit
apple [walmart]
banana NaN
jackfruit NaN
pear NaN
Here as shown below is a data frame , where in a column col2 many nan's are there , i want to fill that only nan value the col1 as key from dictionary dict_map and map those value in col2.
Reproducible code:
import pandas as pd
import numpy as np
dict_map = {'a':45,'b':23,'c':97,'z': -1}
df = pd.DataFrame()
df['tag'] = [1,2,3,4,5,6,7,8,9,10,11]
df['col1'] = ['a','b','c','b','a','a','z','c','b','c','b']
df['col2'] = [np.nan,909,34,56,np.nan,45,np.nan,11,61,np.nan,np.nan]
df['_'] = df['col1'].map(dict_map)
Expected Output
One of the Method is :
df['col3'] = np.where(df['col2'].isna(),df['_'],df['col2'])
df
Just wanted to know any other method using function and map function , we can optimize this .
You can map col1 with your dict_map and then use that as input to fillna, as follows
df['col3'] = df['col2'].fillna(df['col1'].map(dict_map))
You can achieve the very same result just using list comprehension, it is a very pythonic solution and I believe it holds better performance.
We are just reading col2 and copying the value to col3 if its not NaN. Then, if it is, we look into Col1, grab the dict key and, instead, use the corresponding value from dict_map.
df['col3'] = [df['col2'][idx] if not np.isnan(df['col2'][idx]) else dict_map[df['col1'][idx]] for idx in df.index.tolist()]
Output:
df
tag col1 col2 col3
0 1 a NaN 45.0
1 2 b 909.0 909.0
2 3 c 34.0 34.0
3 4 b 56.0 56.0
4 5 a NaN 45.0
5 6 a 45.0 45.0
6 7 z NaN -1.0
7 8 c 11.0 11.0
8 9 b 61.0 61.0
9 10 c NaN 97.0
10 11 b NaN 23.0
My dataframe is currently created to be wide with many columns, after the for statement below is executed. I want to stack multiple columns of data so that the dataframe is long and remove any blank rows from col4 before the output dataframe is generated. The reason for the latter part (remove blanks before output is generated) is because the dataframe will be too large for any output to be created with the blank values included.
code:
# dataframe
df0 = pd.DataFrame(data ={'col1':[123,123,456,456],'col2':['one two three',
'green yellow','four five six','green yellow']})
# words to search for
search_words1 = ['one','three','four','six','green yellow']
# create columns for each search word and indicate if search word is found for each row
for n in search_words1:
df0[n] = np.where(df0['col2'].str.contains(n),n,'')
# stack all search word columns created and remove blank rows in col4 before output is generated
df0 = pd.concat([
df0[['col1']].melt(value_name='col3'),
df0[['one','three','four','six','green yellow']].melt(value_name='col4')],
axis=1)
df0.loc[:,['col3','col4']]
current output:
col3 col4
0 123.0 one
1 123.0
2 456.0
3 456.0
4 NaN three
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN four
11 NaN
12 NaN
13 NaN
14 NaN six
15 NaN
16 NaN
17 NaN green yellow
18 NaN
19 NaN green yellow
desired output:
col3 col4
0 123.0 one
1 123.0 three
2 123.0 green yellow
3 456.0 four
4 456.0 six
5 456.0 green yellow
try this:
search_words1 = ['one','three','four','six','green yellow']
search_words1 = '|'.join(search_words1)
df0['col2'] = df0.col2.str.findall(search_words1)
df0.explode('col2')
>>>
col1 col2
0 123 one
0 123 three
1 123 green yellow
2 456 four
2 456 six
3 456 green yellow
df0['col2'] = df0.col2.str.findall(search_words1)
In this step, you will get the following result:
col1 col2
0 123 [one, three]
1 123 [green yellow]
2 456 [four, six]
3 456 [green yellow]
The last step, explode 'col2'
df0 = df0.explode('col2')
print(df0)
you can remove all nan and blank spaces like such.
col3 = df0['col3']
col4 = df0['col4']
three = col3[col3.notna()]
four = col4[col4 != ""]
print(three, '\n', four)
out:
0 123.0
1 123.0
2 456.0
3 456.0
Name: col3, dtype: float64
0 one
4 three
10 four
14 six
17 green yellow
19 green yellow
Name: col4, dtype: object
Let's say I have 2 dataframes,
both have different lengths but the same amount of columns
df1 = pd.DataFrame({'country': ['Russia','Mexico','USA','Argentina','Denmark','Syngapore'],
'population': [41,12,26,64,123,24]})
df2 = pd.DataFrame({'country': ['Russia','Argentina','Australia','USA'],
'population': [44,12,23,64]})
Lets assume that some of the data in df1 is outdated and I've received a new dataframe that contains some new data but not which may or may not exist already in the outdated dataframe.
I want to find out if any of the values of df2.country are inside df1.country
By doing the following I'm able to return a boolean:
df = df1.country.isin(df2.country)
print(df)
Unfortunately I'm just creating a new dataframe containing the answer to my question
0 True
1 False
2 True
3 True
4 False
5 False
Name: country, dtype: bool
My goal here is to delete the rows of df1 which values match with df2 and add the new data, kind of like an update.
I've manage to come up with something like this:
df = df1.country.isin(df2.country)
i = 0
for x in df:
if x:
df1.drop(i, inplace=True)
i += 1
frames = [df1, df2]
df1 = pd.concat(frames)
df1.reset_index(drop=True, inplace=True)
print(df1)
which in fact works and updates the dataframe
country population
0 Mexico 12
1 Denmark 123
2 Syngapore 24
3 Russia 44
4 Argentina 12
5 Australia 23
6 USA 64
But I really believe there's a batter way of doing the same thing quicker and much more practical considering that the real dataframe is much bigger and updates every few seconds.
I'd love to hear some suggestions, Thanks!
Assuming col1 remains unique in the original dataframe, you can join the two tables together. Once you have them in the same dataframe, you can apply your logic i.e. update value from new dataframe if it is not null. You actually don't need to check if col2 has changed for every entry in col1. You can just replace col2 value with col1 as long as it is not NaN (based on your sample output).
df1 = pd.DataFrame({'col1': ['a','f','r','g','d','s'], 'col2': [41,12,26,64,123,24]})
df2 = pd.DataFrame({'col1': ['a','g','o','r'], 'col2': [44,12,23,64]})
# do the join
x= pd.merge(df1,df2,how='outer',
left_on="col1", right_on="col1")
col1 col2_x col2_y
0 a 41.0 44.0
1 f 12.0 NaN
2 r 26.0 64.0
3 g 64.0 12.0
4 d 123.0 NaN
5 s 24.0 NaN
6 o NaN 23.0
# apply your update rules
x['col2_x'] = np.where(
~x['col2_y'].isnull(),
x['col2_y'],x['col2_x']
)
col1 col2_x col2_y
0 a 44.0 44.0
1 f 12.0 NaN
2 r 64.0 64.0
3 g 12.0 12.0
4 d 123.0 NaN
5 s 24.0 NaN
6 o 23.0 23.0
#clean up
x.drop("col2_y", axis=1, inplace = True)
x.columns = ["col1", "col2"]
col1 col2
0 a 44.0
1 f 12.0
2 r 64.0
3 g 12.0
4 d 123.0
5 s 24.0
6 o 23.0
The isin approach is so close! Simply use the results from isin as a mask, then concat the rows from df1 that are not in (~) df2 with the rest of df2:
m = df1['country'].isin(df2['country'])
df3 = pd.concat((df1[~m], df2), ignore_index=True)
df3:
country population
0 Mexico 12
1 Denmark 123
2 Syngapore 24
3 Russia 44
4 Argentina 12
5 Australia 23
6 USA 64
I have a test df like this:
df = pd.DataFrame({'A': ['Apple','Apple', 'Apple','Orange','Orange','Orange','Pears','Pears'],
'B': [1,2,9,6,4,3,2,1]
})
A B
0 Apple 1
1 Apple 2
2 Apple 9
3 Orange 6
4 Orange 4
5 Orange 3
6 Pears 2
7 Pears 1
Now I need to add a new column with the respective %differences in col 'B'. How is this possible. I cannot get this to work.
I have looked at
update column value of pandas groupby().last()
Not sure that it is pertinent to my problem.
And this which looks promising
Pandas Groupby and Sum Only One Column
I need to find and insert into the col maxpercchng (all rows in group) the maximum change in col (B) per group of col 'A'.
So I have come up with this code:
grouppercchng = ((df.groupby['A'].max() - df.groupby['A'].min())/df.groupby['A'].iloc[0])*100
and try to add it to the group col 'maxpercchng' like so
group['maxpercchng'] = grouppercchng
Or like so
df_kpi_hot.groupby(['A'], as_index=False)['maxpercchng'] = grouppercchng
Does anyone know how to add to all rows in group the maxpercchng col?
I believe you need transform for Series with same size like original DataFrame filled by aggregated values:
g = df.groupby('A')['B']
df['maxpercchng'] = (g.transform('max') - g.transform('min')) / g.transform('first') * 100
print (df)
A B maxpercchng
0 Apple 1 800.0
1 Apple 2 800.0
2 Apple 9 800.0
3 Orange 6 50.0
4 Orange 4 50.0
5 Orange 3 50.0
6 Pears 2 50.0
7 Pears 1 50.0
Or:
g = df.groupby('A')['B']
df1 = ((g.max() - g.min()) / g.first() * 100).reset_index()
print (df1)
A B
0 Apple 800.0
1 Orange 50.0
2 Pears 50.0