I've been looking for ways to do this natively for a little while now and can't find a solution.
I have a large dataframe where I would like to set the value in other_col to 'True' for all rows where one of a list of columns is empty.
This works for a single column page_title:
df.loc[df['page_title'].isna(), ['other_col']] = ''
But not when using a list
df.loc[df[['page_title','brand','name']].isna(), ['other_col']] = ''
Any ideas of how I could do this without using Numpy or looping through all rows?
Thanks
Maybe this is what you are looking for:
df = pd.DataFrame({
'A' : ['1', '2', '3', np.nan],
'B': ['10', np.nan, np.nan, '40'],
'C' : ['test', 'test', 'test', 'test']})
df.loc[df[['A', 'B']].isna().any(1), ['C']] = 'value'
print(df)
Result:
A B C
0 1 10 test
1 2 NaN value
2 3 NaN value
3 NaN 40 value
This will allow you to set which columns you want to determine if np.nan is present and set a True/False indicator
data = {
'Column1' : [1, 2, 3, np.nan],
'Column2' : [1, 2, 3, 4],
'Column3' : [1, 2, np.nan, 4]
}
df = pd.DataFrame(data)
df['other_col'] = np.where((df['Column1'].isna()) | (df['Column2'].isna()) | (df['Column3'].isna()), True, False)
df
Related
Current Pandas DataFrame
fn1 = pd.DataFrame([['A', 'NaN', 'NaN', 9, 6], ['B', 'NaN', 2, 'NaN', 7], ['C', 3, 2, 'NaN', 10], ['D', 'NaN', 7, 'NaN', 'NaN'], ['E', 'NaN', 'NaN', 3, 3], ['F', 'NaN', 'NaN', 7,'NaN']], columns = ['Symbol', 'Condition1','Condition2', 'Condition3', 'Condition4'])
fn1.set_index('Symbol', inplace=True)
Condition1 Condition2 Condition3 Condition4
Symbol
A NaN NaN 9 6
B NaN 2 NaN 7
C 3 2 NaN 10
D NaN 7 NaN NaN
E NaN NaN 3 3
F NaN NaN 7 NaN
I'm currently working with a Pandas DataFrame that looks like the link above. I'm trying to go column by column to substitute values that are not 'NaN' with the 'Symbol' associated with that row then collapse each column (or write to a new DataFrame) so that each column is a list of 'Symbol's that were present for each 'Condition' as shown in the desired output:
Desired Output
I've been able to get the 'Symbols' that were present for each condition into a list of lists (see below) but want to maintain the same column names and had trouble adding them to an ever-growing new DataFrame because the lengths are variable and I'm looping through columns.
ls2 = []
for col in fn1.columns:
fn2 = fn1[fn1[col] > 0]
ls2.append(list(fn2.index))
Where fn1 is the DataFrame that looks like the first image and I had made the 'Symbol' column the index.
Thank you in advance for any help.
Another answer would be slicing, just like below (explanations in comments):
import numpy as np
import pandas as pd
df = pd.DataFrame.from_dict({
"Symbol": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"],
"Condition1": [1, np.nan, 3, np.nan, np.nan, np.nan, 7, np.nan, np.nan, 8, 12],
"Condition2": [np.nan, 2, 2, 7, np.nan, np.nan, 5, 11, 14, np.nan, np.nan],
}
)
new_df = pd.concat(
[
df["Symbol"][df[column].notnull()].reset_index(drop=True) # get columns without null and ignore the index (as your output suggests)
for column in list(df)[1:] # Iterate over all columns except "Symbols"
],
axis=1, # Column-wise concatenation
)
# Rename columns
new_df.columns = list(df)[1:]
# You can leave NaNs or replace them with empty string, your choice
new_df.fillna("", inplace=True)
Output of this operation will be:
Condition1 Condition2
0 a b
1 c c
2 g d
3 j g
4 k h
5 i
If you need any further clarification, post a comment down below.
You can map the symbols to each of the columns, and then take the set of non-null values.
df = fn1.apply(lambda x: x.map(fn1['Symbol'].to_dict()))
condition_symbols = {col:sorted(list(set(fn1_symbols[col].dropna()))) for col in fn1.columns[1:]}
This will give you a dictionary:
{'Condition1': ['B', 'D'],
'Condition2': ['C', 'H'],
'Condition3': ['D', 'H', 'J'],
'Condition4': ['D', 'G', 'H', 'K']}
I know you asked for a Dataframe, but since the length for each list is different, it would not make sense to make it into a Dataframe. If you wanted a Dataframe, then you could just run this code:
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in condition_symbols.items() ]))
This gives you the following output:
Condition1 Condition2 Condition3 Condition4
0 B C D D
1 D H H G
2 NaN NaN J H
3 NaN NaN NaN K
I Have a column within a dataset, regarding categorical company sizes, which currently looks like this, where the '-' hyphens are currently representing missing data:
I want to change the '-' in missing values with nulls so i can analyse missing data. However when I use the pd replace tool (see following code) with a None value it seems to also make any of the genuine entries as they also contain hyphens (e.g 51-200).
df['Company Size'].replace({'-': None},inplace =True, regex= True)
How can I replace only lone standing hyphens and leave the other entries untouched?
You need not to use regex=True.
df['Company Size'].replace({'-': None},inplace =True)
You could also just do:
df['column_name'] = df['column_name'].replace('-','None')
import numpy as np
df.replace('-', np.NaN, inplace=True)
This code worked for me.
you can do it like this
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', '-', 'c--', 'd', 'e']})
df['C'] = df['C'].replace('-', np.nan)
df = df.where((pd.notnull(df)), None)
# can also use this -> df['C'] = df['C'].where((pd.notnull(df)), None)
print(df)
output:
A B C
0 0 5 a
1 1 6 None
2 2 7 c--
3 3 8 d
4 4 9 e
another example:
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': ['5-5', '-', 7, 8, 9],
'C': ['a', 'b', 'c--', 'd', 'e']})
df['B'] = df['B'].replace('-', np.nan)
df = df.where((pd.notnull(df)), None)
print(df)
output:
A B C
0 0 5-5 a
1 1 None b
2 2 7 c--
3 3 8 d
4 4 9 e
My question is similar to one asked here. I have a dataframe and I want to repeat each row of the dataframe k number of times. Along with it, I also want to create a column with values 0 to k-1. So
import pandas as pd
df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'n' : [ 1, 2, 2, 3, 3, 3],
'v' : [ 10, 13, 13, 8, 8, 8],
'repeat_id': [0, 0, 1, 0, 1, 2]
})
Command below does half of the job. I am looking for pandas way of adding the repeat_id column.
df.loc[df.index.repeat(df.n)]
Use GroupBy.cumcount and copy for avoid SettingWithCopyWarning:
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
df1 = df.loc[df.index.repeat(df.n)].copy()
df1['repeat_id'] = df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
id n v repeat_id
0 A 1 10 0
1 B 2 13 0
2 B 2 13 1
3 C 3 8 0
4 C 3 8 1
5 C 3 8 2
I have column names in a dictionary and would like to select those columns from a dataframe.
In the example below, how do I select dictionary values 'b', 'c' and save it in to df1?
import pandas as pd
ds = {'cols': ['b', 'c']}
d = {'a': [2, 3], 'b': [3, 4], 'c': [4, 5]}
df_in = pd.DataFrame(data=d)
print(ds)
print(df_in)
df_out = df_in[[ds['cols']]]
print(df_out)
TypeError: unhashable type: 'list'
Remove nested list - []:
df_out = df_in[ds['cols']]
print(df_out)
b c
0 3 4
1 4 5
According to ref, just need to drop one set of brackets.
df_out = df_in[ds['cols']]
I have been trying to get a count on multiple columns using value_counts. Right now, I have it working on one column, but not multiple.
EDIT: I needed a count of unique IDs previously, hence the count on 'id', but now I want a count of the services under 'id'. I'm editing the data below to more accurately explain the situation.
import pandas as pd
d = {'id': [1, 1, 2, 3, 3], 'service': [3, 3, 4, 2, 3], 'name': ['Joe', 'Joe', 'Bob', 'Val', 'Val']}
df = pd.DataFrame(data=d)
df['count'] = df['id'].map(df['id'].value_counts())
If I try
df['count'] = df['id'].map(df['id']['service'].value_counts())
I get a KeyError on service.
If I try
df['count'] = df['id']['service'].map(df['id'].value_counts())
I get the same error.
I'm hoping to get something along the lines of:
id service 1 , 3: 2
id service 2 , 4: 1
id service 3 , 2: 1
id service 3 , 3: 1
Am I using the wrong function?
A couple of ways. Either use groupby and use count, or create a tuple column and apply value_counts.
Both methods provide results that can be indexed via tuples.
Setup
import pandas as pd
d = {'id': [1, 2, 1], 'service': [3, 4, 3], 'name': ['Joe', 'Bob', 'Mark']}
df = pd.DataFrame(d)
Groupby method
As suggested by #Dark:
res = df.groupby(['id', 'service']).count()
# name
# id service
# 1 3 2
# 2 4 1
Tuple column method
df['id_service'] = list(zip(df.id, df.service))
res = df['id_service'].value_counts()
# (1, 3) 2
# (2, 4) 1
# Name: id_service, dtype: int64