Pulling column values based on conditions - python

I have the following dataframe
df = pd.DataFrame({
'Column_1': ['Position', 'Start', 'End', 'Position'],
'Original_1': ['Open', 'Barn', 'Grass', 'Bubble'],
'Latest_1': ['Shut', 'Horn', 'Date', 'Dinner'],
'Column_2': ['Start', 'Position', 'End', 'During'],
'Original_2': ['Sky', 'Hold', 'Car', 'House'],
'Latest_2': ['Pedal', 'Lap', 'Two', 'Force'],
'Column_3': ['Start', 'End', 'Position', 'During'],
'Original_3': ['Leave', 'Dog', 'Block', 'Hope'],
'Latest_3': ['Sear', 'Crawl', 'Enter', 'Night']
})
For every instance where the word Position is in 'Column_1', 'Column_2', or 'Column_3', I want to capture the associated values in 'Original_1', 'Original_2', 'Original_3' and assign them to the new column named 'Original_Values'.
The following code can accomplish that, but only on a column by column basis.
df['Original_Value1'] = df.loc[df['Column_1'] == 'Position', 'Original_1']
df['Original_Value2'] = df.loc[df['Column_2'] == 'Position', 'Original_2']
df['Original_Value3'] = df.loc[df['Column_3'] == 'Position', 'Original_3']
Is there a way to recreate the above code so that it iterates over the entire data frame (not by specified columns)?
I'm hoping to create one column ('Original_values') with the following result:
0 Open
1 Hold
2 Block
3 Bubble
Name: Original_Values, dtype: object

One way to do it, with df.apply():
def choose_orig(row):
if row['Column_1'] == 'Position':
return row['Original_1']
elif row['Column_2'] == 'Position':
return row['Original_2']
elif row['Column_3'] == 'Position':
return row['Original_3']
return ''
df['Original_Values'] = df.apply(choose_orig, axis=1)
The axis=1 argument to df.apply() causes the choose_orig() function to be called once for each row of the dataframe.
Note that this uses a default value of the empty string, '', when none of the columns match the word 'Position'.

How about creating a mask with the first 3 cols (or specify the name of them) and multiply it with the values in cols 6 to 9 (or specify the names of them). Then take max() value to remove nan.
df['Original_Values'] = ((df.iloc[:,:3] == 'Position') * df.iloc[:,6:9].values).max(1)
print(df['Original_values'])
Returns:
0 Open
1 Hold
2 Block
3 Bubble
Name: Original_Value, dtype: object

Here's a kinda silly way to do it with some stacking, which might perform better if you have a very large df and need to avoid axis=1.
Stack the first three columns to create a list of the index and which 'Original' column the value corresponds to
Stack the columns from which you want to get the values. Use the above list to reindex it, so you return the appropriate value.
Bring those values back to the original df based on the original row index.
Here's the code:
import re
mask_list = ['Column_1', 'Column_2', 'Column_3']
val_list = ['Original_1', 'Original_2', 'Original_3']
idx = df[mask_list].stack()[df[mask_list].stack() == 'Position'].index.tolist()
idx = [(x , re.sub('(.*_)', 'Original_', y)) for x, y in idx]
df['Original_Values'] = df[val_list].stack().reindex(idx).reset_index(level=1).drop(columns='level_1')
df is now:
Column_1 Column_2 Column_3 ... Original_Values
0 Position Start Start ... Open
1 Start Position End ... Hold
2 End End Position ... Block
3 Position During During ... Bubble
If 'Position' is not found in any of the columns in mask_list, Original_Values becomes NaN for that row. If you need to scale it to more columns, simply add them to mask_list and val_list.

Related

How do I print the cell values that cause pandas pandas.DataFrame.any to return True?

The code below tells if a dataframe Df3 cell has the same value as another dataframe cell within an array, dataframe_arrays. However, I want to print the cell value and the specific dataframe within dataframe_arrays that have the same value as Df3. Here is what I have tried -
import pandas as pd
dataframe_arrays = []
Df1 = pd.DataFrame({'IDs': ['Marc', 'Jake', 'Sam', 'Brad']})
dataframe_arrays.append(Df1)
Df2 = pd.DataFrame({'IDs': ['TIm', 'Tom', 'harry', 'joe', 'bill']})
dataframe_arrays.append(Df2)
Df3 = pd.DataFrame({'IDs': ['kob', 'ham', 'konard', 'jupyter', 'Marc']})
repeat = False
for i in dataframe_arrays:
repeat = Df3.IDs.isin(i.IDs).any()
if repeat:
print("i = ", i)
break
My objective is to compare my current dataframe column with columns belonging to another set of dataframes and identify which values are repeating.
If your data is not that large, you can simply use nested loop with .iterrows() to go through row by row and dataframe by dataframe. Also, you can use globals() to get the variable name of the dataframe that contains the duplicate.
def get_var_name(variable):
globals_dict = globals()
return [var_name for var_name in globals_dict if globals_dict[var_name] is variable]
for index, row in Df3.iterrows():
for i in range(len(dataframe_arrays)):
if row['IDs'] in dataframe_arrays[i]['IDs'].values:
print("{} is in {}".format(row['IDs'], get_var_name(dataframe_arrays[i])[0]))
output:
> Marc is in Df1

Why is mapping dataseries with dataframe column names taking so long with map function

Hi all i have a dataframe of approx 400k rows with a column of interest. I would like to map each element in the column to a category (LU, HU, etc.). This is obtained from a smaller dataframe where the column names are the category. The function below however runs very slow voor only 400k rows. I m not sure why. In the example below ofcourse it is fast for 5 examples.
cwp_sector_mapping = {
'LU': ['C2P34', 'C2P35', 'C2P36'],
'HU': ['C2P37', 'C2P38', 'C2P39'],
'EH': ['C2P40', 'C2P41', 'C2P42'],
'EL': ['C2P43', 'C2P44', 'C2P45'],
'WL': ['C2P12', 'C2P13', 'C2P14'],
'WH': ['C2P15', 'C2P16', 'C2P17'],
'NL': ['C2P18', 'C2P19', 'C2P20'],
}
df_cwp = pd.DataFrame.from_dict(cwp_sector_mapping)
columns = df_cwp.columns
ls = pd.Series(['C2P44', 'C2P43', 'C2P12', 'C2P1'])
temp = list((map(lambda pos: columns[df_cwp.eq(pos).any()][0] if
columns[df_cwp.eq(pos).any()].size != 0 else 'UN', ls)))
Use next with iter trick for possible get first meached value of columns, if no match get default value UN:
temp = [next(iter(columns[df_cwp.eq(pos).any()]), 'UN') for pos in ls]

pandas str contains with maximum value

I have 2 data-frames, one of them contains strings and the other contains a timestamp and a string.
df2= pd.DataFrame({'Name':['Tim', 'Timothy', 'Kistian', 'Kris cole','Ian'],
'Age':['1-2-1997', '21-3-1998', '19-6-2000', '18-4-1996','12-12-2001']})
df1= pd.DataFrame({'string':['Ti', 'Kri' ,'ian' ],
'MaxDate':[None, None, None]})
I want to assign to MaxDate column the maximum date of a str.contains(df1['string'][0] operation on df2:
for example: df2[df2.Name.str.contains(df1['string'][0])] gives me 2 records
I want to assign the maximum of these values to MaxDate corresponding to 'ti':
ie o/p for the first iteration will be:
df1= pd.DataFrame({'string':['Ti', 'Kri' ,'ian' ],
'MaxDate':['1-2-1997', None, None]})
How can I do this for all entries of df1 using a loop?
If need loop solution create list of dictionaries with max and pass to DataFrame constructor:
df2['Age'] = pd.to_datetime(df2['Age'], dayfirst=True)
out = []
for x in df1['string']:
m = df2.loc[df2.Name.str.contains(x), 'Age'].max()
out.append({'string': x, 'MaxDate': m})
df = pd.DataFrame(out)
print (df)
string MaxDate
0 Ti 1998-03-21
1 Kri 1996-04-18
2 ian 2000-06-19

Save data frame from inside for loop

I have a function that takes in a dataframe and returns a (reduced) dataframe, e.g. like this:
def transforming_data(dataframe, col_1, col_2, normalized = True):
''' takes in dataframe, groups col_1 according to col_2 and returns dataframe
'''
df = dataframe[col_1].groupby(dataframe[col_2]).value_counts(normalize = normalized).unstack(fill_value = 0)
return dataframe
For the following code, this gives me:
import pandas as pd
import numpy as np
np.random.seed(12)
def transforming_data(df, col_1, col_2, normalized = True):
''' takes in df, groups col_1 according to col_2 and returns df '''
df = dataframe[col_1].groupby(dataframe[col_2]).value_counts(normalize = normalized).unstack(fill_value = 0)
return df
numrows = 1000
dataframe = pd.DataFrame({'Numerical': np.random.randn(numrows),
'Category': np.random.choice(['Panda', 'Elephant', 'Anaconda'], numrows),
'Response 1': np.random.choice(['Yes', 'Maybe', 'No', 'Don\'t know'], numrows),
'Response 2': np.random.choice(['Very Much', 'Much', 'A bit', 'Not at all'], numrows)})
test = transforming_data(dataframe, 'Response 1', 'Category')
print(test)
# Output
# Response 1 Don't know Maybe No Yes
# Category
# Anaconda 0.275229 0.232416 0.217125 0.275229
# Elephant 0.220588 0.270588 0.255882 0.252941
# Panda 0.258258 0.222222 0.273273 0.246246
So far, so good.
Now I want to use the function transforming_data inside a for loop for every column in dataframe (as I have lots of columns, not just two) and save the resulting dataframe to a new dataframe, e.g. test_response_1 and test_response_2 for this example.
Can someone point me in the right direction - i.e. how to implement the loop correctly?
So far, I am using something like this - but cannot figure out how to save the data frame
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
# here, I need to save tmp_df outside of the loop but don't know how to
Thanks a lot for pointers and help. (Note: the most similar question I found does not talk about actually saving the data frame, so it doesn't help me with this.
If you want to save (in memory) all of the temp_df's from your loop, you can append them to a list that you can then index afterwards:
temp_dfs = []
for column in dataframe.columns.tolist(): #you don't actually need the tolist() method here
temp_df = transforming_data(dataframe, column, 'Category')
temp_dfs.append(temp_df)
If you rather be able to access these temp_df's by the column name that was used to transform them, then you could assign each to a dictionary, using the column as the key:
temp_dfs = {}
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
temp_dfs[column] = temp_df
If by "save" you meant "write to disk", then you can use one of the many to_<file_format>() methods that pandas provides:
temp_dfs = {}
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
temp_df.to_csv('temp_df{}.csv'.format(column))
Here's the to_csv() docs.
The most simple solution would be to save the result dataframes into a list. Assuming that all columns that you want to loop over have the text Response in their column name:
result_dframes = []
for col_name in dataframe.filter(like='Response').columns:
result_dframe = transforming_data(dataframe, col_name, 'Category')
result_dframes.append(result_dframe)
Alternatively you can also obtain the exact same result with a list comprehension instead of a for-loop:
result_dframes = [
transforming_data(dataframe, col_name, 'Category')
for col_name in dataframe.filter(like='Response')
]

Pandas DataFrame - Creating a new column from a comparison

I'm trying to create a columns called 'city_code' with values from the 'code' column. But in order to do this I need to compare if 'ds_city' and 'city' values are equal.
Here is a table sample:
https://i.imgur.com/093GJF1.png
I've tried this:
def find_code(data):
if data['ds_city'] == data['city'] :
return data['code']
else:
return 'UNKNOWN'
df['code_city'] = df.apply(find_code, axis=1)
But since there are duplicates in the 'ds_city' columns that's the result:
https://i.imgur.com/geHyVUA.png
Here is a image of the expected result:
https://i.imgur.com/HqxMJ5z.png
How can I work around this?
You can use pandas merge:
df = pd.merge(df, df[['code', 'city']], how='left',
left_on='ds_city', right_on='city',
suffixes=('', '_right')).drop(columns='city_right')
# output:
# code city ds_city code_right
# 0 1500107 ABAETETUBA ABAETETUBA 1500107
# 1 2900207 ABARE ABAETETUBA 1500107
# 2 2100055 ACAILANDIA ABAETETUBA 1500107
# 3 2300309 ACOPIARA ABAETETUBA 1500107
# 4 5200134 ACREUNA ABARE 2900207
Here's pandas.merge's documentation. It takes the input dataframe and left joins itself's code and city columns when ds_city equals city.
The above code will fill code_right when city is not found with nan. You can further do the following to fill it with 'UNKNOWN':
df['code_right'] = df['code_right'].fillna('UNKNOWN')
This is more like np.where
import numpy as np
df['code_city'] = np.where(data['ds_city'] == data['city'],data['code'],'UNKNOWN')
You could try this out:
# Begin with a column of only 'UNKNOWN' values.
data['code_city'] = "UNKNOWN"
# Iterate through the cities in the ds_city column.
for i, lookup_city in enumerate(data['ds_city']):
# Note the row which contains the corresponding city name in the city column.
row = data['city'].tolist().index(lookup_city)
# Reassign the current row's code_city column to that code from the row we found in the last step.
data['code_city'][i] = data['code'][row]

Categories