Setting values to columns of a multi-hierarchical pandas dataframe - python

I have a multi-hierarchical pandas dataframe shown below. How, for a given attribute, attr ('rh', 'T', 'V'), can I set certain values (say values > 0.5) to NaN over the entire set of pLevs? I have seen answers on how to set a specific column (e.g., df['rh', 50]) but have not seen how to select the entire set.
attr rh T V
pLev 50 75 100 50 75 100 50 75 100
refIdx
0 0.225026 0.013868 0.306472 0.144581 0.379578 0.760685 0.686463 0.476179 0.185635
1 0.496020 0.956295 0.471268 0.492284 0.836456 0.852873 0.088977 0.090494 0.604290
2 0.898723 0.733030 0.175646 0.841776 0.517127 0.685937 0.094648 0.857104 0.135651
3 0.136525 0.443102 0.759630 0.148536 0.426558 0.731955 0.523390 0.965385 0.094153
To facilitate assistance, I am including code to create the dataframe here:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random((4,9)))
df.columns = pd.MultiIndex.from_product([['rh','T','V'],[50,75,100]])
df.columns.names = ['attr', 'pLev']
df.index.name = 'refIdx'

The notation is mildly annoying but you can use IndexSlice
df.loc[:,pd.IndexSlice['rh',:]]=np.nan

If your 'given attribute' is 'rh' then you can take a cross-section with the following:
df_xs = df.xs('rh', level='attr', axis=1, drop_level=False)
Then you can update the original df as follows:
df[df_xs > 0.5] = np.nan
This works because drop_level=False was given to .xs

Related

Python: Copy a Row Value and Add to Another Row where a cell is empty

Trying to figure out how to do this in pandas but currently can't figure it out.
I would like to copy a value from Col A where a cell in Col B is blank and add it to the below rows, only until it reaches the next blank cell in Col B where it will then start again.
My Python isn't the strongest so any pointers would be appreciated as i currently haven't got a clue where to start with this one! I have included the below as an example as to how it currently is and as to how I'd like it. I'm currently just manipulating and cleaning the data in Pandas.
A
B
Supply Voltage
BLANK
Rated Value
10
Limit
20
Size
BLANK
Height
10
Width
20
A
B
Supply Voltage
BLANK
Supply Voltage - Rated Value
10
Supply Voltage - Limit
20
Size
BLANK
Size - Height
10
Size - Width
20
Alessandro answers the original question perfectly however in my case the data is something like this in my format. Where there are boolean Yes/No and unique values mixed in. Would a groupby and fill still work in this case?
A
B
Supply Voltage
BLANK
Rated Value
10
Limit
20
Work
Yes
Size
BLANK
Height
11
Depth
14
Width
55
Description
BLANK
Time
1432
Date
10/12/2022
Quote
Hello World
Below you can find a working example:
import pandas as pd
import numpy as np
# Recreate example DataFrame
df = pd.DataFrame({
'A': ['Supply Voltage', 'Rated Value', 'Limit', 'Size', 'Height', 'Width'],
'B': [np.nan, 10, 20, np.nan, 10, 20]
})
# Add helper column (UPDATE)
l = []
c = 0
for i in df['B']:
if pd.isnull(i):
c = c+1
l.append(c)
df['C'] = l
# Extract names associated with blank cells
blank_names = df.loc[df['B'].isna(), ['A', 'C']]
blank_names.columns = ['BLANK_NAME', 'C']
# Add names associated with blank cells to original DataFrame
df = df.merge(blank_names, on='C', how='left')
df['A'] = np.where(df['B'].notna(), df['BLANK_NAME'] + ' - ' + df['A'], df['A'])
# Display final output
df = df.drop(columns=['C', 'BLANK_NAME'])
df

Python most efficient way to dictionary mapping in pandas dataframe

I have a dictionary of dictionaries and each contains a mapping for each column of my dataframe.
My goal is to find the most efficient way to perform mapping for my dataframe with 1 row and 300 columns.
My dataframe is randomly sampled from range(mapping_size); and my dictionaries map values from range(mapping_size) to random.randint(mapping_size+1,mapping_size*2).
I can see from the answer provided by jpp that map is possibly the most efficient way to go but I am looking for something which is even faster than map. Can you think of any? I am happy if the data structure of the input is something else instead of pandas dataframe.
Here is the code for setting up the question and results using map and replace:
# import packages
import random
import pandas as pd
import numpy as np
import timeit
# specify paramters
ncol = 300 # number of columns
nrow = 1 #number of rows
mapping_size = 10 # length of each dictionary
# create a dictionary of dictionaries for mapping
mapping_dict = {}
random.seed(123)
for idx1 in range(ncol):
# create empty dictionary
mapping_dict['col_' + str(idx1)] = {}
for inx2 in range(mapping_size):
# create dictionary of length mapping_size and maps value from range(mapping_size) to random.randint(mapping_size +1 ,mapping_size*2)
mapping_dict['col_' + str(idx1)][inx2+1] = random.randint(mapping_size+1,mapping_size*2)
# Create a dataframe with values sampled from range(mapping_size)
d={}
random.seed(123)
for idx1 in range(ncol):
d['col_' + str(idx1)] = np.random.choice(range(mapping_size),nrow)
df = pd.DataFrame(data=d)
Results using map and replace:
%%timeit -n 20
df.replace(mapping_dict) #296 ms
%%timeit -n 20
for key in mapping_dict.keys():
df[key] = df[key].map(mapping_dict[key]).fillna(df[key]) #221ms
%%timeit -n 20
for key in mapping_dict.keys():
df[key] = df[key].map(mapping_dict[key]) #181ms
Just use pandas without python for iteration.
# runtime ~ 1s (1000rows)
# creat a map_serials with multi_index
df_dict = pd.DataFrame(mapping_dict)
obj_dict = df_dict.T.stack()
# obj_dict
# col_0 1 10
# 2 14
# 3 11
# Length: 3000, dtype: int64
# convert df to map_serials's index, df can have more then 1 row
obj_idx = pd.Series(df.values.flatten())
obj_idx.index = pd.Index(df.columns.to_list() * df.shape[0])
idx = obj_idx.to_frame().reset_index().set_index(['index', 0]).index
result = obj_dict[idx]
# handle null values
cond = result.isnull()
result[cond] = pd.Series(result[cond].index.values).str[1].values
# transform to reslut DataFrame
df_result = pd.DataFrame(result.values.reshape(df.shape))
df_result.columns = df.columns
df_result

Apply def function to a dataframe recursively

I am trying to calculate the column values of Cat "recursively"
Every loop should calculate the Cat columns max value (Catz) of a group of x. If the Date range becomes <=60, Cat column value should be updated with Catz +=1. I got an arcpy of this process going. I, however, have thousands of other data sets outside that need not be converted in arcpy friendly format. I am not well familiar with pandas.
Made reference to [1]: Calculate DataFrame values recursively and [2]: python pandas- apply function with two arguments to columns . I still havent quite understood the Series/Dataframe Concept and how to apply either outcome
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import datetime as dt
from datetime import timedelta
import time
from datetime import date
dict = {'x':["ASPELBJNMI", "JUNRNEXCRG", "ASPELBJNMI", "JUNRNEXCRG"],
'start': ["6/27/2018", "8/4/2018", "8/22/2018", "8/12/2018"],
'finish':["8/11/2018", "10/3/2018", "8/31/2018", "10/26/2018"],
'DateRange':[0,0,0,0],
'Cat':[-1,-1,-1,-1],
'ID':[1,2,3,4]}
df = pd.DataFrame(dict)
df.set_index('ID')
def classd(houp):
Catz = houp.Cat.min()
Catz +=1
houp = houp.groupby('x')
for x, houp2 in houp:
houp.DateRange = (pd.to_datetime(houp.finish.loc[:]).min()- houp.start.loc[:]).astype('timedelta64[D]')
houp.Cat = np.where(houp.DateRange<=60, Catz , -1)
return houp
df['Cat'] = df[['x','DateRange','Cat']].apply(classd, axis=1).Cat
print df
I get the following Traceback when I run my code
Catz = houp.Cat.min()
AttributeError: ("'long' object has no attribute 'min'", u'occurred at index 0')
Desired outcome
OBJECTID_1 * Conc * ID start finish DateRange Cat
1 ASPELBJNMI LAPMT 6/27/2018 8/11/2018 45 0
2 ASPELBJNMI KLKIY 8/22/2018 8/31/2018 9 1
15 JUNRNEXCRG CGCHK 8/4/2018 10/3/2018 60 1
16 JUNRNEXCRG IQYGJ 8/12/2018 10/26/2018 83 -1
You program is little bit complecated to comprehend
But i would suggest to try something simple with apply function:
s.apply(lambda x: x ** 2)
here s is a series
https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html

Python fuzzy string matching as correlation style table/matrix

I have a file with x number of string names and their associated IDs. Essentially two columns of data.
What I would like, is a correlation style table with the format x by x (having the data in question both as the x-axis and y axis), but instead of correlation, I would like the fuzzywuzzy library's function fuzz.ratio(x,y) as the output using the string names as input. Essentially running every entry against every entry.
This is sort of what I had in mind. Just to show my intent:
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.read_csv('random_data_file.csv')
df = df[['ID','String']]
df['String_Dup'] = df['String'] #creating duplicate of data in question
df = df.set_index('ID')
df = df.groupby('ID')[['String','String_Dup']].apply(fuzz.ratio())
But clearly this approach is not working for me at the moment. Any help appreciated. It doesn't have to be pandas, it is just an environment I am relatively more familiar with.
I hope my issue is clearly worded, and really, any input is appreciated,
Use pandas' crosstab function, followed by a column-wise apply to compute the fuzz.
This is considerably more elegant than my first answer.
import pandas as pd
from fuzzywuzzy import fuzz
# Create sample data frame.
df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
columns=['id', 'strings'])
# Create the cartesian product between the strings column with itself.
ct = pd.crosstab(df['strings'], df['strings'])
# Note: for pandas versions <0.22, the two series must have different names.
# In case you observe a "Level XX not found" error, the following may help:
# ct = pd.crosstab(df['strings'].rename(), df['strings'].rename())
# Apply the fuzz (column-wise). Argument col has type pd.Series.
ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
# This results in the following:
# strings abc abracadabra brabra cadra
# strings
# abc 100 43 44 25
# abracadabra 43 100 71 62
# brabra 44 71 100 55
# cadra 25 62 55 100
For simplicity, I omitted the groupby operation as suggested in your question. In case need want to apply the fuzzy string matching on groups, simply create a separate function:
def cross_fuzz(df):
ct = pd.crosstab(df['strings'], df['strings'])
ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
return ct
df.groupby('id').apply(cross_fuzz)
In pandas, the cartesian cross product between two columns can be created using a dummy variable and pd.merge. The fuzz operation is applied using apply. A final pivot operation will extract the format you had in mind. For simplicity, I omitted the groupby operation, but of course, you could apply the procedure to all group-tables by moving the code below into a separate function.
Here is what this could look like:
import pandas as pd
from fuzzywuzzy import fuzz
# Create sample data frame.
df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
columns=['id', 'strings'])
# Cross product, using a temporary column.
df['_tmp'] = 0
mrg = pd.merge(df, df, on='_tmp', suffixes=['_1','_2'])
# Apply the function between the two strings.
mrg['fuzz'] = mrg.apply(lambda s: fuzz.ratio(s['strings_1'], s['strings_2']), axis=1)
# Reorganize data.
ret = mrg.pivot(index='strings_1', columns='strings_2', values='fuzz')
ret.index.name = None
ret.columns.name = None
# This results in the following:
# abc abracadabra brabra cadra
# abc 100 43 44 25
# abracadabra 43 100 71 62
# brabra 44 71 100 55
# cadra 25 62 55 100
import csv
from fuzzywuzzy import fuzz
import numpy as np
input_file = csv.DictReader(open('random_data_file.csv'))
string = []
for row in input_file: #file is appended row by row into a python dictionary
string.append(row["String"]) #keys for the dict. are the headers
#now you have a list of the string values
length = len(string)
resultMat = np.zeros((length, length)) #zeros 2D matrix, with size X * X
for i in range (length):
for j in range (length):
resultMat[i][j] = fuzz.ratio(string[i], string[j])
print resultMat
I did the implementation in a numby 2D matrix. I am not that good in pandas, but I think what you were doing is adding another column and comparing it to the string column, meaning: string[i] will be matched with string_dub[i], all results will be 100
Hope it helps

Selectively replacing DataFrames column names

I have a time series dataset in a .csv file that I want to process with Pandas (using Canopy). The column names from the file are a mix of strings and isotopic numbers.
cycles 40 38.02 35.98 P4
0 1 1.1e-8 4.4e-8 7.7e-8 8.8e-7
1 2 2.2e-8 5.5e-8 8.8e-8 8.7e-7
2 3 3.3e-8 6.6e-8 9.9e-8 8.6e-7
I would like this DataFrame to look like this
cycles 40 38 36 P4
0 1 1.1e-8 4.4e-8 7.7e-8 8.8e-7
1 2 2.2e-8 5.5e-8 8.8e-8 8.7e-7
2 3 3.3e-8 6.6e-8 9.9e-8 8.6e-7
The .csv files won't always have exactly the same column names; they numbers could be slightly different from file to file. To handle this, I've sampled the column names and rounded the values to the nearest integer.This is what my code looks like so far:
import pandas as pd
import numpy as np
df = {'cycles':[1,2,3],'40':[1.1e-8,2.2e-8,3.3e-8],'38.02':[4.4e-8,5.5e-8, 6.6e-8],'35.98':[7.7e-8,8.8e-8,9.9e-8,],'P4':[8.8e-7,8.7e-7,8.6e-7]}
df = pd.DataFrame(df, columns=['cycles', '40', '38.02', '35.98', 'P4'])
colHeaders = df.columns.values.tolist()
colHeaders[1:4] = list(map(float, colHeaders[1:4]))
colHeaders[1:4] = list(map(np.around, colHeaders[1:4]))
colHeaders[1:4] = list(map(int, colHeaders[1:4]))
colHeaders = list(map(str, colHeaders))
I tried df.rename(columns={df.loc[ 1 ]:colHeaders[ 0 ]}, ...), but I get this error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I've read this post as well as the pandas 0.17 documentation, but I can't figure out how to use it to selectively replace the column names in a way that doesn't require me to assign new column names manually like this post.
I'm fairly new to Python and I've never posted on StackOverflow before, so any help would be greatly appreciated.
You could use a variant of your approach, but assign the new columns directly:
>>> cols = list(df.columns)
>>> cols[1:-1] = [int(round(float(x))) for x in cols[1:-1]]
>>> df.columns = cols
>>> df
cycles 40 38 36 P4
0 1 1.100000e-08 4.400000e-08 7.700000e-08 8.800000e-07
1 2 2.200000e-08 5.500000e-08 8.800000e-08 8.700000e-07
2 3 3.300000e-08 6.600000e-08 9.900000e-08 8.600000e-07
>>> df.columns
Index(['cycles', 40, 38, 36, 'P4'], dtype='object')
Or you could pass a function to rename:
>>> df = df.rename(columns=lambda x: x if x[0].isalpha() else int(round(float(x))))
>>> df.columns
Index(['cycles', 40, 38, 36, 'P4'], dtype='object')

Categories