I am looking for a way to do some conditional mapping using multiple comparisons.
I have millions and millions of rows that I am investigating using sample SQL extracts in pandas. Along with SQL extracts read into pandas DataFrames I also have some rules tables, each with a few columns (these are also loaded into dateframes).
This is what I want to do: where a row in my SQL extract matches the conditions expressed in any one row in my rules table, I would like to generate a 1, else: 0. In the end I would like to add a column to my SQL extract called Rule Result with either 1's and 0's.
I have got a system that works using df.merge, but it produces many many extra duplicate rows in the process that must then be removed afterwards. I am looking for a better, faster, more elegant solution and would be grateful for any suggestions.
Here is a working example of the problem and the current solution code:
import pandas as pd
import numpy as np
#Create a set of test data
test_df = pd.DataFrame()
test_df['A'] = [1,120,982,1568,29,455,None, None, None]
test_df['B'] = ['EU','US',None, 'GB','DE','EU','US', 'GB',None]
test_df['C'] = [1111,1121,1111,1821,1131,1111,1121,1821,1723]
test_df['C_C'] = test_df['C']
test_df
test_df
#Create a rules_table
rules_df = pd.DataFrame()
rules_df['A_MIN'] = [0,500,None,600,200]
rules_df['A_MAX'] = [10,1000,500,1200,800]
rules_df['B_B'] = ['GB','GB','US','EU','EU']
rules_df['C_C'] = [1111,1821,1111,1111,None]
rules_df
def map_rules_to_df(df,rules):
#create column that mimics the index to keep track of later duplication
df['IND'] = df.index
#merge the rules with the data on C values
df = df.merge(rules,left_on='C_C',right_on='C_C',how='left')
#create a rule_result_column with a default value of zero
df['RULE_RESULT']=0
#create a mask indentifying those test_df_rows that fit with a
# rule_df_row
mask = df[
((df['A'] > df['A_MIN']) | (df['A_MIN'].isnull())) &
((df['A'] < df['A_MAX']) | (df['A_MAX'].isnull())) &
((df['B'] == df['B_B']) | (df['B_B'].isnull())) &
((df['C'] == df['C_C']) | (df['C_C'].isnull()))
]
#use mask.index to replace 0's in the result column with a 1
df.loc[mask.index.tolist(),'RULE_RESULT']=1
#drop the redundant rule_df columns
df = df.drop(['B_B','C_C','A_MIN','A_MAX'],axis=1)
#drop duplicate rows
df = df.drop_duplicates(keep='first')
#drop rows where the original index is duplicated and the rule result
#is zero
df = df[(df['IND'].duplicated(keep=False)) & (df['RULE_RESULT']==0) == False]
#reset the df index with the original index
df.index = df['IND'].values
#drop the now redundant second index column (IND)
df = df.drop('IND', axis=1)
print('df shape',df.shape)
return df
#map the rules
result_df = map_rules_to_df(test_df,rules_df)
result_df
result_df
I hope I have made what I would like to do clear and thank you for your help.
PS, my rep is non-existent, so i was not allowed to post more than two supporting images.
Related
I'm using a for to generate a excel file to graph the data from a df so I'm using value_counts but I would like to add under this df a second one with the same data but with percentages so my code is this one:
li = []
for i in range(0, len(df.columns)):
value_counts = df.iloc[:, i].value_counts().to_frame().reset_index()
value_percentage = df.iloc[:, i].value_counts(normalize=True).to_frame().reset_index()#.drop(columns='index')
value_percentage = (value_percentage*100).astype(str)+'%'
li.append(value_counts)
li.append(value_percentage)
data = pd.concat(li, axis=1)
data.to_excel("resultdf.xlsx") #index cleaned
Basically I need it to look like this:
As long as the column names match between the two data frames you should be able to use pd.concat() to concatenate the two data frames. To concatenate them vertically, I think you should use axis=0 instead of axis=1 see docs
Data
Let's prepare some dummy data to work with. Based on the provided screenshot, I'm assuming that the raw data are sort of music genres grade on a scale of 1 to 5. So I'm gonna use as data something like this:
import pandas as pd
from numpy.random import default_rng
rng = default_rng(0)
columns = ['Pop', 'Dance', 'Rock', 'Jazz']
data = rng.integers(1, 5, size=(100, len(columns)), endpoint=True)
df = pd.DataFrame(data, columns=columns)
Notes on the original code
There's no need to iterate by a column index. We can iterate through column names, as in for column in df.columns: df[column] ...
I think it's better to format data with help of map('.0%'.format) before transforming them to frame.
Instead of appending counted and normalized values one by one we better pd.concat them vertically into a single frame and append it to the list.
So the original code may be rewritten like this:
li = []
for col in df.columns:
value_counts = df[col].value_counts()
value_percentage = df[col].value_counts(normalize=True).map('{:.0%}'.format)
li.append(pd.concat([value_counts, value_percentage]).to_frame().reset_index())
resultdf = pd.concat(li, axis=1)
resultdf.to_excel("resultdf.xlsx")
Let Excel do formatting
What if we let Excel format the data as percentages on its own? I think that the easiest way to do this is to use Styler. But before that, I suggest to get rid of Index columns. As I can see, all of them refer to the same grades 1,2,3,4,5. So we can use them as the common index thus making indexes meaningful. Also I'm gonna use MultiIndex to separate counted and normalized values like this:
formula = ['counts', 'percent']
values = [1, 2, 3, 4, 5]
counted = pd.DataFrame(index=pd.MultiIndex.from_product([formula, values], names=['formula', 'values']))
counted is our data container and it's empty at the moment. Let's fill it in:
for col in df.columns:
counts = df[col].value_counts()
percent = counts / counts.sum()
counted[col] = pd.concat([counts, percent], keys=formula)
Having these data, let's apply some style to them and only then transform into an Excel file:
styled_data = (
counted.style
.set_properties(**{'number-format': '0'}, subset=pd.IndexSlice['counts', columns])
.set_properties(**{'number-format': '0%'}, subset=pd.IndexSlice['percent', columns])
)
styled_data.to_excel('test.xlsx')
Now our data in Excel are looking like this:
All of them are numbers and we can use them in further calculations.
Full code
from pandas import DataFrame, MultiIndex, IndexSlice, concat
from numpy.random import default_rng
# Initial parameters
rng = default_rng(0)
data_length = 100
genres = ['Pop', 'Dance', 'Rock', 'Jazz']
values = [1, 2, 3, 4, 5]
formula = ['counts', 'percent']
file_name = 'test.xlsx'
# Prepare data
data = rng.integers(min(values), max(values), size=(data_length, len(genres)), endpoint=True)
df = DataFrame(data, columns=genres)
# Prepare a container for counted data
index = MultiIndex.from_product([formula, values], names=['formula', 'values'])
counted = DataFrame(index=index)
# Fill in counted data
for col in df.columns:
counts = df[col].value_counts()
percent = counts / counts.sum()
counted[col] = concat([counts, percent], keys=formula)
# Apply number formatting and save the data in a Excel file
styled_data = (
counted.style
.set_properties(**{'number-format': '0'}, subset=IndexSlice['counts', :])
.set_properties(**{'number-format': '0%'}, subset=IndexSlice['percent', :])
)
styled_data.to_excel(file_name)
P.S.
Note not to get confused. In case of the used dummy data we can see identical values in counts and percent parts. That's because of how data were built. I used 100 total number of values in the initial data frame df. So the number of value_counts and their percentage are equal.
python 3.11.0
pandas 1.5.1
numpy 1.23.4
Update
If we wanna keep values for each column of the original data, but use Styler to set a number format for a second half of the output frame, then we should somehow rename Index columns, because Styler requires unique column/index labels in a passed DataFrame. We can ether rename them somehow (e.g. "Values.Pop", etc.) or we can use a multi indexing for columns, which IMO looks better. Also let's take into account that number of unique values may differ for different columns. Which means that we have to collect data separately for couts and percent values before connecting them:
import pandas as pd
from numpy.random import default_rng
# Prepare dummy data with missing values in some columns
rng = default_rng(0)
columns = ['Pop', 'Dance', 'Rock', 'Jazz']
data = rng.integers(1, 5, size=(100, len(columns)), endpoint=True)
df = pd.DataFrame(data, columns=columns)
df['Pop'].replace([1,5], 2, inplace=True)
df['Dance'].replace(3, 5, inplace=True)
# Collect counted values and their percentage
counts, percent = [], []
for col in df.columns:
item = (
df[col].value_counts()
.rename('count')
.rename_axis('value')
.to_frame()
.reset_index()
)
counts.append(item)
percent.append(item.assign(count=item['count']/item['count'].sum()))
# Combine counts and percent in a single data frame
counts = pd.concat(counts, axis=1, keys=df.columns)
percent = pd.concat(percent, axis=1, keys=df.columns)
resultdf = pd.concat([counts, percent], ignore_index=True)
# Note: In order to use resultdf in styling we should produce
# unique index labels for the output data.
# For this purpose we can use ignore_index=True
# or assign some keys for each part, e.g. key=['counted', 'percent']
# Format the second half of resultdf as Percent, ie. "0%" in Excel terminology
styled_result = (
resultdf.style
.set_properties(
**{'number-format': '0%'},
subset=pd.IndexSlice[len(resultdf)/2:, pd.IndexSlice[:,'count']])
# if we used keys instead of ignore_index to produce resultdf
# then len(resultdf)/2: should be replaced with 'percent'
# i.e. the name of the percent part.
)
styled_result.to_excel('my_new_excel.xlsx')
The output in this case is gonna look like this:
Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1
I am new to pandas , so assume I must be missing something obvious...
Summary:
I have a DataFrame with 300K+ rows. I retrieve a row of new data which may or may not be related to the existing subset of rows in the DF(identified by Group ID), either retrieve the existing Group ID or generate new one and finally insert it with the Group ID.
Pandas seems very slow for this.
Please advise : What am I missing / should I be using something else?
Details:
Columns are (example):
columnList = ['groupID','timeStamp'] + list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
Each groupID can have many unique timeStamp's
groupID is internally generated :
Either using an existing one (by matching the row to existing data, say by column 'D')
Generate new groupID
Thus (in my view at least) I cannot do updates/inserts in bulk, I have to do it row by row
I used an SQL DB analogy to create an index as concat of groupID and timeStamp (I Have tried MultiIndex but it seems even slower).
Finally I insert/update using .loc(ind,columnName)
Code:
import pandas as pd
import numpy as np
import time
columnList = ['groupID','timeStamp'] + list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
columnTypeDict = {'groupID':'int64','timeStamp':'int64'}
startID = 1234567
df = pd.DataFrame(columns=columnList)
df = df.astype(columnTypeDict)
fID = list(range(startID,startID+300000))
df['groupID'] = fID
ts = [1000000000]*150000 + [10000000001]*150000
df['timeStamp'] = ts
indx = [str(i) + str(j) for i, j in zip(fID, ts)]
df['Index'] = indx
df['Index'] = df['Index'].astype('uint64')
df = df.set_index('Index')
startTime = time.time()
for groupID in range(startID+49000,startID+50000) :
timeStamp = 1000000003
# Obtain/generate an index
ind =int(str(groupID) + str(timeStamp))
#print(ind)
df.loc[ind,'A'] = 1
print(df)
print(time.time()-startTime,"secs")
If the index column already exists, its fast, but if it doesn't 10,000 inserts take 140secs
I think accessing dataframes is a relatively expensive operation.
You can save temporatily these values and use them to create dataframe that will be merged with the original one as follows:
startTime = time.time()
temporary_idx = []
temporary_values = []
for groupID in range(startID+49000,startID+50000) :
timeStamp = 1000000003
# Obtain/generate an index
ind = int(str(groupID) + str(timeStamp))
temporary_idx.append(ind)
temporary_values.append(1)
# create a dataframe with new values and apply a join with the original dataframe
df = df.drop(columns=["A"])\
.merge(
pd.DataFrame({"A": temporary_values}, index=temporary_idx).rename_axis("Index", axis="index"),
how="outer", right_index=True, left_index=True
)
print(df)
print(time.time()-startTime,"secs")
When I benchmarked, This takes less than 2 seconds to execute
I don't know what is exactly your real use case, but this for the case of inserting column A as you stated in your example. If your use case is more complex than that, then there might be a better solution
I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.
I found few answers to that but it is still not giving result what I expect.
Answer 1 :
df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]
This doesn't work. It works for single column but not for multiple.
Answer 2 :
df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)
This takes as an input specific columns and also outputs specific columns. I want to print the whole record and not only the specific columns of the record.
I tried this and it gave me the rows:
import pandas as pd
columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)
col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()
new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]
This will give you the rows based on the columns list. Sorry for the bad naming. If you want to rename the columns a bit too, here's the code for that:
for column in new.columns:
if '_x' in column:
new = new.drop(column, axis = 1)
elif '_y' in column:
new = new.rename(columns = {column: column[:column.find('_y')]})
Tell me if it works.
I have a multiindex DataFrame and I'm trying to select data in it base on certain criteria, so far so good. The problem is that once I have selected my data using .loc and pd.IndexSlice, the resulting DataFrame which should logically have less rows and less element in the first level of the multiindex keeps exactly the same multiIndex but with some keys in it refering to empty dataframe.
I've tried creating a completely new DataFrame with a new index, but the structure of my data set is complicating and there is not always the same number of elements in a given level, so it is not easy to created a dataFrame with the right shape in which I can put the data.
import numpy as np
import pandas as pd
np.random.seed(3) #so my exemple is reproductible
idx = pd.IndexSlice
iterables = [['A','B','C'],[0,1,2],['some','rdm','data']]
my_index = pd.MultiIndex.from_product(iterables,names =
['first','second','third'])
my_columns = ['col1','col2','col3']
df1 = pd.DataFrame(data = np.random.randint(10,size =
(len(my_index),len(my_columns))),
index = my_index,
columns = my_columns
)
#Ok, so let's say I want to keep only the elements in the first level of my index (["A","B","C"]) for
#which the total sum in column 3 is less than 35 for some reasons
boolean_mask = (df1.groupby(level = "first").col3.sum() < 35).tolist()
first_level_to_keep = df1.index.levels[0][boolean_mask].tolist()
#lets select the wanted data and put it in df2
df2 = df1.loc[idx[first_level_to_keep,:,:],:]
So far, everything is as expected
The problem is when I want to access the df2 index. I expected the following:
df2.index.levels[0].tolist() == ['B','C']
to be true. But this is what gives a True statement:
df2.index.levels[0].tolist() == ['A','B','C']
So my question is the following: is there a way to select data and to have in retrun a dataFrame with a multiindex reflecting what is in it. Because I find weird to be able to select non existing data in my df2:
I tried to put some images of the dataframes in question but I couldn't because I dont't have enough «reputation»... sorry about that.
Thank you for your time!
Even if you delete the rows corresponding to a particular value in an index level, that value still exists. You can reset the index and then set those columns back as an index in order to generate a MultiIndex with new level values.
df2 = df2.reset_index().set_index(['first','second','third'])
print(df2.index.levels[0].tolist() == ['B','C'])
True