Related
I would like to go through every row (entry) in my df and remove every entry that has the value of " " (which yes is an empty string).
So if my data set is:
Name Gender Age
Jack 5
Anna F 6
Carl M 7
Jake M 7
Therefore Jack would be removed from the dataset.
On another note, I would also like to remove entries that has the value "Unspecified" and "Undetermined" as well.
Eg:
Name Gender Age Address
Jack 5 *address*
Anna F 6 *address*
Carl M 7 Undetermined
Jake M 7 Unspecified
Now,
Jack will be removed due to empty field.
Carl will be removed due to the value Undetermined present in a column.
Jake will be removed due to the value Unspecified present in a column.
For now, this has been my approach but I keep getting a TypeError.
list = []
for i in df.columns:
if df[i] == "":
# everytime there is an empty string, add 1 to list
list.append(1)
# count list to see how many entries there are with empty string
len(list)
Please help me with this. I would prefer a for loop being used due to there being about 22 columns and 9000+ rows in my actual dataset.
Note - I do understand that there are other questions asked like this, its just that none of them apply to my situation, meaning that most of them are only useful for a few columns and I do not wish to hardcode all 22 columns.
Edit - Thank you for all your feedbacks, you all have been incredibly helpful.
To delete a row based on a condition use the following:
df = df.drop(df[condition].index)
For example:
df = df.drop(df[Age==5].index) , will drop the row where the Age is 5.
I've come across a post regarding the same dating back to 2017, it should help you understand it more clearer.
Regarding question 2, here's how to remove rows with the specified values in a given column:
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Let's assume we have a Pandas DataFrame object df.
To remove every row given your conditions, simply do:
df = df[df.Gender == " " or df.df.Age == " " or df.Address in [" ", "Undetermined", "Unspecified"]]
If the unspecified fields are NaN, you can also do:
df = df.dropna(how="any", axis = 0)
Answer from #ThatCSFresher or #Bence will help you out in removing rows based on single column... Which is great!
However, I think there are multiple condition in your query needed to check across multiple columns at once in a loop. So, probably apply-lambda can do the job; Try the following code;
df = pd.DataFrame({"Name":["Jack","Anna","Carl","Jake"],
"Gender":["","F","M","M"],
"Age":[5,6,7,7],
"Address":["address","address","Undetermined","Unspecified"]})
df["Noise_Tag"] = df.apply(lambda x: "Noise" if ("" in list(x)) or ("Undetermined" in list(x)) or ("Unspecified" in list(x)) else "No Noise",axis=1)
df1 = df[df["Noise_Tag"] == "No Noise"]
del df1["Noise_Tag"]
# Output of df;
Name Gender Age Address Noise_Tag
0 Jack 5 address Noise
1 Anna F 6 address No Noise
2 Carl M 7 Undetermined Noise
3 Jake M 7 Unspecified Noise
# Output of df1;
Name Gender Age Address
1 Anna F 6 address
Well, OP actually wants to delete any column with "empty" string.
df = df[~(df=="").any(axis=1)] # deletes all rows that have empty string in any column.
If you want to delete specifically for address column, then you can just delete using
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Or if any column with Undetermined or Unspecified, try similar as the first solution in my post, just by replacing the empty string with Undertermined or Unspecified.
df = df[~((df=="Undetermined") | (df=="Unspecified")).any(axis=1)]
You can build masks and then filter the df according to it:
m1 = df.eq('').any(axis=1)
# m1 is True if any cell in a row has an empty string
m2 = df['Address'].isin(['Undetermined', 'Unspecified'])
# m2 is True if a row has one of the values in the list in column 'Address'
out = df[~m1 & ~m2] # invert both condition and get the desired output
print(out)
Output:
Name Gender Age Address
1 Anna F 6 *address*
Used Input:
df = pd.DataFrame({'Name': ['Jack', 'Anna', 'Carl', 'Jake'],
'Gender': ['', 'F', 'M', 'M'],
'Age': [5, 6, 7, 7],
'Address': ['*address*', '*address*', 'Undetermined', 'Unspecified']}
)
using lambda fun
Code:
df[df.apply(lambda x: False if (x.Address in ['Undetermined', 'Unspecified'] or '' in list(x)) else True, axis=1)]
Output:
Name Gender Age Address
1 Anna F 6 *add
I have two data frames that I am looking to apply two separate functions to that will perform validation checks on each data frame independently, and then any differences that arise will get concatenated into one transformed list.
The issue I am facing is that the first validation check should happen ONLY if numeric values exist in ALL of the numeric columns of whichever of the two data frames it is analyzing. If there are ANY NaN values in a row line for the first validation check, then that row should be skipped.
The second validation check does not need that specification.
Here are the data frames, functions, and transformations:
import pandas as pd
import numpy as np
df1 = {'Fruits': ["Banana","Blueberry","Apple","Cherry","Mango","Pineapple","Watermelon","Papaya","Pear","Coconut"],
'Price': [2,1.5,np.nan,2.5,3,4,np.nan,3.5,1.5,2],'Amount':[40,19,np.nan,np.nan,60,70,80,np.nan,45,102],
'Quantity Frozen':[3,4,np.nan,15,np.nan,9,12,8,np.nan,80],
'Quantity Fresh':[37,12,np.nan,45,np.nan,61,np.nan,24,14,20],
'Multiple':[74,17,np.nan,112.5,np.nan,244,np.nan,84,21,40]}
df1 = pd.DataFrame(df1, columns = ['Fruits', 'Price','Amount','Quantity Frozen','Quantity Fresh','Multiple'])
df2 = {'Fruits': ["Banana","Blueberry","Apple","Cherry","Mango","Pineapple","Watermelon","Papaya","Pear","Coconut"],
'Price': [2,1.5,np.nan,2.6,3,4,np.nan,3.5,1.5,2],'Amount':[40,16,np.nan,np.nan,60,72,80,np.nan,45,100],
'Quantity Frozen':[3,4,np.nan,np.nan,np.nan,9,12,8,np.nan,80],
'Quantity Fresh':[np.nan,12,np.nan,45,np.nan,61,np.nan,24,15,20],
'Multiple':[74,17,np.nan,112.5,np.nan,244,np.nan,84,20,40]}
df2 = pd.DataFrame(df2, columns = ['Fruits', 'Price','Amount','Quantity Frozen','Quantity Fresh','Multiple'])
#Validation Check 1:
for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
dataset['dif_Stock on Hand'] = dataset['Quantity Fresh']+dataset['Quantity Frozen']
for varname,var in {'Stock on Hand vs. Quantity Fresh + Quantity Frozen':'dif_Stock on Hand'}.items():
print('{} differences in {}:'.format(name, varname))
print(dataset[var].value_counts())
print('\n')
#Validation Check 2:
for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
dataset['dif_Multiple'] = dataset['Price'] * dataset['Quantity Fresh']
for varname,var in {'Multiple vs. Price x Quantity Fresh':'dif_Multiple'}.items():
print('{} differences in {}:'.format(name, varname))
print(dataset[var].value_counts())
print('\n')
# #Wrangling internal inconsistency data frames to be in correct format
inconsistency_vars = ['dif_Stock on Hand','dif_Multiple']
inconsistency_var_betternames = {'dif_Stock on Hand':'Stock on Hand = Quantity Fresh + Quantity Frozen','dif_Multiple':'Multiple = Price x Quantity on Hand'}
# #Rollup1
idvars1=['Fruits']
df1 = df1[idvars1 + inconsistency_vars]
df2 = df2[idvars1 + inconsistency_vars]
df1 = df1.melt(id_vars = idvars1, value_vars = inconsistency_vars, value_name = 'Difference Magnitude')
df2 = df2.melt(id_vars = idvars1, value_vars = inconsistency_vars, value_name = 'Difference Magnitude')
df1['dataset'] = 'Fruit Dataset1'
df2['dataset'] = 'Fruit Dataset2'
# #First table in Internal Inconsistencies Sheet (Table 5)
inconsistent = pd.concat([df1,df2])
inconsistent = inconsistent[['variable','Difference Magnitude','dataset','Fruits']]
inconsistent['variable'] = inconsistent['variable'].map(inconsistency_var_betternames)
inconsistent = inconsistent[inconsistent['Difference Magnitude'] != 0]
Here is the desired output, which for the first validation check skips rows in either data frame that have ANY NaN values in the numeric columns (every column but 'Fruits'):
#Desired output
inconsistent_true = {'variable': ["Stock on Hand = Quantity Fresh + Quantity Frozen","Stock on Hand = Quantity Fresh + Quantity Frozen","Multiple = Price x Quantity on Hand",
"Multiple = Price x Quantity on Hand","Multiple = Price x Quantity on Hand"],
'Difference Magnitude': [1,2,1,4.5,2.5],
'dataset':["Fruit Dataset1","Fruit Dataset1","Fruit Dataset2","Fruit Dataset2","Fruit Datset2"],
'Fruits':["Blueberry","Coconut","Blueberry","Cherry","Pear"]}
inconsistent_true = pd.DataFrame(inconsistent_true, columns = ['variable', 'Difference Magnitude','dataset','Fruits'])
A pandas function that may come in handy is pd.isnull() return True for np.nan value-
For example take df1-
pd.isnull(df1['Amount'][2])
True
This can be added as a check to all your numeric columns as such and then use only rows that have column 'numeric_check' value as 1-
df1['numeric_check'] = df1.apply(lambda x: 0 if (pd.isnull(x['Amount']) or
pd.isnull(x['Price']) or pd.isnull(x['Quantity Frozen']) or
pd.isnull(x['Quantity Fresh']) or pd.isnull(x['Multiple'])) else 1, axis =1)
refer the modified validation check 1 -
#Validation Check 1:
for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
if '1' in name: # check to implement condition for only df1
# Adding the 'numeric_check' column to dataset df
dataset['numeric_check'] = dataset.apply(lambda x: 0 if (pd.isnull(x['Amount']) or
pd.isnull(x['Price']) or pd.isnull(x['Quantity Frozen']) or
pd.isnull(x['Quantity Fresh']) or pd.isnull(x['Multiple'])) else 1, axis =1)
# filter out Nan rows, they will not be considered for this check
dataset = dataset.loc[dataset['numeric_check']==1]
dataset['dif_Stock on Hand'] = dataset['Quantity Fresh']+dataset['Quantity Frozen']
for varname,var in {'Stock on Hand vs. Quantity Fresh + Quantity Frozen':'dif_Stock on Hand'}.items():
print('{} differences in {}:'.format(name, varname))
print(dataset[var].value_counts())
print('\n')
I hope I got your intention.
# make boolean mask, True if all numeric values are not NaN
mask = df1.select_dtypes('number').notna().all(axis=1)
print(df1[mask])
Fruits Price Amount Quantity Frozen Quantity Fresh Multiple
0 Banana 2.0 40.0 3.0 37.0 74.0
1 Blueberry 1.5 19.0 4.0 12.0 17.0
5 Pineapple 4.0 70.0 9.0 61.0 244.0
9 Coconut 2.0 102.0 80.0 20.0 40.0
I am stumbled upon a trivial problem in pandas. I have two dataframes. The first one, df_1 is as follows
vendor_name date company_name state
PERTH is june 2019 Abc enterprise Kentucky
Megan Ent 25-april-2019 Xyz Fincorp Texas
The second one df_2 contains the correct values for each column in df_1.
df_2
Field wrong value correct value
vendor_name PERTH Perth Enterprise
date is 15 ## this means that is should be read as 15
company_name Abc enterprise ABC International Enterprise Inc.
In order to replace the values with correct ones in df_1 (except date field) I am using pandas.loc method. Below is the code snippet
vend = df_1['vendor_Name'].tolist()
comp = df_1['company_name'].tolist()
state = df_1['state'].tolist()
for i in vend:
if df_2['wrong value'].str.contains(i):
crct = df_2.loc[df_2['wrong value'] == i,'correct value'].tolist()
Similarly, for company and state I have followed the above way.
However, the crct is returning a blank series. Ideally it should return
['Perth Enterprise','Abc International Enterprise Inc']
The next step would be to replace the respective field values by the above list.
With the above, I have three questions:
Why the above code is generating a blank list? What I am missing here?
How can I replace the respective fields using df_1.replace method?
What should be a correct approach to replace the portion of date in df_1 by the correct one in df_2?
Edit: when data has looping replacement(i.e overlaping keys and values), replacement on whole dataframe will fail. In this case, doing it column by column and concat them together. Finally, use join to adding any missing columns from df1:
df_replace = pd.concat([df1[k].replace(val, regex=True) for k, val in d.items()], axis=1).join(df1.state)
Original:
I tried your code in my interactive and it gives error ValueError: The truth value of a Series is ambiguous on df_2['wrong value'].str.contains(i).
assume you have multiple vendor names, so the simple way is construct a dictionary from groupby of df2 and use it with df.replace on df1.
d = {k: gp.set_index('wrong value')['correct value'].to_dict()
for k, gp in df2.groupby('Field')}
Out[64]:
{'company_name': {'Abc enterprise': 'ABC International Enterprise Inc. '},
'date': {'is': '15'},
'vendor_name': {'PERTH': 'Perth Enterprise'}}
df_replace = df1.replace(d, regex=True)
print(df_replace)
In [68]:
vendor_name date company_name \
0 Perth Enterprise 15 june 2019 ABC International Enterprise Inc.
1 Megan Ent 25-april-2019 Xyz Fincorp
state
0 Kentucky
1 Texas
Note: your sample df2 has only value for vendor PERTH, so it only replace first row. When you have all vendor_names in df2, it will replace them all in df1.
A simple way to do that is to iterate over the first dataframe and then replace the wrong values :
Result = pd.DataFrame()
for i in range(len(df1)):
vendor_name = df1.iloc[i]['vendor_name']
date = df1.iloc[i]['date']
company_name = df1.iloc[i]['company_name']
if vendor_name in df2['wrong value'].values:
vendor_name = df2.loc[df2['wrong value'] == vendor_name]['correct value'].values[0]
if company_name in df2['wrong value'].values:
company_name = df2.loc[df2['wrong value'] == company_name]['correct value'].values[0]
new_row = {'vendor_name':[vendor_name],'date':[date],'company_name':[company_name]}
new_row = pd.DataFrame(new_row,columns=['vendor_name','date','company_name'])
Result = Result.append(new_row,ignore_index=True)
Result :
Define the following replace function:
def repl(row):
fld = row.Field
v1 = row['wrong value']
v2 = row['correct value']
updInd = df_1[df_1[fld].str.contains(v1)].index
df_1.loc[updInd, fld] = df_1.loc[updInd, fld]\
.str.replace(re.escape(v1), v2)
Then call it for each row in df_2:
for _, row in df_2.iterrows():
repl(row)
Note that str.replace alone does not require to import re (Pandas
imports it under the hood).
But in the above function re.escape is called explicitely, from our code,
hence import re is required.
Have a pandas dataframe that includes multiple columns of monthly finance data. I have an input of period that is specified by the person running the program. It's currently just saved as period like shown below within the code.
#coded into python
period = ?? (user adds this in from input screen)
I need to create another column of data that uses the input period number to perform a calculation of other columns.
So, in the above table I'd like to create a new column 'calculation' that depends on the period input. For example, if a period of 1 was used the following calc1 would be completed (with math actually done). Period = 2 - then calc2. Period = 3 - then calc3. I only need one column calculated depending on the period number but added three examples in below picture for example of how it'd work.
I can do this in SQL using case when. So using the input period then sum what columns I need to.
select Account #,
'&Period' AS Period,
'&Year' AS YR,
case
When '&Period' = '1' then sum(d_cf+d_1)
when '&Period' = '2' then sum(d_cf+d_1+d_2)
when '&Period' = '3' then sum(d_cf+d_1+d_2+d_3)
I am unsure on how to do this easily in python (newer learner). Yes, I could create a column that does each calculation via new column for every possible period (1-12), and then only select that column but I'd like to learn and do it a more efficient way.
Can you help more or point me in a better direction?
You could certainly do something like
df[['d_cf'] + [f'd_{i}' for i in range(1, period+1)]].sum(axis=1)
You can do this using a simple function in python:
def get_calculation(df, period=NULL):
'''
df = pandas data frame
period = integer type
'''
if period == 1:
return df.apply(lambda x: x['d_0'] +x['d_1'], axis=1)
if period == 2:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'], axis=1)
if period == 3:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'] + x['d_3'], axis=1)
new_df = get_calculation(df, period = 1)
Setup:
df = pd.DataFrame({'d_0':list(range(1,7)),
'd_1': list(range(10,70,10)),
'd_2':list(range(100,700,100)),
'd_3': list(range(1000,7000,1000))})
Setup:
import pandas as pd
ddict = {
'Year':['2018','2018','2018','2018','2018',],
'Account_Num':['1111','1122','1133','1144','1155'],
'd_cf':['1','2','3','4','5'],
}
data = pd.DataFrame(ddict)
Create value calculator:
def get_calcs(period):
# Convert period to integer
s = str(period)
# Convert to string value
n = int(period) + 1
# This will repeat the period number by the value of the period number
return ''.join([i * n for i in s])
Main function copies data frame, iterates through period values, and sets calculated values to the correct spot index-wise for each relevant column:
def process_data(data_frame=data, period_column='d_cf'):
# Copy data_frame argument
df = data_frame.copy(deep=True)
# Run through each value in our period column
for i in df[period_column].values.tolist():
# Create a temporary column
new_column = 'd_{}'.format(i)
# Pass the period into our calculator; Capture the result
calculated_value = get_calcs(i)
# Create a new column based on our period number
df[new_column] = ''
# Use indexing to place the calculated value into our desired location
df.loc[df[period_column] == i, new_column] = calculated_value
# Return the result
return df
Start:
Year Account_Num d_cf
0 2018 1111 1
1 2018 1122 2
2 2018 1133 3
3 2018 1144 4
4 2018 1155 5
Result:
process_data(data)
Year Account_Num d_cf d_1 d_2 d_3 d_4 d_5
0 2018 1111 1 11
1 2018 1122 2 222
2 2018 1133 3 3333
3 2018 1144 4 44444
4 2018 1155 5 555555
Given I have the following csv data.csv:
id,category,price,source_id
1,food,1.00,4
2,drink,1.00,4
3,food,5.00,10
4,food,6.00,10
5,other,2.00,7
6,other,1.00,4
I want to group the data by (price, source_id) and I am doing it with the following code
import pandas as pd
df = pd.read_csv('data.csv', names=['id', 'category', 'price', 'source_id'])
grouped = df.groupby(['price', 'source_id'])
valid_categories = ['food', 'drink']
for price_source, group in grouped:
if group.category.size < 2:
continue
categories = group.category.tolist()
if 'other' in categories and len(set(categories).intersection(valid_categories)) > 0:
pass
"""
Valid data in this case is:
1,food,1.00,4
2,drink,1.00,4
6,other,1.00,4
I will need all of the above data including the id for other purposes
"""
Is there an alternate way to perform the above filtering in pandas before the for loop and if it's possible, will it be any faster than the above?
The criteria for filtering is:
size of the group is greater than 1
the group by data should contain category other and at least one of either food or drink
You could directly apply a custom filter to the GroupBy object, something like
crit = lambda x: all((x.size > 1,
'other' in x.category.values,
set(x.category) & {'food', 'drink'}))
df.groupby(['price', 'source_id']).filter(crit)
Outputs
category id price source_id
0 food 1 1.0 4
1 drink 2 1.0 4
5 other 6 1.0 4