How to refactor simple dataframe parsing code with Pandas - python

I am using Pandas to parse a dataframe that I have created:
# Initial DF
A B C
0 -1 qqq XXX
1 20 www CCC
2 30 eee VVV
3 -1 rrr BBB
4 50 ttt NNN
5 60 yyy MMM
6 70 uuu LLL
7 -1 iii KKK
8 -1 ooo JJJ
My goal is to analyze column A and apply the following conditions to the dataframe:
Investigate every row
determine if df['A'].iloc[index]=-1
if true and index=0 mark first row as to be removed
if true and index=N mark last row as to be removed
if 0<index<N and df['A'].iloc[index]=-1 and the previous or following row contain -1 (df['A'].iloc[index+]=-1 or
df['A'].iloc[index-1]=-1), mark row as to be removed; else replace
-1 with the average of the previous and following value
The final dataframe should look like this:
# Final DF
A B C
0 20 www CCC
1 30 eee VVV
2 40 rrr BBB
3 50 ttt NNN
4 60 yyy MMM
5 70 uuu LLL
I was able to achieve my goal by writing a simple code that applies the conditions mentioned above:
import pandas as pd
# create dataframe
data = {'A':[-1,20,30,-1,50,60,70,-1,-1],
'B':['qqq','www','eee','rrr','ttt','yyy','uuu','iii','ooo'],
'C':['XXX','CCC','VVV','BBB','NNN','MMM','LLL','KKK','JJJ']}
df = pd.DataFrame(data)
# If df['A'].iloc[index]==-1:
# - option 1: remove row if first or last row are equal to -1
# - option 2: remove row if previous or following row contains -1 (df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1)
# - option 3: replace df['A'].iloc[index] if: df['A'].iloc[index]==-1 and (df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1)
N = len(df.index) # number of rows
index_vect = [] # store indexes of rows to be deleated
for index in range(0,N):
# option 1
if index==0 and df['A'].iloc[index]==-1:
index_vect.append(index)
elif index>1 and index<N and df['A'].iloc[index]==-1:
# option 2
if df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1:
index_vect.append(index)
# option 3
else:
df['A'].iloc[index] = int((df['A'].iloc[index+1]+df['A'].iloc[index-1])/2)
# option 1
elif index==N and df['A'].iloc[index]==-1:
index_vect.append(index)
# remove rows to be deleated
df = df.drop(index_vect).reset_index(drop = True)
As you can see the code is pretty long and I would like to know if you can suggest a smarter and more efficient way to obtain the same result.
Furthermore I noticed my code return a warning message cause by the line df['A'].iloc[index] = int((df['A'].iloc[index+1]+df['A'].iloc[index-1])/2)
Do you know how I could optimize such line of code?

Here's a solution:
import numpy as np
# Let's replace -1 by Not a Number (NaN)
df.ix[df.A==-1,'A'] = np.nan
# If df.A is NaN and either the previous or next is also NaN, we don't select it
# This takes care of the condition on the first and last row too
df = df[~(df.A.isnull() & (df.A.shift(1).isnull() | df.A.shift(-1).isnull()))]
# Use interpolate to fill with the average of previous and next
df.A = df.A.interpolate(method='linear', limit=1)
Here's the resulting df:
A B C
1 20.0 www CCC
2 30.0 eee VVV
3 40.0 rrr BBB
4 50.0 ttt NNN
5 60.0 yyy MMM
6 70.0 uuu LLL
You can then reset the index if you want to.

Related

Loop over data in column 1 from dataframe_1 & assign to rows of a different dataframe_2 in a new column, with col 2 on dataframe_1 showing the limit

I have the below 2 dataframes
Dataframe_1
Salesguy
limit
A
10
B
11
C
0
D
14
E
6
There is another dataframe2, which contains some shop details with 10 columns and say 1000 rows. I need to assign the salesguys to the rows in dataframe2 in a new 11th column in round robin manner (ABCDE ABCDE ..so on). But the assignment needs to stop once the corresponding limit (in column 2 of dataframe_1) for the salesguy is reached.
for ex - since limit for C is 0, the assignment should be ABDE ABDE,
after 6 iterations, it will become ABD ABD (as the limit for E after 6 iterations will be 0)
Can anyone please help with the python code for this ?
I am able to assign the salesguys in the round robin manner using a list
l = [A,B,C,D,E]
dataframe_2['New']=''
dataframe_2.loc['New']=l
But I am unable to figure how to use the column 2 to set the corresponding limits for each salesguy.
You can replicate the values with Series.repeat, sort them in round robin with sort_values and groupby.cumcount:
df2['New'] = (df1['Salesguy'].repeat(df1['limit'])
.sort_values(key=lambda s: s.groupby(s).cumcount(),
kind='stable', ignore_index=True)
)
print(df2)
Example:
dummy New
0 82 A
1 2 B
2 11 D
3 7 E
4 58 A
.. ... ...
995 35 NaN
996 32 NaN
997 89 NaN
998 36 NaN
999 81 NaN
[1000 rows x 2 columns]
Used input:
df2 = pd.DataFrame({'dummy': np.random.randint(0,100, size=1000)})
Here is a lengthy code for doing it. After a series of for-loops and getting the list, you extend it to match the length of axis 0.
def get_salesguy_list(df1):
dict1 = df1.set_index('Salesguy')['limit'].to_dict()
dict2 = dict1.copy()
lst = []
for salesguy in dict1:
for i in range(dict1[salesguy]):
for s in dict1:
if dict2[s] > 0:
# print(s, dict2[s])
lst.append(s)
dict2[s] -= 1
return lst
a_list = get_salesguy_list(df1)
b_list = []
iter_range = int(df2.shape[0]/len(a_list)) + 1 # maths to get no. of repetative appendings
for i in range(iter_range):
for item in a_list:
b_list.append(item)
b_list = b_list[:df2.shape[0]] # discard the extra items
df2['col_11'] = pd.Series(b_list) # your column-11 of dataframe_2

How to create a new column based on a condition in another column

In pandas, How can I create a new column B based on a column A in df, such that:
B=1 if A_(i+1)-A_(i) > 5 or A_(i) <= 10
B=0 if A_(i+1)-A_(i) <= 5
However, the first B_i value is always one
Example:
A
B
5
1 (the first B_i)
12
1
14
0
22
1
20
0
33
1
Use diff with a comparison to your value and convertion from boolean to int using le:
N = 5
df['B'] = (~df['A'].diff().le(N)).astype(int)
NB. using a le(5) comparison with inversion enables to have 1 for the first value
output:
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
updated answer, simply combine a second condition with OR (|):
df['B'] = (~df['A'].diff().le(5)|df['A'].lt(10)).astype(int)
output: same as above with the provided data
I was little confused with your rows numeration bacause we should have missing value on last row instead of first if we calcule for B_i basing on condition A_(i+1)-A_(i) (first row should have both, A_(i) and A_(i+1) and last row should be missing A_(i+1) value.
Anyway,basing on your example i assumed that we calculate for B_(i+1).
import pandas as pd
df = pd.DataFrame(columns=["A"],data=[5,12,14,22,20,33])
df['shifted_A'] = df['A'].shift(1) #This row can be removed - it was added only show to how shift works on final dataframe
df['B']=''
df.loc[((df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10), 'B']=1 #Update rows that fulfill one of conditions with 1
df.loc[(df['A']-df['A'].shift(1))<=5, 'B']=0 #Update rows that fulfill condition with 0
df.loc[df.index==0, 'B']=1 #Update first row in B column
print(df)
That prints:
A shifted_A B
0 5 NaN 1
1 12 5.0 1
2 14 12.0 0
3 22 14.0 1
4 20 22.0 0
5 33 20.0 1
I am not sure if it is fastest way, but i guess it should be one of easier to understand.
Little explanation:
df.loc[mask, columnname]=newvalue allows us to update value in given column if condition (mask) is fulfilled
(df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10)
Each condition here returns True or False. If we added them the result is True if any of that is True (which is simply OR). In case we need AND we can multiply the conditions
Use Series.diff, replace first missing value for 1 after compare for greater or equal by Series.ge:
N = 5
df['B'] = (df.A.diff().fillna(N).ge(N) | df.A.lt(10)).astype(int)
print (df)
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1

Using iterations (or some other method) to apply a function across a dataframe and tabulate multiple output values

I have a function that needs to be mapped across a dataframe to every cell in columns B, C. For each cell input, the function outputs 4 variables which I would like to track, as well as the ID in column A and the column from which the iteration acted on.
def function(cell):
add_2 = cell + 2
subtract_2 = cell - 2
multiply_2 = cell*2
divide_2 = cell/2
For example:
df
[A] [B] [C]
AAA 2 4
BBB 6 10
Goal Output:
df2
[ID] [COL] [Add_2] [Subtract_2] [Multiply_2] [Divide_2]
AAA B 4 0 4 1
AAA C 6 2 8 2
BBB B 8 4 12 3
BBB C 12 8 20 5
I've explored the option of passing a for loop iteration and populate an empty dataframe through the .append() method, but I cannot seem to get the proper for loop to output what I'm looking for. I also read into the .applymap() means to iterate across a dataframe, but I haven't found a way to tabulate my output dataframe. Any help is greatly appreciated.
Here's one approach to getting your target output:
df = df.melt(id_vars='A', var_name='COL', value_name='VAL')
val = df['VAL']
df.assign(
Add_2 = val.add(2),
Subtract_2 = val.sub(2),
Multiply_2 = val.mul(2),
Divide_2 = val.div(2)
).drop('VAL', axis=1)
# [A] COL Add_2 Divide_2 Multiply_2 Subtract_2
# 0 AAA [B] 4 1.0 4 0
# 1 BBB [B] 8 3.0 12 4
# 2 AAA [C] 6 2.0 8 2
# 3 BBB [C] 12 5.0 20 8
... and here's how you can write that into a function that takes an arbitrary num argument, rather than just 2:
def function(dframe, num):
# here, "melt" returns 3 columns: [A], COL, value
temp_df = dframe.melt(id_vars='[A]', var_name='COL')
# store the base column for the calculations,
# so we only look it up once
val = temp_df['value']
# store the num argument as a string,
# so we can add column suffixes
str_num = str(num)
# create a dict of column names + transformed Series objects
# to pass into "assign"
transformations = {
("Add_" + str_num): val.add(num),
("Subtract_" + str_num): val.sub(num),
("Multiply_" + str_num): val.mul(num),
("Divide_" + str_num): val.div(num)
}
return temp_df.assign(**transformations).drop('value', axis=1)
# example:
function(df, 10)
# [A] COL Add_10 Divide_10 Multiply_10 Subtract_10
# 0 AAA [B] 12 0.2 20 -8
# 1 BBB [B] 16 0.6 60 -4
# 2 AAA [C] 14 0.4 40 -6
# 3 BBB [C] 20 1.0 100 0

Filter pandas dataframe with specific column names in python

I have a pandas dataframe and a list as follows
mylist = ['nnn', 'mmm', 'yyy']
mydata =
xxx yyy zzz nnn ddd mmm
0 0 10 5 5 5 5
1 1 9 2 3 4 4
2 2 8 8 7 9 0
Now, I want to get only the columns mentioned in mylist and save it as a csv file.
i.e.
yyy nnn mmm
0 10 5 5
1 9 3 4
2 8 7 0
My current code is as follows.
mydata = pd.read_csv( input_file, header=0)
for item in mylist:
mydata_new = mydata[item]
print(mydata_new)
mydata_new.to_csv(file_name)
It seems to me that my new dataframe produces wrong results.Where I am making it wrong? Please help me!
Just pass a list of column names to index df:
df[['nnn', 'mmm', 'yyy']]
nnn mmm yyy
0 5 5 10
1 3 4 9
2 7 0 8
If you need to handle non-existent column names in your list, try filtering with df.columns.isin -
df.loc[:, df.columns.isin(['nnn', 'mmm', 'yyy', 'zzzzzz'])]
yyy nnn mmm
0 10 5 5
1 9 3 4
2 8 7 0
You can just put mylist inside [] and pandas will select it for you.
mydata_new = mydata[mylist]
Not sure whether your yyy is a typo.
The reason that you are wrong is that you are assigning mydata_new to a new series every time in the loop.
for item in mylist:
mydata_new = mydata[item] # <-
Thus, it will create a series rather than the whole df you want.
If some names in the list is not in your data frame, you can always check it with,
len(set(mylist) - set(mydata.columns)) > 0
and print it out
print(set(mylist) - set(mydata.columns))
Then see if there are typos or other unintended behaviors.
If mylist contains some column names which are not in mydata.columns, you will get an error like
KeyError: "['fff'] not in index"
In this case, you can use the df.filter function:
mydata.filter(['nnn', 'mmm', 'yyy', 'fff'])

How to count particular column values in python pandas?

I'm having dataframe like below:
df1_data = {'sym' :{0:'AAA',1:'BBB',2:'CCC',3:'AAA',4:'CCC',5:'DDD',6:'EEE',7:'EEE',8:'FFF'},
'identity' :{0:'AD',1:'AD',2:'AU',3:'AU',4:'AU',5:'AZ',6:'AU',7:'AZ',8:'AZ'}}
I want to check for sym column in my dataframe. My intension is to generate two different files, one containing same two columns in different order and second file contains sym,sym_count,AD_count,AU_count,neglected_count columns.
Edit 1 -
I want to avoid identity other than (AD & AU). In both output file I don't want result of AD & AU identity. neglected_count column is optional.
Expected Result-
result.csv
sym,identity
AAA,AD
AAA,AU
BBB,AD
CCC,AU
CCC,AU
EEE,AU
result_count.csv
sym,sym_count,AD_count,AU_count,neglected_count
AAA,2,1,1,0
BBB,1,1,0,0
CCC,2,0,2,0
EEE,2,0,1,1
How I can perform such type of calculation in python pandas?
I think you need crosstab with insert for add sum column to first position and add_suffix to column names.
Last write to_csv.
df1_data = {'sym' :{0:'AAA',1:'BBB',2:'CCC',3:'AAA',4:'CCC',5:'DDD',6:'EEE',7:'EEE',8:'FFF'},
'identity' :{0:'AD',1:'AD',2:'AU',3:'AU',4:'AU',5:'AZ',6:'AU',7:'AZ',8:'AZ'}}
df = pd.DataFrame(df1_data, columns=['sym','identity'])
print (df)
sym identity
0 AAA AD
1 BBB AD
2 CCC AU
3 AAA AU
4 CCC AU
5 DDD AZ
6 EEE AU
7 EEE AZ
8 FFF AZ
#write to csv
df.to_csv('result.csv', index=False)
#need vals only in identity
vals = ['AD','AU']
#replace another values to neglected
neglected = df.loc[~df.identity.isin(vals), 'identity'].unique().tolist()
neglected = {x:'neglected' for x in neglected}
print (neglected)
{'AZ': 'neglected'}
df.identity = df.identity.replace(neglected)
df1 = pd.crosstab(df['sym'], df['identity'])
df1.insert(0, 'sym', df1.sum(axis=1))
df2 = df1.add_suffix('_count').reset_index()
#find all rows where is 0 in columns with vals
mask = ~df2.filter(regex='|'.join(vals)).eq(0).all(axis=1)
print (mask)
0 True
1 True
2 True
3 False
4 True
5 False
dtype: bool
#boolean indexing
df2 = df2[mask]
print (df2)
identity sym sym_count AD_count AU_count neglected_count
0 AAA 2 1 1 0
1 BBB 1 1 0 0
2 CCC 2 0 2 0
4 EEE 2 0 1 1
df2.to_csv('result_count.csv', index=False)

Categories