Skip Empty cell Python Pandas - python

I am writing a script to count a percentage of cells that has a specific value. However, when it counts the rows it does not count out the cells that are NaN. Basically I do not want the script to count a cell with the value NaN as a row. I have tried everything from != ""
to .isnan
What im trying to do is calculating the percentage of cells that has a specific value which is not possible if the function counts the rows with NaN value.
RELEVANT CODE
df2 = pd.DataFrame(supplier_data_df, columns=['supplier keywords', 'supplier in ocr'])
total_suppliers = df2[(df2["supplier in ocr"] != "") & (df2["supplier keywords"] != "")]
percentilesupplierkeyword = len(supplier_filtered_df)/len(total_suppliers) * 100
print(percentilesupplierkeyword,"% of supplier-keywords have an issue")
Thank you in advance.

I hope you're doing good.
You can either consider dropping the NaN values or excluding them from your dataframe and then perform your following computations.
If you want to drop the NaN values
df2.dropna(inplace=True)
Or you could use the fillna method to fill the nan values with 0.
df2.fillna(0, inplace=True)
If you want to get the index list of the nan values
df2[df2["col1"].isna()].index.tolist()

Related

Lookup based on row and column header Pandas

How do I use the QuantityFormula column to iterate over the column headers. For example to find
where count (from QuantityFormula) == count (from headers.
Take the value of that row
To produce a new column called Quantity, with that value.
Do the same for all Count, Area, Volume
It needs to work if new rows are added too.
I found this code online, to start with looking to modify it or create a new piece of code to do what I need. How do I loop and compare Column to header (lookup_array == lookup_value) and store row value of that.
Note: the NaN columns (count, area, volume) could have values in them in future tables
def xlookup(lookup_value, lookup_array, return_array, if_not_found:str = ''):
match_value = return_array.loc[lookup_array == lookup_value]
if match_value.empty:
return f'"{lookup_value}" not found!' if if_not_found == '' else if_not_found
else:
return match_value.tolist()[0]
Merged['Quantity'] = Merged['QuantityFormula'].apply(xlookup, args = (Merged['NRM'], left['UoM']))
I have a XLOOKUP functionality but I need something slightly different.
here is one way to do it
I used a made-up Dataframe, if you had shared the dataframe as a code (preferably) or text, I would have used that. Refer to https://stackoverflow.com/help/minimal-reproducible-example
# use apply, to capture a row value for a column in forumla, along x-axis
df['quantity']=df.apply(lambda x: x[x['formula']] , axis=1)
df
count area formula quantity
0 1.0 NaN count 1.0
1 1.0 NaN count 1.0
2 NaN 1.4 area 1.4
3 NaN 0.6 area 0.6
With your current data, you have nan in the columns that aren't the one you want, and only have a real value in the one you do.
So, I say you just add up those three columns, which will effectively be the_number_you_want + 0 + 0. You can use np.nansum() to properly add the nan as zero.
...
import numpy as np
...
df['Quantity'] = np.nansum(df[['Count','Area','Volume']],axis=1)

Pandas - Lookup value for each item in list

I am relatively new to Python and Pandas. I have two dataframes, one contains a column of codes separated by a comma - the number of codes in each list can vary and can contain a string such as 'Not Applicable' or a blank. The other is a lookup table of the codes and a value. I want to lookup the value of each individual code in each list and calculate the maximum value within that list. For example ['H302','H304'] would be [18,11] and the maximum value of those two would be 18. I then want to return the maximum value of each list as a new column to df2. If it contains anything else, return blank.
This process was originally written in VBA, I solved the problem there by splitting each set of codes by delimiter to a new column, then dynamically running index/matches against each code to return the value. Then it would calculate the maximum value and delete out all the generated columns. I thought at the time it was a messy way to do it and I don't want to replicate this in the Python version.
I would post what I've tried by I can't figure out how I'd go about this - any help is appreciated!
import pandas as pd
df1 = [['H302',18],
['H312',17],
['H315',16],
['H316',15],
['H319',14],
['H320',13],
['H332',12],
['H304',11]]
df1 = pd.DataFrame(df1, columns=['Code', 'Value'])
df2 = [['H302,H304'],
['H332,H319,H312,H320,H316,H315,H302,H304'],
['H315,H312,H316'],
['H320,H332,H316,H315,H304,H302,H312'],
['H315,H319,H312,H316,H332'],
['H312'],
['Not Applicable'],
['']]
df2 = pd.DataFrame(df2, columns=['Code'])
df3 = []
for i in range(len(df2)):
df3.append(df2['Code'][i].split(","))
max_values = []
for i in range(len(df3)):
for j in range(len(df3[i])):
for index in range(len(df1)):
if df1['Code'][index] == df3[i][j]:
df3[i][j] = df1['Value'][index]
max_values.append(max(df3[i]))
df2["Max Value"] = max_values
First, df2 seems to be defined wrongly (single quotes between comas are required). Also, don't generate a data frame of it since you need to be flexible to have any number of elements.
Second, you would need to define the codes as the index to look for elements in the data frame. So, you would define the data frame as:
df1 = pd.DataFrame(df1, columns=['Code', 'Value']).set_index('Code')
Third, you need to loop through the second list of lists and index the elements you want before calculating the maximum using .loc. Also, you need to filter out the codes that are not in the first data frame.
result = []
for codes in df2:
c = [_ for _ in codes if _ in df1.index]
result.append(df1.loc[c,'Value'].max())
Try:
df2.join(df2['Code'].str.split(',')\
.explode()\
.map(df1.set_index('Code')['Value']).groupby(level=0).max().rename('Value'))
Output:
Code Value
0 H302,H304 18.0
1 H332,H319,H312,H320,H316,H315,H302,H304 18.0
2 H315,H312,H316 17.0
3 H320,H332,H316,H315,H304,H302,H312 18.0
4 H315,H319,H312,H316,H332 17.0
5 H312 17.0
6 Not Applicable NaN
7 NaN

Filling nan values

I have a dataset that contains nan values. These values are dependent on another variable, and I am trying to clean the data using it. I write a code to replace the nan values but it doesn't work. The code is:
df.loc[(df["house"]=="rented") & (df["car"]=="yes")]["debt"].fillna(2, inplace=True)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
Conditional that returns a boolean Series with column labels specified
df.loc[df['shield'] > 6, ['max_speed']]
max_speed
sidewinder 7
Based on the documentation it should be converted to this:
df.loc['filter','selected column']
Give it a try like this:
df.loc[(df["house"]=="rented") & (df["car"]=="yes"), ["debt"]].fillna(2, inplace=True)
Switch df.loc to
for val in df.index:
if (df["house"][val] == "rented") and (df["car"][val] == "yes"):
df["debt"][val] = 2
If I understand you correctly, you do not want to just fill in the na values. Rather, you'd like to fill the na values only when house is rented and you have a car. To fill all na values at df index "debt"
df["debt"].fillna(2, inplace=True)
should be used rather then your second line of code.

Update main dataframe based on sub dataframes coming from groupby

I am pretty new to pandas and trying to learn it. So, any advice would be appreciated :)
This is just a small part of my whole dataframe DF2:
Chromosome_Name
Sequence_Source
Sequence_Feature
Start
End
Strand
Gene_ID
Gene_Name
0
1
ensembl_havana
gene
14363
34806
-
"ENSG00000227232"
"WASH7P"
1
1
havana
gene
89295
138566
-
"ENSG00000238009"
"RP11-34P13.7"
2
1
havana
gene
141474
178862
-
"ENSG00000241860"
"RP11-34P13.13"
3
1
havana
gene
227615
272253
-
"ENSG00000228463"
"AP006222.2"
4
1
ensembl_havana
gene
312720
453948
+
"ENSG00000237094"
"RP4-669L17.10"
These are my conditions:
Condition 1: Reference row's "Start" value <= Other row's "End" value.
Condition 2: Reference row's "End" value >= Other row's "Start" value.
This is what I have done so far:
chromosome_list = ["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","X","Y"]
dataFrame = DF2.groupby(["Chromosome_Name"])
for chromosome in chromosome_list:
CHR = dataFrame.get_group(chromosome)
for i in range(0, len(CHR)-1):
for j in range(i+1, len(CHR)):
Overlap_index = DF2[(DF2.loc[i, ["Chromosome_Name"] == chromosome]) & (DF2.loc[i, ["Start"]] <= DF2.loc[j, ["End"]]) & (DF2.loc[i, ["End"]] >= DF2.loc[j, ["Start"]])].index
DF2 = DF2.drop(Overlap_index )
The chromosome_list is all the unique values of column "Chromosome_Name".
Mainly, I want to check for each row that whether the columns ("Start" and "End") values are satisfying the conditions above. I believe I need to iterate a single row (reference row) over the particular rows found in the data frame. However, to achieve this I need to consider the value of the first column "Chromosome_Name".
More specifically, every row in DF2 should be checked according to the conditions stated above but, for example, a row at Chromosome_Name = 5 shouldn't be checked with the row of Chromosome_Name = 12. Therefore, first, I thought that I should split the dataframe using pd.groupby() according to Chromosome_Name then, using these dataframes' indexes, I could manipulate (drop the given rows from) the DF2. However, it did not work :)
P.S. After DF2 is splitted into sub dataframes (according to unique Chromosome_Name), each sub dataframe has different size. e.g. There are 641 rows at Chromosome_Name = X but there are 19342 rows for the Chromosome_Name = 1
If you know how to correct my code or provide me another solution, I would be glad.
Thanks in advance.
I am new to pandas too so I do not want to give you a wrong insight and advices but have you ever thougth of converting Start and End columns to lists. So that you can use if statement if you are not comfortable with pandas but your task is urgent. However, I am aware that converting dataframe into list would be something opposite to the creation of pandas.

Pandas Dataframes - How do you maintain an index post a group by/aggregation operation?

This should be easy but I'm having a surprisingly annoying time at it. The code below shows me doing a Pandas groupby operation so I can calculate variance by symbol. Unfortunately what happens is that the aggregation command seems to get rid of the integer index, so I am trying to create a new integer list and add this as a column to the table and set as a new index.
vardataframe = voldataframe.groupby('Symbol')
vardataframe = vardataframe.aggregate(np.var)
vardataframe['newindex']= np.arange(1,(len(vardataframe)+1))
vardataframe.set_index(['newindex'])
vardataframe = vardataframe.ix[:,['newindex','Symbol','volatility']]
However what comes out is the below vardataframe.head() result, which does not properly change the index of the table from Symbol back to numeric. And this hurts me in a line or two when I try to do a merge command.
newindex Symbol volatility
Symbol
A 1 NaN 0.000249
AA 2 NaN 0.000413
AAIT 3 NaN 0.000237
AAL 4 NaN 0.001664
AAME 5 NaN 0.001283
As you see the problems with the above are now there are two Symbol columns and the index hasn't been set correctly. What I'd like to do is get rid of the second Symbol column and make newindex the new index. Anyone know what I'm doing wrong here? (Perhaps a misunderstanding of the ix command). Much appreciated!
You can use as_index=False to preserve integer index. You need only one line to do what you need:
vardataframe = voldataframe.groupby('Symbol', as_index=False).var()
A couple of things in your code:
vardataframe.set_index(['newindex'])
will set ndewindex as index, but returns a new dataframe which is not used. You can do vardataframe.set_index(['newindex'], inplace=True) if you want this.
vardataframe.ix[:,['newindex','Symbol','volatility']]
gives you a column Symbol of all NaN because Symbol is not a column of vardataframe, but only exists in its index. Querying a non-exist column with ix gives all NaN. As #user2600939 mentioned, you can do vardataframe.reset_index(inplace=True) (or vardataframe=vardataframe.reset_index() to put Symbol back as a column.
Instead of making a new index manually just reset it use...
df = df.reset_index()

Categories