Filling nan values - python

I have a dataset that contains nan values. These values are dependent on another variable, and I am trying to clean the data using it. I write a code to replace the nan values but it doesn't work. The code is:
df.loc[(df["house"]=="rented") & (df["car"]=="yes")]["debt"].fillna(2, inplace=True)

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
Conditional that returns a boolean Series with column labels specified
df.loc[df['shield'] > 6, ['max_speed']]
max_speed
sidewinder 7
Based on the documentation it should be converted to this:
df.loc['filter','selected column']
Give it a try like this:
df.loc[(df["house"]=="rented") & (df["car"]=="yes"), ["debt"]].fillna(2, inplace=True)

Switch df.loc to
for val in df.index:
if (df["house"][val] == "rented") and (df["car"][val] == "yes"):
df["debt"][val] = 2
If I understand you correctly, you do not want to just fill in the na values. Rather, you'd like to fill the na values only when house is rented and you have a car. To fill all na values at df index "debt"
df["debt"].fillna(2, inplace=True)
should be used rather then your second line of code.

Related

Pandas - Lookup value for each item in list

I am relatively new to Python and Pandas. I have two dataframes, one contains a column of codes separated by a comma - the number of codes in each list can vary and can contain a string such as 'Not Applicable' or a blank. The other is a lookup table of the codes and a value. I want to lookup the value of each individual code in each list and calculate the maximum value within that list. For example ['H302','H304'] would be [18,11] and the maximum value of those two would be 18. I then want to return the maximum value of each list as a new column to df2. If it contains anything else, return blank.
This process was originally written in VBA, I solved the problem there by splitting each set of codes by delimiter to a new column, then dynamically running index/matches against each code to return the value. Then it would calculate the maximum value and delete out all the generated columns. I thought at the time it was a messy way to do it and I don't want to replicate this in the Python version.
I would post what I've tried by I can't figure out how I'd go about this - any help is appreciated!
import pandas as pd
df1 = [['H302',18],
['H312',17],
['H315',16],
['H316',15],
['H319',14],
['H320',13],
['H332',12],
['H304',11]]
df1 = pd.DataFrame(df1, columns=['Code', 'Value'])
df2 = [['H302,H304'],
['H332,H319,H312,H320,H316,H315,H302,H304'],
['H315,H312,H316'],
['H320,H332,H316,H315,H304,H302,H312'],
['H315,H319,H312,H316,H332'],
['H312'],
['Not Applicable'],
['']]
df2 = pd.DataFrame(df2, columns=['Code'])
df3 = []
for i in range(len(df2)):
df3.append(df2['Code'][i].split(","))
max_values = []
for i in range(len(df3)):
for j in range(len(df3[i])):
for index in range(len(df1)):
if df1['Code'][index] == df3[i][j]:
df3[i][j] = df1['Value'][index]
max_values.append(max(df3[i]))
df2["Max Value"] = max_values
First, df2 seems to be defined wrongly (single quotes between comas are required). Also, don't generate a data frame of it since you need to be flexible to have any number of elements.
Second, you would need to define the codes as the index to look for elements in the data frame. So, you would define the data frame as:
df1 = pd.DataFrame(df1, columns=['Code', 'Value']).set_index('Code')
Third, you need to loop through the second list of lists and index the elements you want before calculating the maximum using .loc. Also, you need to filter out the codes that are not in the first data frame.
result = []
for codes in df2:
c = [_ for _ in codes if _ in df1.index]
result.append(df1.loc[c,'Value'].max())
Try:
df2.join(df2['Code'].str.split(',')\
.explode()\
.map(df1.set_index('Code')['Value']).groupby(level=0).max().rename('Value'))
Output:
Code Value
0 H302,H304 18.0
1 H332,H319,H312,H320,H316,H315,H302,H304 18.0
2 H315,H312,H316 17.0
3 H320,H332,H316,H315,H304,H302,H312 18.0
4 H315,H319,H312,H316,H332 17.0
5 H312 17.0
6 Not Applicable NaN
7 NaN

Skip Empty cell Python Pandas

I am writing a script to count a percentage of cells that has a specific value. However, when it counts the rows it does not count out the cells that are NaN. Basically I do not want the script to count a cell with the value NaN as a row. I have tried everything from != ""
to .isnan
What im trying to do is calculating the percentage of cells that has a specific value which is not possible if the function counts the rows with NaN value.
RELEVANT CODE
df2 = pd.DataFrame(supplier_data_df, columns=['supplier keywords', 'supplier in ocr'])
total_suppliers = df2[(df2["supplier in ocr"] != "") & (df2["supplier keywords"] != "")]
percentilesupplierkeyword = len(supplier_filtered_df)/len(total_suppliers) * 100
print(percentilesupplierkeyword,"% of supplier-keywords have an issue")
Thank you in advance.
I hope you're doing good.
You can either consider dropping the NaN values or excluding them from your dataframe and then perform your following computations.
If you want to drop the NaN values
df2.dropna(inplace=True)
Or you could use the fillna method to fill the nan values with 0.
df2.fillna(0, inplace=True)
If you want to get the index list of the nan values
df2[df2["col1"].isna()].index.tolist()

Pandas concat flips all my values in the DataFrame

I have a dataframe called 'running_tally'
list jan_to jan_from
0 LA True False
1 NY False True
I am trying to append new data to it in the form of a single column dataframe called 'new_data'
list
0 HOU
1 LA
I concat these two dfs based on their 'list' column for further processing, but immediately after I do that all the boolean values unexpectedly flip.
running_tally = pd.concat([running_tally,new_data]).groupby('list',as_index=False).first()
the above statement will produce:
list jan_to jan_from
0 LA False True
1 NY True False
2 HOU NaN NaN
NaN values are expected for the new row, but I don't know why the bools all flip. What could be the reason for this? The code logically makes sense to me so I'm not sure where I'm going wrong. Thanks
EDIT: I made an edit to 'new_data' to include a repeat with LA. The final output should not have repeats which my code currently handles correctly, just has boolean flipping
EDIT 2: Turns out that when concatenating, the columns would flip in order leading me to believe the bools flipped. Still an open issue however
I am not sure why you want to use a groupby in this case... when using concat there is no need to specify which columns you want to use, as long as their names are identical.
Simple concatenation like this should do:
running_tally = pd.concat([running_tally,new_data], ignore_index=True, sort=False)
EDIT to take question edit into account: this should do the same job, without duplicates.
running_tally = running_tally.merge(new_data, on="list", how="outer")
I donĀ“t get the booleans flipped as you, but you can try this too:
running_tally=running_tally.append(new_data,ignore_index=True)
print(running_tally)
Output:
list jan_to jan_from
0 LA True False
1 NY False True
2 HOU NaN NaN
EDIT: Since the question was edited, you could try with:
running_tally=running_tally.append(new_data,ignore_index=True).groupby('list',as_index=False).first()
The actual row order was being flipped when using concat for pandas 0.20.1
How to concat pandas Dataframes without changing the column order in Pandas 0.20.1?

Python, regular expressions - search dots in pandas data frame

I have pandas.dataFrame with column 'Country', head() is below:
0 tmp
1 Environmental Indicators: Energy
2 tmp
3 Energy Supply and Renewable Electricity Produc...
4 NaN
5 NaN
6 NaN
7 Choose a country from the following drop-down ...
8 NaN
9 Country
When I use this line:
energy['Country'] = energy['Country'].str.replace(r'[...]', 'a')
There is no change.
But when I use this line instead:
energy['Country'] = energy['Country'].str.replace(r'[...]', np.nan)
All values are NaN.
Why does only second code change output? My goal is change valuses with triple dot only.
Is this what you want when you say "I need change whole values, not just the triple dots"?
mask = df.Country.str.contains(r'\.\.\.', na=False)
df.Country[mask] = 'a'
.replace(r'[...]', 'a') treats the first parameter as a regular expression, but you want to treat it literally. So, you need .replace(r'\.\.\.', 'a').
As for your actual question, .str.replace requires a string as the second parameter. It attempts to convert np.nan to a string (which is not possible) and fails. For the reason not known to me, instead of raising a TypeError, it instead returns np.nan for each row.

Pandas Dataframes - How do you maintain an index post a group by/aggregation operation?

This should be easy but I'm having a surprisingly annoying time at it. The code below shows me doing a Pandas groupby operation so I can calculate variance by symbol. Unfortunately what happens is that the aggregation command seems to get rid of the integer index, so I am trying to create a new integer list and add this as a column to the table and set as a new index.
vardataframe = voldataframe.groupby('Symbol')
vardataframe = vardataframe.aggregate(np.var)
vardataframe['newindex']= np.arange(1,(len(vardataframe)+1))
vardataframe.set_index(['newindex'])
vardataframe = vardataframe.ix[:,['newindex','Symbol','volatility']]
However what comes out is the below vardataframe.head() result, which does not properly change the index of the table from Symbol back to numeric. And this hurts me in a line or two when I try to do a merge command.
newindex Symbol volatility
Symbol
A 1 NaN 0.000249
AA 2 NaN 0.000413
AAIT 3 NaN 0.000237
AAL 4 NaN 0.001664
AAME 5 NaN 0.001283
As you see the problems with the above are now there are two Symbol columns and the index hasn't been set correctly. What I'd like to do is get rid of the second Symbol column and make newindex the new index. Anyone know what I'm doing wrong here? (Perhaps a misunderstanding of the ix command). Much appreciated!
You can use as_index=False to preserve integer index. You need only one line to do what you need:
vardataframe = voldataframe.groupby('Symbol', as_index=False).var()
A couple of things in your code:
vardataframe.set_index(['newindex'])
will set ndewindex as index, but returns a new dataframe which is not used. You can do vardataframe.set_index(['newindex'], inplace=True) if you want this.
vardataframe.ix[:,['newindex','Symbol','volatility']]
gives you a column Symbol of all NaN because Symbol is not a column of vardataframe, but only exists in its index. Querying a non-exist column with ix gives all NaN. As #user2600939 mentioned, you can do vardataframe.reset_index(inplace=True) (or vardataframe=vardataframe.reset_index() to put Symbol back as a column.
Instead of making a new index manually just reset it use...
df = df.reset_index()

Categories