.fillna column if two cells are empty in Pandas

.fillna column if two cells are empty in Pandas - python

Can somebody tell me why in my for loop
df_all = pd.read_csv("assembly_summary.txt", delimiter='\t', index_col=0)
for row in df_all.index:
if pd.isnull(df_all.infraspecific_name[row]) and pd.isnull(df_all.isolate[row]):
df_all.infraspecific_name.fillna('NA', inplace=True)
print(df_all[['infraspecific_name', 'isolate']])
.fillna fills the specified cell even when the column referred to in the second part of the if statement is not null?
I am trying to use .fillna ONLY if both of the cells referred to in my if statement are null.
I also tried changing the second to last line to df_all.infraspecific_name[row].fillna('NA', inplace=True) Which doesn't work either.
df_all.loc[row,['infraspecific_name']].fillna('NA', inplace=True) corrects the problem, but then when both cells infraspecific_name and isolate ARE null, it doesn't fill the cell with 'NA'
I am not sure if my lack of understanding is in Python loops or Pandas.
The .csv file I am using can be found at ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

Since you are indexing your first col, you could use update:
df_all['infraspecific_name']
Returns a Series of only the specified column. The following will perform .fillna only on select (elements) rows [where condition True]
[(df_all['infraspecific_name'].isnull()) & (df_all['isolate'].isnull())].fillna('NA')
You can achieve all your steps in one line by combining the above and preceding it all with update.
df_all.update(df_all['infraspecific_name'][(df_all['infraspecific_name'].isnull()) & (df_all['isolate'].isnull())].fillna('NA'))
Number of rows changed
len(df_all[df_all['infraspecific_name'] == 'NA'])
1825
The rest of the dataframe should be intact.

This should get you what you want
csvfile = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt'
df_all = pd.read_csv(csvfile, delimiter='\t', index_col=0)
mask = df_all[['infraspecific_name', 'isolate']].isnull().all(axis=1)
df_all.loc[mask, 'infraspecific_name'] = 'NA'
the 3rd line uses these values df_all[['infraspecific_name', 'isolate']] then for each value tests for nulls .isnull(). Then the last part .all(axis=1) is finding out if all columns in each row have Truth values in them.
The 4th line is using that mask to find the locations of the values that need changing.

Related

How do I change a cell's contents within an if statement?

Here I'm trying make the cell for the days_employed column in the row that has 'retiree' in a different column into NaN. As it is, it makes the entire days_employed column into NaN, whereas I only want that specific cell in that row to be NaN
for row in df['income_type']:
if row == 'retiree':
df['days_employed'] = float('Nan')
Is there something similar to row in df['days_employed'] = float('Nan')?

Since this seems like pandas to me, you can vectorize this by applying a condition on the entire column, then assigning those cells to NaN:
df.loc[df['income_type'].eq('retiree'), 'days_employed'] = np.nan
Or, reassign the column using np.wher:
df['days_employed'] = np.where(
df['income_type'].eq('retiree'), np.nan, df['days_employed'])

When you do list[column] you are specifying the entire column, that is why the entire days_employed is being changed to NaN. Try this and let me know if it works:
for row in df['income_type']:
if row == 'retiree':
df['days_employed'][x] = float('Nan')
Note: x is the specific value in ['days_employed'] that should be changed to float('Nan').
If you do not know the value of the cell you want to change, or if it changes often, then simply use a for loop to look for the index of the cell you want to change, and set x to that index.

.fillna empties the whole column instead of repalcing null values

I have a dataframe with a column named rDREB% which contains missing values, as shown:count of cells with value of columns. I tried:
playersData['rDREB%'] = playersData['rDREB%'].fillna(0, inplace=True)
After executing the code, the whole column will be empty when i check. Isn't the code supposed to replace only null value with 0? i am confused.
before the code
after the code
P.S. i am also trying to replace other columns with missing values, i.e. ScoreVal, PlayVal, rORB%, OBPM, BPM...

Using inplace means fillna returns nothing, which you're assinging to your column. Either remove inplace, or don't assign the return value to the column:
playersData['rDREB%'] = playersData['rDREB%'].fillna(0)
or
playersData['rDREB%'].fillna(0, inplace=True)
The first approach is recommended. See this question for more info: In pandas, is inplace = True considered harmful, or not?

Python Pandas Dataframes comparison on 2 columns (with where clause)

I'm stuck on particluar python question here. I have 2 dataframes DF1 and DF2. In both, I have 2 columns pID and yID (which are not indexed, just default). I'm look to add a column Found in DF1 where the respective values of columns (pID and yID) were found in DF2. Also, I would like to zone in on just values in DF2 where aID == 'Text'.
I believe the below gets me the 1st part of this question; however, I'm unsure how as to incorporate the where.
DF1['Found'] = (DF1[['pID', 'yID']] == DF2[['pID','yID']]).all(axis=1).astype(bool)
Suggestions or answers would be most appreciated. Thanks.

You could subset the second dataframe containing aID == 'Text' to get a reduced DF from which select those portions of columns to be compared against the first dataframe.
Use DF.isin() to check if the values that are present under these column names match or not. And, .all(axis=1) returns True if both the columns happen to be True, else they become False. Convert the boolean series to integers via astype(int) and assign the result to the new column, Found.
df1_sub = df1[['pID', 'yID']]
df2_sub = df2.query('aID=="Text"')[['pID', 'yID']]
df1['Found'] = df1_sub.isin(df2_sub).all(axis=1).astype(int)
df1
Demo DF's used:
df1 = pd.DataFrame(dict(pID=[1,2,3,4,5],
yID=[10,20,30,40,50]))
df2 = pd.DataFrame(dict(pID=[1,2,8,4,5],
yID=[10,12,30,40,50],
aID=['Text','Best','Text','Best','Text']))
If it does not matter where those matches occur, then merge the two dataframes on 'pID', 'yID' common columns as the key by considering the bigger DF's index (right_index=True) as the new index axis that needs to be emitted and aligned after the merge operation is over.
Access these indices which indicate matches found and assign the value, 1 to a new column named Found while filling it's missing elements with 0's throughout.
df1.loc[pd.merge(df1_sub, df2_sub, on=['pID', 'yID'], right_index=True).index, 'Found'] = 1
df1['Found'].fillna(0, inplace=True)
df1 should be modifed accordingly post the above steps.

parsing tab delimited values from text file to variables

Hello I've been struggling with this problem, I'm trying to iterate over rows and select data from them and then assign them to variables. this is the first time I'm using pandas and I'm not sure how to select the data
reader = pd.read_csv(file_path, sep="\t" ,lineterminator='\r', usecols=[0,1,2,9,10],)
for row in reader:
print(row)
#id_number = row[0]
#name = row[2]
#ip_address = row[1]
#latitude = row[9]
and this is the output from the row that I want to assign to the variables:
050000
129.240.228.138
planetlab2.simula.no
59.93
Edit: Perhaps this is not a problem for pandas but for general Python. I am fairly new to python and what I'm trying to achieve is to parse tab separated file line by line and assign data to the variables and print them in one loop.
this is the input file sample:
050263 128.2.211.113 planetlab-1.cmcl.cs.cmu.edu NA US Allegheny County Pittsburgh http://www.cs.cmu.edu/ Carnegie Mellon University 40.4446 -79.9427 unknown
050264 128.2.211.115 planetlab-3.cmcl.cs.cmu.edu NA US Allegheny County Pittsburgh http://www.cs.cmu.edu/ Carnegie Mellon University 40.4446 -79.9427 unknown

The general workflow you're describing is: you want to read in a csv, find a row in the file with a certain ID, and unpack all the values from that row into variables. This is simple to do with pandas.
It looks like the CSV file has at least 10 columns in it. Providing the usecols arg should filter out the columns that you're not interested in, and read_csv will ignore them when loading into the pandas DataFrame object (which you've called reader).
Steps to do what you want:
Read the data file using pd.read_csv(). You've already done this, but I recommend calling this variable df instead of reader, as read_csv returns a DataFrame object, not a Reader object. You'll also find it convenient to use the names argument to read_csv to assign column names to the dataframe. It looks like you want names=['id', 'ip_address', 'name', 'latitude','longitude'] to get those as columns. (Assuming col10 is longitude, which makes sense that 9,10 would be lat/long pairs)
Query the dataframe object for the row with that ID that you're interested in. There are a variety of ways to do this. One is using the query syntax. Hard to know why you want that specific row without more details, but you can look up more information about index lookups in pandas. Example: row = df.query("id == 50000")
Given a single row, you want to extract the row values into variables. This is easy if you've assigned column names to your dataframe. You can treat the row as a dictionary of values. E.g. lat = row['lat'] lon = row['long]

You can use iterrows():
df = pandas.read_csv(file_path, sep=',')
for index, row in df.iterrows():
value = row['col_name']
Or if you want to access by index of the column:
df = pandas.read_csv(file_path, sep=',')
for index, row in df.iterrows():
value = row.ix[0]

Are the values you need to add the same for each row, or does it require processing the value to determine the value of the addition? If it is consistent you can apply this sum simply using pandas to do a matrix operation on the dataset. If it requires processing row by row, the above solution is the correct one for sure. If it is a table of variables that must be added row by row, you can do that by dumping them all into a column aligned with your dataset, do the addition by row using pandas, and simply print out the complete dataframe. Assume you have three columns to add, which you put into a new column[e].
df['e'] = df.a + df.b + df.d
or, if it is a constant:
df['e'] = df.a + df.b + {constant}
Then drop the columns you don't need (ex df['a'] and df['b'] in the above)
Obviously, then, if you need to calculate based on unique values for each row, put the values into another column and sum as above.

Returning unique values in .csv and unique strings in python+pandas

my question is very similar to here: Find unique values in a Pandas dataframe, irrespective of row or column location
I am very new to coding, so I apologize for the cringing in advance.
I have a .csv file which I open as a pandas dataframe, and would like to be able to return unique values across the entire dataframe, as well as all unique strings.
I have tried:
for row in df:
pd.unique(df.values.ravel())
This fails to iterate through rows.
The following code prints what I want:
for index, row in df.iterrows():
if isinstance(row, object):
print('%s\n%s' % (index, row))
However, trying to place these values into a previously defined set (myset = set()) fails when I hit a blank column (NoneType error):
for index, row in df.iterrows():
if isinstance(row, object):
myset.update(print('%s\n%s' % (index, row)))
I get closest to what I was when I try the following:
for index, row in df.iterrows():
if isinstance(row, object):
myset.update('%s\n%s' % (index, row))
However, my set prints out a list of characters rather than the strings/floats/values that appear on my screen when I print above.
Someone please help point out where I fail miserably at this task. Thanks!

I think the following should work for almost any dataframe. It will extract each value that is unique in the entire dataframe.
Post a comment if you encounter a problem, i'll try to solve it.
# Replace all nones / nas by spaces - so they won't bother us later
df = df.fillna('')
# Preparing a list
list_sets = []
# Iterates all columns (much faster than rows)
for col in df.columns:
# List containing all the unique values of this column
this_set = list(set(df[col].values))
# Creating a combined list
list_sets = list_sets + this_set
# Doing a set of the combined list
final_set = list(set(list_sets))
# For completion's sake, you can remove the space introduced by the fillna step
final_set.remove('')
Edit :
I think i know what happens. You must have some float columns, and fillna is failing on those, as the code i gave you was replacing missing values with an empty string. Try those :
df = df.fillna(np.nan) or
df = df.fillna(0)
For the first point, you'll need to import numpy first (import numpy as np). It must already be installed as you have pandas.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.