Replacing empty cells in column with variable value - python

I'm trying to replace the emtpy cells of a column called 'City' with the most common value in the same column through the use of the python library (i think) called pandas.
(workin with a csv file here)
This is what i've tried, assume the file is read and ready to be edited:
location = df['City'].mode()
basicdf = "df['City'].replace('',"+location+", inplace=True)"
basicdf
so the logic here was to use .mode which gives the most frequent value in a row and make that value into the variable 'location'
and then add that variable into the second line of code.
(i dont know how to do all this in the correct way at all.)
the second line of code seemed to be the only way to add whatever variable i desire into this .replace command.
Edit: have tried this code instead, this ends up writing in other columns aswell, other than 'City' which is not great.
df['City'].replace('',np.nan,inplace=True)
df = df.fillna(df['City'].value_counts().index[0])
any tips would be appreciated, mainly how to achieve what im trying to do ( while not needing to restart from scratch, cause i have a lot of other code in the file using pandas library) and
how to insert variables into these pandas commands (if even possible).

Found the answer, thanks mainly to Pygirl,
df['City'].replace('',np.nan,inplace=True)
df['City'].fillna(df['City'].value_counts().index[0], inplace=True)
these will first replace the blanks or empty cells with NaNs and then 'fill' in the NaNs with the most common value in the column selected, in this case: 'City'.

Related

Is there a Python pandas function for retrieving a specific value of a dataframe based on its content?

I've got multiple excels and I need a specific value but in each excel, the cell with the value changes position slightly. However, this value is always preceded by a generic description of it which remains constant in all excels.
I was wondering if there was a way to ask Python to grab the value to the right of the element containing the string "xxx".
try iterating over the excel files (I guess you loaded each as a separate pandas object?)
somehting like for df in [dataframe1, dataframe2...dataframeN].
Then you could pick the column you need (if the column stays constant), e.g. - df['columnX'] and find which index it has:
df.index[df['columnX']=="xxx"]. Maybe will make sense to add .tolist() at the end, so that if "xxx" is a value that repeats more than once, you get all occurances in alist.
The last step would be too take the index+1 to get the value you want.
Hope it was helpful.
In general I would highly suggest to be more specific in your questions and provide code / examples.

How to avoid Pandas throwing a SettingCopyWarning when transforming a subset of a series?

mask = ~df.bar.isna()
df.bar.loc[mask] = df.bar.loc[mask].map(f)
This sets off a SettingCopyWarning, though I am using loc.
I am aware of df.mask, but this will not work either as the column contains missing values that throw errors when the mapping function is applied to it .
You get SettingCopyWarning because Pandas cannot be sure if you want to manipulate dataframe via a reference, or if you want to manipulate a copy of the dataframe. So try to add .copy() at the end of the second line and see if the warning goes away. Sometimes the cause of the warning is actually in the code somewhere a few lines before where you get the it.

Is it possible to Skip Blank Lines in a Dataframe? If Yes then how I can do this

I am trying to run this code
num = df_out.drop_duplicates(subset=['Name', 'No.']).groupby.(['Name']).size()
But when I do I get this error:
ValueError: not enough values to unpack (expected 2, got 0)
If we think about my dataframe(df_out) as an excel file I do have blank cells but no full column or full row is blank. I needed to skip the blank lines to run the code without changing the dataframe's structure.
Is this possible?
Thank you
Consider using df.dropna(). It is uses to remove rows that contains NA. See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html for more information.
At first, you probably want your "blank cells" to be converted to NA value, so they can be dropped by dropna(). This can be done using various methods, notably df.replace(r'\s+', pandas.np.nan, regex=True). If your "blank cells" are all empty strings, or fixed strings equal to some value s, you can directly use (first case) df.replace('', pandas.np.nan) or (second case) df.replace(s, pandas.np.nan).

Dataframe not showing in Pycharm

I am using PyCharm 2016.2.1 . When I try to view a Pandas dataframe through the newly added feature 'View as DataFrame' in the debugger, this works as expected for a small (e.g. 4x4) DataFrame.
However when I try to view a DataFrame (generated by custom script) of ~10,000 rows x ~50 columns, I get the message: "Nothing to show".
When I run the same script (that generates the DataFrame) in Spyder, I am able to view it, so I am pretty sure it's not an error in my script.
Does anyone know if there is a maximum size to the DataFrames that can be viewed in PyCharm, and if there is a way to change this?
EDIT:
It seems that the maximum size allowed is 1000 x 15 , as in some cases it gets truncated to this size (when the number of rows is too large, but when there are too many columns pycharm just says 'nothing to show').
Still, I would like to know if there is a way to increase the maximum allowed rows and columns viewable through the DataFrame viewer.
I have faced the same problem with PyCharm 2018.2.2. The reason was having a special character in a column's name as mentioned by Yunzhao .
If your having a column name like 'R&D' changing it to 'RnD' will fix the problem. It's really strange JetBrains hasn't solved this problem for over 2 years.
As you said in your edit, there's a limit on the number of columns (on my PC though it's far less than 15). However, you can see the whole thing by typing:
df.values
It will show you the whole dataframe, but without the names of the columns.
Edit:
To show the column names as well:
np.vstack([df.columns, df.values])
I have met the same problems.
I figured it was because of the special characters in column names (in my case)
In my case, I have "%" in the column name, then it doesn't show the data in View as DataFrame function. After I remove it, everything was correctly shown.
Please double check if you also have some special characters in the column names.
This might be helpful for some people experiencing similar problem:
As of August 2019 SciView in PyCharm does struggle with displaying DataFrames that have contain nullable integer type, see issue on JetBrains
I use PyCharm 2019.1.1 (Community Edition). And when I used right-click "View as DataFrame". I get the message: "Nothing to show".
But when I click the object tail button "...View as DataFrame", it worked.
I find out that my problem is my DataFrame Object is an Object's param. Right-click the "View as DataFrame" doesn't transfer the class name, need to user input the class's name and param's name.
Hope can help somebody.
In the case that you don't strictly need to use the functionalities given by the DataFrame viewer, you can print the whole DataFrame in the output window, using:
def print_full(x):
pd.set_option('display.max_rows', len(x))
print(x)
pd.reset_option('display.max_rows')
I use PyCharm 2019.1.1 (Community Edition) and I run Python 3.7.
When I first click on "View as DataFrame" there seems to be the same issue, but if I wait a few second the content pops up. For me it is a matter of loading.
For the sake of completeness: I face the same problem, due to the fact that some elements in the index of the dataframe contain a question mark '?'. One should avoid that too, if you still want to use the data viewer. Data viewer still worked, if the index strings contain hashes or less-than/greather-than signs though.
In my situation, the problem is caused by two same cloumn name in my dataframe.
Check it by:len(list(df)) == len(set(df))
As of 2020-10-02, using PyCharm 2020.1.4, I found that this issue also occurs if the DataFrame contains a column containing a tuple.
The same thing is happening to me in version 2021.3.3
In my case, it seems to have something to do with the column dtype being Int64, but then completely full of <NA> values. As soon as I change even a single row's value in the offending column to an actual integer, it renders again.
So, if feasible, you can fix it by dropping the column, or setting all (or at least one) of the values to some meaningful replacement value, like -1 or -99999 or something:
df["col"] = df["col"].fillna(value=-1)

Offset in reading columns in a textfile with matplotlib

I have a text file containing an array of numbers from which I want to plot certain columns vs other columns. I defined a column function so I can assign a name to each column and then plot them, as in this sample code:
def column(matrix,i):
return [float(row.split()[i]) for row in matrix]
Db = file('ResolutionEffects', 'r' )
HIcontour = column(Db,1)
Db.seek(1)
However when I display a column in my terminal to check that Python is indeed reading the right one, it appears that the first value of the column (as returned in my terminal) is actually the first value of the NEXT column in the text file. All the other numbers are from the correct column. There are no blank spaces or lines in the text file. As far as I can tell this offset happens to every column after the first one.
If anyone can tell why this is happening, or find a more robust way to read columns in text files I would greatly appreciate it.
Indeed I found loadtext to be a lot more robust. After converting my text file to a data file (.dat) I simply use this:
a=np.loadtxt('ResolutionEffects.dat', usecols=(0,1,11,12))
ax1.plot(a[:,0], a[:,1], 'dk', label='HI')
ax1.plot(a[:,2], a[:,3], 'dr', label='CO')
No weird offsets or bugs anymore :) Thanks Ajean and jedwards!

Categories