showing cells with a particular symbols in pandas dataframe - python

i have not seen such question, so if you happen to know the answer or have seen the same question, please let me know
i have a dataframe in pandas with 4 columns and 5k rows, one of the columns is "price" and i need to do some manipulations with it. but the data was parsed from web-page and it is not clean, so i cannot convert this column to integer type after getting rid of dollar sign and comas. i found out that it also contains data in the format 3500/mo. so i need to filter cells with /mo and decide whether i can drop them, basing on how many of those i have and what is the price.
now, i have managed to count those cells using
df["price"].str.contains("/").sum()
but when i want to see those cells, i cannot do that, because when i create another variable to extract slash-containing cells and use "contains" or smth, i get the series with true/false values - showing me the condition of whether the cell does or does not contain that slash, while i actually need to see cells themselves. any ideas?

You need to use the boolean mask returned by df["price"].str.contains("/") as index to get the respective rows, i.e., df[df["price"].str.contains("/")] (cf. the pandas docs on indexing).

Related

Why is the `df.columns` an empty list while I can see the column names if I print out the dataframe? Python Pandas

import pandas as pd
DATA = pd.read_csv(url)
DATA.head()
I have a large dataset that have dozens of columns. After loading it like above into Colab, I can see the name of each column. But running DATA.columns just return Index([], dtype='object'). What's happening in this?
Now I find it impossible to pick out a few columns without column names. One way is to specify names = [...] when I load it, but I'm reluctant to do that since there're too many columns. So I'm looking for a way to index a column by integers, like in R df[:,[1,2,3]] would simply give me the first three columns of a dataframe. Somehow Pandas seems to focus on column names and makes integer indexing very inconvenient, though.
So what I'm asking is (1) What did I do wrong? Can I obtain those column names as well when I load the dataframe? (2) If not, how can I pick out the [0, 1, 10]th column by a list of integers?
It seems that the problem is in the loading as DATA.shape returns (10000,0). I rerun the loading code a few times, and all of a sudden, things go back normal. Maybe Colab was taking a nap or something?
You can perfectly do that using df.loc[:,[1,2,3]] but i would suggest you to use the names because if the columns ever change the order or you insert new columns, the code can break it.

Pandas: Dealing with missing column in input dataframe

I have a python code which performs mathematical calculations on multiple columns of the dataframe. This input comes from various sources so there is a possibility that sometimes one column is missing from the same.
This column is missing because its insignificant but i need to have a null column atleast for the code to run without errors.
I can add a null column using if loop but there are around 120 columns and i do not want to slow down the code. Is there any other way where the code can check each column is present in the original dataframe and then if any column is not present it adds a null column and then starts with execution of the actual code?
If you know that the column name is the same for every dataframe you could do something like this without having to loop over the column names
if col_name not in df.columns:
df[col_name] = '' # or whatever value you want to set it to
If speed is a super concern, which I can't tell, you could always convert the the columns to a set with set(df.columns) and reduce the search to O(1) time because it will be a hashed search. You can read more in detail on the efficiency of the in operator at this link How efficient is Python's 'in' or 'not in' operators?

Iteration & Computation Pandas Dataframe

As a very very new beginner with Python & Pandas, I am looking for your support regarding an issue.
I need to iterate over columns and find out the maximum value in the concerning rows of a dataframe and write it in a new variable for each row. The number of columns is not manageable, almost 200 columns, therefore I do not want to write each required column id manually. And most importantly that I need to start from a given column id and continue with two columns id increments till a given last columns id.
Will appreciate sample codes, see attachment too.
Try:
df['x']=df.max(axis=1)
Replace x with the name for your desired output column.

How does the isnull() method work to return all rows that are missing in my data frame?

I'm new to Python and just trying to figure out how this small bit of code works. Hoping this'll be easy to explain without an example data frame.
My data frame, called df_train, contains a column called Age. This column is NaN for 177 records.
I submit the following code...
df_train[df_train['Age'].isnull()]
... and it returns all records that are missing.
Now if I submit df_train['Age'].isnull(), all I get is a Boolean List of values. How does the data frame object then work to convert this Boolean List to the rows we actually want?
I don't understand how passing the boolean list to the data frame again results in just the 177 records that we need - could someone please ELI5 for a newbie?
You will have to create subsets of the dataframe you want to use. Suppose you want to use only those rows where df_train['Age'] is not null. In that case, you have to select
df_train_to_use = df_train[df_train['Age'].isnull() == False]
Now, you may cross check any other column that you may want to use and have nulls like
df_train['Column_name'].isnull().any()
If this returns True, you may go ahead and replace nulls with default values, average, zeros or whatever methods you prefer, usually put in application for machine learning programs.
Example
df_train['Column_name'].dropna()
df_train['Column_name'].fillna('') #for strings
df_train['Column_name'].fillna(0) #for int
df_train['Column_name'].fillna(0.0) #for float
Etc.
I hope this helps you explain.

Filtering pandas rows based on first row

I have this table:
And I just want all the rows based on the first row, so I wrote
df[(df['product']==df.loc[[0],['product']])&(df['program_code']==df.loc[[0],['program_code']])]
etc for all the other columns that isn't sum
Which should return the first ~30 rows
Instead I get Must pass DataFrame with boolean values only
If I check to see if there boolean, which you would think as I'm comparing the values to itself and they're homogeneous, I get:
It's like my dataframe somehow shifted and I get two nan's. I'm sure there's a feature whereby this shifting is important, but I don't even know when I do it.
And even if you solve that. I get:
But if I HAND type it, I get
Success!
So maybe the item isnt' right

Categories