I have a column in my dataframe that I would like to convert to datatype int. However it is throwing an error because some of the rows have letters in their entries. I would like to create a new dataframe that only has entries in this column with pure numeric entries (or at least no letters).
So my question is: Is there a way to do something like the following,
df=df[df['addzip'].str.contains("a")==False]
But with a list where the "a" is? See the example below,
df=df[df['addzip'].str.contains(list(str(string.ascii_lowercase)+str(string.ascii_uppercase)))==False]
I know that this very possible to do with an apply command but I would like to keep this as vectorized as possible so that is not what I am looking for. So far I haven't found any solutions anywhere else on stack overflow.
Just use a regular expression
df = df[~df['addzip'].str.contains("[a-zA-Z]").fillna(False)]
Related
i have not seen such question, so if you happen to know the answer or have seen the same question, please let me know
i have a dataframe in pandas with 4 columns and 5k rows, one of the columns is "price" and i need to do some manipulations with it. but the data was parsed from web-page and it is not clean, so i cannot convert this column to integer type after getting rid of dollar sign and comas. i found out that it also contains data in the format 3500/mo. so i need to filter cells with /mo and decide whether i can drop them, basing on how many of those i have and what is the price.
now, i have managed to count those cells using
df["price"].str.contains("/").sum()
but when i want to see those cells, i cannot do that, because when i create another variable to extract slash-containing cells and use "contains" or smth, i get the series with true/false values - showing me the condition of whether the cell does or does not contain that slash, while i actually need to see cells themselves. any ideas?
You need to use the boolean mask returned by df["price"].str.contains("/") as index to get the respective rows, i.e., df[df["price"].str.contains("/")] (cf. the pandas docs on indexing).
I originally have some time series data, which looks like this and have to do the following:
First import it as dataframe
Set date column as datetime index
Add some indicators such as moving average etc, as new columns
Do some rounding (values of the whole column)
Shift a column one row up or down (just to manipulate the data)
Then convert the df to list (because I need to loop it based on some conditions, it's a lot faster than looping a df because I need speed)
But now I want to convert df to dict instead of list because I want to keep the column names, it's more convenient
But now I found out that convert to dict takes a lot longer than list. Even I do it manually instead of using python built-in method.
My question is, is there a better way to do it? Maybe not to import as dataframe in the first place? And still able to do Point 2 to Point 5? At the end I need to convert to dict which allows me to do the loop, keep the column names as keys? THanks.
P.S. the dict should look something like this, the format is similar to df, each row is basically the date with the corresponding data.
On item #7: If you want to convert to a dictionary, you can use df.to_dict()
On item #6: You don't need to convert the df to a list or loop over it: Here are better options. Look for the second answer (it says DON'T)
I have loaded a multiindex matrix from excel to a panda dataframe.
df=pd.read_excel("filename.xlsx",sheet_name=sheet,header=1,index_col=[0,1,2],usecols="B:AZ")
The dataframe has three index colums and one header, so it has 4 indices. Part of the dataframe looks like this:
When I want to show a particular value, it works like this:
df[index1][index2][index3][index4]
I now want to change certain values in the dataframe. After searching through different forums for a while, it seemed like df.at would be the right method to do this. So I tried this:
df.at[index1,index2,index3,index4]=10 (or any random value)
but I got a ValueError: Not enough indexers for scalar access (setting)!. I also tried different orders for the indices inside the df.at-Brackets.
Any help on this would be much appreciated!
It seems you need something like that:
df.loc[('Kosten','LKW','Sofia'),'Ruse']=10
I am quite new in Python coding, and I am dealing with a big dataframe for my internship.
I had an issue as sometimes there are wrong values in my dataframe. For example I find string type values ("broken leaf") instead of integer type values as ("120 cm") or (NaN).
I know there is the df.replace() function, but therefore you need to know that there are wrong values. So how do I find if there are any wrong values inside my dataframe?
Thank you in advance
"120 cm" is a string, not an integer, so that's a confusing example. Some ways to find "unexpected" values include:
Use "describe" to examine the range of numerical values, to see if there are any far outside of your expected range.
Use "unique" to see the set of all values for cases where you expect a small number of permitted values, like a gender field.
Look at the datatypes of columns to see whether there are strings creeping in to fields that are supposed to be numerical.
Use regexps if valid values for a particular column follow a predictable pattern.
My PANDAS data has columns that were read as objects. I want to change these into floats. Following the post linked below (1), I tried:
pdos[cols] = pdos[cols].astype(float)
But PANDAS gives me an error saying that an object can't be recast as float.
ValueError: invalid literal for float(): 17_d
But when I search for 17_d in my data set, it tells me it's not there.
>>> '17_d' in pdos
False
I can look at the raw data to see what's happening outside of python, but feel if I'm going to take python seriously, I should know how to deal with this sort of issue. Why doesn't this search work? How could I do a search over objects for strings in PANDAS? Any advice?
Pandas: change data type of columns
of course it does, because you're only looking in the column list!
'17_d' in pdos
checks to see if '17_d' is in pdos.columns
so what you want to do is pdos[cols] == '17_d', which will give you a truth table. if you want to find which row it is, you can do (pdos[cols] == '17_d').any(1)