add array to pandas data frame - python

I have here an issue and would like to ask for support
Suppose you have a following frame
frame=pd.Dataframe({"Arbitary Number":[1,2,3,4]})
I want to add an additional column, whose entries are np.arrays. I add the entry the following way
frame["new col"]='[8,8,8,8]'
How ever in a later stage I need the entries as array. If I apply
frame["new col"]=frame["new col"].appy(np.array)
I still get object as column type and cannot use the entries to do some math work. I need to go the way with
np.array([eval(xxx)])
to have an array
The question is: Is there a nice and clean way to add arrays as column values without transforming them as strings before assigning them as value?
Or if this is not the case and I do need to assign the list as string, is there a way to change the column type to np.array format?
My mentioned solution is not working
Thanks a lot for any kind of help
Cheers

Related

showing cells with a particular symbols in pandas dataframe

i have not seen such question, so if you happen to know the answer or have seen the same question, please let me know
i have a dataframe in pandas with 4 columns and 5k rows, one of the columns is "price" and i need to do some manipulations with it. but the data was parsed from web-page and it is not clean, so i cannot convert this column to integer type after getting rid of dollar sign and comas. i found out that it also contains data in the format 3500/mo. so i need to filter cells with /mo and decide whether i can drop them, basing on how many of those i have and what is the price.
now, i have managed to count those cells using
df["price"].str.contains("/").sum()
but when i want to see those cells, i cannot do that, because when i create another variable to extract slash-containing cells and use "contains" or smth, i get the series with true/false values - showing me the condition of whether the cell does or does not contain that slash, while i actually need to see cells themselves. any ideas?
You need to use the boolean mask returned by df["price"].str.contains("/") as index to get the respective rows, i.e., df[df["price"].str.contains("/")] (cf. the pandas docs on indexing).

What's the fastest way to do these tasks?

I originally have some time series data, which looks like this and have to do the following:
First import it as dataframe
Set date column as datetime index
Add some indicators such as moving average etc, as new columns
Do some rounding (values of the whole column)
Shift a column one row up or down (just to manipulate the data)
Then convert the df to list (because I need to loop it based on some conditions, it's a lot faster than looping a df because I need speed)
But now I want to convert df to dict instead of list because I want to keep the column names, it's more convenient
But now I found out that convert to dict takes a lot longer than list. Even I do it manually instead of using python built-in method.
My question is, is there a better way to do it? Maybe not to import as dataframe in the first place? And still able to do Point 2 to Point 5? At the end I need to convert to dict which allows me to do the loop, keep the column names as keys? THanks.
P.S. the dict should look something like this, the format is similar to df, each row is basically the date with the corresponding data.
On item #7: If you want to convert to a dictionary, you can use df.to_dict()
On item #6: You don't need to convert the df to a list or loop over it: Here are better options. Look for the second answer (it says DON'T)

how do I create a new column out of a dictionary's sub string on a pandas dataframe

I have the following repo for the files: https://github.com/Glarez/learning.git
dataframe
I need to create a column with the bold part of that string under the params column: "ufield_18":"ONLY" I dont see how can I get that since I'm learning to code from scratch. The solution to this would be nice, but what I would really appreciate is you to point me at the right direction to get the answer for myself. THANKS!
Since you do not want the exact answer. I will provide you one of the ways to achieve this:
filter the params column into a dictionary variable
create a loop to access the keys of the dictionary
append it to the pandas df you have (df[key] = np.nan) - Make sure you add some values while appending the column if your df already has some rows or just add np.nan
note np is numpy library which needs to be imported

How does the isnull() method work to return all rows that are missing in my data frame?

I'm new to Python and just trying to figure out how this small bit of code works. Hoping this'll be easy to explain without an example data frame.
My data frame, called df_train, contains a column called Age. This column is NaN for 177 records.
I submit the following code...
df_train[df_train['Age'].isnull()]
... and it returns all records that are missing.
Now if I submit df_train['Age'].isnull(), all I get is a Boolean List of values. How does the data frame object then work to convert this Boolean List to the rows we actually want?
I don't understand how passing the boolean list to the data frame again results in just the 177 records that we need - could someone please ELI5 for a newbie?
You will have to create subsets of the dataframe you want to use. Suppose you want to use only those rows where df_train['Age'] is not null. In that case, you have to select
df_train_to_use = df_train[df_train['Age'].isnull() == False]
Now, you may cross check any other column that you may want to use and have nulls like
df_train['Column_name'].isnull().any()
If this returns True, you may go ahead and replace nulls with default values, average, zeros or whatever methods you prefer, usually put in application for machine learning programs.
Example
df_train['Column_name'].dropna()
df_train['Column_name'].fillna('') #for strings
df_train['Column_name'].fillna(0) #for int
df_train['Column_name'].fillna(0.0) #for float
Etc.
I hope this helps you explain.

changing column types of a pandas data frame -- finding offending rows that prevent casting

My PANDAS data has columns that were read as objects. I want to change these into floats. Following the post linked below (1), I tried:
pdos[cols] = pdos[cols].astype(float)
But PANDAS gives me an error saying that an object can't be recast as float.
ValueError: invalid literal for float(): 17_d
But when I search for 17_d in my data set, it tells me it's not there.
>>> '17_d' in pdos
False
I can look at the raw data to see what's happening outside of python, but feel if I'm going to take python seriously, I should know how to deal with this sort of issue. Why doesn't this search work? How could I do a search over objects for strings in PANDAS? Any advice?
Pandas: change data type of columns
of course it does, because you're only looking in the column list!
'17_d' in pdos
checks to see if '17_d' is in pdos.columns
so what you want to do is pdos[cols] == '17_d', which will give you a truth table. if you want to find which row it is, you can do (pdos[cols] == '17_d').any(1)

Categories