If I have a dataframe df and want to access the unique values of ID, I can do something like this.
UniqueFactor = df.ID.unique()
But how can I convert this into a function in order to access different variables? I tried this, but it doesn't work
def separate_by_factor(df, factor):
# Separating the readings by given factor
UniqueFactor = df.factor.unique()
separate_by_factor('ID')
And it shouldn't, because I'm passing a string as a variable name. How can I get around this?
I don't know how I can better word the question, sorry for being too vague.
When you create a DataFrame, every column that is a valid identifier it's treated as an attribute. To access a column based on its name (like in your example), you need to use df[factor].unique().
Related
I am trying to clean up some data in a file that I have. In the column I'm trying to "clean" there is Last Name, First Name. The issue is sometimes it will come in as "#123;#Last Name, First Name".enter image description here
Normally with columns like this I would use a string partition, like:
df['Name'] = df['Name'].str.partition('#')[2]
But in this case when I apply it to this column, it blanks out all of the names that are coming in properly as Last Name, First Name.
Is there a way to partition the values only when the "#123;#" before the Last Name, First Name occurs? By the way, the number "123" varies, so I wouldn't want to constrain it on specifically equaling any specific number.
Try something like -
df['Name'] = df['Name'].str.split('#')[-1]
I originally have some time series data, which looks like this and have to do the following:
First import it as dataframe
Set date column as datetime index
Add some indicators such as moving average etc, as new columns
Do some rounding (values of the whole column)
Shift a column one row up or down (just to manipulate the data)
Then convert the df to list (because I need to loop it based on some conditions, it's a lot faster than looping a df because I need speed)
But now I want to convert df to dict instead of list because I want to keep the column names, it's more convenient
But now I found out that convert to dict takes a lot longer than list. Even I do it manually instead of using python built-in method.
My question is, is there a better way to do it? Maybe not to import as dataframe in the first place? And still able to do Point 2 to Point 5? At the end I need to convert to dict which allows me to do the loop, keep the column names as keys? THanks.
P.S. the dict should look something like this, the format is similar to df, each row is basically the date with the corresponding data.
On item #7: If you want to convert to a dictionary, you can use df.to_dict()
On item #6: You don't need to convert the df to a list or loop over it: Here are better options. Look for the second answer (it says DON'T)
I have here an issue and would like to ask for support
Suppose you have a following frame
frame=pd.Dataframe({"Arbitary Number":[1,2,3,4]})
I want to add an additional column, whose entries are np.arrays. I add the entry the following way
frame["new col"]='[8,8,8,8]'
How ever in a later stage I need the entries as array. If I apply
frame["new col"]=frame["new col"].appy(np.array)
I still get object as column type and cannot use the entries to do some math work. I need to go the way with
np.array([eval(xxx)])
to have an array
The question is: Is there a nice and clean way to add arrays as column values without transforming them as strings before assigning them as value?
Or if this is not the case and I do need to assign the list as string, is there a way to change the column type to np.array format?
My mentioned solution is not working
Thanks a lot for any kind of help
Cheers
I am quite new in Python coding, and I am dealing with a big dataframe for my internship.
I had an issue as sometimes there are wrong values in my dataframe. For example I find string type values ("broken leaf") instead of integer type values as ("120 cm") or (NaN).
I know there is the df.replace() function, but therefore you need to know that there are wrong values. So how do I find if there are any wrong values inside my dataframe?
Thank you in advance
"120 cm" is a string, not an integer, so that's a confusing example. Some ways to find "unexpected" values include:
Use "describe" to examine the range of numerical values, to see if there are any far outside of your expected range.
Use "unique" to see the set of all values for cases where you expect a small number of permitted values, like a gender field.
Look at the datatypes of columns to see whether there are strings creeping in to fields that are supposed to be numerical.
Use regexps if valid values for a particular column follow a predictable pattern.
Using ipython for interactive manipulation, the autocomplete feature helps expanding columns names quickly.
But given the column object, I'd like to get it's name but I haven't found a simple way to do it. Is there one?
I'm trying to avoid typing the full "ALongVariableName"
x = "ALongVariableName"
relevantColumn = df[x]
instead I type "df.AL<\Tab>" to get my series. So I have:
relevantColumn = df.ALongVariableName #now how can I get x?
But that series object doesn't carry its name or index in the dataframe. Did I miss it?
Thanks!