Python: Create New Column based on values of other column using len() - python

My dataframe is a pandas dataframe df with many rows & columns.
Now i want to create a new column (series) based on the values of an object column. e.g.:
df.iloc[0, 'oldcolumn'] Output is 0 should give me 0 in a new column and
df.iloc[1, 'oldcolumn'] Output is 'ab%$.' should give me 5 in the same new column (number of literals incl. space).
in addition, is there a way to avoid loops or own functions?
Thank U

To create a new column based on the length of the value in another column, you should do
df['newcol'] = df['oldcol'].apply(lambda x: len(str(x)))
Although this is a generic way of creating a new column based on data from existing columns, Henry's approach is also a good one.
In addition, is there a way to avoid loops or own functions?
I recommend you take a look at How To Make Your Pandas Loop 71803 Times Faster.

You can try this:
df['strlen'] = df['oldcolumn'].apply(len)
print(df)

Related

Create Dataframe with a certain number of columns

I have the following Dataframe:
Now i want to copy the column "Power" as often as i want to another column in the same Dataframe.
The column names should be: Power_1; Power_2; Power_3.....
Creating the Dataframe is too complicated to share, but a simple example how to add the columns with a while-loop would be sufficient.
for i in range(10):
df[f"Power_{i}"] = df["Power"]

How to add a column with values depending on existing rows with lower index in pandas?

Is there a fast way of adding a column to a data frame df with values depending on all the rows of df with smaller index? A very simple example where the new column only depends on the value of one other column would be df["new_col"] = df["old_col"].cumsum() (if df is ordered), but I have something more complicated in mind. Ideally, I'd like to write something like
df["new_col"] = df.[some function here](f),
where [some function] sets the i-th value of df["new_col"] to f(df[df.index <= df.index[i]]). (Ideally [some function] can also be applied to groupby() objects.)
At the moment I loop through rows, add a temporary column containing a dict of relevant values and then apply a function, but this is very slow, memory-inefficient, etc.

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

Adding columns to a pandas.DataFrame with previous row values before calling apply()

I need to add a new column to a pandas dataframe, where the value is calculated from the value of a column in the previous row.
Coming from a non-functional background (C#), I am trying to avoid loops since I read it is an anti-pattern.
My plan is to use series.shift to add a new column to the dataframe for the previous value, call dataframe.apply and finally remove the additional column. E.g.:
def my_function(row):
# perform complex calculations with row.time, row.time_previous and other values
# return the result
df["time_previous"] = df.time.shift(1)
df.apply(my_function, axis = 1)
df.drop("time.previous", axis=1)
In reality, I need to create four additional columns like this. Is there a better alternative to accomplish this without a loop? Is this a good idea at all?

How to add values to a new column in pandas dataframe?

I want to create a new named column in a Pandas dataframe, insert first value into it, and then add another values to the same column:
Something like:
import pandas
df = pandas.DataFrame()
df['New column'].append('a')
df['New column'].append('b')
df['New column'].append('c')
etc.
How do I do that?
If I understand, correctly you want to append a value to an existing column in a pandas data frame. The thing is with DFs you need to maintain a matrix-like shape so the number of rows is equal for each column what you can do is add a column with a default value and then update this value with
for index, row in df.iterrows():
df.at[index, 'new_column'] = new_value
Dont do it, because it's slow:
updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat.
Better to create a list of data and create DataFrame by contructor:
vals = ['a','b','c']
df = pandas.DataFrame({'New column':vals})
If in case you need to add random values to the newly created column, you could also use
df['new_column']= np.random.randint(1, 9, len(df))

Categories