add column with string length of another column and cumsum? - python

Given the following dataframe:
df = pd.DataFrame({'col1': ["kuku", "pu", "d", "fgf"]})
I want to calculate the length of each string and add a cumsum column.
I am trying to do this with df.str.len("col1") but it throws an error.

Use str.len()
Ex:
import pandas as pd
df = pd.DataFrame({"col1": ["kuku", "pu", "d", "fgf"]})
df["New"] = df["col1"].str.len()
print(df)
print(df["New"].cumsum()) #cumulative sum
Output:
col1 New
0 kuku 4
1 pu 2
2 d 1
3 fgf 3
0 4
1 6
2 7
3 10
Name: New, dtype: int64

The dataframe initialization code is wrong. Try this.
>>> df = pd.DataFrame({'col1': ["kuku", "pu", "d", "fgf"]})
>>> df
col1
0 kuku
1 pu
2 d
3 fgf
Alternatively, you can use map as well.
>>> df.col1.map(lambda x: len(x))
0 4
1 2
2 1
3 3
To calculate length.
>>> df['len'] = df.col1.str.len()
>>> df
col1 len
0 kuku 4
1 pu 2
2 d 1
3 fgf 3

Or
import pandas as pd
df = pd.DataFrame({ "col1" : ["kuku", "pu", "d", "fgf"]})
df['new'] = df.col1.apply(lambda x: len(x))

Your col1 argument is an unknown argument to pd.DataFrame()...
Use data as the argument name instead... Then add your new column with the length
data = {'col1': ["kuku", "pu", "d", "fgf"]}
df = pd.DataFrame(data=data)
df["col1 lenghts"] = df["col1"].str.len()
print(df)

Here is another alternative I think solved my issue:
df = pd.DataFrame({"col1": ['dilly macaroni recipe salad', 'gazpacho', 'bake crunchy onion potato', 'cool creamy easy pie watermelon', 'beef easy skillet tropical', 'chicken grilled tea thigh', 'cake dump rhubarb strawberry', 'parfaits yogurt', 'bread nut zucchini', 'la salad salmon']})
df["title_len"] = df[1].str.len()
df["cum_len"] = df["title_len"].cumsum()

Related

How to drop only the number in the same column which has string in object type, pandas

d = {'col1': ['Son', 2, 'Dad'], 'col2': [3, 4, 5]}
df = pd.DataFrame(data=d)
I want to drop change the second row to 'Unknown'
col1 col2
0 Son 3
1 2 4
2 Dad 5
change to
col1 col2
0 Son 3
1 Unknown 4
2 Dad 5
pandas str functions make numeric to NaN
df['col1'] = df['col1'].str[:].fillna('Unknown')
output:
df
col1 col2
0 Son 3
1 Unknown 4
2 Dad 5
you can use regex:
df['col1'] = df['col1'].astype(str)
df['col1'] = df['col1'].str.replace('\d+', 'Unknown') #find numbers and replace with "unknown"

Python pandas: swap column values of a DataFrame slice

I have a DataFrame like:
df = pd.DataFrame({"type":['pet', 'toy', 'toy', 'car'], 'symbol': ['A', 'B', 'C', 'D'], 'desc': [1, 2, 3, 4]})
df
Out[22]:
type symbol desc
0 pet A 1
1 toy B 2
2 toy C 3
3 car D 4
My goal is to swap the values of symbol and desc for the rows whose type is toy:
type symbol desc
0 pet A 1
1 toy 2 B # <-- B and 2 are swapped
2 toy 3 C # <-- C and 3 are swapped
3 car D 4
So I am going to take a slice first, then do the swap on the slice, but failed. My scripts, warnings and results are:
df[df['type']=='toy'][['symbol', 'desc']] = df[df['type']=='toy'][['desc', 'symbol']]
/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py:3191: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[k1] = value[k2]
df
Out[31]:
type symbol desc
0 pet A 1
1 toy B 2 # <-- didn't work :(
2 toy C 3
3 car D 4
Is there any advice?
Let us do
m = df.type=='toy'
l = ['symbol','desc']
df.loc[m,l] = df.loc[m,l[::-1]].values
df
Out[89]:
type symbol desc
0 pet A 1
1 toy 2 B
2 toy 3 C
3 car D 4
Or try with rename
m = df.type=='toy'
l = ['symbol','desc']
out = pd.concat([df[~m],df[m].rename(columns=dict(zip(l,l[::-1])))]).sort_index()
Let's try something like:
import pandas as pd
df = pd.DataFrame(
{"type": ['pet', 'toy', 'toy', 'car'], 'symbol': ['A', 'B', 'C', 'D'],
'desc': [1, 2, 3, 4]})
m = df['type'] == 'toy'
df.loc[m, ['symbol', 'desc']] = df.loc[m, ['desc', 'symbol']].to_numpy()
print(df)
Output:
type symbol desc
0 pet A 1
1 toy 2 B
2 toy 3 C
3 car D 4
Use to_numpy() / values to prevent columns from matching to their old column names.
You can also use pandas where:
df[['symbol', 'desc']] = df[['desc', 'symbol']].where(df['type'] == 'toy',
df[['symbol', 'desc']].values)
Output:
type symbol desc
0 pet A 1
1 toy 2 B
2 toy 3 C
3 car D 4
You can also use list(zip(...)):
m = df['type']=='toy'
df.loc[m, ['symbol', 'desc']] = list(zip(df.loc[m, 'desc'], df.loc[m, 'symbol']))
Or simply use .values:
m = df['type']=='toy'
df.loc[m, ['symbol', 'desc']] = df.loc[m, ['desc', 'symbol']].values
Result:
print(df)
type symbol desc
0 pet A 1
1 toy 2 B
2 toy 3 C
3 car D 4

how to sort the data frame in the "Title "column in alphabetical order

I have the following dataframe:
and this is my code:
movies_taxes['Total Taxes'] = movies_taxes.apply(lambda x:(0.2)* x['US Gross'] + (0.18) * x['Worldwide Gross'], axis=1)
movies_taxes
Simple example:
import pandas as pd
df = pd.DataFrame({'player': ['C','B','A'], 'data': [1,2,3]})
df = df.sort_values(by ='player')
Output:
From:
player data
0 C 1
1 B 2
2 A 3
To:
player data
2 A 3
1 B 2
0 C 1
Another example:
df = pd.DataFrame({
'student': [
'monica', 'nathalia', 'anastasia', 'marina', 'ema'
],
'grade' : ['excellent', 'excellent', 'good', 'very good', 'good'
]
})
print (df)
student grade
0 monica excellent
1 nathalia excellent
2 anastasia good
3 marina very good
4 ema good
Pre pandas 0.17:
Sort by ascending student name
df.sort('student')
reverse ascending
df.sort('student', ascending=False)
Pandas 0.17+ (as mentioned in the other answers):
ascending
df.sort_values('student')
reverse ascending
df.sort_values('student', ascending=False)
This ought to do it:
>>> import pandas as pd
>>> s = pd.Series(['banana', 'apple', 'friends', '3 dog and cat', '10 old man'])
>>> import numpy as np
# We want to know which rows start with a number as well as those that don't
>>> mask = np.array([True if not any(x.startswith(str(n)) for n in range(9)) else False for x in s])
>>> s[mask]
0 banana
1 apple
2 friends
dtype: object
# Stack the sorted, non-starting-with-a-number array and the sorted, starting-with-a-number array
>>> pd.concat((s[mask].sort_values(), s[~mask].sort_values(ascending=False)))
1 apple
0 banana
2 friends
3 3 dog and cat
4 10 old man

How to fastly select dataframe according to multi-columns in pandas

I want to want to filter rows by multi-column values.
For example, given the following dataframes,
import pandas as pd
df = pd.DataFrame({"name":["Amy", "Amy", "Amy", "Bob", "Bob",],
"group":[1, 1, 1, 1, 2],
"place":['a', 'a', "a", 'b', 'b'],
"y":[1, 2, 3, 1, 2]
})
print(df)
Original dataframe:
name group place y
0 Amy 1 a 1
1 Amy 1 a 2
2 Amy 1 a 3
3 Bob 1 b 1
4 Bob 2 b 2
I want to select the samples that satisfy the columns combination [name, group, place] in selectRow.
selectRow = [["Amy", 1, "a"], ["Amy", 2, "b"]]
Then the expected dataframe is :
name group place y
0 Amy 1 a 1
1 Amy 1 a 2
2 Amy 1 a 3
I have tried it and my method is not efficient and runs for a long time, especially when there are many samples in original dataframe.
My Simple Method:
newdf = pd.DataFrame({})
for item in (selectRow):
print(item)
tmp = df.loc[(df['name'] == item[0]) & (df['group'] == item[1]) & (df['place'] == item[2])]
newdf = newdf.append(tmp)
newdf = newdf.reset_index( drop = True)
newdf.tail()
print(newdf)
Hope for an efficient method to achieve it.
Try using isin:
print(df[df['name'].isin(list(zip(*selectRow))[0]) & df['group'].isin(list(zip(*selectRow))[1]) & df['place'].isin(list(zip(*selectRow))[2])])

How to create data frame with links between data in two different data frames

I have one pandas dataframe for persons like:
pid name job
1 Mike A
2 Lucy A
3 Jeff B
And a second one for jobs like:
id name
1 A
2 B
3 C
What I want to produce is a third dataframe where I list the connections between people and jobs, so in this dummy example the desired result will be:
personid jobid
1 1
2 1
3 2
How can I accomplish this with pandas? I don't understand how to join in this case, since it's not a by row thing...
Try with pandas, suppose you have df1 and df2:
import pandas as pd
df1 = pd.read_csv('Data1.csv')
df2 = pd.read_csv('Data2.csv')
print df1
print df2
df1 :
pid name job
0 1 Mike A
1 2 Lucy A
2 3 Jeff B
and df2:
id name
0 1 A
1 2 B
2 3 C
then:
df2['job']=df2['name']
df_result = df1.merge(df2, on='job', how='left')
print df_result[['pid','id']]
It will print out:
pid id
0 1 1
1 2 1
2 3 2
Is this what you're looking for?
output = pd.merge(persons, jobs, how='left', left_on='job', right_on='name')[['pid', 'id']]
Output:
pid id
0 1 1
1 2 1
2 3 2
The two given dataframes are the following:
import pandas as pd
people_df = pd.DataFrame([[1, "Mike", "A"], [2, "Lucy", "A"], [3, "Jeff", "B"]], columns=["pid", "name", "job"])
jobs_df = pd.DataFrame([[1, "A"], [2, "B"], [3, "C"]], columns=["id", "name"])
You can get the desired result by using merge method.
merged_df = pd.merge(people_df, jobs_df, left_on='job', right_on='name')
result = merged_df[['pid', 'id']].rename(columns={'pid': 'personid', 'id': 'jobid'}) # for extracting and renaming data
"inner join" is used in default merge method. You can use how option for other "join" if you want.

Categories