I have a dataframe
df=pd.DataFrame( [0,1,2],columns=[‘3m3a’,’1z6n’,’11p66d’])
Now i would like to apply 2 * value * (last numbers of column name). Eg for the last 2 * 2* 66
Df.apply(lambda x: 2*x) for step 1
Step 2 is the hardest part
Can do new dataframe like df2=df.stack().reset_index().apply(lambda x: x[re.search(‘[azAZ]+’,x).end():]) and then multiple the 2.
What’s a more pythonic way?
For DataFrame:
3m3a 1z6n 11p66d
0 0 1 2
You can use .colums.str.extract and then DataFrame.multiply:
vals = df.columns.str.extract(r"(\d+)[a-z]*?$").T.astype(int)
df = df.multiply(2 * vals.values, axis=1)
print(df)
Prints:
3m3a 1z6n 11p66d
0 0 12 264
Late to the party, and having found almost the same answer, but using negative look-behind regex:
newdf = df.multiply(
2 * df.columns.str.extract(r'.*(?<!\d)(\d+)\D*').astype(int).values.ravel(),
axis=1)
>>> newdf
3m3a 1z6n 11p66d
0 0 12 264
Thank you, that both works
what if i would like to split the column in 2 parts, one up to and including the first letter, and the second the part after
df.columns.str.split(r"(\d+\D+)",n=1,expand=True)
work but give me a 3 part with first blank
Related
Say I have one dataframe
import pandas as pd
input_df = pd.DataFrame(dict(a=[1, 2], b=[2, 3]))
Also I have a function f that maps each row to another dataframe. Here's an example of such a function. Note that in general the function could take any form so I'm not looking for answers that use agg to reimplement the f below.
def f(row):
return pd.DataFrame(dict(x=[row['a'] * row['b'], row['a'] + row['b']],
y=[row['a']**2, row['b']**2]))
I want to create one dataframe that is the concatenation of the function applied to each of the first dataframe's rows. What is the idiomatic way to do this?
output_df = pd.concat([f(row) for _, row in input_df.iterrows()])
I thought I should be able to use apply or similar for this purpose but nothing seemed to work.
x y
0 2 1
1 3 4
0 6 4
1 5 9
You can use DataFrame.agg to calucalate sum and prod and numpy.ndarray.reshape, df.pow(2)/np.sqaure for calculating sqaure.
out = pd.DataFrame({'x': df.agg(['prod', 'sum'],axis=1).to_numpy().reshape(-1),
'y': np.square(df).to_numpy().reshape(-1)})
out
x y
0 2 1
1 3 4
2 6 4
3 5 9
Yoy should avoid iterating rows (How to iterate over rows in a DataFrame in Pandas).
Instead try:
df = df.assign(product=df.a*df.b, sum=df.sum(axis=1),
asq=df.a**2, bsq=df.b**2)
Then:
df = [[[p, s], [asq, bsq]] for p, s, asq, bsq in df.to_numpy()]
My problem is quite hard to explain but easily understandable with an example :
From this dataframe
pd.DataFrame([[2,"1523974569"],[3,"3214569871"],[0,"9384927512"]])
I would like to obtain :
pd.DataFrame(["15","321",""])
It means that the first column is telling me how much characters I should extract from the second column starting from the start.
Thanks
you could get it using apply and lambda on dataframe as below
df = pd.DataFrame([[2,"1523974569"],[3,"3214569871"],[0,"9384927512"]])
df[2] = df.apply(lambda x : x[1][:x[0]], axis=1)
df
it will give you the output
0 1 2
0 2 1523974569 15
1 3 3214569871 321
2 0 9384927512
I have a dataframe like this:
A B
exa 3
example 6
exam 4
hello 4
hell 3
I want to delete the rows that are substrings of another row and keep the longest one (Notice that B is already the length of A)
I want my table to look like this:
A B
example 6
hello 4
I thought about the following boolean filter but it does not work :(
df['Check'] = df.apply(lambda row: df.count(row['A'] in row['A'])>1, axis=1)
This is non-trivial. But we can take advantage of B to sort the data, compare each value with only those strings larger than itself for solution slightly better than O(N^2).
df = df.sort_values('B')
v = df['A'].tolist()
df[[not any(b.startswith(a) for b in v[i + 1:]) for i, a in enumerate(v)]].sort_index()
A B
1 example 6
3 hello 4
Like what cold provided my solution is O(m*n) as well (In your case m=n)
df[np.sum(np.array([[y in x for x in df.A.values] for y in df.A.values]),1)==1]
Out[30]:
A B
1 example 6
3 hello 4
I have a situation where I am creating a pivot table in PANDAS where it makes more sense to calculate the fields separately and just use .pivot_table() for the pivot step. However, I am running into some difficultly trying to calculate the denominator for my percentages. Essentially, due to the data format I appear to need to do something like "groupby transform unique sum" on the second line below (which is where I am stuck):
df['numerator'] = df.groupby(['category1','category2'])['customer_id'].transform('nunique')
df['denominator'] = df.groupby(['category2'])['numerator'].nunique().transform('sum')
df['percentage'] = (df['numerator'] / df['denominator'])
df_pivot = df.pivot_table(index='category1',
columns=['category2'],
values=['numerator','percentage']) \
swaplevel(0,1,axis=1)
df_pivot.loc['total', :] = df_pivot.sum().values
My apologies for not being able to provide any dummy data, but I would appreciate any tips if I have hopefully provided enough detail to reason about.
I believe need lambda function with unique and sum:
df = pd.DataFrame({'numerator':[3,1,1,9,2,2],
'category2':list('aaabbb')})
#print (df)
df['denominator']=df.groupby(['category2'])['numerator'].transform(lambda x: x.unique().sum())
Alternative solution with sets and sums:
df['denominator']=df.groupby(['category2'])['numerator'].transform(lambda x: sum(set(x)))
print (df)
category2 numerator denominator
0 a 3 4
1 a 1 4
2 a 1 4
3 b 9 11
4 b 2 11
5 b 2 11
I have a dataframe that i want to sort on one of my columns (that is a date)
However I have a loop i am running on the index (while i<df.shape[0]), I need the loop to go on my dataframe once it is sorted by date.
Is the current index modified accordingly to the sorting or should I use df.reset_index() ?
Maybe I'm not understanding the question, but a simple check shows that sort_values does modify the index:
df = pd.DataFrame({'x':['a','c','b'], 'y':[1,3,2]})
df = df.sort_values(by = 'x')
Yields:
x y
0 a 1
2 b 2
1 c 3
And a subsequent:
df = df.reset_index(drop = True)
Yields:
x y
0 a 1
1 b 2
2 c 3