I have a dateset like:
a b c
1 x1 c1
2 x2 c2
3 x3 c3
and I want to apply a function f only to the column b.
I did something like :
d2 = d['b'].apply(f)
But I have result like
a b
1 xt
2 xt
3 xt
And I want the column c, a result like :
a b c
1 xt c1
2 xt c2
3 xt c3
How can I do that without merge with the first dataset ?
I think you try don't use apply, because it is slower, better is use pandas API functions:
e.g. if need replace column to new constant values:
df['b'] = 'xt'
print (df)
a b c
0 1 xt c1
1 2 xt c2
2 3 xt c3
But if apply is necessary:
def f(x):
return 'xt'
df['b'] = df.b.apply(f)
print (df)
a b c
0 1 xt c1
1 2 xt c2
2 3 xt c3
If you need new DataFrame, first use copy:
d = df.copy()
def f(x):
return 'xt'
d['b'] = d.b.apply(f)
print (d)
a b c
0 1 xt c1
1 2 xt c2
2 3 xt c3
Related
I have a DataFrame like so:
C1 C2 C3 C4
1 A B C E
2 C D E F
3 A C A B
4 A A B G
5 B nan C E
And a list:
filt = [A, B, C]
What I need is a filter that keeps only the rows that have all the values from filt, in any order or position. So output here would be:
C1 C2 C3 C4
1 A B C E
3 A C A B
I've looked at previous questions like Check multiple columns for multiple values and return a dataframe. In that case, however, the OP is only partially matching. In my case, all values must be present, in any order, for the row to be matched.
One solution
Use:
fs_filt = frozenset(filt)
mask = df.apply(frozenset, axis=1) >= fs_filt
res = df[mask]
print(res)
Output
C1 C2 C3 C4
0 A B C E
2 A C A B
The idea is to convert each row to a fronzenset and then verify if a fronzenset of filt is a subset (>=) of the elements of the row.
I am trying to extract the top level domain (TLD), second level (SLD) etc from a column in a dataframe and added to new columns. Currently I have a solution where I convert this to a list and then use tolist, but the since this does sequential append, it does not work correctly. For example if the url has 3 levels then the mapping gets messed up
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":["xyz[.]com","abc123[.]pro","xyzabc[.]gouv[.]fr"]})
df['C'] = df.C.apply(lambda x: x.split('[.]'))
df.head()
A B C
0 1 2 [xyz, com]
1 2 3 [abc123, pro]
2 3 4 [xyzabc, gouv, fr]
d = [pd.DataFrame(df[col].tolist()).add_prefix(col) for col in df.columns]
df = pd.concat(d, axis=1)
df.head()
A0 B0 C0 C1 C2
0 1 2 xyz com None
1 2 3 abc123 pro None
2 3 4 xyzabc gouv fr
I want C2 to always contain the TLD (com,pro,fr) and C1 to always contain SLD
I am sure there is a better way to do this correctly and would appreciate any pointers.
You can shift the Cx columns:
df.loc[:, "C0":] = df.loc[:, "C0":].apply(
lambda x: x.shift(periods=x.isna().sum()), axis=1
)
print(df)
Prints:
A0 B0 C0 C1 C2
0 1 2 NaN xyz com
1 2 3 NaN abc123 pro
2 3 4 xyzabc gouv fr
You can also use a regex with negative lookup and split builtin pandas with expand
df[['C0', 'C2']] = df.C.str.split('\[\.\](?!.*\[\.\])', expand=True)
df[['C0', 'C1']] = df.C0.str.split('\[\.\]', expand=True)
that gives
A B C C0 C2 C1
0 1 2 xyz[.]com xyz com None
1 2 3 abc123[.]pro abc123 pro None
2 3 4 xyzabc[.]gouv[.]fr xyzabc fr gouv
data frame issue
ID
C1
C2
M1
1
A
B
X
2
A
Y
3
C
W
4
G
H
Z
result wanted
ID
C
1
A
1
B
2
B
3
C
4
C
4
G
The main problem is the first dataset today has C1 and C2
tomorrow we could have C1 , C2 , C3 ...Cn
the filename will be provided and my task is read it and get the result regardless of how many C related columns the file may have. column: M1 is not needed.
-----what I tried:
df = pd.read_csv (r"C:\Users\JIRAdata_TEST.csv")
df = df.filter(regex='ID|C')
print(df2)
will return all ID and C related columns, and remove the M1 column as part of data clean up--dont know if that helps.
then...am stuck!
Use df.melt with df.dropna:
In [1295]: x = df.filter(regex='ID|C').melt('ID', value_name='C').sort_values('ID').dropna().drop('variable', 1)
In [1296]: x
Out[1296]:
ID C
0 1 A
4 1 B
5 2 A
2 3 C
3 4 G
7 4 H
I have a pandas dataframe that looks like:
c1 c2 c3 c4 result
a b c d 1
b c d a 1
a e d b 1
g a f c 1
but I want to randomly select 50% of the rows to swap the order of and also flip the result column from 1 to 0 (as shown below):
c1 c2 c3 c4 result
a b c d 1
d a b c 0 (we swapped c3 and c4 with c1 and c2)
a e d b 1
f c g a 0 (we swapped c3 and c4 with c1 and c2)
What's the idiomatic way to accomplish this?
You had the general idea. Shuffle the DataFrame and split it in half. Then modify one half and join back.
import numpy as np
np.random.seed(410112)
dfs = np.array_split(df.sample(frac=1), 2) # Shuffle then split in 1/2
# On one half set result to 0 and swap the columns
dfs[1]['result'] = 0
dfs[1] = dfs[1].rename(columns={'c1': 'c2', 'c2': 'c1', 'c3': 'c4', 'c4': 'c3'})
# Join Back
df = pd.concat(dfs).sort_index()
c1 c2 c3 c4 result
0 a b c d 1
1 c b a d 0
2 e a b d 0
3 g a f c 1
I have a df:
df =
c1 c2 c3 c4 c5
0 K 6 nan Y V
1 H nan g 5 nan
2 U B g Y L
And a string
s = 'HKg5'
I want to return rows where s[0]=value of c1, s[1]=value of c2, ..... + in some cases where s[i]=nan.
For example, row 1 in df above matches with the string
row 1=
c1 c2 c3 c4 c5
1 H nan g 5 nan
match=True, regardless of s[1,4]=nan
s = H K g 5
And also the string length is dynamic, so my df cols go above c10
I am using df.apply but I can't figure it out clearly. I want to write a function to pass to df.apply, passing the string at the same time.
Thanks for any help!
Output from Chris's answer
df=
c1 c2 c3 c4 c5
0 K 6 NaN Y V
1 H NaN g 5 NaN
2 U B g Y L
s = 'HKg5'
s1 = pd.Series(list(s), index=[f'c{x+1}' for x in range(len(s))])
df.loc[((df == s1) | (df.isna())).all(1)]
Output
`c1 c2 c3 c4 c5`
Create a helper Series from your string and use boolean logic to filter:
s1 = pd.Series(list(s), index=[f'c{x+1}' for x in range(len(s))])
# print(s1)
# c1 H
# c2 K
# c3 g
# c4 5
# dtype: object
Logic is df equals (==) this value OR (|) is nan (isna)
Use all along axis 1 to return rows where all values are True
df.loc[((df == s1) | (df.isna())).all(1)]
[out]
c1 c2 c3 c4 c5
1 H NaN g 5 NaN
So, as a function, you could do:
def df_match_string(frame, string):
s1 = pd.Series(list(string), index=[f'c{x+1}' for x in range(len(string))])
return ((frame == s1) | (frame.isna())).all(1)
df_match_string(df, s)
[out]
0 False
1 True
2 False
dtype: bool
Update
I can't reproduce your issue with the example provided. My guess is that some of the values in your DataFrame may have leading/trailing whitespace?
Before trying the above solution, try this preprocessing step:
for col in df:
df[col] = df[col].str.strip()