data frame issue
ID
C1
C2
M1
1
A
B
X
2
A
Y
3
C
W
4
G
H
Z
result wanted
ID
C
1
A
1
B
2
B
3
C
4
C
4
G
The main problem is the first dataset today has C1 and C2
tomorrow we could have C1 , C2 , C3 ...Cn
the filename will be provided and my task is read it and get the result regardless of how many C related columns the file may have. column: M1 is not needed.
-----what I tried:
df = pd.read_csv (r"C:\Users\JIRAdata_TEST.csv")
df = df.filter(regex='ID|C')
print(df2)
will return all ID and C related columns, and remove the M1 column as part of data clean up--dont know if that helps.
then...am stuck!
Use df.melt with df.dropna:
In [1295]: x = df.filter(regex='ID|C').melt('ID', value_name='C').sort_values('ID').dropna().drop('variable', 1)
In [1296]: x
Out[1296]:
ID C
0 1 A
4 1 B
5 2 A
2 3 C
3 4 G
7 4 H
Related
I am trying to extract the top level domain (TLD), second level (SLD) etc from a column in a dataframe and added to new columns. Currently I have a solution where I convert this to a list and then use tolist, but the since this does sequential append, it does not work correctly. For example if the url has 3 levels then the mapping gets messed up
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":["xyz[.]com","abc123[.]pro","xyzabc[.]gouv[.]fr"]})
df['C'] = df.C.apply(lambda x: x.split('[.]'))
df.head()
A B C
0 1 2 [xyz, com]
1 2 3 [abc123, pro]
2 3 4 [xyzabc, gouv, fr]
d = [pd.DataFrame(df[col].tolist()).add_prefix(col) for col in df.columns]
df = pd.concat(d, axis=1)
df.head()
A0 B0 C0 C1 C2
0 1 2 xyz com None
1 2 3 abc123 pro None
2 3 4 xyzabc gouv fr
I want C2 to always contain the TLD (com,pro,fr) and C1 to always contain SLD
I am sure there is a better way to do this correctly and would appreciate any pointers.
You can shift the Cx columns:
df.loc[:, "C0":] = df.loc[:, "C0":].apply(
lambda x: x.shift(periods=x.isna().sum()), axis=1
)
print(df)
Prints:
A0 B0 C0 C1 C2
0 1 2 NaN xyz com
1 2 3 NaN abc123 pro
2 3 4 xyzabc gouv fr
You can also use a regex with negative lookup and split builtin pandas with expand
df[['C0', 'C2']] = df.C.str.split('\[\.\](?!.*\[\.\])', expand=True)
df[['C0', 'C1']] = df.C0.str.split('\[\.\]', expand=True)
that gives
A B C C0 C2 C1
0 1 2 xyz[.]com xyz com None
1 2 3 abc123[.]pro abc123 pro None
2 3 4 xyzabc[.]gouv[.]fr xyzabc fr gouv
I have two Series (df1 and df2) of equal length, which need to be combined into one DataFrame column as follows. Each index has only one value or no values but never two values, so there are no duplicates (e.g. if df1 has a value 'A' at index 0, then df2 is empty at index 0, and vice versa).
df1 = c1 df2 = c2
0 A 0
1 B 1
2 2 C
3 D 3
4 E 4
5 5 F
6 6
7 G 7
The result I want is this:
0 A
1 B
2 C
3 D
4 E
5 F
6
7 G
I have tried .concat, .append and .union, but these do not produce the desired result. What is the correct approach then?
You can try so:
df1['new'] = df1['c1'] + df2['c2']
For an in-place solution, I recommend pd.Series.replace:
df1['c1'].replace('', df2['c2'], inplace=True)
print(df1)
c1
0 A
1 B
2 C
3 D
4 E
5 F
6
7 G
This question already has answers here:
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
(11 answers)
Closed 3 years ago.
There is a pandas dataframe:
df = pd.DataFrame({'c1':['a','b','c','d'],'c2':[1,2,3,4]})
c1 c2
0 a 1
1 b 2
2 c 3
3 d 4
And a pandas Series:
list1 = pd.Series(['b','c','e','f'])
Out[6]:
0 a
1 b
2 c
3 e
How to create a new data frame that contains rows where c1 is in list1.
output:
c1 c2
0 b 2
1 c 3
You can use df.isin:
In [582]: df[df.c1.isin(list1)]
Out[582]:
c1 c2
1 b 2
2 c 3
Or, using df.loc, if you want to modify your slice:
In [584]: df.loc[df.c1.isin(list1), :]
Out[584]:
c1 c2
1 b 2
2 c 3
Using query
In [1133]: df.query('c1 in #list1')
Out[1133]:
c1 c2
1 b 2
2 c 3
Or, using isin
In [1134]: df[df.c1.isin(list1)]
Out[1134]:
c1 c2
1 b 2
2 c 3
Both #JohnGalt's and #COLDSPEED's answers are more idiomatic pandas. Please don't use these answers. They are intended to be fun and illustrative of other parts of the pandas and numpy api.
Alt 1
This is utilizing numpy.in1d which acts as a proxy for pd.Series.isin
df[np.in1d(df.c1.values, list1.values)]
c1 c2
1 b 2
2 c 3
Alt 2
Use set logic
df[df.c1.apply(set) & set(list1)]
c1 c2
1 b 2
2 c 3
Alt 3
Use pd.Series.str.match
df[df.c1.str.match('|'.join(list1))]
c1 c2
1 b 2
2 c 3
For the sake of completenes
yet another way (definitely not the best one) to achieve that:
In [4]: df.merge(list1.to_frame(name='c1'))
Out[4]:
c1 c2
0 b 2
1 c 3
I am trying to get an output where I wish to add column d in d1 and d2 where a b c are same (like groupby).
For example
d1 = pd.DataFrame([[1,2,3,4]],columns=['a','b','c','d'])
d2 = pd.DataFrame([[1,2,3,4],[2,3,4,5]],columns=['a','b','c','d'])
then I'd like to get an output as
a b c d
0 1 2 3 8
1 2 3 4 5
Merging the two data frames and adding the resultant column d where a b c are same.
d1.add(d2) or radd gives me an aggregate of all columns
The solution should be a DataFrame which can be added again to another similarly.
Any help is appreciated.
You can use set_index first:
print (d2.set_index(['a','b','c'])
.add(d1.set_index(['a','b','c']), fill_value=0)
.astype(int)
.reset_index())
a b c d
0 1 2 3 8
1 2 3 4 5
df = pd.concat([d1, d2])
df.drop_duplicates()
a b c d
0 1 2 3 4
1 2 3 4 5
I have a dateset like:
a b c
1 x1 c1
2 x2 c2
3 x3 c3
and I want to apply a function f only to the column b.
I did something like :
d2 = d['b'].apply(f)
But I have result like
a b
1 xt
2 xt
3 xt
And I want the column c, a result like :
a b c
1 xt c1
2 xt c2
3 xt c3
How can I do that without merge with the first dataset ?
I think you try don't use apply, because it is slower, better is use pandas API functions:
e.g. if need replace column to new constant values:
df['b'] = 'xt'
print (df)
a b c
0 1 xt c1
1 2 xt c2
2 3 xt c3
But if apply is necessary:
def f(x):
return 'xt'
df['b'] = df.b.apply(f)
print (df)
a b c
0 1 xt c1
1 2 xt c2
2 3 xt c3
If you need new DataFrame, first use copy:
d = df.copy()
def f(x):
return 'xt'
d['b'] = d.b.apply(f)
print (d)
a b c
0 1 xt c1
1 2 xt c2
2 3 xt c3