I have a dataframe which looks like this
a b z
1 NULL NULL ... 1
2 NULL 1 ... NULL
3 1 NULL ... NULL
The first column is always populated and there are many others to the right of it. Of columns a through z one is populated the rest are not.
I would like to transform this dataframe into a two-column data frame with the headers of columns a through z in the second column. The example above would be transformed to this.
The_Column
1 z
2 b
3 a
The pandas.melt() function is close to what I need, but it doesn't handle the NULL values. I only care about the one cell in columns B through Z which is populated.
Is there an elegant way to handle this problem?
you need melt, and then df.dropna() - that's it
this should work:
df.set_index('a').melt().dropna().reset_index()
Using stack (which drops NA's by default):
x = (df.set_index('a')
.stack()
.reset_index()
.drop(columns=0)
.rename(columns={'level_1': 'The_Column'})
print(x)
Output:
a The_Column
0 1 z
1 2 b
2 3 c
Related
I Have a dataframe which has some unique IDs in two of the columns.for e.g
S.no. Column1 Column2
1 00001x 00002x
2 00003j 00005k
3 00002x 00001x
4 00004d 00008e
Value can be anything in the string format
I want to compare the two column in such a way that either of s.no 1 or 3 data remains. as these id contains the same information. only its order is different.
Basically if for one row value in a column 1 is X and column 2 is Y and for other row value in column 1 is Y and in Column 2 is x then only one of the row should remain.
is that possible in python?
You can convert your columns as frozenset per row.
This will give a common order to apply duplicated.
Finally, slice the rows using the previous output as mask:
mask = df.filter(like='Column').apply(frozenset, axis=1).duplicated()
df[~mask]
previous answer using set:
mask = df.filter(like='Column').apply(lambda x: tuple(set(x)), axis=1).duplicated()
df[~mask]
NB. Using a set or sorted requires to convert as tuple (lambda x: tuple(sorted(x))) as the duplicated function hashes the values, which is not possible with mutable objects
output:
S.no. Column1 Column2
0 1 00001x 00002x
1 2 00003j 00005k
3 4 00004d 00008e
Has the title say, I would like to find a way to drop the row (erase it) in a data frame from a column to the end of the data frame but I don't find any way to do so.
I would like to start with
A B C
-----------
1 1 1
1 1 1
1 1 1
and get
A B C
-----------
1
1
1
I was trying with
df.drop(df.loc[:, 'B':].columns, axis = 1, inplace = True)
But this delete the column itself too
A
-
1
1
1
am I missing something?
If you only know the column name that you want to keep:
import pandas as pd
new_df = pd.DataFrame(df["A"])
If you only know the column names that you want to drop:
new_df = df.drop(["B", "C"], axis=1)
For your case, to keep the columns, but remove the content, one possible way is:
new_df = pd.DataFrame(df["A"], columns=df.columns)
Resulting df contains columns "A" and "B" but without values (NaN instead)
I have a pandas dataframe as below. How can I drop any column which is a subset of any of the remaining columns? I would like to do this without using fillna.
df = pd.DataFrame([ [1,1,3,3], [np.NaN,2,np.NaN,4]], columns=['A','B','C','D'] )
df
A B C D
0 1.0 1 3.0 3
1 NaN 2 NaN 4
I can identify here that column A is subset of B and column C is a subset of D with something like this:
if all(df[A][df[A].notnull()].isin(df[B]))
I could run a loop over all columns and drop the subset columns. But is there a more efficient way to accomplish this, so that I have the following result:
df
B D
0 1 3
1 2 4
Thanks.
It still requires iteration, but you can use this list comprehension (with an if statement similar to the one you provided) to get columns to keep:
keep_cols = [x for x in df if not any(df.drop(x, axis=1).apply(lambda y: df[x].dropna().isin(y).all()))]
# ['B', 'D']
And then use the result with filter:
df.filter(items=keep_cols)
# B D
# 0 1 3
# 1 2 4
This should be fast enough, since it still uses apply at its core, and seems to be safer/more efficient than dropping columns within a loop.
If you're keen on a one-line solution, of course assigning the list to a variable is an optional step:
df.filter(items=[x for x in df if not any(df.drop(x, axis=1).apply(lambda y: df[x].dropna().isin(y).all()))])
I have a very simple dataframe like so:
In [8]: df
Out[8]:
A B C
0 2 a a
1 3 s 3
2 4 c !
3 1 f 1
My goal is to extract the first row in such a way that looks like this:
A B C
0 2 a a
As you can see the dataframe shape (1x3) is preserved and the first row still has 3 columns.
However when I type the following command df.loc[0] the output result is this:
df.loc[0]
Out[9]:
A 2
B a
C a
Name: 0, dtype: object
As you can see the row has turned into a column with 3 rows! (3x1 instead of 3x1). How is this possible? how can I simply extract the row and preserve its shape as described in my goal? Could you provide a smart and elegant way to do it?
I tried to use the transpose command .T but without success... I know I could create another dataframe where the columns are extracted by the original dataframe but this way quite tedious and not elegant I would say (pd.DataFrame({'A':[2], 'B':'a', 'C':'a'})).
Here is the dataframe if you need it:
import pandas as pd
df = pd.DataFrame({'A':[2,3,4,1], 'B':['a','s','c','f'], 'C':['a', 3, '!', 1]})
You need add [] for DataFrame:
#select by index value
print (df.loc[[0]])
A B C
0 2 a a
Or:
print (df.iloc[[0]])
A B C
0 2 a a
If need transpose Series, first need convert it to DataFrame by to_frame:
print (df.loc[0].to_frame())
0
A 2
B a
C a
print (df.loc[0].to_frame().T)
A B C
0 2 a a
Use a range selector will preserve the Dataframe format.
df.iloc[0:1]
Out[221]:
A B C
0 2 a a
I am trying to essentially do a COUNTIF in pandas to count how many items in a row match a number in the first column.
Dataframe:
a b c d
1 2 3 1
2 3 4 2
3 5 6 3
So I want to count instances in a row (b,c,d) that match a. In row 1 for instance it should be 1 as only d matches a.
I have searched quite a bit for this but so far only found examples where its a common number (like counting all values more than 0) but not based on a dataframe column. i'm guessing its some form of logic that masks based on the column but df == df.a doesnt seem to work
You can use eq, which you can pass an axis parameter to specify the direction of the comparison, then you can do a row sum to count the number of matched values:
df.eq(df.a, axis=0).sum(1) - 1
#0 1
#1 1
#2 1
#dtype: int64
df.apply(lambda x: (x == x[0]).sum()-1,axis=1)