I have a dataframe where I need to drop if any of the combinations in my nested list are met. Here's the sample dataframe:
df = pd.DataFrame([['A','Green',10],['A','Red',20],['B','Blue',5],['B','Red',15],['C','Orange',25]],columns = ['Letter','Color','Value'])
print df
Letter Color Value
0 A Green 10
1 A Red 20
2 B Blue 5
3 B Red 15
4 C Orange 25
I have a list of letter/color combinations that I need to remove from the dataframe:
dropList = [['A','Green'],['B','Red']]
How can I drop from the dataframe where the letter/color combinations are in any of the nested lists?
Approaches I can do if necessary, but want to avoid:
Write a .apply function
Any form of brute force iteration
Convert the dropList to a df and merge
#df_out = code here to drop if letter/color combo appears in my droplist
print df_out
Letter Color Value
0 A Red 20
1 B Blue 5
2 C Orange 25
I imagine there is some simple one/two line solution that I just can't see...Thanks!
you can create a helper DF:
In [36]: drp = pd.DataFrame(dropList, columns=['Letter','Color'])
merge (left) your main DF with the helper DF and select only those rows that are missing in the right DF:
In [37]: df.merge(drp, how='left', indicator=True) \
.query("_merge=='left_only'") \
.drop('_merge',1)
Out[37]:
Letter Color Value
1 A Red 20
2 B Blue 5
4 C Orange 25
You can use the diff between the Letter Color combo and dropList to reindex the DF.
result = (
df.set_index(['Letter','Color'])
.pipe(lambda x: x.reindex(x.index.difference(dropList)))
.reset_index()
)
result
Out[45]:
Letter Color Value
0 A Red 20
1 B Blue 5
2 C Orange 25
Here is a crazy use of isin() though my first choice would be #MaxU's solution
new_df = df[~df[['Letter', 'Color']].apply(','.join,axis = 1).isin([s[0]+','+s[1] for s in dropList])]
Letter Color Value
1 A Red 20
2 B Blue 5
4 C Orange 25
Multi-indexing on the columns you use in dropList should do what you're after. Subtract the elements to be dropped from the full set of multiindex elements, then slice the dataframe by that remainder.
Note that the elements of dropList need to be tuples for the lookup.
dropSet = {tuple(elem) for elem in dropList}
# Creates a multi-index on letter/colour.
temp = df.set_index(['Letter', 'Color'])
# Keep all elements of the index except those in droplist.
temp = temp.loc[list(set(temp.index) - dropSet)]
# Reset index to get the original column layout.
df_dropped = temp.reset_index()
This returns:
In [4]: df_dropped
Out[4]:
Letter Color Value
0 B Blue 5
1 A Red 20
2 C Orange 25
Transform the list of lists into a dictionary
mapper = dict(dropList)
Now filter out, by mapping the dictionary to the dataframe
df[df.Letter.map(mapper) != df.Color]
Yields
Letter Color Value
1 A Red 20
2 B Blue 5
4 C Orange 25
This post is inspired by #Wen's solution to a later problem, please upvote there.
df2 = pd.DataFrame(dropList, columns=['Letter', 'Color'])
df.loc[~df.index.isin(df.merge(df2.assign(a='key'), how='left').dropna().index)]
Related
I have this example CSV file:
Name,Dimensions,Color
Chair,!12:88:33!!9:10:50!!40:23:11!,Red
Table,!9:10:50!!40:23:11!,Brown
Couch,!40:23:11!!12:88:33!,Blue
I read it into a dataframe, then split Dimensions by ! and take the first value of each !..:..:..!-section. I append these as new columns to the dataframe, and delete Dimensions. (code for this below)
import pandas as pd
df = pd.read_csv("./data.csv")
df[["first","second","third"]] = (df['Dimensions']
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[0]))
df = df.drop("Dimensions", axis=1)
And I get this:
Name Color first second third
0 Chair Red 12 9 40
1 Table Brown 9 40 None
2 Couch Blue 40 12 None
I named them ["first","second","third"] by manually here.
But what if there are more than 3 in the future, or only 2, or I don't know how many there will be, and I want them to be named using a string + an enumerating number?
Like this:
Name Color data_0 data_1 data_2
0 Chair Red 12 9 40
1 Table Brown 9 40 None
2 Couch Blue 40 12 None
Question:
How do I make the naming automatic, based on the string "data_" so it gives each column the name "data_" + the number of the column? (So I don't have to type in names manually)
Use DataFrame.pop for use and drop column Dimensions, add DataFrame.add_prefix to default columnsnames and append to original DataFrame by DataFrame.join:
df = (df.join(df.pop('Dimensions')
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[0]).add_prefix('data_')))
print (df)
Name Color data_0 data_1 data_2
0 Chair Red 12 9 40
1 Table Brown 9 40 None
2 Couch Blue 40 12 None
Nevermind, hahah, I solved it.
import pandas as pd
df = pd.read_csv("./data.csv")
df2 = (df['Dimensions']
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[0]))
df[[ ("data_"+str(i)) for i in range(len(df2.columns)) ]] = df2
df = df.drop("Dimensions", axis=1)
This question might be common but I am new to python and would like to learn more from the community. I have 2 map files which have data mapping like this:
map1 : A --> B
map2 : B --> C,D,E
I want to create a new map file which will be A --> C
What is the most efficient way to achieve this in python? A generic approach would be very helpful as I need to apply the same logic on different files and different columns
Example:
Map1:
1,100
2,453
3,200
Map2:
100,25,30,
200,300,,
250,190,20,1
My map3 should be:
1,25
2,0
3,300
As 453 is not present in map2, our map3 contains value 0 for key 2.
First create DataFrames:
df1 = pd.read_csv(Map1, header=None)
df2 = pd.read_csv(Map2, header=None)
And then use Series.map by second column with by Series created by df2 with set index by first column, last replace missing values to 0 for not matched values:
df1[1] = df1[1].map(df2.set_index(0)[1]).fillna(0, downcast='int')
print (df1)
0 1
0 1 25
1 2 0
2 3 300
EDIT: for mapping multiple columns use left join with remove only missing columns by DataFrame.dropna and columns b,c used for join, last replace missing values:
df1.columns=['a','b']
df2.columns=['c','d','e','f']
df = (df1.merge(df2, how='left', left_on='b', right_on='c')
.dropna(how='all', axis=1)
.drop(['b','c'], axis=1)
.fillna(0)
.convert_dtypes())
print (df)
a d e
0 1 25 30
1 2 0 0
2 3 300 0
I have the following dataframe:
df
Group Dist
0 A 5
1 B 2
2 A 3
3 B 1
4 B 0
5 A 5
I am trying to drop all rows that match Group if the Dist column equals zero. This works to delete row 4:
df = df[df.Dist != 0]
however I also want to delete rows 1 and 3 so I am left with:
df
Group Dist
0 A 5
2 A 3
5 A 5
Any ideas on how to drop the group based off this condition?
Thanks!
First get all Group values for Entry == 0 and then filter out them by check column Group with inverted mask by ~:
df1 = df[~df['Group'].isin(df.loc[df.Dist == 0, 'Group'])]
print (df1)
Group Dist
0 A 5
2 A 3
5 A 5
Or you can use GroupBy.transform with GroupBy.all for test if groups has no 0 values:
df1 = df[(df.Dist != 0).groupby(df['Group']).transform('all')]
EDIT: For remove all groups with missing values:
df2 = df[df['Dist'].notna().groupby(df['Group']).transform('all')]
For test missing values:
print (df[df['Dist'].isna()])
if return nothing there are no missing values NaN or no None like Nonetype.
So is possible check scalar, e.g. if this value is in row with index 10:
print (df.loc[10, 'Dist'])
print (type(df.loc[10, 'Dist']))
You can use groupby and the method filter:
df.groupby('Group').filter(lambda x: x['Dist'].ne(0).all())
Output:
Group Dist
0 A 5
2 A 3
5 A 5
If you want to filter out groups with missing values:
df.groupby('Group').filter(lambda x: x['Dist'].notna().all())
data aggregation parsed from file at the moment:
obj price1*red price1*blue price2*red price2*blue
a 5 7 10 12
b 15 17 20 22
desired outcome:
obj color price1 price2
a red 5 7
a blue 10 12
b red 15 17
b blue 20 22
this example is simplified. the data of the real usecase persists out of 404 columns and 10'000 of rows. The data mostly has arround 99 positions of colors and 4 different kind of pricelists (pricelists are always 4 kinds of).
I already tried a different approach from another part i programmed before in python
df_pricelist = pd.melt(df_pricelist, id_vars=["object_nr"], var_name='color', value_name='prices')
but this approach was initially used to pivot data from a single attribute to multiple lines. Or in other words only 1 cell for the different pricelists instead of multiple cells.
Where i also used assign to add the different blocks of the string to dofferent column cells.
To get all the different columns into the dataframe i use str.startswith. This way i don't have to know all the different colors there could be.
A solution that makes use of a MultiIndex as an intermediate step:
import pandas as pd
# Construct example dataframe
col_names = ["obj", "price1*red", "price1*blue", "price2*red", "price2*blue"]
data = [
["a", 5, 7, 10, 12],
["b", 15, 17, 20, 22],
]
df = pd.DataFrame(data, columns=col_names)
# Convert objects column into rows index
df2 = df.set_index("obj")
# Convert columns index into two-level multi-index by splitting name strings
color_price_pairs = [tuple(col_name.split("*")) for col_name in df2.columns]
df2.columns = pd.MultiIndex.from_tuples(color_price_pairs, names=("price", "color"))
# Stack colors-level of the columns index into a rows index level
df2 = df2.stack()
df2.columns.name = ""
# Optional: convert rows index (containing objects and colors) into columns
df2 = df2.reset_index()
This is a print-out that shows both the original dataframe df and the result dataframe df2:
In [1] df
Out[1]:
obj price1*red price1*blue price2*red price2*blue
0 a 5 7 10 12
1 b 15 17 20 22
In [2]: df2
Out[2]:
obj color price1 price2
0 a blue 7 12
1 a red 5 10
2 b blue 17 22
3 b red 15 20
I have the following data frame:
import pandas as pd
df = pd.DataFrame({'a':[0,0,1,1], 'b':[0,1,0,1],'tag':['apple','orange','grapes','lemon']})
df = df[["tag","a","b"]]
That looks like this:
In [37]: df
Out[37]:
tag a b
0 apple 0 0
1 orange 0 1
2 grapes 1 0
3 lemon 1 1
What I want to do is to remove rows where numerical columns is zero resulting in this:
tag a b
orange 0 1
grapes 1 0
lemon 1 1
How can I achieve that?
Note that in actuality, the number of columns can be more than 2 and column name can be varied. So we need a general solution.
I tried this but has no effect:
df[(df.T != 0).any()]
There's a few different things going on in this answer, let me know if anything is particularly confusing:
df.loc[~ (df.select_dtypes(include=['number']) == 0).all(axis='columns'), :]
So:
Filtering to find just the numeric columns
Applying the .all() method across columns rather than rows (rows is the default)
Negating with ~
Passing the resulting boolean series to df.loc[]
Get numeric columns:
numcols = df.dtypes == np.int64
create indexer
I = np.sum((df.loc[:,numcols]) != 0,axis = 1) != 0
df[I]
tag a b
1 orange 0 1
2 grapes 1 0
3 lemon 1 1