I have a dataframe which is something like this
Victim Sex Female Male Unknown
Perpetrator Sex
Female 10850 37618 24
Male 99354 299781 92
Unknown 33068 156545 148
I'm planning to drop both the row indexed as 'Unknown' and the column named 'Unknown'. I know how to drop a row and a column but I was wondering whether you could drop a row and a column at the same time in pandas? If yes, how could it be done?
This should do the job, however it's not really at the same time, but no intermediate object is returned to you.
df.drop("Unknown", axis=1).drop("Unknown", axis=0)
So for a concrete Example:
df = pd.DataFrame([[1,2],[3,4]], columns=['A', 'B'], index=['C','D'])
print(df)
A B
C 1 2
D 3 4
the call
df.drop('B', axis=1).drop('C', axis=0)
returns
A
D 3
I think closest 'at the same time' is select by loc and difference:
print (df.index.difference(['Unknown']))
Index(['Female', 'Male'], dtype='object')
print (df.columns.difference(['Unknown']))
Index(['Female', 'Male'], dtype='object')
df = df.loc[df.index.difference(['Unknown']), df.columns.difference(['Unknown'])]
print (df)
Victim Sex Female Male
Perpetrator Sex
Female 10850 37618
Male 99354 299781
You can delete columns and rows at the same time in one line just with their position. For example, if you want delete column 2,3 and 5 and at the same time if you want to remove index 0,1 and 3 along with the last row of the dataframe, you can do this by following,
df.drop(df.columns[[2,3,5]], axis = 1).drop(df.index[[0,1,3,-1]])
Related
I have two dataFrames that after merge by "Name" some rows retun NaN because the "Names" are incomplete.
df1
Name
Info 1
Walter
Adress 1
john wick
Adress 1
df2
Name
Info 2
Walter White
Male
john wick
Male
df2 = pd.merge(df1,df2,on='Name', how='left')
I'm geting
Name
Info 1
Info 2
Walter
NaN
NaN
john wick
Adress 1
Male
I Want
Name
Info 1
Info 2
Walter White
Adress 1
Male
john wick
Adress 1
Male
How can I treat rows, to try get values by substring, if return NaN? I dont know if use merge in first time was the best logic.
Try this:
df2 = pd.merge_asof(df1,df2,on='Name', how='left')
this depends on the resemblance of the different values
The reason its not working is because pandas doesn't consider "Walter" and "Walter White" as same values.
Thus when you perform a left join on df1 it keeps all the values of df1 and adds the values from df2 that have the same "Name" column values. Since walter is not present in df2 it adds NaN in info2 column(again "walter" and "walter white" are different).
One way you could solve this is by creating two separate columns for "First_Name" and "Last_Name" and then try merging on "First_Name"
something like
df1["First_Name"] = df1.apply(lambda row: row['Name'].split()[0], axis = 1)
df2["First_Name"] = df2.apply(lambda row: row['Name'].split()[0], axis = 1)
Then simply use the same merge as you did...
df2 = pd.merge(df1,df2,on='First_Name', how='left')
I have two dataframes, df1 and df2, which have a common column heading, Name. The Name values are unique within df1 and df2. df1's Name values are a subset of those in df2; df2 has more rows -- about 17,300 -- than df1 -- about 6,900 -- but each Name value in df1 is in df2. I would like to create a list of Name values in df1 that meet certain criteria in other columns of the corresponding rows in df2.
Example:
df1:
Name
Age
Hair
0
Jim
25
black
1
Mary
58
brown
3
Sue
15
purple
df2:
Name
Country
phoneOS
0
Shari
GB
Android
1
Jim
US
Android
2
Alain
TZ
iOS
3
Sue
PE
iOS
4
Mary
US
Android
I would like a list of only those Name values in df1 that have df2 Country and phoneOS values of US and Android. The example result would be [Jim, Mary].
I have successfully selected rows within one dataframe that meet multiple criteria in order to copy those rows to a new dataframe. In that case pandas/Python does the iteration over rows internally. I guess I could write a "manual" iteration over the rows of df1 and access df2 on each iteration. I was hoping for a more efficient solution whereby the iteration was handled internally as in the single-dataframe case. But my searches for such a solution have been fruitless.
try:
df_1.loc[df_1.Name.isin(df_2.loc[df_2.Country.eq('US') & \
df_2.phoneOS.eq('Android'), 'Name']), 'Name']
Result:
0 Jim
1 Mary
Name: Name, dtype: object
if you want the result as a list just add .to_list() at the end
data = df1.merge(df2, on='Name')
data.loc[((data.phoneOS == 'Android') & (data.Country == "US")), 'Name'].values.tolist()
I have two data frames: DF1 and DF2.
DF2 is essentially a randomly generated subset of rows in DF1.
I want to get the (integer) indexes of DF1 of the rows where there is a complete match of all column values in DF1.
I'm trying to do this with a multi-index:
So if I have the following:
DF1:
Index Name Age Gender Label
0 Kate 24 F 1
1 Bill 23 M 0
2 Bob 22 M 0
3 Billy 21 M 0
DF2:
MultiIndex Name Age Gender Label
(Bob,22,M) Bob 22 M 0
(Billy,21,M) Billy 21 M 0
Desired Output: [2,3]
How can I use that MultiIndex in DF2 to check DF1 for those matches?
I found this while searching but I think this requires you to specify what value you want beforehand? I can't find this exact use case.
df2.loc[(df2.index.get_level_values("Name" =='xxx') &
(df2.index.get_level_values('Age') == x &
(df2.index.get_level_values('Gender') == x)]
Please let me know the best way.
Thanks!
Edit (Code to generate df1):
Pseudocode: Merge two dataframes to get a total of 10 columns and
drop everything except 4 columns
Edit (Code to generate df2):
if amount_needed - len(lowest_value_keys) > 0:
extra_samples = df1[df1.Label==0].sample(n=amount_needed -len(lowest_value_keys) ,replace=False)
lowest_value_df = pd.DataFrame(data = lower_value_keys, columns = ["Name", 'Age','Gender'])
samples = pd.concat([lowest_value_df, extra_samples])
samples.index = pd.MultiIndex.from_frame(samples [["Name", 'Age','Gender']])
else:
all_samples = pd.DataFrame(data = lower_value_keys, columns = ["Name", 'Age','Gender'])
samples = all_samples.sample(n=amount_needed,replace=False)
samples.index = pd.MultiIndex.from_frame(samples [["Name", 'Age','Gender']])
Not sure if this answers your query, but if we first reset the index of df1 to get that as another column 'Index', and then set_index on Name, Age , Gender to find the matches on df2 and just take the resulting Index column would that work ?
So that would be:
df1.reset_index().set_index(['Name','Age','Gender']).loc[df2.set_index(['Name','Age','Gender']).index]['Index'].values
I have multiple categorical columns like Marital Status, Education, Gender, City and I wanted to check all the unique values inside these columns at once instead of writing this code every time.
df['Education'].value_counts()
I can only give an example of a few features but I need a solution when there are so many categorical features and its not possible to write code again and again to examine them.
Maritial_Status Education City
Married UG LA
Single PHD CA
Single UG Ca
Expected output:
Maritial_Status Education City
Married 1 UG 2 LA 1
Single 2 PHD 1 CA 2
Is there any kind of method to do this in Python?
Thanks
Yes, you can get what you're looking for with the following approach (also you don't have to worry about if your df has more data than the 4 columns you specified):
Get (only) all your categorical columns from your df in a list:
cat_cols = [i for i in df.columns if df[i].dtypes == 'O']
Then, run a loop performing .size() on your grouped object, over your categorical columns, and store each result (which is a df object) in an empty list.
li = []
for col in cat_cols:
li.append(df.groupby([col]).size().reset_index(name=col+'_count'))
Lastly, concat the newly created dataframes within your list, into 1.
dat = pd.concat(li,axis=1)
All in 1 block:
cat_cols = [i for i in df.columns if df[i].dtypes == 'O']
li = []
for col in cat_cols:
li.append(df.groupby([col]).size().reset_index(name=col+'_count'))
dat = pd.concat(li,axis=1)# use axis=1, so that the concatenation is column-wise
Marital Status Marital Status_count ... City City_count
0 Divorced 4.0 ... Athens 4
1 Married 3.0 ... Berlin 2
2 Single 3.0 ... London 2
3 Widowed 2.0 ... New York 2
4 NaN NaN ... Singapore 2
Using value_counts, you can do the following
res = (df
.apply(lambda x: x.value_counts()) # column by column value_counts would be applied
.stack()
.reset_index(level=0).sort_index(axis=0)
.rename(columns={'level_0': 'Value', 0: 'value_counts'}))
Another format of the the output:
res['Id'] = res.groupby(level=0).cumcount()
res.set_index('Id', append=True)
Explanation:
After applying value_counts, you will get the following:
Then using stack you can remove the NAN and get all things "stacked up" and then you can do the formatting/ ordering of the output.
To know how many repeated unique values you have for each column, you can try drop_duplicates() method:
dataset.drop_duplicates()
Sorry for the seemingly confusing title. I was reading Excel data using Pandas. However, the original Excel data has multiple rows for header and some of the cells are merged. It sort of looks like this:
It shows in my Jupyter Notebook like this
My plan is to just the 2nd level as my column names and drop the level0. But the original data has about 15 columns that shows as "Unnamed...", I wonder if I can rename those before dropping the level0 column names.
The desirable output looks like:
I may do this repeatedly so I didn't save it as CSV first and then read it in Pandas. Now I have spent longer than I care to admit on fixing the column names. I wonder if there is a way to do this with a function instead of renaming every individual column of interest.
Thanks.
I think simpliest here is use list comprehension - get values of MultiIndex only if no Unnamed text:
df.columns = [first if 'Unnamed' in second else second for first, second in df.columns]
print (df)
Purchase/sell_time Quantity Price Side
0 2020-04-09 15:22:00 20 43 B
1 2020-04-09 16:22:00 30 56 S
But if more levels in real data is possible some columns should be duplicated, so cannot select them (if select by duplicated column get all columns, not only one, e.g. by df['dup_column_name']).
You can test it:
print (df.columns[df.columns.duplicated(keep=False)])
Then I suggest join all unnamed levels for prevent it:
df.columns = ['_'.join(y for y in x if 'Unnamed' not in y) for x in df.columns]
print (df)
Purchase/sell_time Purchase/sell_time_Quantity Purchase/sell_time_Price \
0 2020-04-09 15:22:00 20 43
1 2020-04-09 16:22:00 30 56
Side
0 B
1 S
Your columns are multiindex, and index are immutable, meaning you can't change only a part of them. This is why I suggest to retrieve both levels of the multiindex, then to create an array with your desired columns and to replace the DataFrame column with this, as follows:
# First I reproduce your dataframe
df1 = pd.DataFrame({("Purchase/sell_time","Unnamed:"): pd.date_range("2020-04-09 15:22:00",
freq="H", periods = 2),
("Purchase/sell_time", "Quantity"): [20,30],
("Purchase/sell_time", "Price"): [43, 56],
("Side", "Unnamed:") : ["B", "S"]})
df1 = df1.sort_index()
It looks like this:
Purchase/sell_time Side
Unnamed: Quantity Price Unnamed:
0 2020-04-09 15:22:00 20 43 B
1 2020-04-09 16:22:00 30 56 S
The column is a multiindex as you can see:
MultiIndex([('Purchase/sell_time', 'Unnamed:'),
('Purchase/sell_time', 'Quantity'),
('Purchase/sell_time', 'Price'),
( 'Side', 'Unnamed:')],
)
# I retrieve the first and second level of the multiindex then create an array conditionally
# on the second level not starting with "Unnamed"
first_header = df1.columns.get_level_values(0)
second_header = df1.columns.get_level_values(1)
merge_header = np.where(second_header.str.startswith("Unnamed:"),
first_header, second_header)
df1.columns = merge_header
Here is the result:
Purchase/sell_time Quantity Price Side
0 2020-04-09 15:22:00 20 43 B
1 2020-04-09 16:22:00 30 56 S
Hope it helps