I am new to this, and I need to split a column that contains two strings into 2 columns, like this:
Initial dataframe:
Full String
0 Orange Juice
1 Pink Bird
2 Blue Ball
3 Green Tea
4 Yellow Sun
Final dataframe:
First String Second String
0 Orange Juice
1 Pink Bird
2 Blue Ball
3 Green Tea
4 Yellow Sun
I tried this but doesn't work:
df['First String'] , df['Second String'] = df['Full String'].str.split()
and this:
df['First String', 'Second String'] = df['Full String'].str.split()
How to make it work? Thank you!!!
The key here is to include the parameter expand=True in your str.split() to expand the split strings into separate columns.
Type it like this:
df[['First String','Second String']] = df['Full String'].str.split(expand=True)
Output:
Full String First String Second String
0 Orange Juice Orange Juice
1 Pink Bird Pink Bird
2 Blue Ball Blue Ball
3 Green Tea Green Tea
4 Yellow Sun Yellow Sun
have you tried this solution ?
https://stackoverflow.com/a/14745484/15320403
df = pd.DataFrame(df['Full String'].str.split(' ',1).tolist(), columns = ['First String', 'Second String'])
I am having difficulties with calculating how much of an item is present in all order portfolios in percentage?
Items are toys that people usually buy: bear, rabbit, moose, dog, horse, cat, mouse, pig, chicken, eagle, raccoon, dolphin, shark, and whale.
I have an order_portfolio_id which represents the person buying the toys, and I have columns position_X where X is the number of position of the item ordered, with total of 8 positions. Person ordering the toys will never buy the same toy twice, so the items never repeat in one portfolio/row. Please note my original dataframe contains NaN, so I included them here as well.
>>> import pandas as pd
>>> from numpy import nan
>>>
>>> data = pd.DataFrame({'order_portfolio_num': [1,2,3,4,5,6,7,8],
... 'order_position_1':['dog', 'horse', 'cat','shark', 'dog', 'rabbit', 'rabbit', 'cat'],
... 'order_position_2':['mouse', 'bear', 'dog', 'dolphin', 'cat', 'bear', 'eagle', 'shark'],
... 'order_position_3':['bear', 'dog', 'raccoon', 'dog', 'whale', 'mouse', 'cat', 'moose'],
... 'order_position_4':['dolphin', 'cat', 'chicken', nan, 'horse', 'pig', 'dog', 'chicken'],
... 'order_position_5':['pig', 'chicken', 'eagle', nan, 'bear', 'raccoon', 'whale', nan],
... 'order_position_6':[nan, 'whale', nan, nan, 'eagle', 'moose', nan, nan],
... 'order_position_7':[nan, 'dolphin', nan, nan, nan, 'chicken', nan, nan]})
>>>
>>> data
order_portfolio_num order_position_1 order_position_2 order_position_3 order_position_4 order_position_5 order_position_6 order_position_7
0 1 dog mouse bear dolphin pig NaN NaN
1 2 horse bear dog cat chicken whale dolphin
2 3 cat dog raccoon chicken eagle NaN NaN
3 4 shark dolphin dog NaN NaN NaN NaN
4 5 dog cat whale horse bear eagle NaN
5 6 rabbit bear mouse pig raccoon moose chicken
6 7 rabbit eagle cat dog whale NaN NaN
7 8 cat shark moose chicken NaN NaN NaN
I would like to calculate top 5 most common toys ordered across all portfolios, in percentage. For example, if I have 10 order_portfolios, and the toy bear is present in 4 of them, the bear toy will than have a value of 40%. My goal is to have something that looks like this:
toy percent
dog 60%
cat 48%
mouse 36%
bear 28%
shark 19%
I tried to sum across all toys in the dataframe, but i got the number of occurrences of all toys in all portfolios, and I am unsure exactly how to calculate the percentage from that (which value represents the 100% value?), and if it's even that I am looking for as it will give me percentage of occurrences of all toys, and not portfolios. So I am unsure how to proceed. This is what I tried:
>>> cols = ['order_position_1', 'order_position_2', 'order_position_3', 'order_position_4',
... 'order_position_5', 'order_position_6', 'order_position_7']
>>>
>>> position_values = data[cols].melt().groupby('value').size().reset_index(name='count')
>>>
>>> position_values.sort_values(by = 'count', ascending = False)
value count
3 dog 6
1 cat 5
0 bear 4
2 chicken 4
4 dolphin 3
5 eagle 3
13 whale 3
6 horse 2
7 moose 2
8 mouse 2
9 pig 2
10 rabbit 2
11 raccoon 2
12 shark 2
Any ideas?
Use DataFrame.melt with Series.value_counts and divide by original number of rows:
df = data.melt('order_portfolio_num')['value'].value_counts().div(len(data)).mul(100).head()
print (df)
dog 75.0
cat 62.5
bear 50.0
chicken 50.0
dolphin 37.5
Name: value, dtype: float64
This is the general idea:
First get the name of all toys
Check, for every toy, if it is in a row and store that count
Get the frecuency
unique_values = df.drop(columns = "order_portfolio_num").stack().unique()
count = pd.Series([(df == x).any(1).sum() for x in unique_values], unique_values)
frec = count / df["order_portfolio_num"].size() * 100
print(frec.head())
dog 75.0
mouse 25.0
bear 50.0
dolphin 37.5
pig 25.0
Documentation
pandas.DataFrame.drop
pandas.DataFrame.stack
pandas.unique
pandas.Series.size
pandas.DataFrame.any
pandas.Series.sum
List Comprehensions
devide each value by the number of order portfolios (N)
position_values['percent']=position_values['count']/data['order_portfolio_num'].count()
and then sort and head
I am working pandas project. I am pretty new to it, I have two huge data frame which has structure similar to bellow
Data frame 1:
Animals Plants
Dog Amaryllis
Cat Angel Wing Begonia
Dragon African Violet
Data frame 2:
Animals Planents Amaryllis Angel Wing Begonia
Dog Earth x x
Cat Pluto na na
Dragon Mars na x
I need all the plants form dataframe1 to be compared with 'x' values in the dataframe2 , if 'x' is present the particular column , I have to pick column name (ex:Amaryllis is present in first row ) , animal name , planet name and write them into other file .
Expected output :
Amaryllis , Dog, Earth
Angel wing Begonia , Dog, Earth
Angel wing Begonia , Dragon, Mars
Currently I have tried just reading column with x value
DATA = df_xlsx[df_xlsx['Amaryllis'].str.contains('X', na=False)]
DATA
I'm not sure what's the significance of df1. As far as I can see, it looks like a melt/stack and filter:
(df2.melt(['Animals','Planents'], var_name='Plants')
.query('value=="x"')
.iloc[:,:-1]
)
Output:
Animals Planents Plants
0 Dog Earth Amaryllis
3 Dog Earth Angel Wing Begonia
5 Dragon Mars Angel Wing Begonia
This question already has answers here:
Diff of two Dataframes
(7 answers)
Closed 4 years ago.
I have 2 dataframes (df_a and df_b) with 2 columns: 'Animal' and 'Name'.
In the bigger dataframe, there are more animals of the same type than the other. How do I find the extra animals of the same type by name? i.e. (df_a - df_b)
Dataframe A
Animal Name
dog john
dog henry
dog betty
dog smith
cat charlie
fish tango
lion foxtrot
lion lima
Dataframe B
Animal Name
dog john
cat charlie
dog betty
fish tango
lion foxtrot
dog smith
In this case, the extra would be:
Animal Name
dog henry
lion lima
Attempt: I tried using
df_c = df_a.subtract(df_b, axis='columns')
but got the following error "unsupported operand type(s) for -: 'unicode' and 'unicode'", which makes sense since they are strings not numbers. Is there any other way?
You are looking for a left_only merge.
merged = pd.merge(df_a,df_b, how='outer', indicator=True)
merged.loc[merged['_merge'] == 'left_only'][['Animal', 'Name']]
Output
Animal Name
1 dog henry
7 lion lima
Explanation:
merged = pd.merge(df_a,df_b, how='outer', indicator=True)
Gives:
Animal Name _merge
0 dog john both
1 dog henry left_only
2 dog betty both
3 dog smith both
4 cat charlie both
5 fish tango both
6 lion foxtrot both
7 lion lima left_only
The extra animals are in df_a only, which is denoted by left_only.
Using isin
df1[~df1.sum(1).isin(df2.sum(1))]
Out[611]:
Animal Name
1 dog henry
7 lion lima
I have a dataframe with 4 categorical index levels:
grey cat male ralph ...
grey cat female bessie ...
yellow parrot female lisa ...
black dog male fido ...
orange parrot female janie ...
orange parrot male pete ...
black dog male will ...
grey cat female wanda ...
white dog female karen ...
black cat male albert ...
I want to sort the data in the following order not referring specifically to the index values:
First by animal
Second by color
third by gender
I'll potentially want to slice by (animal,color,gender) groups.
and where first, second, and third categorical values are the same I want the records sorted in ascending alphabetical order of the 4th (see grey, cat, female below - bessie is ordered before wanda). On thinking about this level 4 (name) may not need to be an index level?)
so the resulting dataframe would like the following (only indices shown)
black cat male albert ...
grey cat female bessie ...
grey cat female wanda ...
grey cat male ralph ...
black dog male fido ...
black dog male will ...
white dog female karen ...
orange parrot female janie ...
orange parrot male pete ...
yellow parrot female lisa ...
I may use the code for other data sets so I want to write it generically (not referring to the specific contents of the data set.
I'm stumped. Can someone provide some guidance?
Thanks.