So, I have a data frame like this (the important column is the third one):
| ABC | DEF | fruit |
----------------------------
1 | 12 | LO | banana
2 | 45 | KA | orange
3 | 65 | JU | banana
4 | 25 | UY | grape
5 | 23 | TE | apple
6 | 28 | YT | orange
7 | 78 | TR | melon
I want to keep the rows that have the 5 most occurring fruits and drop the rest, so I made a variable to hold those fruits to keep in a list, like this:
fruits = df['fruit'].value_counts()
fruits_to_keep = fruits[:5].reset_index()
fruits_to_keep.drop(['fruit'], inplace=True, axis=1)
fruits_to_keep = fruits_to_keep.to_numpy()
fruits_to_keep = fruits_to_keep.tolist()
fruits_to_keep
[['banana'],['orange'],[apple],[melon],[grape]]
I have the feeling that I made unnecessary steps, but anyway, the problem arises when I try to select the rows containing those fruits_to_keep
df = df.set_index('fruit')
df = df.loc[fruits_to_keep,:]
Then I get the Key Error saying that "None of [Index([('banana',), \n ('orange',), \n ('apple',)...... dtype='object', name='fruit')] are in the [index]"
I also tried:
df[df.fruit in fruits_to_keep]
But then I get the following error:
('Lengths must match to compare', (43987,), (1,))
Obs.: I actually have 43k rows, many 'fruits' that I don't want on the dataframe and 30k+ rows with the 5 most occurring 'fruits'
Thanks in advance!
To keep the rows with the top N values you can use value_counts and isin.
By default, value_counts returns the elements in descending order of frequency.
N = 5
df[df['col'].isin(df['col'].value_counts().index[:N])]
Related
I have two dfs that have same columns and contain same information, but from different sources:
df_orders = pd.DataFrame({'id':[1,2,3],'model':['A1','A3','A6'], 'color':['Red','Blue','Green']})
df_billed = pd.DataFrame({'id':[1,6,7],'model':['A1','A7','B1'], 'color':['Purple','Pink','Red']})
Then I do a merge left on the df_billed by ids and add sufixes as column names overlap:
merge_df = pd.merge(df_billed,df_orders,on='id',how='left',suffixes=('_order','_billed'))
Results in
id|model_order|color_order | model_billed | color_billed
0 1 | A1 | Purple | A1 | Red
1 6 | A7 | Pink | NaN | NaN
2 7 | B1 | Red | NaN | NaN
The column order has more priority when the suffix is _order than billed, and somehow I would like to have a dataframe where if no billed info, then we take the order, and the suffixes are removed:
id|model_billed | color_billed |
0 1 | A1 | Red |
1 6 | A7 | Pink |
2 7 | B1 | Purple |
Ideally I thought of doing a combine_first to coalesce the colums and at the end rename them, but looks a bit dirty in code and looking for another more well-designed solution.
You can just use .fillna() and use the _order columns to fill the NAs
merge_df['model_billed'] = merge_df['model_billed'].fillna(merge_df['model_order'])
merge_df['color_billed'] = merge_df['color_billed'].fillna(merge_df['color_order'])
Output
merge_df[['id', 'model_billed', 'color_billed']]
id model_billed color_billed
0 1 A1 Red
1 6 A7 Pink
2 7 B1 Red
UPDATE
If there are more such columns, you can just use a loop like this:
col_names = ['model', 'color']
for col in col_names:
merge_df[col+'_billed'] = merge_df[col+'_billed'].fillna(merge_df[col+'_order'])
I have a DataFrame like this:
df = pd.DataFrame({'Source1': ['Corona,Corona,Corona','Sars,Sars','Corona,Sars',
'Sars,Corona','Sars'],
'Area': ['A,A,A,B','A','A,B,B,C','C,C,B,C','A,B,C']})
df
Source1 Area
0 Corona,Corona,Corona A,A,A,B
1 Sars,Sars A
2 Corona,Sars A,B,B,C
3 Sars,Corona C,C,B,C
4 Sars A,B,C
I want to check each cell in each column (the real data has many columns) and find the frequency of each unique word (we can distinguish the unique words by ','), and replace the whole entry by the most frequent word.
In the case of a tie, it doesn't matter which word to replace. So the desired output would look like this:
df
Source Area
0 Corona A
1 Sars A
2 Corona B
3 Sars C
4 Sars A
In this case, I randomly chose to pick the first word when there is a tie, but it really doesn't matter.
Thanks in advance.
Create DataFrames by Series.str.split and expand=True and is used DataFrame.mode with selecting first column by position:
df['Source1'] = df['Source1'].str.split(',', expand=True).mode(axis=1).iloc[:, 0]
df['Area'] = df['Area'].str.split(',', expand=True).mode(axis=1).iloc[:, 0]
print (df)
Source1 Area
0 Corona A
1 Sars A
2 Corona B
3 Sars C
4 Sars A
Another idea with collections.Counter.most_common:
from collections import Counter
f = lambda x: [Counter(y.split(',')).most_common(1)[0][0] for y in x]
df[['Source1', 'Area']] = df[['Source1', 'Area']].apply(f)
#all columns
#df = df.apply(f)
print (df)
Source1 Area
0 Corona A
1 Sars A
2 Corona B
3 Sars C
4 Sars A
Here would be my offering that can be executed in a single line for each series and requires no extra imports.
df['Area'] = df['Area'].apply(lambda x: max(x.replace(',',''), key=x.count))
After replacing all , in the characters found in the Area series, we replace the field with the element that has the greatest number of occurrences (or first element in the case of equal values) with the key=x.count argument.
You could also use use something similar (demonstrated with the Source1 series), returning the maximum from the list of elements created by splitting the field.
df['Source1'] = df['Source1'].apply(lambda x: max(list(x.split(',')), key=x.count))
+---+---------+------+
| | Source1 | Area |
+---+---------+------+
| 0 | Corona | A |
| 1 | Sars | A |
| 2 | Corona | B |
| 3 | Sars | C |
| 4 | Sars | A |
+---+---------+------+
Two methods shown above to highlight choices; both would work adequately on either or both series.
I am new to Python and Stackoverflow, so please bear with me. I have a large datafile of roughly 140k rows stored as a csv. The file is split up into sections based on age groups, ie. 16-24, 24-50 etc. At every break there are information lines about the age and etnicity of the subjects. After loading the csv into pandas, I tried to break up the dataframe into several smaller ones by dividing on the information lines of the age groups using iloc. Now I have a list of dataframes. I can access each dataframe in the list, no problem, however (I guess due to the information lines) pandas displays all information in one column. Is there a way to format the output and make pandas display the column headers and put the information lines into the header above the column headers? I'm sorry if this is not very clear, please feel free to suggest any edits.
The data in the csv looks something like this:
0 Some information
1 Some information
2 Some information
3
4
5 a | b | c | d |
6 a | 1 | 1 | 1 |
7 a | 1 | 1 | 1 |
8 a | 1 | 1 | 1 |
9
10 Some information
11 Some information
12 Some information
13
14
15 a | b | c | d |
16 a | 1 | 1 | 1 |
17 a | 1 | 1 | 1 |
18 a | 1 | 1 | 1 |
I used iloc to break this up on the information lines by row index.
36065,43278,50491,57704,
64917,72130,79343,86556,
93769,100982,108195,115408,
122621,129834,137047]
l_mod = [0] + l + [max(l)+1]
list_of_dfs = [mydata_df.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
when accessing I used: df1_df=list_of_dfs[1]
The output is currently as follows:
0
--------------------
1 a,b,c
2 a,1,1,
I hope this makes sense, please suggest edits and I'll do my best to explain.
You can try df[0].str.split(',', expand=True), which expands your dataframe based on every split on a comma. Then you can assign the new column names to it, since it will give column names [0, 1, 2, 3.. etc]
I'm stuck with a little problem with python and regular expressions.
I got a pandas table with records with a different
different order of construction, see below.
+----------------------------------------------+
| Total |
+----------------------------------------------+
| Total Price: 4 x 2 = 8 |
| Total Price 200 Price_per_piece 10 Amount 20 |
+----------------------------------------------+
I want to separate the records in the ‘Total’ column to 3 other columns like below.
Do I need first to split those columns in 2 subset and to do different regular expressions or do you guys have some other solutions/ideas?
+-------+-----------------+--------+
| Total | Price_per_piece | Amount |
+-------+-----------------+--------+
| 8 | 4 | 2 |
| 200 | 10 | 20 |
+-------+-----------------+--------+
Try this one:
dtotal = ({"Total":["Total Price: 4 x 2 = 8","Total Price 200 Price_per_piece 10 Amount 20"]})
dt = pd.DataFrame(dtotal)
data = []
for item in dt['Total']:
regex = re.findall(r"(\d+)\D+(\d+)\D+(\d+)",item)
regex = (map(list,regex))
data.append(list(map(int,list(regex)[0])))
dftotal = pd.DataFrame(data, columns=['Total','Price_per_piece','Amount'])
print(dftotal)
Output:
Total Price_per_piece Amount
0 4 2 8
1 200 10 20
I got the following problem which I got stuck on and unfortunately cannot resolve by myself or by similar questions that I found on stackoverflow.
To keep it simple, I'll give a short example of my problem:
I got a Dataframe with several columns and one column that indicates the ID of a user. It might happen that the same user has several entries in this data frame:
| | userID | col2 | col3 |
+---+-----------+----------------+-------+
| 1 | 1 | a | b |
| 2 | 1 | c | d |
| 3 | 2 | a | a |
| 4 | 3 | d | e |
Something like this. Now I want to known the number of rows that belongs to a certain userID. For this operation I tried to use df.groupby('userID').size() which in return I want to use for another simple calculation, like division whatsover.
But as I try to save the results of the calculation in a seperate column, I keep getting NaN values.
Is there a way to solve this so that I get the result of the calculations in a seperate column?
Thanks for your help!
edit//
To make clear, how my output should look like. The upper dataframe is my main data frame so to say. Besides this frame I got a second frame looking like this:
| | userID | value | value/appearances |
+---+-----------+----------------+-------+
| 1 | 1 | 10 | 10 / 2 = 5 |
| 3 | 2 | 20 | 20 / 1 = 20 |
| 4 | 3 | 30 | 30 / 1 = 30 |
So I basically want in the column 'value/appearances' to have the result of the number in the value column divided by the number of appearances of this certain user in the main dataframe. For user with ID=1 this would be 10/2, as this user has a value of 10 and has 2 rows in the main dataframe.
I hope this makes it a bit clearer.
IIUC you want to do the following, groupby on 'userID' and call transform on the grouped column and pass 'size' to identify the method to call:
In [54]:
df['size'] = df.groupby('userID')['userID'].transform('size')
df
Out[54]:
userID col2 col3 size
1 1 a b 2
2 1 c d 2
3 2 a a 1
4 3 d e 1
What you tried:
In [55]:
df.groupby('userID').size()
Out[55]:
userID
1 2
2 1
3 1
dtype: int64
When assigned back to the df aligns with the df index so it introduced NaN for the last row:
In [57]:
df['size'] = df.groupby('userID').size()
df
Out[57]:
userID col2 col3 size
1 1 a b 2
2 1 c d 1
3 2 a a 1
4 3 d e NaN