Not sure on the right title for this. But I have a need to take out a column from a dataframe, and show the top five results. The column is a mix of integers and n/a results. As an example I create a basic dataframe:
regiona col1
a n/a
a 1
a 200
b 208
b 400
b 560
b 600
c 800
c 1120
c 1200
c 1680
d n/a
d n/a
And so run:
import pandas as pd
df = pd.read_csv('test_data.csv')
I then created a basic function so I could use this on different columns, so constructed:
def max_search(indicator):
displaced_count = df[df[indicator] != 'n/a']
table = displaced_count.sort_values([indicator], ascending=[False])
return table.head(5)
But when I run
max_search('col1')
It returns:
regiona col1
7 c 800
6 b 600
5 b 560
4 b 400
3 b 208
So it misses anything greater than 800. The steps I think the function should be doing is:
Filter out n/a valyes
Return the top five values.
However, it is not returning anything over 800? Am I missing something very obvious?
Check your dataframe's dtypes, now it is object. So first make sure col1's datatype is numeric.
Use na_values at pd.read_csv() and your function will work as expected:
df = pd.read_csv('test_data.csv', na_values='n/a')
# df.dtypes
You could also do:
df['col1'] = pd.to_numeric(df['col1'], errors='coerce')
df.dropna().sort_values(['col1'], ascending=False).head(5)
regiona col1
10 c 1680.0
9 c 1200.0
8 c 1120.0
7 c 800.0
6 b 600.0
Related
I would like to know how we can get the unique combination of two column values if the values are in a similar combination. Below is the dataframe
I tried using below code but my expected output is different
df.groupby(['column1', 'column2'], as_index = False).agg({'expense' : 'sum'})
This is a variant of this question, but an important distinction is that it seems you don't care about the order of column1 or column2. Before I share the solution, here's the pseudocode:
Create an id column which we can use to find rows where the sets of column1 and column2 are the same
Apply the approach from the linked post to id.
Drop duplicates based on id
Here's my manual transcription of the data. In the future, please provide the sample data as text, instead of as a screenshot.
column1,column2,salary
ram,shyam,100
sita,geeta,500
geeta,sita,300
shyam,ram,600
sohan,mohan,200
mohan,sohan,400
And here's the code
>>> import pandas as pd
>>> df = pd.read_csv('data.csv')
>>> hash_func = lambda n: hash("-".join(sorted(n)))
>>> df['id'] = df[['column1','column2']].apply(hash_func, axis=1)
>>> df
column1 column2 salary id
0 ram shyam 100 -1387604912582040812
1 sita geeta 500 9030593041392264307
2 geeta sita 300 9030593041392264307
3 shyam ram 600 -1387604912582040812
4 sohan mohan 200 6327789560655124249
5 mohan sohan 400 6327789560655124249
>>> df['expense'] = df.groupby('id')['salary'].transform('sum')
>>> df
column1 column2 salary id expense
0 ram shyam 100 7227562739062788100 700
1 sita geeta 500 6328366926112663723 800
2 geeta sita 300 6328366926112663723 800
3 shyam ram 600 7227562739062788100 700
4 sohan mohan 200 -3239226935758438599 600
5 mohan sohan 400 -3239226935758438599 600
>>> df = df.drop_duplicates(subset=['id'])
>>> df
column1 column2 salary id expense
0 ram shyam 100 7227562739062788100 700
1 sita geeta 500 6328366926112663723 800
4 sohan mohan 200 -3239226935758438599 600
>>> df = df.drop(columns=['id','salary']) # some extra cleanup
>>> df
column1 column2 expense
0 ram shyam 700
1 sita geeta 800
4 sohan mohan 600
I followed these steps
df['pairs'] = df['col1'] + '-' + df['col2']
Then apply a foo function to this column
The idea of this function is to take the pairs column data and sort it based on the 1st character of each element in the pair.
for example input is ram-shyam or ram-shaym we will get output as ram-shyam
This is the foo function -
def foo(s):
lst_s = s.split('-')
temp = {}
for idx, name in enumerate(lst_s):
temp[idx]= name[0]
temp = dict(sorted(temp.items(), key=lambda item: item[1]))
final = []
for key in temp.keys():
final.append(lst_s[key])
return '-'.join(final)
Now apply this function on the pairs column
df['unique-pair'] = df['pairs'].apply(foo)
The output looks now like this -
col1 col2 salary unique-pair pairs
0 ram shyam 100 ram-shyam ram-shyam
1 sita gita 500 gita-sita sita-gita
2 gita sita 300 gita-sita gita-sita
3 shyam ram 600 ram-shyam shyam-ram
4 sohan mohan 200 mohan-sohan sohan-mohan
5 mohan sohan 400 mohan-sohan mohan-sohan
Now you can do a group by
df.groupby(['unique-pair']).agg({'salary':sum})
final output is -
salary
unique-pair
gita-sita 800
mohan-sohan 600
ram-shyam 700
You can sort the first and second columns so that a,b and b,a are treated as the same in groupby.
Now, since the sort() is deprecated, we can use numpy sort and re-create a new dataframe.
Assuming the following csv_file:
column1,column2,salary
a,b,1
c,b,3
b,a,10
b,c,30
d,e,99
We can do it as follows:
import pandas as pd
import numpy as np
df = pd.read_csv("csvfile.csv",)
print("Original:\n ",df.head())
print ("\nGrouped sum:\n")
print ((pd.concat([pd.DataFrame(np.sort(df[df.columns[:2]], axis=1), columns=df.columns[:2]),\
df["salary"]], axis=1)).reset_index(drop=True, inplace=False).groupby\
(["column1", "column2"]).sum())
The output is shown below:
Original:
column1 column2 salary
0 a b 1
1 c b 3
2 b a 10
3 b c 30
4 d e 99
Grouped sum:
salary
column1 column2
a b 11
b c 33
d e 99
I have a big df with tarifs for aviation lines. you can specify data for concrete route, for example by airport of origination, airport of destination, aicraft, month.
Plain example of df:
data = {'orig':['A','A','A','B','B','B'],
'dest':['C','C','C','D','D','D'],
'currency':['RUB','USD','RUB','USD','RUB','USD'],
'tarif':[100,10,120,20,150,30]}
df=pd.DataFrame(data)
df
orig dest currency tarif
0 A C RUB 100
1 A C USD 10
2 A C RUB 120
3 B D USD 20
4 B D RUB 150
5 B D USD 30
I have df2, that contains aviation plan for concrete company. There you may find the same info, like month, orig,dest,aircraft
Plain example of df2:
data2={'orig':['A','B'],
'dest':['C','D']}
df2=pd.DataFrame(data2)
df2
orig dest
0 A C
1 B D
Task:for each row in df2, summurize tarif using conditions.
What I expect:
orig dest RUB USD
0 A C 220 10
1 B D 150 50
Thanks.
Hmmm
df = df.groupby(["orig", "dest", "currency"]).agg(sum).unstack()
df.columns = ['_'.join(col).strip() for col in df.columns.values]
df
gives me
tarif_RUB tarif_USD
orig dest
A C 220 10
B D 150 50
Which is your desired result but I haven't looked at df2 yet so I am afraid you have to describe better/extend your example so I have to do something with df2.
I am looking to increase the speed of an operation within pandas and I have learned that it is generally best to do so via using vectorization. The problem I am looking for help with is vectorizing the following operation.
Setup:
df1 = a table with a date-time column, and city column
df2 = another (considerably larger) table with a date-time column, and city column
The Operation:
for i, row in df2.iterrows():
for x, row2 in df1.iterrows():
if row['date-time'] - row2['date-time'] > pd.Timedelta('8 hours') and row['city'] == row2['city']:
df2.at[i, 'result'] = True
break
As you might imagine, this operation is insanely slow on any dataset of a decent size. I am also just beginning to learn pandas vector operations and would like some help in figuring out a more optimal way to solve this problem
I think what you need is merge() with numpy.where() to achieve the same result.
Since you don't have a reproducible sample in your question, kindly consider this:
>>> df1 = pd.DataFrame({'time':[24,20,15,10,5], 'city':['A','B','C','D','E']})
>>> df2 = pd.DataFrame({'time':[2,4,6,8,10,12,14], 'city':['A','B','C','F','G','H','D']})
>>> df1
time city
0 24 A
1 20 B
2 15 C
3 10 D
4 5 E
>>> df2
time city
0 2 A
1 4 B
2 6 C
3 8 F
4 10 G
5 12 H
6 14 D
From what I understand, you only need to get all the rows in your df2 that has a value in the city column in df1, where the difference in the dates are at least 9 hours (greater than 8 hours).
To do that, we need to merge on your city column:
>>> new_df = df2.merge(df1, how = 'inner', left_on = 'city', right_on = 'city')
>>> new_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
3 14 D 10
time_x basically is the time in your df2 dataframe, and time_y is from your df1.
Now we need to check the difference of those times and retain the one that will give a greater than 8 value in doing so, by using numpy.where() flagging them to do the filtering later:
>>> new_df['flag'] = np.where(new_df['time_y'] - new_df['time_x'] > 8, ['Retain'], ['Remove'])
>>> new_df
time_x city time_y flag
0 2 A 24 Retain
1 4 B 20 Retain
2 6 C 15 Retain
3 14 D 10 Remove
Now that you have that, you can simply filter your new_df by the flag column, removing the column in the final output as such:
>>> final_df = new_df[new_df['flag'].isin(['Retain'])][['time_x', 'city', 'time_y']]
>>> final_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
And there you go, no looping needed. Hope this helps :D
I have a data frame which consist of 5 columns. I need to extract the first 3 columns and I need to create two new columns from the last two columns?
column A
column B
column c
column D
column E
df[df[1:3]] this will give me first three column,
but from Column D & E I need to extract only the last two characters .
How can I extract in the same code ?
Broadly I see that your dataframe has 5 columns. Say A B C D E. And in this DataFrame you want to add two new columns based on D and E. This is what your question is suggesting.
df[df[['A', 'B', 'C', 'D', 'E']
x = df.D
y = df.E
df['F'] = x # or just do df.F = df.D
df['G'] = y # or df.G = df.E
You will have F and G new columns 'based on' (well technically equal-to) your last two columns D and E.
IIUC, this should do the trick
df1= df.iloc[:,:3]
df2=df.iloc[:,3:]
a=df2.columns[:]
df2[a +'_extracted'] = df2[a].apply(lambda x: x.str[-2:])
Input
script call_put strike animals codes
a 280 280 rat nill
a 260 260 cat fill
a 275 275 pat dill
b 280 280 mat grill
b 285 285 bat shrill
Output
df1
script call_put strike
0 a 280 280
1 a 260 260
2 a 275 275
3 b 280 280
4 b 285 285
df2
animals codes animals_extracted codes_extracted
0 rat nill at ll
1 cat fill at ll
2 pat dill at ll
3 mat grill at ll
4 bat shrill at ll
I have columns like names,FDR%,age,FCR%,income.I want to select the columns with '%' string match and multiply by 100.Finally, i would like to return the entire dataframe with the '% column values changes. I tried as under:
df_final=df_1.filter(like='%', axis=1).apply(lambda x:x*100)
df_final
This just returns the subset i.e. columns operated upon FDR% and FCR%.I need the entire dataframe returned with the corresponding changes.
Also is there a better method of achieving the same?
You can select columns from DataFrame returned by filter and multiple by 100:
df_1 = pd.DataFrame({
'A':list('abcdef'),
'FDR%':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'FCR%':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
})
cols = df_1.filter(like='%').columns
df_1[cols] *= 100
print (df_1)
A FDR% C FCR% E
0 a 400 7 100 5
1 b 500 8 300 3
2 c 400 9 500 6
3 d 500 4 700 9
4 e 500 2 100 2
5 f 400 3 0 4
Or use mask by Series.str.contains or Series.str.endswith and select columns with DataFrame.loc:
mask = df_1.columns.str.contains('%')
#alternative
#mask = df_1.columns.str.endswith('%')
df_1.loc[:, mask] *= 100
Notice:
apply here for multiple is bad choice, because loops under the hood, so slow. Fast solution is multiple by scalar only.