I have the following data frame:
import pandas as pd
pandas_df = pd.DataFrame([
["SEX", "Male"],
["SEX", "Female"],
["EXACT_AGE", None],
["Country", "Afghanistan"],
["Country", "Albania"]],
columns=['FullName', 'ResponseLabel'
])
Now what I need to do is to add sort order to this dataframe. Each new "FullName" would increment it by 100 and each consecutive "ResponseLabel" for a given "FullName" would increment it by 1 (for this specific "FullName"). So I basically create two different sort orders that I sum later on.
pandas_full_name_increment = pandas_df[['FullName']].drop_duplicates()
pandas_full_name_increment = pandas_full_name_increment.reset_index()
pandas_full_name_increment.index += 1
pandas_full_name_increment['SortOrderFullName'] = pandas_full_name_increment.index * 100
pandas_df['SortOrderResponseLabel'] = pandas_df.groupby(['FullName']).cumcount() + 1
pandas_df = pd.merge(pandas_df, pandas_full_name_increment, on = ['FullName'], how = 'left')
Result:
FullName ResponseLabel SortOrderResponseLabel index SortOrderFullName SortOrder
0 SEX Male 1 0 100 101
1 SEX Female 2 0 100 102
2 EXACT_AGE NULL 1 2 200 201
3 Country Afghanistan 1 3 300 301
4 Country Albania 2 3 300 302
The result that I get on my "SortOrder" column is correct but I wonder if there is some better approach pandas-wise?
Thank you!
The best way to do this would be to use ngroup and cumcount
name_group = pandas_df.groupby('FullName')
pandas_df['sort_order'] = (
name_group.ngroup(ascending=False).add(1).mul(100) +
name_group.cumcount().add(1)
)
Output
FullName ResponseLabel sort_order
0 SEX Male 101
1 SEX Female 102
2 EXACT_AGE None 201
3 Country Afghanistan 301
4 Country Albania 302
Related
I have two dataframes where I need to update the first one based on the value of the second one if exists. Sample story provided below is to replace the student_id with updatedId if exists in 'old_id' column and replace it with 'new_id'.
import pandas as pd
import numpy as np
student = {
'Name': ['John', 'Jay', 'sachin', 'Geetha', 'Amutha', 'ganesh'],
'gender': ['male', 'male', 'male', 'female', 'female', 'male'],
'math score': [50, 100, 70, 80, 75, 40],
'student_Id': ['1234', '6788', 'xyz', 'abcd', 'ok83', '234v'],
}
updatedId = {
'old_id' : ['ok83', '234v'],
'new_id' : ['83ko', 'v432'],
}
df_student = pd.DataFrame(student)
df_updated_id = pd.DataFrame(updatedId)
print(df_student)
print(df_updated_id)
# Method with np.where
for index, row in df_updated_id.iterrows():
df_student['student_Id'] = np.where(df_student['student_Id'] == row['old_id'], row['new_id'], df_student['student_Id'])
# print(df_student)
# Method with dataframe.mask
for index, row in df_updated_id.iterrows():
df_student['student_Id'].mask(df_student['student_Id'] == row['old_id'], row['new_id'], inplace=True)
print(df_student)
The results for both methods above work and yield the correct result
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 ok83
5 ganesh male 40 234v
old_id new_id
0 ok83 83ko
1 234v v432
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Nonetheless, the actual data of students has about 500,000 rows and updated_id has 6000 rows.
Thus I run into performance issues as loop is very slow:
A simple timer are placed to observe the number of records processed for df_updated_id
100 rows - numpy Time=3.9020769596099854; mask Time=3.9169061183929443
500 rows - numpy Time=20.42293930053711; mask Time=19.768696784973145
1000 rows - numpy Time=40.06309795379639; mask Time=37.26559829711914
My question is whether I can optimize it using a merge (join table), or ditch the iterrows? I tried something like the below but failed to get it to work.
Replace dataframe column values based on matching id in another dataframe, and How to iterate over rows in a DataFrame in Pandas
Please advice..
You can also try with map:
df_student['student_Id'] = (
df_student['student_Id'].map(df_updated_id.set_index('old_id')['new_id'])
.fillna(df_student['student_Id'])
)
print(df_student)
# Output
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Update
I believe the updated_id isn't unique, so I need to further pre-process the data.
In this case, maybe you could drop duplicates before considering the last value (keep='last') is the most recent for a same old_id:
sr = df_updated_id.drop_duplicates('old_id', keep='last') \
.set_index('old_id')['new_id']
df_student['student_Id'] = df_student['student_Id'].map(sr) \
.fillna(df_student['student_Id']
)
Note: this is exactly what the #BENY's answer does. As he creates a dict, only the last occurrence of an old_id is kept. However, if you want to keep the first value appears, his code doesn't work. With drop_duplicates, you can adjust the keep parameter.
We can just replace
df_student.replace({'student_Id':df_updated_id.set_index('old_id')['new_id']},inplace=True)
df_student
Out[337]:
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
I'll try to represent my problem, basing on simple example below. I have below main csv and I am trying to split into 2 or more csv basing on column header, keeping the unique column id in intact in every csv file.
Below is the code I am trying to figure out, but not quite getting the result.
import pandas as pd
df = pd.read_csv('abc.csv')
df[['id','name','age']] = df['csv1'].str.split(' ', expand=True)
csv
id name age color Gender
0 101 Jack 23 white M
1 102 Mary 25 black F
2 103 Tom 24 brown M
Output required
csv1
id name age
0 101 Jack 23
1 102 Mary 25
2 103 Tom 24
csv2 -
id color Gender
0 101 white M
1 102 black F
2 103 brown M
UPDATE
I found a better apporach with np.array_split
I used this example df:
x y R TR x_c y_c xxx yyy RRR TTTR xxx_c yyy_c
id
1256780.0 13989 6241 6.689222 20.986341 14050.83 6315.33 213989 36241 46.689222 520.986341 614050.83 76315.33
12000.0 14013 6278 53.152036 0.000000 14060.00 6288.00 214013 36278 453.152036 5.000000 614060.00 76288.00
1100.0 14111 6379 87.598357 5.000000 14070.55 7000.00 214111 36379 487.598357 55.000000 614070.55 76288.00
which has 12 columns.
# the 4 means, split the df into 4 evenly sized chunks
chunks = np.array_split(df,4, axis=1)
Chunks is a list containing all seperate dataframes.
Output:
# chunks[0]
x y R
id
1256780.0 13989.0 6241.0 6.689222
12000.0 14013.0 6278.0 53.152036
1100.0 14111.0 6379.0 87.598357
# chunks[1]
TR x_c y_c
id
1256780.0 20.986341 14050.83 6315.33
12000.0 0.000000 14060.00 6288.00
1100.0 5.000000 14070.55 7000.00
# chunks[2]
xxx yyy RRR
id
1256780.0 213989.0 36241.0 46.689222
12000.0 214013.0 36278.0 453.152036
1100.0 214111.0 36379.0 487.598357
# chunks[3]
TTTR xxx_c yyy_c
id
1256780.0 520.986341 614050.83 76315.33
12000.0 5.000000 614060.00 76288.00
1100.0 55.000000 614070.55 76288.00
Old answer:
You could calculate half of the columns and then use iloc to split the dataframes into two parts.
df = df.set_index('id')
half = len(df.columns)//2
df1, df2 = df.iloc[:,:half], df.iloc[:,half:]
df1 = df1.reset_index()
df2 = df2.reset_index()
Output:
#df1
id name age
0 101 Jack 23
1 102 Mary 25
2 103 Tom 24
#df2
id color Gender
0 101 white M
1 102 black F
2 103 brown M
I have a dataframe as shown below
Token Label StartID EndID
0 Germany Country 0 2
1 Berlin Capital 6 9
2 Frankfurt City 15 18
3 four million Number 21 24
4 Sweden Country 26 27
5 United Kingdom Country 32 34
6 ten million Number 40 45
7 London Capital 50 55
I am trying to get row based on certain condition, i.e. associate the label Number to closest capital i.e. Berlin
3 four million Number 21 24 - > 1 Berlin Capital 6 9
or something like:
df[row3] -> df [row1]
A pseudo logic
First check, for the rows with label: Number then (assumption is that the city is always '2 rows' above or below) and has the label: Capital. But, label: 'capital' loc is always after the label: Country
What I have done until now,
columnsName =['Token', 'Label', 'StartID', 'EndID']
df = pd.read_csv('resources/testcsv.csv', index_col= 0, skip_blank_lines=True, header=0)
print(df)
key_number = 'Number'
df_with_number = (df[df['Label'].str.lower().str.contains(r"\b{}\b".format(key_number), regex=True, case=False)])
print(df_with_number)
key_capital = 'Capital'
df_with_capitals = (df[df['Label'].str.lower().str.contains(r"\b{}\b".format(key_capital), regex=True, case=False)])
print(df_with_capitals)
key_country = 'Country'
df_with_country = (df[df[1].str.lower().str.contains(r"\b{}\b".format(key_country), regex=True, case=False)])
print(df_with_country)
The logic is to compare the index's and then make possible relations
i.e.
df[row3] -> [ df [row1], df[row7]]
you could use merge_asof with the parameter direction=nearest for example:
df_nb_cap = pd.merge_asof(df_with_number.reset_index(),
df_with_capitals.reset_index(),
on='index',
suffixes=('_nb', '_cap'), direction='nearest')
print (df_nb_cap)
index Token_nb Label_nb StartID_nb EndID_nb Token_cap Label_cap \
0 3 four_million Number 21 24 Berlin Capital
1 6 ten_million Number 40 45 London Capital
StartID_cap EndID_cap
0 6 9
1 50 55
# adjusted sample data
s = """Token,Label,StartID,EndID
Germany,Country,0,2
Berlin,Capital,6,9
Frankfurt,City,15,18
four million,Number,21,24
Sweden,Country,26,27
United Kingdom,Country,32,34
ten million,Number,40,45
London,Capital,50,55
ten million,Number,40,45
ten million,Number,40,45"""
df = pd.read_csv(StringIO(s))
# create a mask for number where capital is 2 above or below
# and where country is three above number or one below number
mask = (df['Label'] == 'Number') & (((df['Label'].shift(2) == 'Capital') |
(df['Label'].shift(-2) == 'Capital')) &
(df['Label'].shift(3) == 'Country') |
(df['Label'].shift(-1) == 'Country'))
# create a mask for capital where number is 2 above or below
# and where country is one above capital
mask2 = (df['Label'] == 'Capital') & (((df['Label'].shift(2) == 'Number') |
(df['Label'].shift(-2) == 'Number')) &
(df['Label'].shift(1) == 'Country'))
# hstack your two masks and create a frame
new_df = pd.DataFrame(np.hstack([df[mask].to_numpy(), df[mask2].to_numpy()]))
print(new_df)
0 1 2 3 4 5 6 7
0 four million Number 21 24 Berlin Capital 6 9
I have a dataframe of samples, with a country column. The relative number of records in each country are:
d1.groupby("country").size()
country
Australia 21
Cambodia 58
China 280
India 133
Indonesia 195
Malaysia 138
Myanmar 51
Philippines 49
Singapore 1268
Taiwan 47
Thailand 273
Vietnam 288
How do I select, say, 100 random samples from each country, if that country has > 100 samples? (if the country has <= 100 samples, do nothing). Currently, I do this for, say, Singapore:
names_nonsg_ls = []
names_sg_ls = []
# if the country is not SG, add it to names_nonsg_ls.
# else, add it to names_sg_ls, which will be subsampled later.
for index, row in d0.iterrows():
if str(row["country"]) != "Singapore":
names_nonsg_ls.append(str(row["header"]))
else:
names_sg_ls.append(str(row["header"]))
# Select 100 random names from names_sg_ls
names_sg_ls = random.sample(names_sg_ls, 100)
# Form the list of names to retain
names_ls = names_nonsg_ls + names_sg_ls
# create new dataframe
d1 = d0.loc[d0["header"].isin(names_ls)]
But manually a new list for each country that has >100 names is just poor form, not to mention that I first have to manually pick out the countries with > 100 names.
You can group by country, then sample based on the group size:
d1.groupby("country", group_keys=False).apply(lambda g: g.sample(100) if len(g) > 100 else g)
Example:
df = pd.DataFrame({
'A': ['a','a','a','a','b','b','b','c','d'],
'B': list(range(9))
})
df.groupby('A', group_keys=False).apply(lambda g: g.sample(3) if len(g) > 3 else g)
# A B
#2 a 2
#0 a 0
#1 a 1
#4 b 4
#5 b 5
#6 b 6
#7 c 7
#8 d 8
The ordering of my age, height and weight columns is changing with each run of the code. I need to keep the order of my agg columns static because I ultimately refer to this output file according to the column locations. What can I do to make sure age, height and weight are output in the same order every time?
d = pd.read_csv(input_file, na_values=[''])
df = pd.DataFrame(d)
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col).agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})
df_out.to_csv(output_file, sep=',')
I think you can use subset:
df_out = df.groupby(df.index_col)
.agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})[['age','height','weight']]
Also you can use pandas functions:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
Sample:
df = pd.DataFrame({'name':['q','q','a','a'],
'address':['a','a','s','s'],
'age':[7,8,9,10],
'height':[1,3,5,7],
'weight':[5,3,6,8]})
print (df)
address age height name weight
0 a 7 1 q 5
1 a 8 3 q 3
2 s 9 5 a 6
3 s 10 7 a 8
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
print (df_out)
age height weight
name address
a s 9.5 12 14
q a 7.5 4 8
EDIT by suggestion - add reset_index, here as_index=False does not work if need index values too:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
.reset_index()
print (df_out)
name address age height weight
0 a s 9.5 12 14
1 q a 7.5 4 8
If you care mostly about the order when written to a file and not while its still in a DataFrame object, you can set the columns parameter of the to_csv() method:
>>> df = pd.DataFrame(
{'age': [28,63,28,45],
'height': [183,156,170,201],
'weight': [70.2, 62.5, 65.9, 81.0],
'name': ['Kim', 'Pat', 'Yuu', 'Sacha']},
columns=['name','age','weight', 'height'])
>>> df
name age weight height
0 Kim 28 70.2 183
1 Pat 63 62.5 156
2 Yuu 28 65.9 170
3 Sacha 45 81.0 201
>>> df_out = df.groupby(['age'], as_index=False).agg(
{'weight': sum, 'height': sum})
>>> df_out
age height weight
0 28 353 136.1
1 45 201 81.0
2 63 156 62.5
>>> df_out.to_csv('out.csv', sep=',', columns=['age','height','weight'])
out.csv then looks like this:
,age,height,weight
0,28,353,136.10000000000002
1,45,201,81.0
2,63,156,62.5