I am a beginner in Python and trying to find a solution for the following problem.
I have a csv file:
name, mark
Anna,24
John,19
Mike,22
Monica,20
Alex, 17
Daniel, 26
And xls file:
name, group
John, red
Anna, blue
Monica, blue
Mike, yellow
Alex, red
I am trying to get the result:
group, mark
Red, 26
Blue, 44
Yellow, 22
The number in result shows the total mark for the whole group.
I was trying to find similar problems but was not successful and I do not have much experience to find out what exactly I have to do and what commands to use.
Use pd.read_csv with df.merge and Groupby.sum:
In [89]: df1 = pd.read_csv('file1.csv')
In [89]: df1
Out[89]:
name mark
0 Anna 24
1 John 19
2 Mike 22
3 Monica 20
4 Alex 17
5 Daniel 26
In [90]: df2 = pd.read_csv('file2.csv')
In [90]: df2
Out[90]:
name group
0 John red
1 Anna blue
2 Monica blue
3 Mike yellow
4 Alex red
In [94]: df = df1.merge(df2).groupby('group').sum().reset_index()
In [95]: df
Out[95]:
group mark
0 blue 44
1 red 36
2 yellow 22
EDIT: If you have other columns, which you don't want to sum, do this:
In [284]: df1.merge(df2).groupby('group').agg({'mark': 'sum'}).reset_index()
Out[284]:
group mark
0 blue 44
1 red 36
2 yellow 22
Related
I have two dataframes df1 and df2 which different row sizes but same columns, The ID column is common across both dataframes. I want a write the difference in a text file. For example:
df1:
ID Name Age Profession sex
1 Tom 20 engineer M
2 nick 21 doctor M
3 krishi 19 lawyer F
4 jacky 18 dentist F
df2:
ID Name Age Profession sex
1 Tom 20 plumber M
2 nick 21 doctor M
3 krishi 23 Analyst F
4 jacky 18 dentist F
The resultant text file should look like:
ID : 1
Profession_old Profession_new
engineer plumber
ID : 3
Age_old Age_new Profession_old Profession_new
19 23 lawyer Analyst
You can use compare and a loop:
df3 = df1.set_index('ID').compare(df2.set_index('ID'))
df3.columns = (df3.rename({'self': 'old', 'other': 'new'}, level=1, axis=1)
.columns.map('_'.join)
)
for id, row in df3.iterrows():
print(f'ID : {id}')
print(row.dropna().to_frame().T.to_string(index=False))
print()
output:
ID : 1
Profession_old Profession_new
engineer plumber
ID : 3
Age_old Age_new Profession_old Profession_new
19.0 23.0 lawyer Analyst
NB. using print here for the demo, to write to a file:
with open('file.txt') as f:
f.write(f'ID : {id}\n')
f.write(row.dropna().to_frame().T.to_string(index=False))
f.write('\n\n')
You could also directly use df3:
Age_old Age_new Profession_old Profession_new
ID
1 NaN NaN engineer plumber
3 19.0 23.0 lawyer Analyst
Here's a simple piece of code, something similar to what I am doing. I'm trying to replace the value after 1 with a -1. But in my case, how would I do it if I don't know where the 1's are in a dataframe of over 1000's of rows?
import pandas as pd
df = pd.DataFrame({'Name':['Craig', 'Davis', 'Anthony', 'Tony'], 'Age':[22, 27, 24, 33], 'Employed':[0, 1, 0, 0]})
df
I have this...
Name
Age
Employed
Craig
22
0
Davis
27
1
Anthony
24
0
Tony
33
0
I want something similar to this but iterable through 1000's of rows
Name
Age
Employed
Craig
22
0
Davis
27
1
Anthony
24
-1
Tony
33
0
Use shift to get the next row after a 1:
df = df.loc[df['Employed'].shift() == 1, 'Employed'] = -1
print(df)
# Output
Name Age Employed
0 Craig 22 0
1 Davis 27 1
2 Anthony 24 -1
3 Tony 33 0
I'm working on transforming a dataframe to show the top 3 earners.
The dataframe looks like this
data = {'Name': ['Allistair', 'Bob', 'Carrie', 'Diane', 'Allistair', 'Bob', 'Carrie','Evelyn'], 'Sale': [20, 21, 19, 18, 5, 300, 35, 22]}
df = pd.DataFrame(data)
print(df)
Name Sale
0 Allistair 20
1 Bob 21
2 Carrie 19
3 Diane 18
4 Allistair 5
5 Bob 300
6 Carrie 35
7 Evelyn 22
In my actual dataset, I have several more columns and rows, and I want to print out and get to
something like
Name Sale
0 Bob 321
1 Carrie 35
2 Allistair 25
Every iteration that I've searched through doesn't quite get there because I get
'Name' is both an index level and a column label, which is ambiguous.
Use groupby:
>>> df.groupby('Name').sum().sort_values('Sale', ascending=False)
Sale
Name
Bob 321
Carrie 54
Allistair 25
Evelyn 22
Diane 18
Thanks to #Andrej Kasely above,
df.groupby("Name")["Sale"].sum().nlargest(3)
I have a dataframe that is:
A
1 king, crab, 2008
2 green, 2010
3 blue
4 green no. 4
5 green, house
I want to split the dates out into:
A B
1 king, crab 2008
2 green 2010
3 blue
4 green no. 4
5 green, house
I cant split the first instance of ", " because that would make:
A B
1 king crab, 2008
2 green 2010
3 blue
4 green no. 4
5 green house
I cant split after the last instance of ", " because that would make:
A B
1 king crab 2008
2 green 2010
3 blue
4 green no. 4
5 green house
I also cant separate it by numbers because that would make:
A B
1 king, crab 2008
2 green 2010
3 blue
4 green no. 4
5 green, house
Is there some way to split by ", " and then a 4 digit number that is between two values? The two values condition would be extra safety to filter out accidental 4 digit numbers that are clearly not years. For example.
Split by:
", " + (four digit number between 1000 - 2021)
Also appreciated are answers that split by:
", " + four digit number
Even better would be an answer that took into account that the number is ALWAYS at the end of the string.
Or you can just use series.str.extract and replace:
df = pd.DataFrame({"A":["king, crab, 2008","green, 2010","blue","green no. 4","green, house"]})
df["year"] = df["A"].str.extract("(\d{4})")
df["A"] = df["A"].str.replace(",\s\d{4}","")
print (df)
A year
0 king, crab 2008
1 green 2010
2 blue NaN
3 green no. 4 NaN
4 green, house NaN
import pandas as pd
list_dict_Input = [{'A': 'king, crab, 2008'},
{'A': 'green, 2010'},
{'A': 'green no. 4'},
{'A': 'green no. 4'},]
df = pd.DataFrame(list_dict_Input)
for row_Index in range(len(df)):
text = (df.iloc[row_Index]['A']).strip()
last_4_Char = (text[-4:])
if last_4_Char.isdigit() and int(last_4_Char) >= 1000 and int(last_4_Char) <= 2021:
df.at[row_Index, 'B'] = last_4_Char
print(df)
I have two dataframes DfMaster and DfError
DfMaster which looks like:
Id Name Building
0 4653 Jane Smith A
1 3467 Steve Jones B
2 34 Kim Lee F
3 4567 John Evans A
4 3643 Kevin Franks S
5 244 Stella Howard D
and DfError looks like
Id Name Building
0 4567 John Evans A
1 244 Stella Howard D
In DfMaster I would like to change the Building value for a record to DD if it appears in the DfError data-frame. So my desired output would be:
Id Name Building
0 4653 Jane Smith A
1 3467 Steve Jones B
2 34 Kim Lee F
3 4567 John Evans DD
4 3643 Kevin Franks S
5 244 Stella Howard DD
I am trying to use the following:
DfMaster.loc[DfError['Id'], 'Building'] = 'DD'
however I get an error:
KeyError: "None of [Int64Index([4567,244], dtype='int64')] are in the [index]"
What have I done wrong?
try this using np.where
import numpy as np
errors = list(dfError['id'].unqiue())
dfMaster['Building_id'] = np.where(dfMaster['Building_id'].isin(errors),'DD',dfMaster['Building_id'])
DataFrame.loc expects that you input an index or a Boolean series, not a value from a column.
I believe this should do the trick:
DfMaster.loc[DfMaster['Id'].isin(DfError['Id']), 'Building'] = 'DD'
Basically, it's telling:
For all rows where Id value is present in DfError['Id'], set the value of 'Building' to 'DD'.