A data frame like this and I am adding some columns from mapping and calculation.
code month of entry name reports
0 JJ 20171002 Jason 14
1 MM 20171206 Molly 24
2 TT 20171208 Tina 31
3 JJ 20171018 Jake 22
4 AA 20090506 Amy 34
5 DD 20171128 Daisy 16
6 RR 20101216 River 47
7 KK 20171230 Kate 32
8 DD 20171115 David 14
9 JJ 20171030 Jack 10
10 NN 20171216 Nancy 28
What it is doing here is select some rows and look up the values from the dictionary and insert a further column from simple calculation. It works fine:
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy', 'Daisy', 'River', 'Kate', 'David', 'Jack', 'Nancy'],
'code' : ['JJ', 'MM', 'TT', 'JJ', 'AA', 'DD', 'RR', 'KK', 'DD', 'JJ', 'NN'],
'month of entry': ["20171002", "20171206", "20171208", "20171018", "20090506", "20171128", "20101216", "20171230", "20171115", "20171030", "20171216"],
'reports': [14, 24, 31, 22, 34, 16, 47, 32, 14, 10, 28]}
df = pd.DataFrame(data)
dict_hour = {'JasonJJ' : 3, 'MollyMM' : 6, 'TinaTT' : 2, 'JakeJJ' : 3, 'AmyAA' : 8, 'DaisyDD' : 6, 'RiverRR' : 4, 'KateKK' : 8, 'DavidDD' : 5, 'JackJJ' : 5, 'NancyNN' : 2}
wanted = ['JasonJJ', 'TinaTT', 'AmyAA', 'DaisyDD', 'KateKK']
df['name_code'] = df['name'].astype(str) + df['code'].astype(str)
df1 = df[df['name_code'].isin(wanted)]
df1['hour'] = df1['name_code'].map(dict_hour).astype(float)
df1['coefficient'] = df1['reports'] / df1['hour'] - 1
But the last 2 lines received a same warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
How can the code can be improved accordingly? Thank you.
You need copy:
df1 = df[df['name_code'].isin(wanted)].copy()
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
Related
I have a dataframe with the following sample data
Product quantity sold
a 30
at 20
am 10
b 5
bn 7
bt 90
c 76
c1 67
ct 54
m 12
t 87
n 12
I want to group the products that start with a into a new product name Art, those that start with b under name Brt and those that start with c into Crt and leave products m, t and n in the same dataframe into something below:
Product quantity sold
Art 60
Brt 102
Crt 197
m 12
t 87
n 12
Since you have complex conditions, might be easy enough to just rename the ones you want.
import pandas as pd
df = pd.DataFrame({'Product': ['a', 'at', 'am', 'b', 'bn', 'bt', 'c', 'c1', 'ct', 'm', 't', 'n'],
'quantity sold ': [30, 20, 10, 5, 7, 90, 76, 67, 54, 12, 87, 12]})
df.loc[df['Product'].str.startswith('a'), 'Product'] = 'Art'
df.loc[df['Product'].str.startswith('b'), 'Product'] = 'Brt'
df.loc[df['Product'].str.startswith('c'), 'Product'] = 'Crt'
df.groupby('Product', as_index=False).sum()
Output
Product quantity sold
0 Art 60
1 Brt 102
2 Crt 197
3 m 12
4 n 12
5 t 87
You can do it using str.map and dictionary:
grp = df['Product'].str[0].map({'a':'Art', 'b':'BRT', 'c':'CRT'}).fillna(df['Product'])
df.groupby(grp)['quantity sold'].sum()
Output:
Product
Art 60
BRT 102
CRT 197
m 12
n 12
t 87
Name: quantity sold, dtype: int64
Here, we are using a shortcut for .str.get, str[0] indexing the first character from the string, then using map to create desired groups, and those values not in map are filled with the original values from df['Product']. Lastly, we groupby the newly created group.
I have given the following pandas dataframe:
d = {'ID': ['1169', '1234', '2456', '9567', '1234', '4321', '9567', '0169'], 'YEAR': ['2001', '2013', '2009', '1989', '2012', '2013', '2002', '2012'], 'VALUE': [8, 24, 50, 75, 3, 6, 150, 47]}
df = pd.DataFrame(data=d)
print(df)
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 24
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 4321 2013 6
6 9567 2002 150
7 1169 2012 47
I now want to merge two rows of the DataFrame, where there are two different IDs, where ultimately only one remains. The merge should only take place if the values of the column "YEAR" match. The values of the column "VALUE" should be added.
The output should look like this:
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 30
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 9567 2002 150
6 1169 2012 47
Line 1 and line 5 have been merged. Line 5 is removed and line 1 remains with the previous ID, but the VALUEs of line 1 and line 5 have been added.
I would like to specify later which two lines or which two IDs should be merged. One of the two should always remain. The two IDs to be merged come from another function.
I experimented with the groupby() function, but I don't know how to merge two different IDs there. I managed it only with identical values of the "ID" column. This then looked like this:
df.groupby(['ID', 'YEAR'])['VALUE'].sum().reset_index(name ='VALUE')
Unfortunately, even after extensive searching, I have not found anything suitable. I would be very happy if someone can help me! I would like to apply the whole thing later to a much larger DataFrame with more rows. Thanks in advance and best regards!
Try this, just group on 'ID' and take the max YEAR and sum VALUE:
df.groupby('ID', as_index=False).agg({'YEAR':'max', 'VALUE':'sum'})
Output:
ID YEAR VALUE
0 1234 2013 27
1 4321 2013 6
Or group on year and take first ID:
df.groupby('YEAR', as_index=False).agg({'ID':'first', 'VALUE':'sum'})
Ouptut:
YEAR ID VALUE
0 2012 1234 3
1 2013 1234 30
Based on all the comments and update to the question it sounds like the logic (maybe not this exact code) is required...
Try:
import pandas as pd
d = {'ID': ['1169', '1234', '2456', '9567', '1234', '4321', '9567', '0169'], 'YEAR': ['2001', '2013', '2009', '1989', '2012', '2013', '2002', '2012'], 'VALUE': [8, 24, 50, 75, 3, 6, 150, 47]}
df = pd.DataFrame(d)
df['ID'] = df['ID'].astype(int)
def correctRows(l, i):
for x in l:
if df.loc[x, 'YEAR'] == df.loc[i, 'YEAR']:
row = x
break
return row
def mergeRows(a, b):
rowa = list(df[df['ID'] == a].index)
rowb = list(df[df['ID'] == b].index)
if len(rowa) > 1:
if type(rowb)==list:
rowa = correctRows(rowa, rowb[0])
else:
rowa = correctRows(rowa, rowb)
else:
rowa = rowa[0]
if len(rowb) > 1:
if type(rowa)==list:
rowb = correctRows(rowb, rowa[0])
else:
rowb = correctRows(rowb, rowa)
else:
rowb = rowb[0]
print('Keeping:', df.loc[rowa].to_string().replace('\n', ', ').replace(' ', ' '))
print('Dropping:', df.loc[rowb].to_string().replace('\n', ', ').replace(' ', ' '))
df.loc[rowa, 'VALUE'] = df.loc[rowa, 'VALUE'] + df.loc[rowb, 'VALUE']
df.drop(df.index[rowb], inplace=True)
df.reset_index(drop = True, inplace=True)
return None
# add two ids. First 'ID' is kept; the second dropped, but the 'Value'
# of the second is added to the 'Value' of the first.
# Note: the line near the start df['ID'].astype(int), hence integers required
# mergeRows(4321, 1234)
mergeRows(1234, 4321)
Outputs:
Keeping: ID 1234, YEAR 2013, VALUE 24
Dropping: ID 4321, YEAR 2013, VALUE 6
Frame now looks like:
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 30 #<-- sum of 6 + 24
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 9567 2002 150
6 169 2012 47
I have a ~200mil data in dictionary index_data:
index_data = [
{3396623046050748: [0, 1],
3749192045350356: [2],
4605074846433127: [3],
112884719857303: [4],
507466746864539: [5],
.....
}
]
Key is a value in CustId and Value is an index of CustID in df_data:
I have a DataFrame df_data:
CustID Score Number1 Number2 Phone
3396623046050748 2 2 3 0000
3396623046050748 6 2 3 0000
3749192045350356 1 56 23 2222
4605074846433127 67 532 321 3333
112884719857303 3 11 66 4444
507466746864539 7 22 96 5555
NOTE: If CustID is duplicate, only column Score have different data in each row
I want to create a new list of dict(Total_Score is an avg Score of each CustID, Number is Number2 divide Number1):
result = [
{'CustID' :3396623046050748,
'Total_Score': 4,
'Number' : 1.5,
'Phone' : 0000
},
{'CustID' :3749192045350356,
'Total_Score': 1,
'Number' : 0.41,
'Phone' : 2222
},
{'CustID' :4605074846433127,
'Total_Score': 67,
'Number' : 0.6,
'Phone' : 3333
},
.........
]
My solution is to loop my dictionary and use multiprocessing
from multiprocessing import Process, Manager
def calculateTime(ns, value):
# get data with share of each process
df_data2 = ns.df_data
result2 = ns.result
# Create new DF from index and old DF
df_sampleresult = df_data2.loc[value].reset_index(drop = True)
# create sample list to save data need to append in final result
dict_sample['CustID'] = df_sampleresult['CustID'][0]
dict_sample['Time_Score'] = df_sampleresult['Score'].mean()
result2.append(dict_sample)
ns.result = result2
ns.df_data = df_data
if __name__ == '__main__':
result = list()
manager = Manager()
ns = manager.Namespace()
ns.df = df_data
ns.result = result
job = [Process(target = calculateTime, args=(ns,value)) for key,value in
index_data.items()]
_ = [p.start() for p in job]
_ = [p.join() for p in job]
But It's not working. Performance is slow and higher memory? Is my setup multiprocess is right? Have another way to do that?
In [353]: df
Out[353]:
CustID Score Number1 Number2 Phone
0 3396623046050748 2 2 3 0000
1 3396623046050748 6 2 3 0000
2 3749192045350356 1 56 23 2222
3 4605074846433127 67 532 321 3333
4 112884719857303 3 11 66 4444
5 507466746864539 7 22 96 5555
In [351]: d = df.groupby(['CustID', 'Phone', round(df.Number2.div(df.Number1), 2)])['Score'].mean().reset_index(name='Total_Score').rename(columns={'level_2': 'Number'}).to_dict('records')
In [352]: d
Out[352]:
[{'CustID': 112884719857303, 'Phone': 4444, 'Number': 6.0, 'Total_Score': 3},
{'CustID': 507466746864539, 'Phone': 5555, 'Number': 4.36, 'Total_Score': 7},
{'CustID': 3396623046050748, 'Phone': 0000, 'Number': 1.5, 'Total_Score': 4},
{'CustID': 3749192045350356, 'Phone': 2222, 'Number': 0.41, 'Total_Score': 1},
{'CustID': 4605074846433127, 'Phone': 3333, 'Number': 0.6, 'Total_Score': 67}]
Let we have a dataframe like
names = ['bob', 'alice', 'bob', 'bob', 'ali', 'alice', 'bob', 'ali', 'moji', 'ali', 'moji', 'ali', 'bob', 'bob', 'bob']
times = [21 , 34, 37, 40, 55, 65, 67, 84, 88, 90 , 91, 97, 104,105, 108]
df = pd.DataFrame({'name' : names , 'time_of_action' : times})
We would like to groupby it by name and in each group consider times from 0. That is in each group we want to subtract min-time_of_action in that group from all times of that group. how could we do this systematically with pandas?
If I am correct then you want this:
df['new time'] = df['time_of_action']-df.groupby('name')['time_of_action'].transform('min')
df:
name time_of_action new time
0 bob 21 0
1 alice 34 0
2 bob 37 16
3 bob 40 19
4 ali 55 0
5 alice 65 31
6 bob 67 46
7 ali 84 29
8 moji 88 0
9 ali 90 35
10 moji 91 3
11 ali 97 42
12 bob 104 83
13 bob 105 84
14 bob 108 87
Try this
df['new_time'] = df.groupby('name')['time_of_action'].apply(lambda x: x - x.min())
df
Output:
name time_of_action new_time
0 bob 21 0
1 alice 34 0
2 bob 37 16
3 bob 40 19
4 ali 55 0
5 alice 65 31
Others have already answered but here's mine
import pandas as pd
names = ['bob', 'alice', 'bob', 'bob', 'ali', 'alice', 'bob', 'ali', 'moji', 'ali', 'moji', 'ali', 'bob', 'bob', 'bob']
times = [21 , 34, 37, 40, 55, 65, 67, 84, 88, 90 , 91, 97, 104,105, 108]
df = pd.DataFrame({'name' : names , 'time_of_action' : times})
def subtract_min(df):
df['new_time'] = df['time_of_action'] - df['time_of_action'].min()
return df
df.groupby('name').apply(subtract_min).sort_values('name')
As others have said I am kind of guessing as well
My text file
Name Surname Age Sex Grade X
Chris M. 14 M 4 10 05 2010
Adam A. 17 M 11 12 2011
Jack O. M 8 08 04 2009
...
I want to count years.
Example output:
{ '2010' : 1 , "2011" : 1 ...}
but I got "Key Error : Year".
import pandas as pd
df = pd.read_fwf("file.txt")
df.join(df['X'].str.split(' ', 2, expand = True).rename(columns={0: '1', 1: '2', 2: '3}))
df.columns=["1"]
df["1"].value_counts().dict()
What's wrong with my code?
Your df will remain original one, you have to assign to it after you join with new column, then you will get the df with column Year. Try this:
import pandas as pd
df = pd.read_fwf("file.txt")
df = df.join(df['Date'].str.split(' ', 2, expand = True).rename(columns={1: 'Year', 0: 'Month'}))
df["Year"].value_counts().to_dict()
output:
{'2009': 1, '2010': 1, '2011': 1}