Subtracting values between groups within a dataframe - python

I am attempting to calculate the differences between two groups that may have mismatched data in an efficient manner.
The following dataframe, df,
df = pd.DataFrame({'type': ['A', 'A', 'A', 'W', 'W', 'W'],
'code': ['1', '2', '3', '1', '2', '4'],
'values': [50, 25, 25, 50, 10, 40]})
has two types that have mismatched "codes" -- notably code 3 is not present for the 'W' type and code 4 is not present for the 'A' type. I have wrapped codes as strings as in my particular case they are sometimes strings.
I would like to substract the values for matching codes between the two types so that we obtain,
result = pd.DataFrame({'code': ['1', '2', '3', '4'],
'diff': [0, 15, 25, -40]})
Where the sign would indicate which type had the greater value.
I have spent some time examining variations on groupby diff methods here, but have not seen anything that deals with the particular issue of subtracting between two potentially mismatched columns. Instead, most questions appear to be appropriate for the intended use of the diff() method.
The route I've tried most recently is using a list comprehension on the df.groupby['type'] to split into two dataframes, but then I remain with a similar problem regarding subtracting mismatched cases.

Groupby on code, then substitute the missing value with 0
df = pd.DataFrame({'type': ['A', 'A', 'A', 'W', 'W', 'W'],
'code': ['1', '2', '3', '1', '2', '4'],
'values': [50, 25, 25, 50, 10, 40]})
def my_func(x):
# What if there are more than 1 value for a type/code combo?
a_value = x[x.type == 'A']['values'].max()
w_value = x[x.type == 'W']['values'].max()
a_value = 0 if np.isnan(a_value) else a_value
w_value = 0 if np.isnan(w_value) else w_value
return a_value - w_value
df_new = df.groupby('code').apply(my_func)
df_new = df_new.reset_index()
df_new = df_new.rename(columns={0:'diff'})
print(df_new)
code diff
0 1 0
1 2 15
2 3 25
3 4 -40

Related

How can I add column with the dates (starting from today)?

How can I add to column C the dates (starting from today)? Column B is an example of what I want to get.
df = pd.DataFrame({'N': ['1', '2', '3', '4'], 'B': ['16.11.2021', '17.11.2021', '18.11.2021', '19.11.2021'], 'C': ['nan', 'nan', 'nan', 'nan']})
If I understood your question correctly, you want something like this:
import datetime
base = datetime.datetime.today()
date_list = sorted([(base - datetime.timedelta(days=x)).strftime('%d.%m.%Y') for x in range(len(df))])
df['C'] = date_list
This will produce the same result as in column B.

Which Pandas function do I need? group_by or pivot

I'm still relatively new to Pandas and I can't tell which of the functions I'm best off using to get to my answer. I have looked at pivot, pivot_table, group_by and aggregate but I can't seem to get it to do what I require. Quite possibly user error, for which I apologise!
I have data like this:
Code to create df:
import pandas as pd
df = pd.DataFrame([
['1', '1', 'A', 3, 7],
['1', '1', 'B', 2, 9],
['1', '1', 'C', 2, 9],
['1', '2', 'A', 4, 10],
['1', '2', 'B', 4, 0],
['1', '2', 'C', 9, 8],
['2', '1', 'A', 3, 8],
['2', '1', 'B', 10, 4],
['2', '1', 'C', 0, 1],
['2', '2', 'A', 1, 6],
['2', '2', 'B', 10, 2],
['2', '2', 'C', 10, 3]
], columns = ['Field1', 'Field2', 'Type', 'Price1', 'Price2'])
print(df)
I am trying to get data like this:
Although my end goal will be to end up with one column for A, one for B and one for C. As A will use Price1 and B & C will use Price2.
I don't want to necessarily get the max or min or average or sum of the Price as theoretically (although unlikely) there could be two different Price1's for the same Fields & Type.
What's the best function to use in Pandas to get to what I need?
Use DataFrame.set_index with DataFrame.unstack for reshape - output is MultiIndex in columns, so added sorting second level by DataFrame.sort_index, flatten values and last create column from Field levels:
df1 = (df.set_index(['Field1','Field2', 'Type'])
.unstack(fill_value=0)
.sort_index(axis=1, level=1))
df1.columns = [f'{b}-{a}' for a, b in df1.columns]
df1 = df1.reset_index()
print (df1)
Field1 Field2 A-Price1 A-Price2 B-Price1 B-Price2 C-Price1 C-Price2
0 1 1 3 7 2 9 2 9
1 1 2 4 10 4 0 9 8
2 2 1 3 8 10 4 0 1
3 2 2 1 6 10 2 10 3
Solution with DataFrame.pivot_table is also possible, but it aggregate values in duplicates first 3 columns with default mean function:
df2 = (df.pivot_table(index=['Field1','Field2'],
columns='Type',
values=['Price1', 'Price2'],
aggfunc='mean')
.sort_index(axis=1, level=1))
df2.columns = [f'{b}-{a}' for a, b in df2.columns]
df2 = df2.reset_index()
print (df2)
use pivot_table
pd.pivot_table(df, values =['Price1', 'Price2'], index=['Field1','Field2'],columns='Type').reset_index()

Rows to columns based on another column ok this is it

I've merged two dataframes, but now there are duplicate rows. I want to move my rows to columns based on/grouped by a column value.
I have already merged the two dataframes:
df_merge = pd.merge(top_emails_df, keyword_df, on='kmed_idf')
The new dataframe looks like this:
import pandas as pd
df = pd.DataFrame({'kmed_idf': ['1', '1', '1', '2', '2'],
'n_docs': [796, 796, 796, 200, 200],
'email_from: ['foo', 'foo', 'foo', 'bar', 'bar'})
I tried to stack the dataframe:
newtest = df_merge.set_index(['kmed_idf']).stack(level=0)
newtest= newtest.to_frame()
But this only created a series. When converted to a dataframe it's still not very useful.
What I would like is a dataframe where each row is a unique value of 'kmed_idf', and the rows are now columns. Something like this:
import pandas as pd
df = pd.Dataframe({'kmed_idf': ['1', '2', '3'],
'n_docs': [796],
'n_docs2': [796],
'n_docs3,: [796]})
This will make it easier to delete the duplicates. I've also tried using the drop duplicates pandas function, but to no avail.
if all you want is to remove duplicates, I think the .drop_duplicates function should be the way to go...
I don't know why it hadn't worked for you, but please try this:
import pandas as pd
df = pd.DataFrame({'kmed_idf': ['1', '1', '1', '2', '2'],
'n_docs': [796, 796, 796, 200, 200],
'email_from': ['foo', 'foo', 'foo', 'bar', 'bar']})
df.drop_duplicates(inplace=True)
print(df)
Output:
email_from kmed_idf n_docs
0 foo 1 796
3 bar 2 200

Creating array from CSV in Python while limiting columns used

I am working with a CSV file with the following format,
ST 1 2 3 4
WA 10 10 5 2
OR 0 7 3 9
CA 11 5 4 12
AZ -999 0 0 11
The first row represents # of days 1-4. I want to be able to take the data for each state, example WA, 10, 10, 5, 2 and create an array with just the numbers in that row that is sorted. If I omit the first index which is WA I can do this using.
sorted(list, key=int)
Doing so would give me a list, [2,5,10,10].
What I want to do is
Read each line of the CSV.
Create an array of numbers using the numerical data.
Run some calculations using array(Percent rank)
Combine the calculated values with the correct state fields. For instance if I want to add a value of 3 to the array for WA.
b.insert(list[4]), 3)
to get
[2,3,5,10,10]
so I can calculate rank. (Note: I am unable to use scipy so I must calculate rank using a function which I've already figured out.)
End by writing State and rank value to new csv, something like.
ST Rank
WA 30
CA 26
OR 55
where Rank is the rank of the given value in the array.
I am pretty new to python so any help or pointers would be greatly appreciated. I am also limited to using basic python modules.(numpy, csv....etc)
UPDATE CODE:
with open(outputDir+"needy.csv", 'rb') as f:
first = {row[0]: sorted(row[1:], key=int) for row in list(csv.reader(f))}
for key, value in first.items():
if addn in first:
g= "yes"
print key, addn, g
#print d
else:
g= "no"
print key, addn, g
value.append(300)
value.append(22)
value = sorted(value, key=int)
print "State:", key, value
When i do this the values I append will be prpoperly added and the dict will be properly sorted, but when I define n as a value, it will not be fouund. example below.
{'WA': ['1', '1', '1', '2', '2', '2', '3', '4', '4', '4', '5', '5', '5', '5', '6', '6', '7', '7', '8', '8', '8', '8', '9', '10', '10', '10', '10', '11', '11'}
The above line is what happens if I simply print out first.
If I utilize the for loop and specify addn as 11 as a global function I get.
WA 11 no
State: WA ['1', '1', '1', '2', '2', '2', '3', '4', '4', '4', '5', '5', '5', '5', '6', '6', '7', '7', '8', '8', '8', '8', '9', '10', '10', '10', '10', '11', '11',..]
Being that 11 is part of the key it should return yes etc.
You can use simple commands and a dictionary to organize your data:
fid = open('out.txt') # Just copy what you put in your question inside a file.
l = fid.readlines() # Read the whole file into a list.
d = {} # create a dictionary.
for i in l:
s = i.split() # split the list using spaces (default)
d[s[0]] = [int(s[j]) for j in range(1,len(s))] # list comprehension to transform string into its for you number lists.
print(d)
, the result is:
{'CA': [11, 5, 4, 12], 'ST': [1, 2, 3, 4], 'OR': [0, 7, 3, 9], 'WA': [10, 10, 5, 2], 'AZ': [-999, 0, 0, 11]}
From this point you can do whatever you wish to your entries in the dictionary including append.
d['CA'].append(3)
EDIT: #J.R.W. building the dictionary the way I recommended, followed by your code (plus the correction I gave):
fid = open('out.txt') # Just copy what you put in your question inside a file.
l = fid.readlines() # Read the whole file into a list.
first = {} # create a dictionary.
for i in l:
s = i.split() # split the list using spaces (default)
first[s[0]] = [int(s[j]) for j in range(1,len(s))] # list comprehension to transform string into its for you number lists.
print(first)
addn = 11
for key, value in first.items():
if addn in value:
g= "yes"
print(key, addn, g)
#print d
else:
g= "no"
print(key, addn, g)
value.append(300)
value.append(22)
value = sorted(value, key=int)
print("State:", key, value)
, results in:
{'ST': [1, 2, 3, 4], 'CA': [11, 5, 4, 12], 'OR': [0, 7, 3, 9], 'AZ': [-999, 0, 0, 11], 'WA': [10, 10, 5, 2]}
ST 11 no
State: ST [1, 2, 3, 4, 22, 300]
CA 11 yes
State: CA [4, 5, 11, 12, 22, 300]
OR 11 no
State: OR [0, 3, 7, 9, 22, 300]
AZ 11 yes
State: AZ [-999, 0, 0, 11, 22, 300]
WA 11 no
State: WA [2, 5, 10, 10, 22, 300]
, which says yes when 11 exists (your own test), and no when it doesn't.

Understanding Matrix to List of List and then Numpy Array

I want to construct a matrix like:
Col1 Col2 Col3 Coln
row1 1 2 4 2
row2 3 8 3 3
row3 8 7 7 3
rown n n n n
I have yet to find anything in the python documentation that states how a list of list is assembled, is it like:
a = [[1,2,4,2],[3,8,3,3],[8,7,7,3],[n,n,n,n]]
Where each row is a list item or should it be that each column is a list item:
b = [[1,3,8,n],[2,8,7,n],[4,3,7,n],[2,3,3,n]]
I would think that this would be a common question but I can't seem to find a straight answer.
Based on the documentation I'm guessing that I can convert this to a numpy array by simply:
np.array(a)
Can anyone help?
You want the first version:
a = [[1,2,4,2],[3,8,3,3],[8,7,7,3],[n,n,n,n]]
When accessing an element in a matrix, you typically use matrix[row][col], so with the above Python list format a[i] would give you row i, and a[i][j] would give you the jth element from the ith row.
To convert it to a numpy array, np.array(a) is the correct method.
This:
a = [[1,2,4,2],[3,8,3,3],[8,7,7,3],[n,n,n,n]]
will create the list you want, and yes, np.array(a) will convert it to a numpy array.
Also, this is the 'pythonish' was of creating an array with m rows and n columns (and setting all the elements to 0):
a = [[0 for i in range(n)] for j in range(m)]
Since you mention "matrix" let me also add that you have the np.matrix() option as well.
For example: You can use
A = [[1,2,3],[4,5,6],[7,8,9]]
to create a list (of lists), with each inner list representing a row.
Then
AA = np.array(A)
will create a 2D array with the appearance of a matrix, but not all the properties of a matrix.
Whereas
AM = np.matrix(A)
will create a matrix.
If you perform arithmetic operations on these two then you'll see the difference. For example
AA**2
will square each element in the 2D array. However
AM**2
will perform matrix multiplication of AM by itself.
BTW. The above usage assumes "import numpy as np" of course.
Use the first convention. If transpose needed:
>>> a = [[1,2,4,2],[3,8,3,3],[8,7,7,3],['n','n','n','n']]
>>> trans=[]
>>> for i in range(len(a)):
... trans.append([row[i] for row in a])
...
>>> trans
[[1, 3, 8, 'n'], [2, 8, 7, 'n'], [4, 3, 7, 'n'], [2, 3, 3, 'n']]
An element is then a[row][col] vs trans[col][row] (with respect to a of your example)
The first is used by Python and that is easily seen why you should use the first convention when laid out:
a = [[1,2,4,2],
[3,8,3,3],
[8,7,7,3],
['n','n','n','n']]
Certainly when you use numpy, use the first convention since that is used by numpy:
>>> np.array(a)
array([['1', '2', '4', '2'],
['3', '8', '3', '3'],
['8', '7', '7', '3'],
['n', 'n', 'n', 'n']],
dtype='|S1')
>>> np.array(trans)
array([['1', '3', '8', 'n'],
['2', '8', '7', 'n'],
['4', '3', '7', 'n'],
['2', '3', '3', 'n']],
dtype='|S1')
Note: numpy converts the ints to strings because of the 'n' in the final row/col.
When you actual start to print that table, here is a way:
def pprint_table(table):
def format_field(field, fmt='{:,.0f}'):
if type(field) is str: return field
if type(field) is tuple: return field[1].format(field[0])
return fmt.format(field)
def get_max_col_w(table, index):
return max([len(format_field(row[index])) for row in table])
col_paddings=[get_max_col_w(table, i) for i in range(len(table[0]))]
for i,row in enumerate(table):
# left col
row_tab=[row[0].ljust(col_paddings[0])]
# rest of the cols
row_tab+=[format_field(row[j]).rjust(col_paddings[j]) for j in range(1,len(row))]
print(' '.join(row_tab))
pprint_table([
['','Col 1', 'Col 2', 'Col 3', 'Col 4'],
['row 1', '1','2','4','2'],
['row 2','3','8','3','3'],
['row 3','8','7','7','3'],
['row 4', 'n','n','n','n']])
Prints:
Col 1 Col 2 Col 3 Col 4
row 1 1 2 4 2
row 2 3 8 3 3
row 3 8 7 7 3
row 4 n n n n

Categories